Everything DAS
Last Updated 5th Feb 2009
The intention of this document is to bring together and add to all the documentation available on the WWW for the DAS system. The content on these pages draws from many sources of information and thus has many contributors. Eventually the intention if that this document will be a set of instructions that you can print out and use as reference documentation or a good read. If you find any errors on these pages and pages that it links to then please contact me (Jonathan Warren) to let me know, any suggestions and contributions are also welcomed.
Contents
Content:
What is DAS?
As biological databases are becoming so large with the advent of high throughput technologies such as sequencers and microarray chips it is becoming increasing difficult to download all the data relevant for a research team. DAS gets around this by keeping the data stored with it's originators and allows users around the world to access just the relevant parts they need at any one time. Put another way: by making use of DAS you can take advantage of being able to view integrated information from multiple sources, without these sources needing to be aware of each other. You can also add your own DAS data source, perhaps privately in your own institution and then view the information served from this source in the context of features from other institutions. DAS stands for Distributed Annotation System. It was originally set up to be used with genomic information where annotations/features are layered on top of a reference sequence , usually a genome. The idea is that a genome browser such as ensembl or GBrowse (both DAS clients in this scenario) can be used to look at annotations from data sources both that exist on the same server/machine the browser is running on and display annotations in the same view from data sources (data served by DAS servers) that could be on the other side of the world (communicating via the WWW). The DAS system consists of the DAS Registry www.dasregistry.org as well as DAS Servers and Clients. The Registry is there to enable people and computers to easily find the DAS data sources available around the world and also to help these data sources conform to the specifications. It's important that data served by DAS servers conform to enable the interoperability of different clients and servers around the world. The 1.6 spec is the latest and soon to be official DAS spec that mainly focuses on genomic annotations but also refers to the Extentions specified in the 1.53E spec below. The 1.53E spec contains up to date specifications for servers and clients that support information that can be exchanged using DAS that is not genome centric. Types of data include Proteins- Structures and alignments, Molecular Interactions, volume map data.
Current Status/ DAS specifications 1.5, 1.53E, 1.6, 2.0 and Future Intentions
Currently DAS 1.5 is the most widely used and supported together with 1.53E. DAS 2.0 is quite different and is really running in parallel to the other 2 versions of DAS and it is hoped that in the next few years these versions will become one version (the 1.6 Spec includes some of the commands from the 2.0 spec). If you wish your data to be widely accessible then use the The 1.6 spec and The 1.53E spec documents as their guide. If your main priority is using the most recent technology and external libraries then DAS2.0 may be of most interest to you.
Setting up a DAS Server
There are several different options available for setting up a DAS server. All are either written in PERL or Java.
Servers available
Name | Programming Language | advantages | disadvantages |
Dazzle | Java | Standard implementation, includes support for extensions (structure, interaction, vol) | Some people say it can be hard to configure and deploy if you are not used to Java web development |
Proserver | PERL | Standard implementation includes support for extensions (structure, interaction, vol) | |
MyDAS | Java | Some people say it's easier to set up and configure than Dazzle | Doesn't support extensions currently |
LDAS | PERL | Very Easy to set up? | Limited support for DAS functionality and sources |
Dazzle
Dazzle is currently the standard/default implementation for Java users- however MyDas (mentioned below) is popular.
Dazzle Eclipse Tutorial
[DazzzleTutorial.jsp Dazzle Eclipse Tutorial] This tutorial takes you through setting up Dazzle in eclipse and then shows you how to add your own plugins
Getting Dazzle
http://biojava.org/wiki/Dazzle#Getting_Dazzle The latest version from the cutting edge source code is available here from subversion:http://www.derkholm.net/svn/repos/dazzle/
Using ready made plugins for datasources
http://biojava.org/wiki/Dazzle:plugins More examples needed here and tips for using mysql etc? http://biojava.org/wiki/Dazzle:deployment
Writing your own plugin
http://biojava.org/wiki/Dazzle:writeplugin
How to write a plugin using eclipse more on what interfaces need to be implemented and give a full example that implements all needed functionality such as sources.cmd and coordinate system etc.
Deploying an Ensembl Reference Server
link to Ensembl reference server instructions
MyDas
Information about MyDas can be found here.
Proserver
Proserver Page at the Sanger Institute.
Proserver Tutorial
Guide to Proserver
Implementing the latest specs
Proserver example of config to implement sources cmd:
coordinates = TAIR_8,Chromosome,Arabidopsis thaliana -> 1:2000,3000 properties = key1 -> value1 ; key2 -> value2 mapmaster = http://www.gramene.org/das/Arabidopsis_thaliana.TAIR8.reference capabilities = features -> 1.0
The coordinates data is taken from the coordinates/registry_coordinates.xml file, which is an archived copy of the list of coordinates available in the DAS registry. Specifying the name (or URI, actually) and test range is enough, ProServer will pick up the rest from the XML file. If the full data is not picked up, you may need to update the coordinates XML file from the registry (http://www.dasregistry.org/das/coordinatesystem). If your coordinate system is not in the Registry, an admin can add it for you.
Protein Annotations and Ontologies
[extension_ontology.jsp explanation of ontologies for proteins usage in DAS]
Testing your implementation
Validation and Registering of your Server
RelaxNG and other validation in the Registry
The DAS Registry uses RelaxNG to validate the xml responses from DAS servers before allowing them to register as a valid das source. RelaxNG is essentially a document like a dtd except that it uses an xml syntax that is easy to learn quickly. The registry uses the documents found at the following http://www.dasregistry.org/validation/ and has one document for each of the DAS commands (note you may need to right click "view the source" to see anything on these pages in a web browser) features.rng, sources.rng, alignments.rng, structure.rng, entry_points.rng, interaction.rng, sequence.rng and types.rng.
The DAS Registry
Introduction to the DAS Registry
Connecting to the Registry Programmatically
There are several commands that can be used to query the registry including: The sources cmd with optional parameters: label, organism, authority, capability, type and unique source_id. You can also use the organsim, coordinatesystem and lastmodified commands. For examples see Scripting an example of a java classe written using Dasobert to access the Registry is here http://www.derkholm.net/svn/repos/dasobert/trunk/doc/examples/ContactRegistry.java
Setting up a DAS client
Currently Available DAS Clients - table?
<%@include file="sangertablestart.jsp"%>
|- ! Name ! Description ! Programming Language ! Links |- | GBrowse | quote from GBrowse Website "GBrowse[1] is the most popular viewer in GMOD. For a list of GBrowse and GMOD installations see the GMOD Users page. For a demo of its features, try the WormBase, FlyBase, or Human Genome Segmental Duplication Database web sites. Spec DAS 1.53E and 1.6 soon | PERL | http://gmod.org/wiki/Gbrowse |- | EnsEMBL | EnsEMBL is a web based genome browser and database system which supports DAS 1.53E and soon 1.6 | PERL | http://www.ensembl.org/ |- | IGB | is an application built upon the GenoViz SDK and Genometry for visualization and exploration of genomes and corresponding annotations from multiple data sources | Java | http://genoviz.sourceforge.net/ |- | Jalview | A multiple sequence alignment editor & viewer | Java | http://www.jalview.org/ |- | Dasty2 | Dasty, a protein DAS client is implemented for visualising protein sequence feature information. The client is able to connect, to a reference server and one or many DAS servers. It merges the data from all the servers, and displays sequence information as well as annotated feature information form all the available DAS Servers in a very user friendly way . | PERL and AJAX | http://www.ebi.ac.uk/dasty/ |- | Add clients from the workshop <%@include file="sangertableend.jsp" %>
Writing your own DAS client
A Java DAS Client Library - Dasobert
Examples of client code written in Java using Dasobert can be found here: http://www.derkholm.net/svn/repos/dasobert/trunk/doc/examples/
There is also a tutorial for using Dasobert within eclipse that (follows on from the Dazzle eclipse tutorial here): [DasobertTutorial.jsp Dasobert Eclipse Tutorial]
Example of walking a DAS source using perl
This example was kindly provided by Felix Kokocinski: You can specify a region or let it walk through all regions if the server can supply entry points with lengths. This is done in eg. 20 MB slices. It takes quite some time, but works nicely.
# Example script that reads genomic data from DAS server # using a defined chunk size # writing the data out to a gff file. # fsk@sanger.ac.uk, 2008 use strict; use Bio::Das::Lite; use Getopt::Long; #default DAS server adress my $server = "http://das.sanger.ac.uk/das"; #default DAS source name my $source = 'otter_das'; #proxy name my $http_proxy = undef; #genomic chunk size to query my $max_len = 20000000; my $chromosome = undef; my $start = 0; my $end = 0; my $gff_file = undef; my %transcripts = (); my $type; &GetOptions( 'file=s' => \$gff_file, 'chromosome=s' => \$chromosome, 'start=s' => \$start, 'end=s' => \$end, 'server=s' => \$server, 'source=s' => \$source, ); #connect to DAS server my $das = connect_das("$server/$source", $http_proxy); #get entry point list/lengths #requires the DAS server to support the entry-points function my $chrom_lens = get_entry_points(); open(GFF, ">$gff_file") or die "Can't open file $gff_file.\n"; if($chromosome){ #query specific region get_region($chromosome, $start, $end); } else{ #go through all chromosomes foreach my $chrom (keys %$chrom_lens){ print "getting $chrom\n"; get_region($chrom, undef, undef); %transcripts = (); } } close(GFF)or die "Can't close file $gff_file.\n"; ################################################ #connect to DAS server sub connect_das { my ($dsn, $proxy) = @_; my $das = Bio::Das::Lite->new({ 'timeout' => 10000, 'dsn' => $dsn, 'http_proxy' => $proxy, }) or die "cant connect to DAS server!\n"; return $das; } #look at the region requested sub get_region { my ($chromosome, $start, $end) = @_; my $chrom_len = $chrom_lens->{$chromosome}; my $region = ""; if( $start and $end){ if($start > $end){ die "Coordinates wrong: $start > $end!\n"; } if( ($end - $start) <= $max_len ){ #get entire region my $region = ":".$start.",".$end; get_transcripts($region, $chromosome); } else{ go_through_chunks($start, $end, $chromosome, $chrom_len); } } elsif( $chrom_len <= $max_len ){ #get entire chromosome get_transcripts($region, $chromosome); } else{ go_through_chunks(1, $chrom_len, $chromosome, $chrom_len); } } #go through a region in chunks sub go_through_chunks { my ($chunk_start, $chunk_end, $chromosome, $chrom_len) = @_; my ($region_start, $region_end); my %ids_seen; #loop through regions until all is covered #keep track of genes to avoid duplicates! for($region_start = $chunk_start, $region_end = $region_start $max_len; $region_start < $chunk_end; $region_start = $region_end 1, $region_end = $max_len){ if($region_end > $chrom_len){ $region_end = $chrom_len; }elsif($region_end > $chunk_end){ $region_end = $chunk_end; } my $region = ":".$region_start.",".$region_end; #get all transcripts from chunk my $new_ids = get_transcripts($region, $chromosome, \%ids_seen); %ids_seen = (%ids_seen, %$new_ids); } } #fetch all available entry-points (chromosomes) and their lengths from server sub get_entry_points { my %chrom_lens; my $entry_points = $das->entry_points(); foreach my $k (keys %$entry_points){ foreach my $l (@{$entry_points->{$k}}){ foreach my $segment (@{ $l->{"segment"} }){ $chrom_lens{ $segment->{"segment_id"} } = $segment->{"segment_size"}; } } } return \%chrom_lens; } #fetch the data and process it. #note that this function is quite specific to the way your DAS source is set-up. #the idea is to get together all exons, etc that belong to a transcript and all transcripts #that belong to a gene. sub get_transcripts { my ( $region, $chromosome, $previous_genes ) = @_; print STDERR "have chr $chromosome$region\n"; my %genes = (); my %new_features = (); my $response = undef; #fetch DAS features $response = $das->features({ 'segment' => $chromosome.$region, 'type' => $type, }); while (my ($url, $features) = each %$response) { if(ref $features eq "ARRAY"){ print STDERR "Received ".scalar @$features." features.\n"; FEATURES: foreach my $feature (@$features) { my %notes = (); my $grouphash = $feature->{'group'}->[0]; #get other notes my $i = 0; my $morenote_entry = ''; while(defined($feature->{'note'}->[$i])){ my $morenotes = $feature->{'note'}->[$i]; my ($morenotes_type, $morenotes_value) = split('=', $morenotes); $morenotes_value =~ s/\&\#39\;/\'/g; $notes{$morenotes_type} = $morenotes_value; $i ; } #remove duplicates from overlapping regions if(defined $previous_genes and exists($previous_genes->{$grouphash->{'group_type'}})){ next FEATURES; } #you could do some filtering of the response at this point my %gff_element; #build structure for exons and general items #find type my $element_type = $feature->{'type'} || "exon"; $element_type =~ m/((intron)|(UTR)|(exon))/g; if($1){ $element_type = $1 } my $group_type = $grouphash->{'group_type'}; my $strand = $feature->{'orientation'}; if($feature->{'orientation'} =~ /^(\ |\-|\.)$/) { } elsif($feature->{'orientation'} == 1){ $strand = ' ' } elsif($feature->{'orientation'} == -1){ $strand = '-' } elsif($feature->{'orientation'} == 0){ $strand = '.' } else{ die "INVALID STRAND SYMBOL: ".$feature->{'orientation'}."\n"; } my $phase = "."; if($feature->{'phase'}){ $phase = $feature->{'phase'}; } elsif($element_type eq "exon"){ $phase = "0"; } if(!$notes{"Transcriptstatus"}){ die "PROBLEM: $element_type, ".$feature->{'feature_id'}."\n"; } $gff_element{'seqid'} = $chromosome; $gff_element{'source'} = $notes{"Transcripttype"}; $gff_element{'type'} = $element_type; $gff_element{'start'} = $feature->{'start'}; $gff_element{'end'} = $feature->{'end'}; $gff_element{'score'} = "."; $gff_element{'strand'} = $strand; $gff_element{'phase'} = $phase; #check for some missing values if(!exists $feature->{'feature_id'}){ print STDERR "Missing value for Parent-feature_id\n"; $feature->{'feature_id'} = "0"; } if(!exists $notes{"Transcriptstatus"}){ print STDERR "Missing value for Transcriptstatus\n"; $notes{"Transcriptstatus"} = "-"; } if(!exists $notes{"Created"}){ print STDERR "Missing value for Created\n"; $notes{"Created"} = 0; } if(!exists $notes{"Lastmod"}){ print STDERR "Missing value for Lastmod\n"; $notes{"Lastmod"} = 0; } $gff_element{'attributes'} = "Parent=".$feature->{'feature_id'}. ";Status=".$notes{"Transcriptstatus"}. ";CREATED=".$notes{"Created"}. ";LASTMOD=".$notes{"Lastmod"}; if(!exists $genes{ $group_type }){ $genes{ $group_type } = 1; my %gff_gene; my $gene_region = $feature->{'target'}; my ($gs, $gene_loc) = split('\=', $gene_region); my ($gene_start, $gene_end) = split('\-', $gene_loc); #build structure for gene $gff_gene{'seqid'} = $chromosome; $gff_gene{'source'} = $notes{"Genetype"}; $gff_gene{'type'} = "gene"; $gff_gene{'start'} = $gene_start; $gff_gene{'end'} = $gene_end; $gff_gene{'score'} = "."; $gff_gene{'strand'} = $strand; $gff_gene{'phase'} = "."; #get gene description my $description = ""; foreach my $gnote (@{$grouphash->{'note'}}){ my ($gnote_s, $gnote_string) = split('=', $gnote); if($gnote_s eq "DESCR"){ $description = ";Description=".$gnote_string; } } $gff_gene{'attributes'} = "ID=".$grouphash->{'group_type'}. $description. ";Status=".$notes{"Genestatus"}. ";CREATED=".$notes{"Created"}. ";LASTMOD=".$notes{"Lastmod"}; #print entry for transcript print_gff_line(\%gff_gene); %gff_gene = (); $new_features{$grouphash->{'group_type'}} = 1; } if(!exists $transcripts{ $feature->{'feature_id'} }){ $transcripts{ $feature->{'feature_id'} } = 1; my %gff_transcript; #build structure for transcript $gff_transcript{'seqid'} = $chromosome; $gff_transcript{'source'} = $notes{"Transcripttype"}; $gff_transcript{'type'} = "transcript"; $gff_transcript{'start'} = $feature->{'target_start'}; $gff_transcript{'end'} = $feature->{'target_stop'}; $gff_transcript{'score'} = "."; $gff_transcript{'strand'} = $strand; $gff_transcript{'phase'} = "."; $gff_transcript{'attributes'} = "ID=".$feature->{'feature_id'}.";Alias1=".$feature->{'target_id'}. ";Parent=".$grouphash->{'group_type'}. ";CREATED=".$notes{"Created"}. ";LASTMOD=".$notes{"Lastmod"}. ";Status=".$notes{"Transcriptstatus"}; #print entry for transcript print_gff_line(\%gff_transcript); %gff_transcript = (); } #else{ print STDERR "_" } #print entry for exons, etc. if($feature->{'type_category'} =~ /error/){ print STDERR "Found an error feature:\n"; print STDERR $gff_element{'seqid'}."\t"; print STDERR $gff_element{'source'}."\t"; print STDERR $gff_element{'type'}."\t"; print STDERR $gff_element{'start'}."\t"; print STDERR $gff_element{'end'}."\t"; print STDERR $gff_element{'score'}."\t"; print STDERR $gff_element{'strand'}."\t"; print STDERR $gff_element{'phase'}."\t"; print STDERR $gff_element{'attributes'}."\n"; } else { print_gff_line(\%gff_element); %gff_element = (); } $feature = undef; } @$features = (); $features = undef; } } return \%new_features; } #print the different data types as GFF sub print_gff_line { my ($element) = @_; print GFF $element->{'seqid'}."\t"; print GFF $element->{'source'}."\t"; print GFF $element->{'type'}."\t"; print GFF $element->{'start'}."\t"; print GFF $element->{'end'}."\t"; print GFF $element->{'score'}."\t"; print GFF $element->{'strand'}."\t"; print GFF $element->{'phase'}."\t"; print GFF $element->{'attributes'}."\n"; }
[acknoledgments ]Acknowledgments
(some of this document may have been cut an pasted from documentation contributed by the following people):
- Andreas Prlic
- Andy Jenkinson
- Phil Jones
- Tim Hubbard
- Lincoln Stein
- Thomas Down
|}