Difference between revisions of "Everything DAS"

From BioDAS
Jump to: navigation, search
(Introduction to the DAS Registry)
(Adding a large set of data sources to the registry)
Line 115: Line 115:
 
=== Adding a large set of data sources to the registry ===
 
=== Adding a large set of data sources to the registry ===
 
The registry can automatically load a large set of data sources from the sources.xml that is returned from the sources cmd. If you wish to load a large set of sources you can contact dasregistry@sanger.ac.uk and ask for your data sources to be loaded. Please note that your sources must have valid coordinate systems that are in the registry and a valid sources document. You can do an initial test at http://www.dasregistry.org/validateServer.jsp and select the sources capability for your server.
 
The registry can automatically load a large set of data sources from the sources.xml that is returned from the sources cmd. If you wish to load a large set of sources you can contact dasregistry@sanger.ac.uk and ask for your data sources to be loaded. Please note that your sources must have valid coordinate systems that are in the registry and a valid sources document. You can do an initial test at http://www.dasregistry.org/validateServer.jsp and select the sources capability for your server.
 +
 +
==Discovering DAS sources programmatically==
 +
The registry produces it's own sources.xml in response to the url request http://www.dasregistry.org/das1/sources and this can be used by clients to get information on the many DAS sources available around the world and what there capabililties are. Currently there is no way of clients finding out if the registry knows if the sources are valid or not.
  
 
== Setting up a DAS client ==
 
== Setting up a DAS client ==

Revision as of 05:10, 22 April 2009

Last Updated 21 April 2009

The intention of this document is to bring together and add to all the documentation available on the WWW for the DAS system. The content on these pages draws from many sources of information and thus has many contributors. Eventually the intention if that this document will be a set of instructions that you can print out and use as reference documentation or a good read. If you find any errors on these pages and pages that it links to then please contact me (Jonathan Warren) to let me know, any suggestions and contributions are also welcomed. As this is now in wiki form you can log in and edit/add things yourself.

What is DAS?

As biological databases are becoming so large with the advent of high throughput technologies such as sequencers and microarray chips it is becoming increasing difficult to download all the data relevant for a research team. DAS gets around this by keeping the data stored with it's originators and allows users around the world to access just the relevant parts they need at any one time. Put another way: by making use of DAS you can take advantage of being able to view integrated information from multiple sources, without these sources needing to be aware of each other. You can also add your own DAS data source, perhaps privately in your own institution and then view the information served from this source in the context of features from other institutions. DAS stands for Distributed Annotation System. It was originally set up to be used with genomic information where annotations/features are layered on top of a reference sequence , usually a genome. The idea is that a genome browser such as ensembl or GBrowse (both DAS clients in this scenario) can be used to look at annotations from data sources both that exist on the same server/machine the browser is running on and display annotations in the same view from data sources (data served by DAS servers) that could be on the other side of the world (communicating via the WWW). The DAS system consists of the DAS Registry www.dasregistry.org as well as DAS Servers and Clients. The Registry is there to enable people and computers to easily find the DAS data sources available around the world and also to help these data sources conform to the specifications. It's important that data served by DAS servers conform to enable the interoperability of different clients and servers around the world.1.0 is currently supported and work is starting on supporting the new The 1.6 spec that is the latest and soon to be official DAS spec that mainly focuses on genomic annotations but also refers to the Extentions specified in the 1.53E spec below. The 1.53E spec contains up to date specifications for servers and clients that support information that can be exchanged using DAS that is not genome centric. Types of data include Proteins- Structures and alignments, Molecular Interactions, volume map data.

Current Status/ DAS specifications 1.5, 1.53E, 1.6, 2.0 and Future Intentions

Currently DAS 1.5 is the most widely used and supported together with 1.53E. DAS 2.0 is quite different and is really running in parallel to the other 2 versions of DAS. After the 2009 workshop it was generally agreed that most of the useful additional features that 2.0 provides is now or very soon to be implemented in DAS 1.6E and it's subsequent incarnations and thus DAS2.0 is now considered redundant. If you wish your data to be widely accessible then use the The 1.6 spec and The 1.53E spec documents as your guide.

Setting up a DAS Server

There are several different options available for setting up a DAS server. All are either written in PERL or Java.

Servers available

Name Programming Language advantages disadvantages
Dazzle Java Standard implementation, includes support for extensions (structure, interaction, vol) Some people say it can be hard to configure and deploy if you are not used to Java web development
Proserver PERL Standard implementation includes support for extensions (structure, interaction, vol)
MyDAS Java Some people say it's easier to set up and configure than Dazzle Doesn't support extensions currently
LDAS PERL Very Easy to set up? Limited support for DAS functionality and sources


Dazzle

Dazzle is currently the standard/default implementation for Java users- however MyDas (mentioned below) is popular.

Dazzle Eclipse Tutorial

Dazzle_Tutorial This tutorial takes you through setting up Dazzle in eclipse and then shows you how to add your own plugins

Getting Dazzle

http://biojava.org/wiki/Dazzle#Getting_Dazzle The latest version from the cutting edge source code is available here from subversion:http://www.derkholm.net/svn/repos/dazzle/

Using ready made plugins for datasources

http://biojava.org/wiki/Dazzle:plugins More examples needed here and tips for using mysql etc? http://biojava.org/wiki/Dazzle:deployment

Writing your own plugin

http://biojava.org/wiki/Dazzle:writeplugin
How to write a plugin using eclipse more on what interfaces need to be implemented and give a full example that implements all needed functionality such as sources.cmd and coordinate system etc.

Deploying an Ensembl Reference Server

link to Ensembl reference server instructions

MyDas

Information about MyDas can be found here.

Proserver

Proserver Page at the Sanger Institute.
Proserver Tutorial
Guide to Proserver

Implementing the latest specs

Proserver example of config to implement sources cmd:


coordinates = TAIR_8,Chromosome,Arabidopsis thaliana -> 1:2000,3000
properties  = key1 -> value1 ; key2 -> value2
mapmaster   =
http://www.gramene.org/das/Arabidopsis_thaliana.TAIR8.reference
capabilities = features -> 1.0

The coordinates data is taken from the coordinates/registry_coordinates.xml file, which is an archived copy of the list of coordinates available in the DAS registry. Specifying the name (or URI, actually) and test range is enough, ProServer will pick up the rest from the XML file. If the full data is not picked up, you may need to update the coordinates XML file from the registry (http://www.dasregistry.org/das/coordinatesystem). If your coordinate system is not in the Registry, an admin can add it for you.

Protein Annotations and Ontologies

[extension_ontology.jsp explanation of ontologies for proteins usage in DAS]

Testing your implementation

Validation and Registering of your Server

RelaxNG and other validation in the Registry

The DAS Registry uses RelaxNG to validate the xml responses from DAS servers before allowing them to register as a valid das source. RelaxNG is essentially a document like a dtd except that it uses an xml syntax that is easy to learn quickly. The registry uses the documents found at the following http://www.dasregistry.org/validation/ and has one document for each of the DAS commands (note you may need to right click "view the source" to see anything on these pages in a web browser) features.rng, sources.rng, alignments.rng, structure.rng, entry_points.rng, interaction.rng, sequence.rng and types.rng.

The DAS Registry

Introduction to the DAS Registry

The DAS registry can be found at http://www.dasregistry.org and serves as a central place for discovering DAS sources from around the world and for validating the sources. There is a user interface for interrogating the sources and ways for clients to also interrogate the sources. Support for searching sources based on Ontologies is likely to be included in future releases. The number of sources registered is set to increase rapidly to accommodate the ensembl genomes project data and the general increase in numbers of sequenced genomes. The registry will thus have to be modified in order to cope with this increase in data.

Connecting to the Registry Programmatically

There are several commands that can be used to query the registry including: The sources cmd with optional parameters: label, organism, authority, capability, type and unique source_id. You can also use the organsim, coordinatesystem and lastmodified commands. For examples see Scripting an example of a java classe written using Dasobert to access the Registry is here http://www.derkholm.net/svn/repos/dasobert/trunk/doc/examples/ContactRegistry.java

Adding a large set of data sources to the registry

The registry can automatically load a large set of data sources from the sources.xml that is returned from the sources cmd. If you wish to load a large set of sources you can contact dasregistry@sanger.ac.uk and ask for your data sources to be loaded. Please note that your sources must have valid coordinate systems that are in the registry and a valid sources document. You can do an initial test at http://www.dasregistry.org/validateServer.jsp and select the sources capability for your server.

Discovering DAS sources programmatically

The registry produces it's own sources.xml in response to the url request http://www.dasregistry.org/das1/sources and this can be used by clients to get information on the many DAS sources available around the world and what there capabililties are. Currently there is no way of clients finding out if the registry knows if the sources are valid or not.

Setting up a DAS client

Currently Available DAS Clients - table?

Name Description Programming Language Links
GBrowse

quote from GBrowse Website "GBrowse[1] is the most popular viewer in GMOD. For a list of GBrowse and GMOD installations see the GMOD Users page. For a demo of its features, try the WormBase, FlyBase, or Human Genome Segmental Duplication Database web sites. Spec DAS 1.53E and 1.6 soon

PERL http://gmod.org/wiki/Gbrowse
EnsEMBL EnsEMBL is a web based genome browser and database system which supports DAS 1.53E and soon 1.6 PERL http://www.ensembl.org/
IGB is an application built upon the GenoViz SDK and Genometry for visualization and exploration of genomes and corresponding annotations from multiple data sources Java http://genoviz.sourceforge.net/
Jalview A multiple sequence alignment editor & viewer Java http://www.jalview.org/
Dasty2 Dasty, a protein DAS client is implemented for visualising protein sequence feature information. The client is able to connect, to a reference server and one or many DAS servers. It merges the data from all the servers, and displays sequence information as well as annotated feature information form all the available DAS Servers in a very user friendly way . PERL and AJAX

http://www.ebi.ac.uk/dasty/

Writing your own DAS client

A Java DAS Client Library - Dasobert

Examples of client code written in Java using Dasobert can be found here: http://www.derkholm.net/svn/repos/dasobert/trunk/doc/examples/

There is also a tutorial for using Dasobert within eclipse that (follows on from the Dazzle eclipse tutorial here): Dasobert_Tutorial

Example of walking a DAS source using perl

This example was kindly provided by Felix Kokocinski: You can specify a region or let it walk through all regions if the server can supply entry points with lengths. This is done in eg. 20 MB slices. It takes quite some time, but works nicely.


# Example script that reads genomic data from DAS server # using a defined chunk size # writing the data out to a gff file. 
# fsk@sanger.ac.uk, 2008 
use strict;
use Bio::Das::Lite;
use Getopt::Long;

#default DAS server adress my $server = "http://das.sanger.ac.uk/das";
#default DAS source name my $source = 'otter_das';
#proxy name my $http_proxy = undef;
 #genomic chunk size to query my $max_len    = 20000000;


my $chromosome = undef;
 my $start      = 0;
my $end        = 0;
my $gff_file   = undef;
 my %transcripts = ();

my $type;

&GetOptions(
	    'file=s'                 => \$gff_file,
	    'chromosome=s'           => \$chromosome,
	    'start=s'                => \$start,
	    'end=s'                  => \$end,
            'server=s'               => \$server,
            'source=s'               => \$source,
	   );

#connect to DAS server my $das = connect_das("$server/$source", $http_proxy);

#get entry point list/lengths #requires the DAS server to support the entry-points function my $chrom_lens = get_entry_points();

open(GFF, ">$gff_file") or die "Can't open file $gff_file.\n";

if($chromosome){
  #query specific region   get_region($chromosome, $start, $end);
}
else{
  #go through all chromosomes	   foreach my $chrom (keys %$chrom_lens){
	print "getting $chrom\n";
	get_region($chrom, undef, undef);
	%transcripts = ();
  }
}


close(GFF)or die "Can't close file $gff_file.\n";


  ################################################ 

#connect to DAS server sub connect_das {
  my ($dsn, $proxy) = @_;

  my $das = Bio::Das::Lite->new({
				 'timeout'    => 10000,
				 'dsn'        => $dsn,
				 'http_proxy' => $proxy,
				}) or die "cant connect to DAS server!\n";

  return $das;
}



#look at the region requested sub get_region {
  my ($chromosome, $start, $end) = @_;

  my $chrom_len    = $chrom_lens->{$chromosome};
  my $region       = "";

  if( $start and $end){
    if($start > $end){
      die "Coordinates wrong: $start > $end!\n";
    }
    if( ($end - $start) <= $max_len ){
      #get entire region       my $region = ":".$start.",".$end;
      get_transcripts($region, $chromosome);
    }
    else{
      go_through_chunks($start, $end, $chromosome, $chrom_len);
    }
  }
  elsif( $chrom_len <= $max_len ){
    #get entire chromosome     get_transcripts($region, $chromosome);
  }
  else{
    go_through_chunks(1, $chrom_len, $chromosome, $chrom_len);
  }

}


#go through a region in chunks sub go_through_chunks {
  my ($chunk_start, $chunk_end, $chromosome, $chrom_len) = @_;

  my ($region_start, $region_end);
  my %ids_seen;

  #loop through regions until all is covered   #keep track of genes to avoid duplicates!   for($region_start = $chunk_start, $region_end = $region_start   $max_len;
      $region_start < $chunk_end;
      $region_start = $region_end   1, $region_end  = $max_len){

    if($region_end > $chrom_len){
      $region_end = $chrom_len;
    }elsif($region_end > $chunk_end){
      $region_end = $chunk_end;
    }
    my $region = ":".$region_start.",".$region_end;

    #get all transcripts from chunk     my $new_ids = get_transcripts($region, $chromosome, \%ids_seen);
    %ids_seen = (%ids_seen, %$new_ids);
  }

}



#fetch all available entry-points (chromosomes) and their lengths from server sub get_entry_points {

  my %chrom_lens;

  my $entry_points = $das->entry_points();

  foreach my $k (keys %$entry_points){
	foreach my $l (@{$entry_points->{$k}}){
		foreach my $segment (@{ $l->{"segment"} }){
			$chrom_lens{ $segment->{"segment_id"} } = $segment->{"segment_size"};
		}
  	}
  }

  return \%chrom_lens;
}



#fetch the data and process it. #note that this function is quite specific to the way your DAS source is set-up. #the idea is to get together all exons, etc that belong to a transcript and all transcripts #that belong to a gene. sub get_transcripts {
  my ( $region, $chromosome, $previous_genes ) = @_;

  print STDERR "have chr $chromosome$region\n";

  my %genes = ();
  my %new_features = ();
  my $response = undef;
 
   #fetch DAS features   $response = $das->features({
			      'segment' => $chromosome.$region,
			      'type'    => $type,
			     });

  while (my ($url, $features) = each %$response) {

    if(ref $features eq "ARRAY"){
      print STDERR "Received ".scalar @$features." features.\n";

    FEATURES:
      foreach my $feature (@$features) {

	my %notes = ();

	my $grouphash = $feature->{'group'}->[0];

	#get other notes 	my $i = 0;
	my $morenote_entry = '';
 	while(defined($feature->{'note'}->[$i])){
	  my $morenotes = $feature->{'note'}->[$i];
	  my ($morenotes_type, $morenotes_value) = split('=', $morenotes);
	  $morenotes_value =~ s/\&\#39\;/\'/g; 	  $notes{$morenotes_type} = $morenotes_value;
	  $i  ;
	}

	#remove duplicates from overlapping regions 	if(defined $previous_genes and exists($previous_genes->{$grouphash->{'group_type'}})){
	  next FEATURES;
	}

	#you could do some filtering of the response at this point 
	my %gff_element;

	#build structure for exons and general items 	#find type 	my $element_type = $feature->{'type'} || "exon";
	$element_type    =~ m/((intron)|(UTR)|(exon))/g;
	if($1){ $element_type = $1 }

	my $group_type   = $grouphash->{'group_type'};

	my $strand       = $feature->{'orientation'};
	if($feature->{'orientation'}    =~ /^(\ |\-|\.)$/) {  }
	elsif($feature->{'orientation'} ==  1){ $strand = ' ' }
	elsif($feature->{'orientation'} == -1){ $strand = '-' }
	elsif($feature->{'orientation'} ==  0){ $strand = '.' }
	else{ die "INVALID STRAND SYMBOL: ".$feature->{'orientation'}."\n"; }

	my $phase        = ".";
	if($feature->{'phase'}){
	  $phase = $feature->{'phase'};
	}
	elsif($element_type eq "exon"){
	  $phase = "0";
	}

	if(!$notes{"Transcriptstatus"}){
	  die "PROBLEM: $element_type, ".$feature->{'feature_id'}."\n";
	}

	$gff_element{'seqid'}      = $chromosome;
	$gff_element{'source'}     = $notes{"Transcripttype"};
	$gff_element{'type'}       = $element_type;
	$gff_element{'start'}      = $feature->{'start'};
	$gff_element{'end'}        = $feature->{'end'};
	$gff_element{'score'}      = ".";
	$gff_element{'strand'}     = $strand;
	$gff_element{'phase'}      = $phase;

	#check for some missing values 	if(!exists $feature->{'feature_id'}){
	  print STDERR "Missing value for Parent-feature_id\n";
	  $feature->{'feature_id'} = "0";
	}
	if(!exists $notes{"Transcriptstatus"}){
	  print STDERR "Missing value for Transcriptstatus\n";
	  $notes{"Transcriptstatus"} = "-";
	}
	if(!exists $notes{"Created"}){
	  print STDERR "Missing value for Created\n";
	  $notes{"Created"} = 0;
	}
	if(!exists $notes{"Lastmod"}){
	  print STDERR "Missing value for Lastmod\n";
	  $notes{"Lastmod"} = 0;
	}
	$gff_element{'attributes'} = "Parent=".$feature->{'feature_id'}.
	                             ";Status=".$notes{"Transcriptstatus"}.
				     ";CREATED=".$notes{"Created"}.
				     ";LASTMOD=".$notes{"Lastmod"};

	if(!exists $genes{ $group_type }){
	  $genes{ $group_type } = 1;
	  my %gff_gene;

          my $gene_region = $feature->{'target'};
          my ($gs, $gene_loc) = split('\=', $gene_region);
	  my ($gene_start, $gene_end) = split('\-', $gene_loc);

	  #build structure for gene 	  $gff_gene{'seqid'}      = $chromosome;
	  $gff_gene{'source'}     = $notes{"Genetype"};
	  $gff_gene{'type'}       = "gene";
	  $gff_gene{'start'}      = $gene_start;
	  $gff_gene{'end'}        = $gene_end;
	  $gff_gene{'score'}      = ".";
	  $gff_gene{'strand'}     = $strand;
	  $gff_gene{'phase'}      = ".";

	  #get gene description 	  my $description = "";
	  foreach my $gnote (@{$grouphash->{'note'}}){
	    my ($gnote_s, $gnote_string) = split('=', $gnote);
	    if($gnote_s eq "DESCR"){
	      $description = ";Description=".$gnote_string;
	    }
	  }
	  $gff_gene{'attributes'} = "ID=".$grouphash->{'group_type'}.
	                            $description.
				    ";Status=".$notes{"Genestatus"}.
	                            ";CREATED=".$notes{"Created"}.
				    ";LASTMOD=".$notes{"Lastmod"};

	  #print entry for transcript 	  print_gff_line(\%gff_gene);
	  %gff_gene = ();

	  $new_features{$grouphash->{'group_type'}} = 1;

	}

	if(!exists $transcripts{ $feature->{'feature_id'} }){
	  $transcripts{ $feature->{'feature_id'} } = 1;
	  my %gff_transcript;

	  #build structure for transcript 	  $gff_transcript{'seqid'}      = $chromosome;
	  $gff_transcript{'source'}     = $notes{"Transcripttype"};
	  $gff_transcript{'type'}       = "transcript";
	  $gff_transcript{'start'}      = $feature->{'target_start'};
	  $gff_transcript{'end'}        = $feature->{'target_stop'};
	  $gff_transcript{'score'}      = ".";
	  $gff_transcript{'strand'}     = $strand;
	  $gff_transcript{'phase'}      = ".";
	  $gff_transcript{'attributes'} = "ID=".$feature->{'feature_id'}.";Alias1=".$feature->{'target_id'}.
	                                  ";Parent=".$grouphash->{'group_type'}.
					  ";CREATED=".$notes{"Created"}.
					  ";LASTMOD=".$notes{"Lastmod"}.
					  ";Status=".$notes{"Transcriptstatus"};

	  #print entry for transcript 	  print_gff_line(\%gff_transcript);
	  %gff_transcript = ();
	}
	#else{ print STDERR "_" } 
	#print entry for exons, etc. 	if($feature->{'type_category'} =~ /error/){
	  print STDERR "Found an error feature:\n";
	  print STDERR $gff_element{'seqid'}."\t";
	  print STDERR $gff_element{'source'}."\t";
	  print STDERR $gff_element{'type'}."\t";
	  print STDERR $gff_element{'start'}."\t";
	  print STDERR $gff_element{'end'}."\t";
	  print STDERR $gff_element{'score'}."\t";
	  print STDERR $gff_element{'strand'}."\t";
	  print STDERR $gff_element{'phase'}."\t";
	  print STDERR $gff_element{'attributes'}."\n";
	} else {
	  print_gff_line(\%gff_element);
	  %gff_element = ();
	}

	$feature = undef;
       }
       @$features = ();
       $features  = undef;
     }
   }
 
   return \%new_features;
}



#print the different data types as GFF sub print_gff_line {
  my ($element) = @_;

  print GFF $element->{'seqid'}."\t";
  print GFF $element->{'source'}."\t";
  print GFF $element->{'type'}."\t";
  print GFF $element->{'start'}."\t";
  print GFF $element->{'end'}."\t";
  print GFF $element->{'score'}."\t";
  print GFF $element->{'strand'}."\t";
  print GFF $element->{'phase'}."\t";
  print GFF $element->{'attributes'}."\n";
}

Acknowledgments

(some of this document may have been cut an pasted from documentation contributed by the following people):

  • Andreas Prlic
  • Andy Jenkinson
  • Phil Jones
  • Tim Hubbard
  • Lincoln Stein
  • Thomas Down