Retrieving DAS2 sequence and annotation feature records

This document describes the DAS2 retrieval protocol for annotation servers and genomic sequence reference servers. There are four main sets of documents. The sources documents describe the available data sources, the segments documents describe the reference genomic sequences, the features documents contain the features on the sequences, and the types documents characterize the different feature types. The formal schema for the XML responses is a RelaxNG schema in .rnc format.

Genome DAS/2 uses half-open intervals to specify regions along a nucleotide sequence. See the Segment ranges section for more information.

Every document is identified with a URI, which should be an HTTP URL. A document is retrieved by doing a GET request on its URL. The following table summarizes the various types of GET requests used within the genome domain.

Request	Documentation	Information Retrieved	Default Response Type
sources	Overview \| Detailed	Information about all data sources or a specific data source	application/x-das-sources+xml
segments	Overview \| Detailed	Description of the regions on which features are located	application/x-das-segments+xml
types	Overview \| Detailed	Description of all feature types or a single type	application/x-das-types+xml
features	Overview \| Detailed	One or more genomic features, including sequence alignments	application/x-das-features+xml

1 GENERAL

This section contains information pertaining to the DAS/2 specification as a whole.

1.1 URIs, URLs and HTTP

This specification makes extensive use of URIs, and more specifically HTTP URLs. While other URLs and URIs are possible the exchange protocol uses concepts like request action and headers, response code and headers, and query string construction which only make sense in the context of HTTP and related protocols like HTTPS.

1.2 Content-Type header

A server should include the correct MIME type in the Content-Type header of the response. If not it must respond with "application/xml" and must not respond with text/xml. Character encoding is determined as per RFC 3023. We recommend that server implementers either not include the charset parameter in the Content-Type header or ensure that it is identical to the encoding in the document's XML declaration.

For use during specification development a server may include a "version" value so clients can determine which version of the spec is implemented by the server. Unless others can convince me otherwise this will be removed in the final specification.

1.3 ISO dates

Several elements have 'created' and 'modified' attributes. These dates are formatted in a subset of ISO 8601. Data providers must write the date using one of the following forms

1.4 Global sequence identifiers

1.5 Segment ranges

Segment locations are used in three places in DAS: feature locations, range-based feature filters, and sequence retrieval.

Every location is on a segment, named using its URI. This is either the primary URI given in the segments document or the reference URI either specified by the coordinate system or by the "reference" attribute in the segments document.

Segment ranges are given by start and end positions specified using half-open, zero-based intervals (a.k.a. interbase coordinates). The interval is half-open meaning that the interval from 'start' to 'end' includes the residues from position 'start' up to but not including position 'end'. The first residue in the sequence is at position zero and is specified as the range (0,1). The length of an interval is always equal to end minus start (length=end-start).

This scheme is sometimes called "interbase coordinates" because the numbering system labels the positions inbetween the bases (residues) rather than the bases themselves. For example, the range (3,6) includes the fourth, fifth, and sixth residues not second or seventh:

The end coordinate of a range is never less than the start position. The range (5,6) covers the residue at position 5 while (5,5) has size of zero and refers to the point between positions 4 and 5. Cleavage site annotations may use zero size annotations like the latter.

Features may optionally be located on a strand. "1" denotes the positive strand, "-1" denotes the negative strand, "0" denotes both strands. If the strand is not specified then the strand is unknown or meaningless for the given feature or sequence type.

Features ranges are specified using a compact notation. Given the start and end positions the range is

Here are a few examples of how they look like in feature filters. Note that feature filters do not support strand information in the query range. The query parameters in both cases were encoded for use as a URL query parameter.

Sequence retrieval queries work directly on the segment URI. To get the sequence for a subrange, pass the range using the "range" key of the query, as in the following:

1.6 The link element

The LINK element connects a document or a feature record to some other resource identified by a URL. The LINK element is modeled on the "link" element in HTML 4.0 and has many of the same attributes, with identical meanings.

The optional 'title' attribute is an advisory title providing a hint about what the element links to. The optional 'href' attribute is the URL being linked to. The optional 'type' attribute contains an advisory MIME type describing the expected document content-type. For example, a feature may link to an image in both GIF and a PNG formats, letting the client chose a prefered format.

The 'rel' and 'rev' attributes contain space separated terms which characterize the type of the forward and reverse links, respectively. Likely most links will be forward links and so use the 'rel' attribute. This specification does not reserve any link terms. We expect de facto types to evolve through common use.

1.7 Formats and extensibility

This specification defines a base set of XML document formats. A data provider may want to return an alternate format in addition to the base formats. For examples, if the XML overhead is too high a data provider may want to develop a more compact binary feature format, or if the feature data is too rich for the simple DAS property table then it would be better represented in an alternate format.

Once implemented the alternative format support is announced through the FORMAT elements of the CAPABILITY element.

For example, the following says that the server implements three formats. The format name "das2xml" is reserved for the formats defined by this specification and must be supported by the server even if not listed. The other two format names are for example only:

To request an alternate format the client must add a "format=<name>" field to the query string of the URL. For example, to request all of the features from the above example but in "das3xml" the client makes a request for:

Servers may extend the features filter language to add new capabilities as long as those terms do not affect queries without those fields. A server may list support for a query extension using the SUPPORTS tag. In the following the server says it supports the "curation-search" as well as the das2xml and compact-binary formats:

The client implementer must use some other means to discover what additional filters are available for a "curation-search".

A server may support additional capabilities not defined by this specification and list support for it through a new CAPABILITY item. For example, in the following the hypothetical server implements an alternative query language based on XQuery.

The contents of the non-DAS2 CAPABILITY elements is determined by the server implementer and a client implementer must look elsewhere to discover what it means.

2 OVERVIEWS

This section provides an overview of the DAS/2 documents corresponding to each DAS/2 request. Each document is described in more detail in the Details section of this specification.

2.1 The sources document (overview)

A DAS server supplies information about genomic sequence data sources. The collection of all sources, each data source, and each version of a data source are accessible through a URL. All three classes of URLs return a document of content-type 'application/x-das-sources+xml' though likely with differing amounts of detail. A 'versioned source' request returns information only about a specific version of a data source. A 'source' request returns the list of all the versioned source data for that source. A 'sources' request returns the list of all the source data, including all the versioned source data.

The URLs might not be distinct. For example, a server with only one version of one data source may use the same URL for all three documents, and a server for a single organism may use the same URL for the 'sources' and 'source' documents.

Most servers will list only the data sources provided by that server. Some servers combine the sources documents from other servers into a single document. These registry servers act as a centralized index and reduce configuration and network overhead. A registry server uses the same sources format as an annotation server.

Here is an example of a simple sources document which makes no distinction between the three sources categories. It, like the other new DAS formats, is in XML. All of the DAS elements are in the XML namespace http://biodas.org/documents/das2. This namespace is reserved and authors of DAS extensions may not create new XML elements in it.

Request:

http://www.example.com/das/genome/yeast.xml

Response:

Content-Type: application/x-das-sources+xml

<?xml version="1.0" encoding="UTF-8"?>
<SOURCES xmlns="http://biodas.org/documents/das2"
          xml:base="http://www.example.com/das/genome/">

  <SOURCE uri="yeast.xml" title="Saccharomyces cerevisiae (Baker's yeast) genome"
         doc_href="http://www.example.com/yeast.html">
    <VERSION uri="yeast.xml" created="2005-12-05">
      <COORDINATES uri="http://sanger.ac.uk/das-registry/yeast-32-gene"
             taxid="4932" source="Gene_ID" authority="SGD32" version="3"/>
      <CAPABILITY type="features" query_uri="features.xml" />
      <CAPABILITY type="types" query_uri="types.xml"/>
    </VERSION>
  </SOURCE>

</SOURCES>

All identifiers and href attributes in DAS documents follow the XML Base specification when resolving partial identifiers and href attributes. In this case the relative id "yeast.xml" is fully resolved using the xml:base of "http://www.example.com/das/genome/" to "http://www.example.com/das/genome/yeast.xml". If the result after resolving through all the parent xml:base attributes is still a relative URL then it is resolved once more with respect to the URL used to fetch the document.

Here is an example of a more complicated sources document with multiple organisms each with multiple versions. Each of the two source documents (one for each organism) has a distinct URL as does each of the version for each organism. This is a pure registry server because the actual annotation data comes from other machines.

Request:

http://www.biodas.org/known_das_servers

Response:

Content-Type: application/x-das-sources+xml

<SOURCES xmlns="http://biodas.org/documents/das2">
  <SOURCE uri="http://das.ensembl.org/das/SPICEDS/" title="das_vega_trans">
    <VERSION uri="http://das.ensembl.org/das/SPICEDS/127/" created="2005-05-23">
      <MAINTAINER email="someone@sanger.ac.uk" />
      <COORDINATES uri="http://sanger.ac.uk/das-registry/zv4-25-chr" taxid="7955"
               source="Chromosome" authority="ZV4" version="25"
               test_range="name=BX255914" />
      <CAPABILITY type="segments"
             query_uri="http://www.ebi.ac.uk/das-srv/genome/zebrafish-62" />
      <CAPABILITY type="features"
           query_uri="http://das.ensembl.org/das/SPICEDS/127/features">
        <SUPPORTS name="das2queries" />
      </CAPABILITY>
      <CAPABILITY type="types"
           query_uri="http://das.ensembl.org/das/SPICEDS/127/types" />
    </VERSION>

    <VERSION uri="http://das.ensembl.org/das/SPICEDS/128/" created="2005-08-13">
      <MAINTAINER email="someone-else@sanger.ac.uk" />
      <COORDINATES uri="http://sanger.ac.uk/das-registry/zv4-26-chr" taxid="7955"
               source="Chromosome" authority="ZV4" version="26"
               test_range="name=BX255914" />
      <CAPABILITY type="segments"
             query_uri="http://www.ebi.ac.uk/das-srv/genome/zebrafish-62" />
      <CAPABILITY type="features"
           query_uri="http://das.ensembl.org/das/SPICEDS/128/features">
        <SUPPORTS name="das2queries" />
      </CAPABILITY>
      <CAPABILITY type="types"
           query_uri="http://das.ensembl.org/das/SPICEDS/128/types" />
      <CAPABILITY type="locks" query_uri="http://das.ensembl.org/das/SPICEDS/128/locks" />
      <CAPABILITY type="writeback"
           query_uri="http://das.ensembl.org/das/SPICEDS/128/locks" />
    </VERSION>
  </SOURCE>

  <SOURCE uri="http://www.example.com/das2/mus/sources.xml" title="Mus musculus">
    <VERSION uri="http://www.example.com/das2/mus/42/sources.xml" created="2006-02-11">
      <MAINTAINER email="pied-piper@hamlet.ac.uk" />
      <COORDINATES uri="http://sanger.ac.uk/das-registry/yeast-12-clone" taxid="10090"
                source="Clone" authority="Ensembl" test_range="name=AL935121" />
      <CAPABILITY type="features"
           query_uri="http://www.example.com/cgi-bin/features-mus-v42.cgi">
        <SUPPORTS name="das2queries" />
      </CAPABILITY>
      <CAPABILITY type="types"
           query_uri="http://www.example.com/das2/mus/v42/types.xml" />
    </VERSION>
  </SOURCE>
</SOURCES>

Each SOURCE id and VERSION id is fetchable. Fetching the URL "http://das.ensembl.org/das/SPICEDS/" returns a sources document with the SOURCE record for "das_vega_trans" and both of its VERSION subelements and fetching "http://das.ensembl.org/das/SPICEDS/128/" returns a sources document with only the second of its VERSION subelements.

DAS documents refer to other documents through URLs. There are no restrictions on the internal form of the URLs, other than the query string portion. Server implementers are free to choose URLs which best fit the architecture needs. For example, a simple DAS server may be implemented as a set of XML files hosted by a standard web server while more complex servers with search support may be implemented as CGI scripts or through embedded web server extensions. The URLs do not need to define a hierarchical structure nor even be on the same machine. Compare this to the DAS1 specification where some URLs were constructed by direct string modification of other URLs.

2.2 The segments document (overview)

Features are located on segments. A segment is the largest chunk of contiguous sequence. For fully sequenced organisms a segment may be a chromosome. For partially assembled genomes where the distance between the assembled regions is not known then each assembled region may be its own segment. If a server provides annotations in contig space then each contig is a segment. A specific set of segments is also called a coordinate system.

There are two ways for a versioned source record to describe which coordinate system it uses. The first is through a CAPABILITY element of type "segments". (The CAPABILITY elements list the different DAS interfaces and extensions supported by a server.) Fetching the corresponding query_uri returns a document of content-type 'application/x-das-segments+xml' listing information about each segment.

The second is through a COORDINATES element that uniquely characterizes the coordinate system but requires consulting other sources for details on the segments.

Request:

http://www.biodas.org/das2/h.sapiens/v3/segments.xml

Response:

Content-Type: application/x-das-segments+xml

<?xml version="1.0" encoding="UTF-8"?>
<das:SEGMENTS xmlns:das="http://biodas.org/documents/das2">
 <das:SEGMENT uri="http://www.biodas.org/das2/h.sapiens/v37/segment/Chr1.xml"
     title="Chr1" length="245522847"
     doc_href="http://www.ensembl.org/Homo_sapiens/mapview?chr=1"/>
 <das:SEGMENT uri="http://www.biodas.org/das2/h.sapiens/v37/segment/Chr2.xml"
     title="Chr2" length="243018229"
     doc_href="http://www.ensembl.org/Homo_sapiens/mapview?chr=2"/>
</das:SEGMENTS>

Note that unlike the previous examples this document defined the new namespace abbreviation "das" instead of defining a default namespace.

2.3 The types document (overview)

Every DAS feature is associated with a particular feature TYPE. DAS types do not describe a formal type system in that DAS types do not derive from other DAS types. The DAS type record exists to group together all features of the same "type" (as defined by the data provider), link that type to an external ontology term, and describe how to depict the features.

A DAS annotation server's source document must contain a CAPABILITY element of type "types". Fetching the corresponding query_uri returns a document of content-type "application/x-das-types+xml" listing all of the types available on the server. There are no query filter parameters for retrieving the types document.

The following is an example of a DAS annotation server for human, providing features based on Genscan transcript predictions. Transcript prediction features for Genscan are specified as type "http://www.example.org/das/human/build36/types/genscan_transcript". Genscan predictions also include the exons of the transcript, and these are specified as type "http://www.example.org/das/human/build36/types/genscan_exon".

Request:

http://www.example.org/das2/human/build36/types

Response:

Content-Type: application/x-das-types+xml

<TYPES xmlns="http://biodas.org/documents/das2"
       xml:base="http://www.example.org/das2/human/build36/types/">
  <TYPE uri="genscan_transcript" 
      title="Genscan transcript predictions"
      doc_href="http://www.example.org/docs/genscan_transcript.html"
      method="GENSCAN 1.0"
      ontology="http://das.biopackages.net/das/ontology/obo/1/ontology/SO/0000673"
      so_accession="SO:0000673">
  </TYPE>
  <TYPE uri="genscan_exon" 
      title="Genscan exon predictions"
      doc_href="http://www.example.org/docs/genscan_exon.html"
      method="GENSCAN 1.0" 
      ontology="http://das.biopackages.net/das/ontology/obo/1/ontology/SO/0000147"
      so_accession="SO:0000147">
  </TYPE>
</TYPES>

2.4 The features document (overview)

The versioned source record for an annotation server must include a CAPABILITY element of type "features". Clients use the corresponding query_uri to get feature information from the server. By default fetching this URL returns a list of all the features for the versioned source as a document of content-type "application/x-das-features+xml". A client may specify a set of filters to retrieve a subset of the features and may ask for the response to be in an alternative format. Servers may respond with an error if there are too many matching features to return.

Here is an example features document for a server which contains a gene and an alignment.

Request:

http://das.biopackages.net/das/genome/yeast/S228C/features.pl

Response:

Content-Type: application/x-das-features+xml

<?xml version="1.0" encoding="UTF-8"?>
<FEATURES xmlns="http://biodas.org/documents/das2"
          xml:base="http://www.example.org/volvox/1/">
 <FEATURE uri="feature/cTel54X" type="type/gene" title="tg-3">
   <LOC segment="segment/Chr2" range="1200:2917:1" />
 </FEATURE>

 <FEATURE uri="feature/hit12"
          type="type/est-alignment"
          created="2001-12-15T22:43:36"
          modified="2004-09-26T21:10:15" >

   <LOC segment="segment/Chr3" range="1201:1400:1" />
   <PART uri="feature/hit12.hsp1" />
   <PART uri="feature/hit12.hsp2" />
   <PROP key="est2genomescore" value="180" />
 </FEATURE>

 <FEATURE uri="feature/hit12.hsp1"
          type="type/est-alignment-hsp">
   <LOC segment="segment/Chr3" range="1201:1250:-1" />
   <PARENT uri="feature/hit12"/>
   <PROP  key="est2genomescore" value="180" />
 </FEATURE>

 <FEATURE uri="feature/hit12.hsp2"
          type="type/est-alignment-hsp" >
   <LOC segment="segment/Chr3" range="1351:1400:1" />
   <PARENT uri="feature/hit12" />
   <PROP  key="est2genomescore" value="120" />
 </FEATURE>

</FEATURES>

Each feature has a unique identifier and an identifer linking it to a type record. Both identifiers are URLs and should be directly fetchable. Simple features can be located on a region of a segment. More complex features like a gapped alignment are represented through a parent/part relationship. A feature may have multiple parents and multiple parts.

2.4.1 Feature filters (overview)

An annotation server may contain many features while the client may only be interested in a subset; most likely features in a given portion of the reference sequence. To help minimize the bandwidth overhead the feature query URL should support the DAS feature filter language. The syntax uses the standard HTML form-urlencoded query syntax. Here is a request for all features on Chr2, assuming the full URL for Chr2 is http://example.org/volvox/1/segment/Chr2

Request:

http://www.example.org/volvox/1/features.cgi?segment=http%3A%2F%2Fexample.org%2Fvolvox%2F1%2Fsegment%2FChr2

Response:

Content-Type: application/x-das-features+xml

<?xml version="1.0" encoding="UTF-8"?>
<FEATURES xmlns="http://biodas.org/documents/das2"
          xml:base="http://www.example.org/volvox/1/">
 <FEATURE uri="feature/cTel54X" type="type/gene" title="tg-3">
   <LOC segment="segment/Chr2" range="1200:2917:1" />
 </FEATURE>

 <FEATURE uri="feature/hit12"
          type="type/est-alignment"
          created="2001-12-15T22:43:36"
          modified="2004-09-26T21:10:15" >

   <LOC segment="segment/Chr3" range="1201:1400:1" />
   <PART uri="feature/hit12.hsp1" />
   <PART uri="feature/hit12.hsp2" />
   <PROP key="est2genomescore" value="180" />
 </FEATURE>
</FEATURES>

and here is the filter for all EST alignments, assuming those all have the feature type 'http://www.example.org/volvox/1/type/est-alignment'

Request:

http://www.example.org/volvox/1/features.cgi?type=http%3A%2F%2Fwww.example.org%2Fvolvox%2F1%2Ftype%2Fest-alignment

Response:

Content-Type: application/x-das-features+xml

<FEATURES xmlns="http://biodas.org/documents/das2"
          xml:base="http://www.example.org/volvox/1/">
 <FEATURE uri="feature/hit12"
          type="type/est-alignment"
          created="2001-12-15T22:43:36"
          modified="2004-09-26T21:10:15" >

   <LOC segment="segment/Chr3" range="1201:1400:1" />
   <PART uri="feature/hit12.hsp1" />
   <PART uri="feature/hit12.hsp2" />
   <PROP key="est2genomescore" value="180" />
 </FEATURE>
</FEATURES>

3 DETAILS

This section contains additional detailed information about the DAS/2 documents corresponding to each DAS/2 request. These documents are also described in the Overviews section of this specification.

3.1 The sources document (detailed)

A sources request is a request for information about the data sets available from a DAS server. This may be a list of all data sources, a list of all versions of a given data source, or information about a specific version. All three are done by fetching a sources document given a URL. The returned format is identical for all three cases, except that some portions will require one element instead a list of zero or more elements.

The sources request does not use query parameters. A future version of DAS/2 may add optional query parameters to the sources request, but for now servers should respond with an HTTP error code 400 "Bad Request" if any query parameters are included.

Response:

Content-Type: application/x-das-sources+xml

<?xml version="1.0" encoding="UTF-8"?>
<SOURCES
    xmlns="http://biodas.org/documents/das2"
    xml:base="http://dev.wormbase.org/das/genome/">

  <MAINTAINER
    name="Yoyodyne DNA Systems"
    email="yoyodyna@example.com"
    href="http://www.example.com/" />

  <SOURCE uri="volvox" title="Volvox Database"
      doc_href="http://www.example.org/volvox_db.pdf">

    <VERSION uri="volvox/build_1" title="Build 1, October 2002"
           writeable="no" created="2002-10-15" modified="2002-10-25T09:56:23">

      <MAINTAINER
        name="Volvox helpdesk"
        email="volvox-help@example.com" />
      <COORDINATES uri="http://ncbi.nlm.nih.gov/das-genomes/human-35"
                   taxid="3066" source="chromosome" authority="NCBI" version="35" />

      <COORDINATES uri="http://embl.ebi.ac.uk/genome/volvox-clone"
                   taxid="2034" source="clone" authority="EMBL" />

      <CAPABILITY type="segments" query_uri="volvox/1/segments">
          <FORMAT name="fasta"/>
          <FORMAT name="raw"/>
      </CAPABILITY>
      <CAPABILITY type="types" query_uri="volvox/1/types">
          <FORMAT name="das2xml" />
      </CAPABILITY>
      <CAPABILITY type="features" query_uri="volvox/1/features">
          <FORMAT name="das2xml" />
      </CAPABILITY>
      <CAPABILITY type="locks" query_uri="volvox/1/locks" />

    </VERSION>
  </SOURCE>
</SOURCES>

The root element is named SOURCES. The MAINTAINER element is optional. A server should provide at least one of the 'name', 'email' or 'href' attributes. The 'name' is short human-readable text, the 'email' is an email address and the 'href' is a URL meant for a human using a web browser. The MAINTAINER element under the SOURCES element designates the maintainer for the server, which may be different than the maintainer for the data sources.

The SOURCES element has zero or more LINK elements linking the sources document to some other document.

The SOURCES element has zero or more SOURCE elements. The 'uri' attribute is a URI and must be a unique identifier within a SOURCES document. It should be fetchable and if fetchable it must respond with a sources document describing the given data source. The 'title' is a short label describing the source to people. The optional 'description' contains up to a paragraph of text (no HTML markup) with more details about the data source. The optional 'doc_href' is a URL for a web browser to display human-readable documentation.

A SOURCES element may have an optional MAINTAINER element. The syntax is the same as the one at the SOURCES level. If present it contains the contact information for the maintainer of the given data source. If not present clients may use the SOURCES MAINTAINER as the SOURCE MAINTAINER.

A data source may have multiple VERSION elements. The definition of what constitutes a new version is left to the data provider. VERSION elements should be listed in creation time order, with the oldest versions first. The 'uri' attribute is a URI and must be a unique identifier within the data source. It should be fetchable and if fetchable it must respond with a source document describing only the specific versioned data source. The 'title' is a short label describing the version to people. The optional description contains up to a paragraph of text (no HTML markup) with more details about the version. The optional 'doc_href' is a URL for a web browser to display human-readable documentation.

The optional 'created' and 'modified' attributes are ISO timestamps specifying when the version was first made available and most recently modified.

Each VERSION element may contain an optional MAINTAINER element, which has the same syntax and meaning as the MAINTAINER element at the SOURCES and SOURCE level. It contains the contact information for the maintainer of the specific version of the data source, which may be different than the maintainer for the data source or for the server. If the VERSION MAINTAINER is not present, clients may use the SOURCE MAINTAINER instead for contact information and if that does not exist clients may use the SOURCES MAINTAINER.

The COORDINATES element characterizes a coordinate system used by the versioned source. Each COORDINATES has a 'uri' attribute which exactly identifies the coordinate system. If two annotation servers have a COORDINATES with the same uri then they annotatate the same system. Coordinate system URIs for human and many model organism genomes are currently listed at http://www.biodas.org/wiki/GlobalSeqIDs. This list will soon be incorporated into the Sanger DAS/2 registry.

To help people select an appropriate coordinate system to use, the COORDINATES element contains a set of attributes with additional details. The 'authority' attribute is the name of the organization that determined the coordinate system. It is a name like 'NCBI', 'EMBL', 'Ensembl', 'HUGO_ID', 'IPI' or 'UniProt'. The 'source' attribute refers to the physical dimension of the coordinate system. It is a name like 'Chromosome', 'Clone', 'Contig', 'Gene_ID', 'NT_Contig', 'Protein Sequence', 'Protein Structure', or 'Scaffold'. The Sanger registry contains a full list of these restricted vocabulary terms.

The optional 'version' contains the version name of the build upon which the coordinate system is based. The optional 'created' attribute is the ISO timestamp for the coordinate system and is used when sorting by time, as when identifying the most recent version. The optional 'taxid' attribute is the NCBI taxonomy id of the organism, as a number.

The COORDINATES element may contain an optional 'test_range' attribute used to test that the server is operational. Experience with DAS1 found that the web interface code often did not catch errors at the database interface layer and would return empty results instead of correctly reporting errors. The test_range attribute contains a feature filter query string which can be passed to the "feature" CAPABILITY query_uri. The response after doing that feature filter request must contain at least one feature and should contain no more than a modest number.

A versioned source may have more than one COORDINATES element. For example, features may be given in both chromosome and contig space, or in human and mouse coordinates.

The CAPABILITY elements describe what sort of queries a client may do with the versioned source data. The query is done through the URL listed in the 'query_uri' field. Different query URLs support different query interfaces. The specific interface is listed in the 'type' field. The specification defines the following query URL types:

'type' value	for information about
segments	description of the regions on which features are located
types	the feature types
features	the features
locks	the locks
writeback	writeback support

A CAPABILITY has zero or more FORMAT elements, each with a 'name' attribute. These list the supported formats for the given capability. To get the document in a given format, use the format's name in the "format" parameter of the query.

A CAPABILITY has zero or more SUPPORTS elements, each with a 'name' attribute. These list the available extensions supported by the given capability.

The next section describes in more detail how the FORMAT and SUPPORTS elements are used.

3.2 The segments document (detailed)

A versioned source entry contains two ways to get information about the top-level segments used as a coordinate system. One is through a global registry of COORDINATES URIs. This is an abstract identifier scheme. The other is through a concrete link to a segments document, which lists information about each segment and how to get the sequence information.

The segments request is done through the query_uri of the "segments" CAPABILITY listed in the versioned source entry, as in the following:

The versioned source may contain multiple segments CAPABILITIES. This occurs when there are multiple top-level coordinate systems for the annotation server, for example, with features annotated in both contig and chromosome coordinates.

When that occurs each segments CAPABILITY must have a 'coordinates' attribute containing a URI linking it to a COORDINATES element, which must also exist in the versioned source. Note that the coordinates URIs should but are not required to be in the global registry. You may make up a URI if none exists . For example:

Request:

http://www.biodas.org/sequence/gallus_gallus/March2004.xml

Response:

Content-Type: application/x-das-segments+xml

<?xml version="1.0" encoding="UTF-8"?>
<SEGMENTS xmlns="http://biodas.org/documents/das2"
     xml:base="http://www.biodas.org/das2/sequence/fly/release4/">
 <FORMAT name="fasta" />
 <FORMAT name="agp" />
 <SEGMENT uri="2L" title="Chromosome 2L" length="186349"
   reference="http://www.flybase.org/genome/D_melanogaster/R4.3/dna/2L" />
 <SEGMENT uri="2R" title="Chromosome 2R" length="464030"
   reference="http://www.flybase.org/genome/D_melanogaster/R4.3/dna/2R" />
 <SEGMENT uri="3L" title="Chromosome 3L" length="419684"
   reference="http://www.flybase.org/genome/D_melanogaster/R4.3/dna/3L" />
 <SEGMENT uri="3R" title="Chromosome 3R" length="1428"
   reference="http://www.flybase.org/genome/D_melanogaster/R4.3/dna/3R" />
 <SEGMENT uri="4" title="Chromosome 4" length="43776"
   reference="http://www.flybase.org/genome/D_melanogaster/R4.3/dna/4" />
 <SEGMENT uri="X" title="Chromosome X" length="311673"
   reference="http://www.flybase.org/genome/D_melanogaster/R4.3/dna/X" />
</SEGMENTS>

Note that in the example the SEGMENT uri attributes are relative URIs and are resolved using the xml:base defined in the SEGMENTS element.

The SEGMENTS element has zero or more FORMAT elements. A FORMAT element has the single attribute named 'name' describing the supported format. This specification defines the following format names:

Format name	Meaning
das2xml	a segments response of type application/x-das-segments+xml
fasta	sequence data in FASTA format
raw	sequence data with only residue names and newlines
agp	Assembly data in AGP format
count	the total number of segments, as a decimal string Used to get the segment count before potentially requesting 10,000s of segments
formats	the standard das2xml format but without SEGMENT elements Used to get the format listing even when there are a large number of segments

All versioned sources that have a "segments" capability must support the "das2xml" format. The "das2xml" FORMAT entry does not need to be specified in a FORMAT element. For details of use, see the next section.

The SEGMENTS element has zero or more SEGMENT elements. Each SEGMENT element has several attributes. The 'uri' attribute is a URI and must be unique for each SEGMENT. The 'title' attribute is a short title of at most a few words, meant for people. The 'length' attribute is an integer count of the total number of residues in the segment. The optional 'doc_href' is a URL for a web browser to display human-readable documentation.

The optional 'reference' attribute is a URI. It connects the given segment to the globally agreed upon standard identifier for that segment. For example, the following reference URI is the identifier for Human Chromosome 4, defined by NCBI assembly 34:

A client uses the reference identifier to merge features from different DAS annotation servers into a common view because two segments from different servers with the same reference identifier must be copies of the same underlying segment.

3.2.1 Segments query parameters

The segment URIs must be fetchable. This is used to return the sequence data. Each segment URI must support form-urlencoded query parameters. The optional 'format' query specifies the data format to return, using the values in the recently defined FORMAT elements. If not given the default format is 'das2xml', which returns a segments document of content-type 'application/x-das-segments+xml' containing the details for only that segment. If the format query parameter is "fasta", then the content-type for the returned FASTA document should be "text/plain". FASTA records should have a title containing the segment name but a client may ignore the title.

The segment URI queries support a 'range' query parameter to limit the response to specific range of the segment. The query is of the form "$start:$end", for examples "100:200" and "345:9876". Note that the colon should be escaped for use in a URL.

The response must only include the sequence in the specified range. As usual, the first residue is 0 and the range includes the start position but not the end position. For example, if the sequence is "CATAGGTA" then range=1:3 is the subsequence "AT".

If the server does not support a requested format name then it MUST respond with an HTTP error message with status code 400 "Bad Request", and the message body SHOULD indicate that the requested format is not supported. If the server considers the response too large then it MUST respond with an HTTP error message with status code 413 "Request Entity Too Large", and the message body SHOULD indicate what the server would consider an acceptable response size. It is up to the server to determine how large is "too large".

If the start or end of the range is negative, or greater than the length of the segment, or beyond the limits of the integer data type used for the range then the server MAY respond with an HTTP error 400 "Bad Request". Servers MUST accept ranges from 0 up to and including the length of the segment, except for the case where the response is considered too large

3.3 The types document (detailed)

A types request returns information about feature types. This includes the link to an ontology, display information, and a description of the possible alternative formats. The DAS type record is not part of a type system in that DAS type records are not derived from other DAS type records. Instead they link groups of features to an external type system.

The request for all types on the server is done through the query_uri of the "types" CAPABILITY listed in the versioned source entry, as in the following:

In this case the query_uri is a relative URL, with the base URL defined somewhere besides the snippet shown. There must be one and only one types CAPABILITY for a versioned source.

Fetching that URI returns a document of content-type "application/x-das-types+xml" containing all of the types.

Request:

http://www.biodas.org/das/sequence/gallus_gallus/March2004/types

Response:

Content-Type: application/x-das-types+xml

<?xml version="1.0" encoding="UTF-8"?>
<TYPES xmlns="http://biodas.org/documents/das2"
   xml:base="http://www.biodas.org/das/sequence/gallus_gallus/March2004/">

 <TYPE uri="T1" ontology="http://song.sourceforge.net/SO/five_prime_UTR"
   so_accession="SO:0000204" title="5' UTR"
   description="A region at the 5' end of a mature transcript (preceding 
the initiation codon) that is not translated into a protein."
   method="N-SCAN 3.0">
  <PROP key="program_href" value="http://genes.cse.wustl.edu/BrentLab/MB-Lab-Software.html" />
  <ext:n-scan-options xmlns:ext="http://dalkescientific.com/das-extension"
        informant="human_nscan-10-03-2005.zhmm" sequence="refseqs_hg17"
        parameter="x3j11-Jan-2006.txt" />
 </TYPE>

 <TYPE uri="T2" ontology="http://song.sourceforge.net/SO/gene"
   so_accession="SO:0000704" title="strong gene prediction"
   description="Genscan score above 100" method="GENSCAN 1.0">
  <PROP key="program_href" value="http://genes.mit.edu/GENSCAN.html" />
 </TYPE>

 <TYPE uri="T3" ontology="http://song.sourceforge.net/SO/gene"
   so_accession="SO:0000704" title="moderate gene prediction"
   doc_href="http://genes.mit.edu/GENSCANinfo.html"
   description="Genscan score between 50 and 100" method="GENSCAN 1.0">
  <PROP key="program_href" value="http://genes.mit.edu/GENSCAN.html" />
 </TYPE>

 <TYPE uri="T4" ontology="http://song.sourceforge.net/SO/gene"
   so_accession="SO:0000704" title="weak gene prediction"
   doc_href="http://genes.mit.edu/GENSCANinfo.html"
   description="Genscan score between 0 and 50" method="GENSCAN 1.0">
  <PROP key="program_href" value="http://genes.mit.edu/GENSCAN.html" />
 </TYPE>

</TYPES>

The TYPES element has zero or more LINK elements linking the types document to some other document.

The TYPES element have zero or more TYPE elements. Each TYPE element has several attributes. The 'uri' attribute is a URI and must be unique for each TYPE in the document. The URI is used by a feature to identify its type. Each URI should be individually fetchable, returning a type document containing the given type record.

The 'ontology' attribute is a URI identifying the formal sequence ontology term to which the DAS type belongs. Multiple DAS type records may point to the same ontology term, as for example gene predictions from different source programs or different representation styles depending on the score. At present there are no stable ontology URI schemes so this attribute is optional.

The sequence ontology (SO) is widely used but its identifiers are not URIs. The 'so_accession' attribute contains the SO accession number including the leading "SO:", as in "SO:0000316". Note that the leading zeros are important. This field should be interpreted as an opaque string. It is highly recommended that DAS/2 servers feature types based on the sequence ontology vocabulary.

The optional 'title' attribute is a short title of at most a few words, meant to be human readable. The optional 'description' attribute contains up to a paragraph of text (no HTML markup) with more details about the type record. The optional 'doc_href' is a URL for a web browser to display human-readable documentation.

The optional 'method' attribute states where the data came from. It is at most a few words, meant for people. It may be the name of the database, analysis program, or person who curated the database. Some example method fields are:

All features are available in a DAS-specific XML format. A data provider may want to return an alternate format in addition to the base formats. A simple example is to provide the data in GFF3 format as well as the standard DAS2 XML. When this occurs the server should add a FORMAT entry to the "features" capability so clients know which alternate formats are globally applicable.

Some of the alternative formats are data type specific. For example, a server may provide the ABI trace data for each sequencing read. The data provider may model this as a single feature covering the entire genome where the feature type record links to a new ontology term "http://example.com/das2/feature/sequencing_read". It also declares that the data is available in "abi" format. If the client recognizes the ontology type and format name it can retrieve the trace data for a region of interest by doing a feature query with the appropriate feature filters and adding "&format=abi" to the query string.

A server indicates support for an alternate format using a FORMAT element. A TYPE element has zero or more FORMAT elements, each with a 'name' attribute. These list the supported formats for features of the given type. The reserved format names are "das2xml", "count" and "uris". See the FEATURE (detailed) section for more details.

3.3.1 Extending the type record

Some type records may need to provide additional information, for instance more details about how the features were generated. There are two ways to extend the basic type record.

The first is through a property table. The table is a list of zero or more PROP elements. Each PROP element has a 'key' and 'value' attribute, mapping the given key to value. Duplicate keys are allowed. There are no pre-defined key names though de facto ones will likely arise through use. The order of the PROP elements is arbitrary and must not affect the meaning of the keys or values.

For more complex extensibility use non-DAS XML elements. Any number of XML elements may occur after the property table so long as the outermost elements are not in the DAS namespace. Clients must ignore unknown elements.

3.4 The features document (detailed)

Each annotation server provides zero or more features as the result of a feature query, which returns a features document. Each feature has a title, a type, zero or more locations, and many other properties.

The feature query is done through the query_uri of the "features" CAPABILITY listed in the versioned source entry, as in the following:

The CAPABILITY element may list alternative feature format using the FORMAT element. The syntax details were discussed in the SOURCES (detailed) section. The reserved feature format names are

format name	format description
das2xml	the XML format described here
count	the number of matching features
uris	a list of matching feature URIs

All DAS servers must support the das2xml format. A client may assume the server supports das2xml even if the format name is not listed in the CAPABILITY.

The "count" format returns a document with a single line containing an integer count of the number of feature elements that would be returned. (A complex feature with a parent and a part has a count of 2 because it has two FEATURE elements.) The content-type of this document should be text/plain.

The "uris" format returns a document containing a newline separated list of each feature URI matching the search. The content-type of this document should be text/plain.

The CAPABILTY element may list the query interfaces supported by the query_uri using the SUPPORTS element. The syntax details were discussed in the SOURCES (detailed) section. The reserved query names are

SUPPORTS name	query description
simple	a request may be made with no feature filters
das2queries	server implements the DAS2 feature filter query language

If only "simple" is listed then the server does not support the DAS2 feature filter query language. This likely means that the server has at most a few thousand features and it's easier for the client to download everything than to make multiple requests against the server.

If "das2queries" is listed then the server implements the feature filter query language defined in this document. There is no extra meaning if "simple" is also listed.

If neither "simple" nor "das2queries" are given then a client may assume the server supports "das2queries".

Here is an "features" CAPABILITY example using the FORMAT and SUPPORTS elements:

Fetching the query_uri returns all of the features for the versioned source. If the server considers the response too large then it MUST respond with an HTTP error 413 "Request Entity Too Large", and the message body SHOULD indicate what the server would consider an acceptable response size. It is up to the server to determine how large is "too large".

Request:

http://biodas.org/das/sequence/fly/Jun2006/feature-search.cgi

Response:

Content-Type: application/x-das-features+xml

<?xml version="1.0" encoding="UTF-8"?>
<FEATURES
     xmlns="http://biodas.org/documents/das2"
     xml:base="http://www.biodas.org/das2/sequence/fly/Jun2006/">

 <FEATURE uri="./FT_9" type="./transposable_element" title="baggins"
    created="2005-11-24" doc_href="http://www.flybase.org/.bin/fbidq.html?FBgn0063440">
  <ALIAS alias="CG8672" />
  <ALIAS alias="CG3392" />
  <ALIAS alias="CG8672" />
  <ALIAS alias="baggins1" />
  <ALIAS alias="Baggins1" />

  <LOC segment="http://www.flybase.org/genome/D_melanogaster/R4.3/dna/4" range="637:1719:-1" />

  <NOTE>element type: non-LTR retrotransposon (Kapitonov and Jurka, 1997-*)</NOTE>
  <NOTE>total length in bp: 5453 number of copies in genome: 14 in euchromatin of
Release 3 genome annotation, of which zero are full length.</NOTE>

  <PROP key="gene_class" value="transposable element" />
 </FEATURE>

 <FEATURE uri="./FT_10" type="./exon" title="pseudogene CR32011"
    doc_href="http://www.flybase.org/.bin/fbidq.html?FBan0032011">
  <ALIAS alias="FBtr0089182" />
  <ALIAS alias="FBgn0076625" />
  <LOC segment="http://www.flybase.org/genome/D_melanogaster/R4.3/dna/4" range="26993:27101" />
  
  <NOTE>pseudogene CR32011 (CR32011 ,FBgn0052011) is located on 4 and has a length
of 5398 nt on the genomic sequence. The cytologic location is 101F1--102A1 . The
physical map boundaries are 4:26,994..32,391[-].</NOTE>
  <NOTE>Pseudogene related to CG17245</NOTE>
  <NOTE>based on homology to the obsolete gene model for Dmel CG32011-PA amino acids
1..1240 in release 3.1</NOTE>

  <PROP key="gene_region_length" value="5398" />
  <fly:map xmlns:fly="http://flybase.org/"
           physical_map="4: 26,994..32,391[-]"
           cytogenetic_map="4: 101F1-102A1" />
 </FEATURE>

</FEATURES>

The FEATURES element has zero or more LINK elements linking the types document to some other document. The content of the LINK attributes may depend on the filters being used. For example, a LINK element may point to an RSS feed which gives updates when new features match the query. Different queries would point to different feeds.

The FEATURES element contains 0 or more FEATURE element. Each FEATURE element has a 'uri' attribute which is a URI that must be unique for each feature on the server. The URI should be fetchable and by default return a document of content-type application/x-das-features+xml containing the information for that feature. If the server supports the "uris" FORMAT then each URI must be independently fetchable. If a feature URI is fetchable then it must support the "format=" query parameter, where the format name is given by "features" CAPABILITY of the versioned source document and by the FORMAT list of the corresponding feature type record.

The 'type' attribute of the FEATURE element is a URI identifying the type record to which the feature belongs. The URI must be listed in the full types document and should be individually fetchable.

The 'title' element contains a short description of the feature. This should contain the feature name or other essential description of the feature. The optional 'created' and 'modified' attributes contain timestamps of when the the feature annotation was first created and most recently modified. The optional 'doc_href' is a URL for a web browser to display human-readable documentation.

A FEATURE has zero or more ALIAS elements. Each ALIAS element lists an alternate name for the feature, given by the 'alias' attribute.

A FEATURE has zero or more LOC elements describing where the annotation is located on the segments. The 'segment' attribute of the LOC element contains the URI of the segment on which the annotation is located. If the versioned source document refers to "segments" capabilities then the URI must be listed in one of the segments documents. Otherwise the URI must match the URIs defined for the COORDINATE system stated in the versioned source record.

A LOC record has an optional 'range' attribute. If not given the location is on the entire segment. The range string is in one of the following two formats:

A LOC record has an optional "gap" attribute which is a CIGAR string. This can be used to specify details of a gapped alignment. The rationale and syntax of CIGAR strings are explained in "The Ensembl Core Software Libraries", Stabenau et. al., Genome Research 2004:

A FEATURE has zero or more LINK elements linking the feature record to an external database entry. This link element is specific to the given feature. For example, it may link to an image icon to be included in graphical displays.

Some features are complex and cannot easily be modeled with a single feature record. Quoting from the "Chado Schema Documentation" (http://www.gmod.org/schema):

Complex annotations are modeled through a directed acyclic graph structure. Each node in the graph is a feature record, and is uniquely labeled by its URI. A node may have zero or more children. A link from a parent to a child indicates that the child is a "part of" the parent. Cycles are not allowed - a node may not be a descendent of itself. Complex features have a single root node. It may be a synthetic and location-less feature if there is no natural physical interpretation for it.

The FEATURE record stores the graph relationships in the PARENT and PART elements. A FEATURE has zero or more PARENT elements. Each PARENT element has a 'uri' attribute with the URI for a parent feature record. A FEATURE has zero or more PART elements. Each PART element has a 'uri' attribute with the URI for a child feature record. Having both parents and parts listed means the graph structure is stored twice. In practice this makes processing complex features easier than when only single direction links are given.

A FEATURE has zero or more NOTE elements containing human-readable information about the feature in the CDATA section of the element. Clients and servers must treat the notes as an ordered list because the text in a note may refer to another using terms like "previous note."

3.4.1 The feature filter query language

Unless the "features" capability of the versioned source record only lists support for "simple" searches, a DAS server must implement the DAS2 query filter language. The language is based on a set of predicates, specified as a key/value pair. If no predicates are given, the feature query returns all features. Predicates act as filters and reduce the number of features returned.

The query field keys are:

name	takes	matches features ...
link	URI	which have the given link
type	URI	with exactly the given type
segment	URI	on the given segment
coordinates	URI	which are part of the given coordinate system
overlaps	region	which overlap the given region
excludes	region	which have no overlap to the given region
inside	region	which are contained inside the given region
name	string	with a "title" or "alias" matching the given string
note	string	with a "note" matching the given string
prop-*	string	with the property "*" matching the given string

Queries are form-urlencoded requests, meaning the DAS query is encoded in the query portion of an HTTP GET URL. For example, if the feature query URL is 'http://biodas.org/features' and there is a segment named 'http://ncbi.org/human/Chr1' then the logical form of a request for all features on the first 10,000 bases of that segment is

Multiple search terms with the same key are OR'ed together. The following searches for features containing the name or alias of either BC048328 or BC015400

Multiple search terms with different keys are AND'ed together, but only after doing the OR search for each set of search terms with identical keys. The following searches for features which have a name or alias of BC048328 or BC015400 and which are on the segment http://ncbi.org/human/Chr1

If any part of a complex feature (that is, one with parents or parts) matches a search term then all of the parents and parts are returned.

The fields which take URLs require exact matches, that is, a character by character match. (For details on the nuances of comparing URIs see http://www.textuality.com/tag/uri-comp-3.html).

The segment query filter takes a URI. This must accept the segment URI and, if known to the server, the equivalent reference identifier for the segment.

If range searches are given then one and only one segment must be given. If there are multiple segment queries then ranges are not allowed.

The string searches may be exact matches, substring, prefix or suffix searches. The query type depends on if the search value starts and/or ends with a '*'.

The interpretation of "*" or "?" elsewhere in the query string is implementation dependent. Text searches are case-insensitive. The string "ABC" matches "abc", "aBc", "ABC", etc.

A server should collapse multiple whitespace characters into a single space character for search purposes. For example, the query "*a newline*" should match

The 'name' search does a text search of the 'title' and 'alias' fields. A record matches if the name is found in any one of those fields.

The "prop-*" is shorthand for a class of text searches of <PROP> elements. Features may have properties, like

To do a string search for all 'membrane' cellular components, construct the query key by taking the string "prop-" and appending the property key text ("cellular_component"). The query value is the text to search for, in this case:

The rules for multiple searches with the same key also apply to the prop-* searches. To search for all 'membrane' or 'nuclear' cellular components, use two 'prop-cellular_component' terms, as

The range searches are defined with explicit start and end coordinates. The range syntax is in the form

A feature may have several locations. An annotation may have several features in a parent/part relationship. The relationship may have several levels. If a range search matches any feature in the annotation then the search returns all of the features in the annotation.

An 'overlaps' search matches if and only if any feature location of any of the parent or part overlaps the query range and segment.

An 'inside' search matches if and only if at least one feature in the annotation has a location on the query segment and all features which have a location on the query segment have at least one location which starts and ends in the query range.

EXPERIMENTAL: An 'excludes' matches if and only if at least one feature of the annotation is on the query segment and no features are in the query range. This is the complement of the 'overlaps' search, for annotations on the same query segment.

Unlike the other search keys, if there multiple 'excludes' searches then the results are AND'ed together. That is, if the query has two excludes ranges

3.4.2 Additional feature properties

Some feature records may need to provide additional information. Examples include alignment score information and curation history. There are two ways to extend the basic type record.

The first is through a property table. The table is a list of zero or more PROP elements. Each PROP element has a 'key' and 'value' attribute, mapping the given given to value. Duplicate keys are allowed. There are no pre-defined key names though de facto ones will likely arise through use. The order of the PROP elements is arbitrary and must not affect the meaning of the keys or values.

Retrieving DAS2 genomic sequence and annotation feature records

Table of Contents

1 GENERAL

1.1 URIs, URLs and HTTP

1.2 Content-Type header

1.3 ISO dates

1.4 Global sequence identifiers

1.5 Segment ranges

1.6 The link element

1.7 Formats and extensibility

2 OVERVIEWS

2.1 The sources document (overview)

2.2 The segments document (overview)

2.3 The types document (overview)

2.4 The features document (overview)

2.4.1 Feature filters (overview)

3 DETAILS

3.1 The sources document (detailed)

3.2 The segments document (detailed)

3.2.1 Segments query parameters

3.3 The types document (detailed)

3.3.1 Extending the type record

3.4 The features document (detailed)

3.4.1 The feature filter query language

3.4.2 Additional feature properties