Spec Review
Note: this is NOT the official version of the DAS specification.
Contents
- 1 Distributed Sequence Annotation System (DAS) Spec Review
- 1.1 System Architecture
- 1.2 Client/Server Interactions
- 1.3 Reference Object IDs
- 1.4 The Queries
- 1.4.1 Sources Command
- 1.4.2 Entry Points Command
- 1.4.3 Retrieve the DNA Associated with a Subsequence has been deprecated as the sequence cmd below is more commonly used.
- 1.4.4 Sequence Command
- 1.4.5 Types Command
- 1.4.6 Features Command
- 1.4.7 We are proposing to remove Linking to a Feature from the spec, as implementations tend to use a separate web server anyway
- 1.4.8 Retrieving the Stylesheet
- 1.4.9 Glyphs and Groups
- 1.5 Fetching Sequence Assemblies We are proposing to deprecate this as not many people use it????
- 1.6 Feature Types and Categories
- 1.7 Glyph Types
- 1.8 Other Issues
- 1.9 Changes
Distributed Sequence Annotation System (DAS) Spec Review
Oct 1,2008
This is a working document and a proposal for a reworked DAS specification which hopes to clarify the DAS spec based on how DAS is being used in the community today and to include commands from the 1.53E spec and some of the 2.0 spec. Also the document has been adjusted to reflect changes in the use of DAS away from a solely genome centric protocol to a more open one encompassing other reference/coordinate systems such as protein sequences and structures. The spec also includes references to the DAS Registry which is essential for implementing an SOA architecture. Note: this is a technical document but should be readable and understandable by people without a deep understanding of broader technical issues and other system architectures. Note that we are proposing to change dtd for xsd
System Architecture
The Distributed Annotation System is a network of server and client software installations distributed across the web. The DAS protocol is a standard mechanism through which clients can communicate with servers in order to obtain various types of biological data. The protocol defines:
- the communication method
- the query model
- the data format
By enforcing these constraints, DAS allows a client to integrate data from many diverse sources implementing the protocol at a scaleable development cost.
The DAS network of servers comprises a registry, several reference servers and several annotation servers. Tying these together are the concepts of reference objects and coordinate systems.
Reference Objects
Reference objects are items of data with stable identifiers that are targets for annotation. At the most abstract level a reference object might be an annotatable concept or idea (e.g. a particular gene), but usually describes a biological unit within which annotations can be positioned. For example, "P15056" refers to a protein sequence upon which annotations can be based. Similarly, "chromosome 21" refers to a DNA sequence.
Individual reference objects can in fact have several versions, and it is important to recognise that annotations based upon different versions of the same reference entity are not necessarily equivalent.
Annotations
Annotations are pieces of information that are always attributed to a reference object. Annotations are usually positional, that is they refer to a specific location within a reference object. An exon within a genomic sequence is an example. Annotations can also be non-positional, in which case they can be considered as information attributed to the whole of the reference object. For example, the description of a protein or gene.
Coordinate Systems
A coordinate system is a stable, logical grouping of reference objects. A coordinate system provides a mechanism to uniquely identify reference objects that share identifiers, such as chromosomes. For example, chromosome 21 might identify several reference objects from different species', but only one within the NCBI 36 human assembly. Thus, "human NCBI 36 chromosomes" is a coordinate system containing 25 reference objects.
Coordinate systems are formally described using four properties:
- The category or type of annotatable entity. For example a chromosome sequence or protein structure
- The authority responsible for defining the coordinate system. For example NCBI or UniProt
- The version, for coordinate systems containing entities that are not versioned (e.g. genomic assemblies)
- The species, for coordinate systems containing only entities from a single organism
Of these, category and authority are required.
Some example coordinate systems:
Category | Authority | Version | Species |
---|---|---|---|
Chromosome | NCBI | 36 | Homo sapiens |
Scaffold | ZFISH | 7 | Danio rerio |
Protein sequence | UniProt | - | - |
Reference & Annotation Servers
A reference server is a DAS server that provides core data for the reference objects in a particular coordinate system. For example, the reference server for "UniProt Protein sequence" provides the actual sequence for each UniProt entry. It does this by implementing the DAS sequence command. So that clients can discover the available reference objects in a coordinate system, a reference server must also list them via the entry_points command.
Annotation servers are specialized for returning lists of annotations for the reference objects within a coordinate system. This is done by implementing the DAS features command.
In future versions of the spec (i.e. those not focussed entirely on sequence) this will be generalised. That is, reference objects won't be assumed to be sequences and annotations won't be assumed to be sequence features.
Note: The distinction between reference and annotation servers is conceptual rather than physical. That is, a single server instance can in fact play both roles by offering offer both sequences and annotations of those sequences.
Note: A server may support multiple coordinate systems. Since some coordinate systems are subsets of others (such as chromosomes, contigs and clones), this can mean that annotation servers can potentially serve the same annotations on several coordinate systems.
In my opinion, the registry coordinate systems XML (i.e. http://www.dasregistry.org/das/coordinatesystem) should be used to indicate coordinate systems that are subsets of each other which would avoid this problem - clients could then programmatically choose what to query with and what to avoid.
The DAS Registry
The DAS registry is a special component of DAS, fulfilling the following roles:
- Catalogues and describes the capabilities and coordinate systems of DAS services
- Allows discovery of available DAS services via both human and programmatic interfaces
- Automatically validates registered DAS sources to ensure that they correctly implement the protocol
- Periodically tests DAS sources and notifies their administrators if they are unavailable
- Provides a mechanism for activating or highlighting individual DAS services in clients
Clients
A DAS client typically integrates data from a number of DAS servers, making use of the different data types. For example, a client might implement the following procedure for a particular sequence location:
- Contact DAS registry to find reference and annotation servers for the relevant assembly
- Obtain sequence from the reference server
- Obtain sequence features from each of the annotation servers
- Display the annotations in the context of the sequence
This is best explained by diagram
Client/Server Interactions
The DAS is web-based. Clients query the reference and annotation servers using the HTTP protocol (see RFC2616) by sending a formatted URL request to the server. Servers process the request and return a response in the form of a formatted XML document (see W3C Extensible Markup Language) according to a predefined schema.
The Request
All DAS requests take the form of a hierarchical URL. Each URL has a site-specific prefix, followed by a standardized path and query string. The standardized path begins with the string /das. This is followed by URL components containing the data source name and a command. Should put some guidance or specify the ability for servers to accept encoded URLs For example:
How do we get everyone to specify say "chromosome1" in the exact same way not "chr1" etc. By coordinate system and entry_points I guess. Reference server MUST implement entry_points, regardless of number of objects (don't expect it to come back quickly). We can always add a "range" parameter later
http://das.sanger.ac.uk/das/ccds_mouse/features?segment=1:174405453,174408689 ^^^^^^^^^^^^^^^^^^^^^^^ ^^^ ^^^^^^^^^^ ^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ site-specific prefix das data src command arguments
In this case, the site-specific prefix is http://das.sanger.ac.uk. The request begins with the standardized path /das, and the data source, in this case /ccds_mouse. This is followed by the command /features, which requests a list of features, and a query string providing named arguments to the /features command.
Thus, a single DAS server hosts one or more DAS sources, allowing it to provide different types of information, and/or information in several coordinate systems. This example source provides consensus CDS transcripts for mouse chromosomes, but the same server provides a number of other sources, including a similar source for human along with sources containing very different types of data.
More information on the format of the request and the various available commands is given [#commands below].
The query string portion of the request (the "?" symbol rightward) can be POSTed to the URL following conventional HTTP standards. Since some queries can be quite large, this is the recommended way of argument passing.
SOAP has not been widely adopted for das so should we delete this?Yep! The request may be replaced with a SOAP-style XML-encapsulated document in future versions of this specification.
The Response
The response from the server to the client consists of a standard HTTP header with DAS status information within that header, followed optionally by XML content that contains the answer to the query. The DAS status portion of the header consists of three lines. The first is X-DAS-Version and gives the current protocol version number, currently DAS/1.6. The second line is X-DAS-Status and contains a three digit status code which indicates the outcome of the request. The third is X-DAS-Capabilities, which describes the parts of of the spec the server implements.
Here is an example HTTP header: (provided by Web server)
HTTP/1.1 200 OK Date: Sun, 12 Mar 2000 16:13:51 GMT Server: Apache/1.3.6 (Unix) mod_perl/1.19 Last-Modified: Fri, 18 Feb 2000 20:57:52 GMT Connection: close Content-Type: text/plain X-DAS-Version: DAS/1.5 X-DAS-Status: 200 X-DAS-Capabilities: error-segment/1.0; unknown-segment/1.0; unknown-feature/1.0; ... data follows...
The defined status codes are listed in Table 1.
200 | OK, data follows |
---|---|
400 | Bad command (command not recognized) |
401 | Bad data source (data source unknown) |
402 | Bad command arguments (arguments invalid) |
403 | Bad reference object (reference sequence unknown) |
404 | Bad stylesheet (requested stylesheet unknown) |
405 | Coordinate error (sequence coordinate is out of bounds/invalid) |
500 | Server error, not otherwise specified |
501 | Unimplemented feature |
The HTTP/1.0 protocol allows web clients to request byte-level compression of the response by sending the HTTP header Accept-Encoding header. Web servers that are capable of it can reply with a Content-transfer-encoding header and a compressed body. Implementors of DAS clients and servers may wish to implement this HTTP feature.
X-DAS-Capabilities
The X-Das-Capabilities header provides an extensible list of the capabilities that the server provides. This can be used by clients wishing to make use of optional components of the DAS protocol where they are supported, and by those writing experimental extensions to DAS to flag clients that those extensions are available. Capabilities have the form CapabilityName/Version and are separated by semicolon, space, as in "capabilityA/1.0; capabilityB/1.4; capabilityC/1.0". The following standard capabilities are present in the DAS/1.6 protocol:
Capability Name | Description |
---|---|
dsn/1.0 | Deprecate these: The server supports the deprecated dsn request. |
dna/1.0 | The server supports the deprecated dna request. |
types/1.0 | The server supports the basic types request. |
stylesheet/1.1 | The server supports the basic stylesheet request. |
features/1.0 | The server supports the basic features request. |
entry_points/1.0 | The server supports the basic entry_points request. |
error-segment/1.0 | Server will report requests for invalid segments with an <ErrorSegment> response. |
unknown-segment/1.0 | Server will report requests for unknown or unannotated segments with an <UnknownSegment> response. |
unknown-feature/1.0 | Server will report requests for unknown features with an <UnknownFeature> response. |
feature-by-id/1.0 | The features request will accept the CGI parameter "feature_id", enabling the server to look up annotations based on their ID. |
group-by-id/1.0 | The features request will accept the CGI parameter "group_id", enabling the server to look up annotations based on the ID of a group. |
component/1.0 | Deprecate? The features request will return components of the indicated segment when a category type of "component" is requested. |
supercomponent/1.0 | Deprecate? The features request will return supercomponents of the indicated segment when a category type of "supercomponent" is requested. |
sequence/1.0 | The server supports the sequence request. |
Reference Object IDs
The ID used by a client or server to refer to a reference object can contain any set of printable characters (including the space character), but not the colon character (":"), which is reserved for separating reference IDs from sequence ranges (see below). The newline, tab and carriage return characters are also reserved for future use.
A data source that uses the colon character for its internal IDs must map this character to another one on the way in and on the way out. For example:
Client request server's internal id Response to client gi-123456 --> gi:123456 ---> gi-123456 gi-123456:1,1000 --> gi:123456 start=1 stop=1000 ---> gi-123456:1,1000
In general, DAS mandates that a server must respond in the same coordinate system by which it is queried. This is relevant where annotations can potentially exist in several coordinate systems (such as clones and contigs).
The Queries
This section lists the queries recognized by reference and annotation servers. Each of these queries begins with some site-specific prefix, denoted here as PREFIX.
Sources Command
Retrieve the list of data sources for a server.
DSN command has been deprecated in favour of this sources cmd
Scope: Reference and annotation servers.
Command: sources
Format:
PREFIX/das/sources
Description: This query returns the list of data sources that are available from this server. In particular the following information for a DAS server is important:
- The email address of the maintainer of a DAS source
- The coordinate system of the provided data
- Different properties that allow further description of a source
Arguments: none
Response:
The response to the sources command is the "DASSSOURCE" XML-formatted document:
<?xml version='1.0' encoding='UTF-8' ?> <?xml-stylesheet type="text/xsl" href="das.xsl"?> <SOURCES> <SOURCE uri="URI" title="title" doc_href="URL" description="description"> <MAINTAINER email="email address" /> <VERSION uri="URI" created="date"> <COORDINATES uri="uri" source="data type" authority="authority" test_range="ID">coordinate string</COORDINATES> <CAPABILITY type="das1:command" query_uri="URL" /> <PROP name="key" value="value" /> </VERSION> </SOURCE> </SOURCES>
Format:
xml-stylesheet | optional | an XSL stylesheet that e.g. allows a browser to nicely display the XML response I'm not sure whether this should actually be part of the spec - could be confused with stylesheet command? We should probably highlight the fact that there is a difference and that this is for display in a web browser not how to display in a client. </td>
</tr> | ||
SOURCES</td> | mandatory</td> | the main container for several DAS sources</td>
</tr> | ||
SOURCE</td> | mandatory, one or many</td> | the description for a DAS datasource</td>
</tr> | ||
uri</td> | mandatory</td> | a unique URI for the DAS source</td>
</tr> | ||
title, description</td> | mandatory</td> | the nickname under which a DAS server shall be known and displayed in a view.
The description is a free text description of the provided data</td> </tr> | ||
doc_href</td> | optional</td> | points to a web site where more information about a DAS source can get obtained.</td>
</tr> | ||
MAINTAINER, email</td> | mandatory</td> | the email address of the maintainer of this DAS source.</td>
</tr> | ||
VERSION</td> | mandatory</td> | in principle this would allow hosting several versions of a DAS sources (with unique URIs)
on a server, but in practise most people provide only the server with the latest data. Different versions of the same source should be considered to beequivalent, that is the latest version is definitive. The created attribute provides the date on which a DAS server has been set up initially. For a DAS registation server this is the date at which a DAS server has been pulished. </td> </tr>
| ||
COORDINATES</td> | mandatory, one or many</td> | The description of the coordinate system(s) a DAS source operates on. uri - the unique URI for a DAS coordinate system. For a DAS registration server these
should be resolvable and allow to access more information.
e.g. [1] for the UniProt,Protein Sequence coordinate system. source - the data type. This refers to the "physical dimension" of the data. Currently the following categories are available:
Chromosome, Clone, Contig, Gene_ID, NT_Contig, Protein Sequence, Protein Structure authority - the authority, or institution that assigns the accession code for this namespace. In case of genome assemblies the authority that builds the assembly.
</td> </tr> | ||
CAPABILTIY</td> | mandatory, one or many</td> | The supported DAS commmand type - the type of the DAS command. to distinguish DAS/1 from DAS/2 servers das1:
is used before the name of the command. /features?segment=IDneeds to be attached. </td> </tr> | ||
PROP</td> | optional, one or many</td> | a free key- value style property that allows to add more tags to a server</td>
</tr> </table> </div> Example Responses
Entry Points CommandRetrieve the list of reference objects for a data source This Entry_Points cmd is now mandatory for reference servers. Scope: Reference servers. Command: entry_points Format:
PREFIX/das/DSN/entry_points
Description: This query returns the list of sequence entry points available and their sizes in base pairs. Arguments:
Response:The response to the entry_points command is the "DASEP" XML-formatted document: Format:
<?xml version="1.0" standalone="no"?> <!DOCTYPE DASEP SYSTEM "http://www.biodas.org/dtd/dasep.dtd"> <DASEP> <ENTRY_POINTS href="url" version="X.XX"> <SEGMENT id="id1" start="start1" stop="stop1" type="type" is now deprecated due to the registry should specify the type in this coordinate system orientation="+">descriptive text</SEGMENT> <SEGMENT id="id2" start="start2" stop="stop2" type="type" orientation="+">descriptive text</SEGMENT> <SEGMENT id="id3" start="start3" stop="stop3" type="type" orientation="+">descriptive text</SEGMENT> ... </ENTRY_POINTS> </DASEP>
<SEGMENT id="id" size="123456"> In this case, the start is implied to be "1" and the stop is implied to be the same as the length. Note: The result from the entry points requests does not carry sufficient information to reconstruct a complex sequence assembly. Instead, use the features request with a category of "component". See [#assemblies Fetching Sequence Assemblies]. Retrieve the DNA Associated with a Subsequence has been deprecated as the sequence cmd below is more commonly used.Sequence CommandRetrieve all or part of the sequence for a reference object. Scope: Reference servers. Command: sequence Format:
PREFIX/das/DSN/sequence?segment=RANGE[;segment=RANGE...]
Description: This query returns the sequence (nucleotide or protein) corresponding to the indicated segment. Arguments:
Here is an example of a valid request that uses the segment argument to fetch three independent segments. The last segment is a subsequence: http://www.ebi.ac.uk/das-srv/uniprot/das/uniprot/sequence? segment=P00280;segment=P15056;segment=P51587:200,300 Response:The response to sequence is the "DASSEQUENCE" XML-formatted document. Format: <?xml version="1.0" standalone="no"?> <!DOCTYPE DASSEQUENCE SYSTEM "http://www.biodas.org/dtd/dassequence.dtd"> <DASSEQUENCE> <SEQUENCE id="id" start="start" stop="stop" atttcttggcgtaaataagagtctcaatgagactctcagaagaaaattgataaatattat taatgatataataataatcttgttgatccgttctatctccagacgattttcctagtctcc agtcgattttgcgctgaaaatgggatatttaatggaattgtttttgtttttattaataaa taggaataaatttacgaaaatcacaaaattttcaataaaaaacaccaaaaaaaagagaaa aaatgagaaaaatcgacgaaaatcggtataaaatcaaataaaaatagaaggaaaatattc agctcgtaaacccacacgtgcggcacggtttcgtgggcggggcgtctctgccgggaaaat tttgcgtttaaaaactcacatataggcatccaatggattttcggattttaaaaattaata taaaatcagggaaatttttttaaattttttcacatcgatattcggtatcaggggcaaaat tagagtcagaaacatatatttccccacaaactctactccccctttaaacaaagcaaagag cgatactcattgcctgtagcctctatattatgccttatgggaatgcatttgattgtttcc gcatattgtttacaaccatttatacaacatgtgacgtagacgcactgggcggttgtaaaa cctgacagaaagaattggtcccgtcatctactttctgattttttggaaaatatgtacaat gtcgtccagtattctattccttctcggcgatttggccaagttattcaaacacgtataaat aaaaatcaataaagctaggaaaatattttcagccatcacaaagtttcgtcagccttgtta tgtcaaccactttttatacaaattatataaccagaaatactattaaataagtatttgtat gaaacaatgaacactattataacattttcagaaaatgtagtatttaagcgaaggtagtgc acatcaaggccgtcaaacggaaaaatttttgcaagaatca </SEQUENCE> </DASDNA>
Types CommandRetrieve the types of features offered by a data source how do we reconcile the way UCSC and ensembl uses the type command i.e. UCSC use it as a filter for tracks from one source, whereas ensembl only has one track per source!!! Can't deprecate this command as is used by UCSC??? Make the unit of registration in the registry the combination of the source and type(s), and we will make Ensembl will honour it and apply the filter. Scope: Annotation and reference servers. Command: types Format:
PREFIX/das/DSN/types [?segment=RANGE] [;segment=RANGE] [;type=TYPE] [;type=TYPE]
Description: This query returns the annotation available for a segment of sequence. Arguments:
If one or more segment arguments are provided, the list of types returned is restricted to the indicated segments. If no segment argument is provided, then all feature types known to the source are returned. Response:The document returned from the types request is an XML-formatted "DASTYPES" documents. This is a shortened form of the full features format (see below) and is used to summarize the type and number of each annotation. Annotation types can be grouped into segments, or be totaled across the entire database.
<?xml version="1.0" standalone="no"?> <!DOCTYPE DASTYPES SYSTEM "http://www.biodas.org/dtd/dastypes.dtd"> <DASTYPES> <GFF version="1.0" href="url"> <SEGMENT id="id" start="start" stop="stop" type="type" version="X.XX" label="label"> <TYPE id="id1" method="method" category="category">Type Count 1</TYPE> <TYPE id="id2" method="method" category="category">Type Count 2</TYPE> ... </SEGMENT> </GFF> </DASTYPES>
Features CommandRetrieve the annotations for all or part of a reference object Scope: Reference and annotation servers. Command: features Format:
PREFIX/das/DSN/features?segment=REF:start,stop[;segment=REF:start,stop...] [;type=TYPE] [;type=TYPE] [;category=CATEGORY] [;category=CATEGORY] [;categorize=yes|no] [;feature_id=ID] [;group_id=ID]
Description: This query returns the annotations across one or more segments of sequence. Arguments:
Annotations must be returned using the coordinate system in which they were requested. For example, if a contig ID was used to specify the segment, then the annotation endpoints must use contig coordinates. Servers should return annotations which overlap the segment, but are not completely contained within them. Annotation servers are no longer allowed to only return annotations which are completely contained within the indicated segment. This is confusing. Better as: Servers should return annotations which lie wholly or partially within the query segment. For example:
------------------- Query ----- --- ------- ----------- ----- A B C D E
In the above example, the server should return annotations B and C because they lie wholly within the query segment, and annotation D because it lies partially within the query segment. If multiple segment arguments are provided and they happen to overlap, then the annotation server may return the same annotation multiple times, possibly using different coordinate systems. It is the responsibility of the client to merge annotations based on the assembly. Response:The document returned from the features request is an XML-formatted "DASGFF" document. Format:
<?xml version="1.0" standalone="no"?> <!DOCTYPE DASGFF SYSTEM "http://www.biodas.org/dtd/dasgff.dtd"> <DASGFF> <GFF version="1.0" href="url"> <SEGMENT id="id" start="start" stop="stop" type="type" version="X.XX" label="label"> <FEATURE id="id" label="label"> <TYPE id="id" category="category" reference="yes|no">type label</TYPE> <METHOD id="id"> method label </METHOD> <START> start </START> <END> end </END> <SCORE> [X.XX|-] </SCORE> <ORIENTATION> [0|-|+] </ORIENTATION> <PHASE> [0|1|2|-]</PHASE> <NOTE> note text </NOTE> <LINK href="url"> link text </LINK> <TARGET id="id" start="x" stop="y">target name</TARGET> <GROUP id="id" label="label" type="type"> <NOTE> note text </NOTE> <LINK href="url"> link text </LINK> <TARGET id="id" start="x" stop="y">target name</TARGET> </GROUP> </FEATURE> ... </SEGMENT> </GFF> </DASGFF>
Annotations have an ID that is unique to the server and a structured description that describes its nature and attributes. Annotations may also be associated with Web URLs that provide additional human readable information about the annotation.
To look up ontologies you could go here: "http://www.ebi.ac.uk/ontology-lookup/" category is a broad functional category that can be used to filter, group and sort annotations. "Homology", "variation" and "transcribed" are all valid categories. The existence of these categories allows researchers to add new annotation types if the existing list is inadequate without entirely losing all semantic value. (we also encourage the use of ECO numbers to represent the method of annotation e.g.ECO:0000032 "inferred from curated blast match to nucleic acid). Example<TYPE id="SO:0000417" category="inferred from reviewed computational analysis (ECO:0000053)">polypeptide_domain</TYPE>
Examples<DASGFF><GFF version="1.01" href="http://das.sanger.ac.uk:80/das/pfam/features"><SEGMENT id="P08487" version="d1dfb367c112d4820eeffe4eab1d6487" start="1" stop="1291"><FEATURE id="C2" label="C2:1090-1177"><TYPE id="SO:0000417" category="inferred from reviewed computational analysis (ECO:0000053)">polypeptide_domain</TYPE><START>1090</START><END>1177</END><METHOD id="Pfam">Pfam-A</METHOD><SCORE>1.7e-18</SCORE><NOTE>HMMER Version: 2.3.2</NOTE><LINK href="http://pfam.sanger.ac.uk/family?entry=PF00168">C2</LINK></FEATURE></SEGMENT></GFF></DASGFF> We are proposing to remove Linking to a Feature from the spec, as implementations tend to use a separate web server anywayScope: Annotation servers. Command: link Format:
PREFIX/das/DSN/link?field=TAG;id=ID
Description: This query can be issued in order to retrieve further human-readable information about an annotation. It is best to pass this URL directly to a browser, as the type of the returned data is not specified (it will typically be an HTML file, but any MIME format is allowed). Arguments:
Response: A web page. Retrieving the StylesheetScope: Annotation servers. Command: stylesheet Format:
PREFIX/das/DSN/stylesheet
Description: This query can be issued to an annotation server in order to retrieve the server's recommendations on formatting annotations retrieved from it. These recommendations are not normative. A viewer is free to use any display format it chooses. Arguments: None. Response:This document is intended to provide hints to the annotation display client. It maps feature categories and individual types to a series of glyphs known to the display client. Format:
<?xml version="1.0" standalone="no"?> <!DOCTYPE DASSTYLE SYSTEM "http://www.biodas.org/dtd/dasstyle.dtd"> <DASSTYLE> <STYLESHEET version="X.XX"> <CATEGORY id="default"> <TYPE id="default"> <GLYPH zoom="high"> <ID> <ATTR>value</ATTR> <ATTR>value</ATTR> ... </ID> </GLYPH> <GLYPH zoom="medium"> <ID> <ATTR>value</ATTR> <ATTR>value</ATTR> ... </ID> </GLYPH> <GLYPH zoom="low"> <ID> <ATTR>value</ATTR> <ATTR>value</ATTR> ... </ID> </GLYPH> </TYPE> </CATEGORY> <CATEGORY id="group"> <TYPE id="group_id1"> <GLYPH zoom="high"> <ID> <ATTR>value</ATTR> <ATTR>value</ATTR> ... </ID> </GLYPH> ... </CATEGORY> <CATEGORY id="category1"> <TYPE id="default"> <GLYPH> <ID> <ATTR>value</ATTR> ... </ID> </GLYPH> </TYPE> <TYPE id="type1"> <GLYPH> <ID> <ATTR>value</ATTR> ... </ID> </GLYPH> </TYPE> <TYPE id="type2"> <GLYPH> <ID> <ATTR>value</ATTR> ... </ID> </GLYPH> </TYPE> ... </CATEGORY> <CATEGORY id="category2"> <TYPE id="default"> <GLYPH> <ID> <ATTR>value</ATTR> ... </ID> </GLYPH> </TYPE> ... </CATEGORY> ... </STYLESHEET> </DASSTYLE>
Here is a short stylesheet example:
...
<CATEGORY id="Similarity">
<TYPE id="default">
<GLYPH>
<LINE>
<FGCOLOR>gray</FGCOLOR>
</LINE>
</GLYPH>
</TYPE>
<TYPE id="NN">
<GLYPH >
<BOX>
<HEIGHT>4</HEIGHT>
<FGCOLOR>black</FGCOLOR>
<BGCOLOR>red</BGCOLOR>
</BOX>
</GLYPH>
</TYPE>
<TYPE id="NP">
<GLYPH>
<TOOMANY>
<HEIGHT>4</HEIGHT>
<FGCOLOR>black</FGCOLOR>
<BGCOLOR>blue</BGCOLOR>
</TOOMANY>
</GLYPH>
</TYPE>
<TYPE id="PN">
<GLYPH>
<BOX>
<HEIGHT>3</HEIGHT>
<FGCOLOR>blue</FGCOLOR>
<BGCOLOR>green</BGCOLOR>
</BOX>
</GLYPH>
</TYPE>
<TYPE id="PP">
<GLYPH>
<HEIGHT>4</HEIGHT>
<FGCOLOR>gray</FGCOLOR>
</GLYPH>
</TYPE>
</CATEGORY>
...
Groups can also have stylesheet entries. If present, they are located in the category named "group". Typically a group will be associated with the "line" glyph, which as described below, draws connections between the members of a group. A sample stylesheet used for the WormBase DAS server can be found at [sample_stylesheet.xml http://www.biodas.org/documents/sample_stylesheet.xml]. Glyphs and GroupsGlyphs and their attributes are typically applied to individual features. However, they can be applied to entire groups as well (via the <GROUP> type attribute). In this case, the glyph will apply to the connecting regions between the components of the group. For example, to indicate that the exons in a "transcript" group should be drawn with a yellow box, that the utrs should be drawn with a blue box, and that the connections between exons should be drawn with a hat-shaped line:
<CATEGORY id="Transcription"> <TYPE id="exon"> <GLYPH> <BOX> <BGCOLOR>yellow</BGCOLOR> </BOX> </GLYPH> </TYPE> <TYPE id="utr"> <GLYPH> <BOX> <BGCOLOR>blue</BGCOLOR> </BOX> </GLYPH> </TYPE> </CATEGORY> <CATEGORY id="group"> <TYPE id="transcript"> <GLYPH> <LINE> <FGCOLOR>black</FGCOLOR> <LINE_STYLE>hat</LINE_STYLE> </LINE> </GLYPH> </TYPE> ... Fetching Sequence Assemblies We are proposing to deprecate this as not many people use it????Reference servers, but not annotation servers, must represent and serve genome assemblies. The components of an assembly are treated as a set of features with a type category attribute of "component" and a reference attribute of "yes". Intermediate components of the assembly will also have a subparts attribute of "yes". Components that are the parents of the reference sequence in the assembly have a category attribute of "supercomponent." Moving Down in an AssemblyFor those components that have subparts, the start and end of the feature give the feature's position in the requested segment's coordinate system, and the id, start and end of the <TARGET> element gives the feature's position in its native coordinates. For example:
1 200 400 1000 +--------+-----------+-------------------+ chr22 1 200 220 1 20 620 +--------+---- A --+-------------------+ B 1 80 280 400 ------+-----------+-------- C =================== C.1 ============= C.2 A request for this assembly will look like the following:
http://www.wormbase.org/db/das/elegans/features?segment=chr22:1,1000;category=component
The reference server will return the following (abbreviated) document:
<SEGMENT id="chr22" start="1" stop="1000"> <FEATURE id="chr22"> <START>1</START> <STOP>1000</STOP> <TYPE id="Contig" category="component" reference="yes" superparts="no" subparts="yes">chr 22</TYPE> <TARGET id="chr22" start="1" stop="1000">chr22</TARGET> ... </FEATURE> <FEATURE id="Contig:A"> <START>1</START> <STOP>200</STOP> <TYPE id="Contig" category="component" reference="yes" superparts="yes" subparts="no">a contig</TYPE> <TARGET id="A" start="1" stop="200">Contig A</TARGET> ... </FEATURE> <FEATURE id="Contig:B"> <START>400</START> <STOP>1000</STOP> <TYPE id="Contig" category="component" reference="yes" superparts="yes" subparts="no">a contig</TYPE> <TARGET id="B" start="20" stop="620">Contig B</TARGET> ... </FEATURE> <FEATURE id="Contig:C"> <START>200</START> <STOP>400</STOP> <TYPE id="Contig" category="component" reference="yes" superparts="yes" subparts="yes">a contig</TYPE> <TARGET id="C" start="80" stop="280">Contig C</TARGET> ... </FEATURE> </SEGMENT> Notice that contig C is marked as having subparts. This is an indication to the client that it should emit a features request that includes segment C:80,280 in order to discover its components (C.1 and C.2). Notice also that chr22 appears as a component of itself with the attribute superparts="no" and subparts="yes". This is a side effect of providing information about the component parent. Moving Up in an AssemblyIt is also desirable for a client to fetch the parent of a segment, so as to accomodate the situation in which the user enters the browser at a contig or sequenced clone, and wants to "zoom out." This situation is complicated by rough draft issues, in which a single rough draft sequence segment may have multiple parents, and some sections of the segment may not belong in the assembly at all. For example:
A B C D contig21-----------> <-----------contig100 | | / / | | / / Acc A --------------------- a b c d Here, the segment "Acc A" contains two fragments, one of which is located on contig21 and the other on contig100. To retrieve this information, the client requests the category supercomponent. For segments that are in the middle of the assembly, one or more assembly parents will be returned in addition to subcomponents. The parent <START>, <STOP> and <ORIENTATION> tags are presented in the coordinate system of the requested segment, as always. The start and stop attributes of the <TARGET> tag, denote the corresponding segment in the coordinate system of the parent. As always, start is less than stop, for both the feature and the target.
<SEGMENT id="Acc A" start="1" stop="1000"> <FEATURE id="contig21_goldenpath_map"> <START>a</START> <STOP>b</STOP> <ORIENTATION>+</ORIENTATION> <TYPE id="Contig" category="supercomponent" reference="yes" superparts="yes" subparts="yes">a contig</TYPE> <TARGET id="contig21" start="A" stop="B"></TARGET> </FEATURE> <FEATURE id="contig100_goldenpath_map"> <START>c</START> <STOP>d</STOP> <ORIENTATION>-</ORIENTATION> <TYPE id="Contig" category="supercomponent" reference="yes" superparts="yes" subparts="yes">a contig</TYPE> <TARGET id="contig100" start="D" stop="C"></TARGET> </FEATURE> </SEGMENT> To continue following the parents upward in the assembly, the client will issue further features requests for the target IDs, in this case "contig21" and "contig100". In the general case, following parents will project the requested segment onto a discontinuous set of regions, potentially on different chromosomes. The client may wish to alert the user and refuse to proceed further when it encounters a segment with multiple parents. Feature Types and CategoriesThis is a list of generic feature categories and specific feature types within them. This list was derived from the features currently exported by ACeDB/GFF and is not comprehensive. Suggestions for modifications, additions and deletions are welcomed. componentThis category indicates that the feature is a child component of the reference sequence in the current assembly. When combined with the reference="yes" attribute, this indicates that the feature can be used as a reference point to retrieve subfeatures contained within it (including subcomponents). supercomponentThis category indicates that the feature is the parent of the reference sequence in the current assembly. When combined with the reference="yes" attribute, this indicates that the feature can be used as a reference point to retrieve features that completely contain the selected range of the reference sequence. translationThe translation category is used for features that relate to regions of the sequence that are translated into proteins. Features that relate to transcription are separate (see below). Features:
It is recommended, but not required, that the <FEATURE> section contain <LINK> and/or <NOTE> tags that provide further information on the transcription feature. transcriptionThe transcription category is used for features that relate to regions of the sequence that are transcribed into RNA. Features:
It is recommended, but not required, that the <FEATURE> section contain <LINK> and/or <NOTE> tags that provide further information on the transcription feature. variationThe variation category is used for features that relate to regions of the sequence that are polymorphic. Features:
It is recommended, but not required, that the <FEATURE> section contain <LINK> and/or <NOTE> tags that provide further information on the variation. structuralThe structural category is used for features that relate to mapping, sequencing and assembly, as well as for various landmarks that carry no intrinsic biological information. Features:
It is recommended, but not required, that the <FEATURE> section contain <LINK> and/or <NOTE> tags that provide further information on the structural feature. similarityThe similarity category is used for areas that are similar to other sequences. Similarity features should have a <METHOD> tag that indicates the algorithm used for the sequence comparison, and a <TARGET> tag that indicates the target of the match. Features:
repeatThe repeat category is used for areas that contain repetitive DNA. This category is used both for low-complexity regions, such as microsatellites, and for more biologically interesting features, such as transposon insertion sites. Features:
It is recommended, but not required, that the <FEATURE> section contain <LINK> and/or <NOTE> tags that provide further information on the repetitive element. experimentalThe experimental category is a catchall used to flag areas where there is interesting experimental data of one sort or another. It is intended for use with high-throughput functional genomics work, such as knockouts or insertional mutagenesis screens. Features:
It is recommended, but not required, that the <FEATURE> section contain <LINK> and/or <NOTE> tags that provide further information on the nature of the experimental data. Glyph TypesThis section describes a set of generic "glyphs" that can be used by sequence display programs to display the position of features on a sequence map. The annotation server may use these glyphs to send display suggestions to the viewer via the [#stylesheet stylesheet document]. The current set of glyph ID values are:
Each glyph has a set of attributes associated with it. Attribute values come in the following flavors:
Some attributes are shared by all glyphs. Others are glyph-specific. The following attributes are shared in common:
ARROWA double-headed arrow with an axis either orthogonal or parallel to the sequence map. Attributes:
ANCHORED_ARROWAn arrow that has an arrowhead at one end, and an "anchor" (typically a diamond or line) at the other. The arrow points in the direction indicated by the <ORIENTATION> tag. Attributes:
BOXA rectangular box. Attributes:
CROSSA cross "+". Common used for point mutations and other point-like features. Attributes: (no glyph-specific attributes) DOTA dot. Common used for point mutations and other point-like features. Attributes: (no glyph-specific attributes) EX"X" marks the spot. Common used for point mutations and other point-like features. Attributes: (no glyph-specific attributes) HIDDENA feature that is invisible, intended to support semantic zooming schemes in which a feature is hidden at particular zooms. Attributes: none. LINEA line. Lines are equivalent to arrows with both the northeast and southwest attributes set to "no". Attributes:
SPANA spanning region, the recommended representation is a horizontal line with vertical lines at each end. Attributes: (no glyph-specific attributes) TEXTA bit of text. Attributes:
PRIMERSTwo inward-pointing arrows connected by a line of a different color. Used for showing primer pairs and a PCR product. The length of the arrows is meaningless. There are no glyph-specific attributes, but in this context the foreground color is the color of the arrows, and the background color is the color of the line that connects them. TOOMANYToo many features than can be shown. Recommended for use in consolidating sequence homology hits. The recommended visual presentation is a set of overlapping boxes. Attributes:
TRIANGLEA triangle. Commonly used for point mutations and other point-like features. The triangle is always drawn in the center of its range, but its width and height can be controlled by HEIGHT and LINEWIDTH respectively. Attributes:
Other IssuesThe distributed annotation system must have a mechanism for detecting and resolving version skew across reference and annotation servers. Although one such mechanism is currently incorporated into the ACeDB-based prototype, it is largely untested and hence not yet a part of the DAS standard. ChangesLast modified: 08 Oct 2008 |