A Distributed Annotation Project

A proposal for a MS degree in Computer Science
Robin Dowell
Washington University in St. Louis
December 1999

The DAS website is at: http://biodas.org/ and contains the full DAS specification.

Introduction

Overview

One goal of any genome project is the elucidation of the primary sequence of DNA contained within a given species. While the availability of the primary sequence itself is valuable, it does not reach its full potential until it has been annotated. Generally defined, annotation is descriptive information or commentary added to text, in this case genomic sequence. As the sequencing projects proceed and finish, the focus shifts toward annotation [30].

Full genome-scale annotation is a difficult problem. Some technical challenges include the sheer volume of data, the heterogeneous and growing types of annotations, the time sensitivity of searches, and the need to present the information in an integrated graphical fashion. A wealth of information is contained in the laboratories and individuals within a research community. Each laboratory wants to use its field of expertise to record insights about a portion of the primary sequence. Without a mechanism for collecting, recording, and disseminating this community-based annotation, a valuable source of information is severely diminished [39].

We propose to design and develop a distributed annotation system (DAS), allowing interested laboratories to develop and maintain annotations which are readily accessible to the community at large. The system needs to be easy to use, readily accessible by the community, and capable of representing annotations graphically.

Background

Annotation can greatly enhance the biological value of sequence. Initially, annotation was primarily done by hand, a tedious but extremely valuable resource. A variety of systems quickly developed for automatic annotation including Genotator [22], PEDANT [20], the Institute for Genomic Research (TIGR) annotation system [29], MagPie [21], BEAUTY [59], and GeneQuiz [47]. Yet, concern over the quality of data, particularly annotations, has been expressed by many [35,26,31,44]. Automatic annotation tools and the quality concerns about current annotation prompted Wheelan and Boguski [57] to suggest that annotation should not be stored in the database but rather should be calculated on demand.

This solution has proven to be unsatisfactory. Even in the best automated environments, high quality data sources require some human intervention in the annotation process. Experimental results are a critical component of annotation, an aspect overlooked by the Wheeland and Boguski definition. In addition, researchers ``want communication at least as much as they want information access'' [33]. This communication is informal and often takes the form of an annotation. To maintain quality, data sources need to be heavily curated to reflect new developments, additional knowledge, and continuing research efforts.

Databases

Currently most annotation is handled by centralized databases. These data sources struggle between providing a rich source of annotation and providing a complete resource. Large comprehensive databases are excellent canonical repositories of sequence, but are less adept at dealing with annotation. Examples of large databases include GenBank [6], the Genome Sequence DataBase (GSDB) [50], EMBL [54], the Integrated Genomic Database (IGD) [46], the Protein Data Bank (PDB) [8], GenoTopographer [15], and Genome Database (GDB) [32]. Smaller databases frequently arise to address the need for specialization within a particular community. Examples include a C. elegans Database (ACeDB) [16], FlyBase [13], Sacchromyces genome databse (SGD) [11], CyanoBase [42], ProDom [14], Online Medelian Inheritance in Man (OMIM) [45], and the Ribosomal Database Project (RDP) [36]. These smaller databases sometimes suffer from a lack of accessibility, weak query mechanisms, and poor scalability. The proliferation of independent heterogeneous databases further complicates locating and disseminating relevant information.

Currently most databases encourage users to submit annotations and changes to the data centers for inclusion. A curation group then reviews the information and decides what and how it is to be incorporated. The centralized nature of this system limits community involvement. Consequently, even in tight-knit communities, many useful annotations maintained in individual's laboratories never get incorporated into the official release. Thus far, attempts to improve community involvement have centered around consolidating current database systems. Classical approaches to managing heterogeneous databases include connecting them using the world wide web, organizing them into database federations and data warehouses, or dumping their data into centralized data repositories [37,27].

Many of the biologically relevant databases provide WWW connectiveness. For example, Proteome Inc.'s WormPD [24] provides links to AceBrowser, GenBank, the Protein Information Resource (PIR) [5], and Swiss-Prot [4] when available. Links are a one way navigation tool and require high maintenance. To utilize the information contained in these cross-linked systems requires familiarity with multiple database systems. These limitations create a barrier to efficient information dissemination and usage [27].

Database federations and data warehouses are similar methods which entail developing a global schema, the basic underlying structure, of the component databases. The principle difference between a federation and a warehouse is the way they integrate data. A federation uses an approach known as lazy integration, where data are retrieved, processed, and returned after each query. In contrast, a warehouse uses an eager approach, retrieving and integrating data in advance of a query. The tradeoff between database federations and warehousing is one of query performance versus data ``freshness''. The need for a strict common schema has led to social and technological difficulties in building and maintaining these systems [58]. One system in molecular biology is the Sequence Retrieval System (SRS) [18]. SRS deals with semistructured ASCII text by parsing and indexing mechanisms which require a controlled vocabulary (schema). The indices are warehoused but the data is federated, creating a hybrid system.

Another method of data integration is through data repositories, monolithic databases open to insertion and/or modification by the community at large [1,9]. In principle, making a database available to community annotation is an open and democratic method of removing the curation bottleneck. In practice, it raises tremendous problems with annotation quality, accountability, data ownership, and data integrity [17]. One of the earliest repository systems was the Worm Community System v2.0 (WCS) [49]. This system allowed community annotations to be written to a central database. However, the WCS failed to gain wide acceptance in the community due to competition from a competing database system, ACeDB, and the emerging WWW, among other factors.

C. elegans Databases

A C. elegans database (ACeDB) is one particularly successful community database [16]. It has served as the central repository of phenotyping, bibliographic, mapping, and sequencing information for the Caenorhabditis elegans community since 1990 [55]. The ACeDB database provides a variety of domain specific advantages over more traditional systems. ACeDB allows positional queries through its graphical interface. By smooth transitions between relevant graphical views of the data, annotations are displayed at their appropriate magnification. For these reasons, it has been adopted by a large number of organism-specific communities (a complete list is available at: http://genome.cornell.edu/).

Over time, steps have been taken to overcome many of ACeDB's initial limitations. In 1995, a text-ACE (TACE) [40] was introduced to allow shell scripting interactions. In the years following, WebAce [56], and Ace.pm (Steve Rozen, unpublished) provided richer scripting interaction with the ACE data. More recently, Lincoln Stein has developed two complementary libraries, JADE [51] and AcePerl [52]. They utilize the power of the new ACE servers to provide access directly to ACE objects from Java and Perl, respectively. The database-specific details are hidden by these libraries, providing a much needed level of abstraction. AceBrowser [53], a set of common gateway interface (CGI) scripts, was developed by Lincoln Stein from AcePerl. It provides some of ACeDB's functionality though the WWW.

Visualization

Visual presentation and manipulation of data is a central theme in biology. Much is said about automated annotation of sequence data and warehousing of databases, but in the end much if not most of the useful results derived from sequence data are the result of informative visualizations [48]. It is the acquired understanding of a sequence that may guide many months of experimental work. Quality visualization tools must be readily accessible to the molecular and cellular biologist who may be untrained in computer technology [28].

Ideally, sequence annotations must be presented graphically and in an interactive display. This allows the user to adjust the granularity of information presented and to explore by requesting more details on regions of particular interest. A variety of annotation visualization tools already exist including Entrez [19], Chromoscope [6], Genome Topographer [15], and Anubis [41]. These tools differ in their portability, file formats, map presentation, and ease of use. At the other end of the spectrum are independent display components. The bioWidgets Consortium [12] and Neomorphic's [23] Java package known as the Genome Software Development Kit (GSDK) are just two examples. These tools simplify programming of visualization components.

The Annotation Question

Annotation is not a task specific to the biological community. Feedback, critiques, and exchanges are an integral part of any communication process. In 1945, Vannevar Bush's proposed Memex machine focused on annotation through ``trail blazing'' [10]. The original hypertext specification [7] included a mechanisms for active feedback, allowing browsers to annotate by means of annotation servers. The original Mosaic browser contained an ``Annotate'' feature which allowed the user to save comments on a particular page to their local disk. The goal of the WWW designers was to facilitate active two-way communication rather than passive surfing, but for a number of technical reasons, this specification was not widely adopted.

One group particularly interested in annotation is the digital libraries community. Studies of annotation [38,39] demonstrate that in general annotation has value and users are typically aware of what kinds of material they trust as annotation. Libraries exist to serve the research needs of their constituents. Consequently, digital libraries aim to foster informal collaborations and communication through fluid and transient materials, including annotations.

The collective experience with traditional libraries has created an atmosphere directed toward slowly changing materials. Yet it should be noted that ``nothing in the nature of digital technology mandates that a digital library should include only rarely changing, long-lasting documents'' [33]. In general, document management systems do not deal well with versions or custom documents. Yet annotations, as a communication device, are expected to be somewhat transient. Versioning of both the reference and the annotation is critical in fluid environments to avoid skew, the condition resulting when the annotation refers to a different version of the reference document than currently available.

The desire for annotation of webpage content has resurfaced recently. Two products have emerged for accomplishing this task. Third Voice [25], is a Javascript plug-in for Internet Explorer which allows users to annotate webpages by depositing their public and group comments in a centralized annotation database. Alternatively, private annotations can be kept on their local machine. A second group is working on an open source method of annotation called Crit [60]. This method uses a mediation server to retrieve and combine a webpage with its available annotations. This server then presents to a regular web browser a ``value-added'' version of the original hypertext documents. As with the Third Voice model, annotations are kept in a centralized data repository.

System Design

With the rise of computational biology, high throughput annotation is now possible within many laboratories. These laboratories can often annotate entire genomes relatively quickly and efficiently. The difficulties of maintaining high quality annotation is expected to be exacerbated as more laboratories contribute to the annotation process. In addition, valuable discussion, including dissension, is frequently lost in the current process. A new system is needed to address the annotation problem.

In collaboration with Lincoln Stein, I propose to design and develop a distributed annotation system (DAS) which will improve the community's ability to annotate genomic data. With the recent completion of the C. elegans genome, annotation efforts in this community are accelerating rapidly. This makes the C. elegans data a good model for the new system.

The DAS design is modeled after the current WWW. Web browsers are lightweight and are available on all platforms. Web pages are written in a simple standardized language, the hypertext markup language (HTML). This language can be produced in a text editor or by programs. A simple addressing scheme, uniform resource locators (URLs), identifies useful servers. Search engines and hot lists support the rapid location of relevant information. In the DAS design, a user would select from an annotation directory which annotation sources to view. Then the genome information and annotations would be combined in a ``layered'' fashion within the annotation viewer, much like layering transparencies. The principle change to the WWW design is the concept of multiple layers (URLs) integrated into a single view.

This system is best described as a multi-database system. Unlike federated and warehouse systems, this system requires a sematic-free schema rather than the stricter controlled vocabulary. Instead integration is achieved through a common display language. The system emphasizes the visualization of annotations rather than complex queries. The language will emphasize ``typesetting'' rather than the content of the document. In this way the system should be easily extensible to new annotation types.

Architecture

The basic system is composed of a genome server, one or more annotation servers, an annotation directory server, and an annotation viewer. Analysis of this architecture is presented in Appendix A.

The genome server is responsible for serving genome maps, sequences, and information related to the sequencing process. Initially there will be a single server, but mirror sites will be added as necessary.

Annotation servers are responsible for responding to requests on a region of a clone. A primary annotation server will be maintained by the sequencing center. Third party annotation servers can be built and maintained by any laboratory. Data sources need not have the same schema, but only communicate through a common language. This language will be XML (Extensible Markup Language) based. Versioning will be an integral part of the database reference, allowing for automatic handling of skew in most instances.

The directory access server is a small server maintained at the sequencing center that will provide clients with a list of current annotation servers. From this list, a client can select those annotation sources of personal interest.

The annotation viewer will be available in two separate versions, a lightweight and multi-platform stand alone application and a web based version. Both versions will allow users to browse the sequence and annotations and to pose simple queries. An important aspect of the viewer will be that there is no hard coded representation of the features. Instead, annotation types will be dynamically associated with graphical representations using a cascading style sheet [43] approach.

Initial Prototype

The initial prototype was developed over existing architecture available in the ACeDB community. The gifaceserver was used as the annotation and sequence server. Sequence are retrieved from a copy of the Genome Sequencing Center's (GSC) version of ACeDB. The initial annotation viewer is a modified version of AceBrowser. The initial 3rd party annotation databases have been built using Todd Lowe's tRNAscan-SE [34], the GSC's SNP data [3], and the GSC's Caenorhabditis briggsae to Caenorhabditis elegans homology data [2].

Subsequent developments will focus on moving from this ACeDB based initial prototype to a full application system. This will depend upon a XML based annotation language. The specification of this language is provided at http://biodas.org/documents/spec.html. It is still under development and revision.

Appendix A: Analysis of Proposed Design

Scalability	Data sources will be distributed across the internet rather than residing in one centralized monolithic system.
Maintainability	Individual annotators will own and maintain their own layers, instead of having them curated by a central team. A pride of ownership will become the driving force for remaining current.
Improvement by Competition	``Market forces'' will ensure that successful layers survive while unsuccessful and outdated layers disappear from the client's hotlist.
Specialization	The divide-and-conquer approach to annotation will permit experts to provide and maintain their own layers without the need to become computer experts. Specialized layers will develop to address individual communities.
Ease of Distribution	Because the client needs only a lightweight browser, local software installation and maintenance overhead will be minimized.
Quality Control	The end user's selection of layers provides the ultimate in selection and control over the annotations presented. Word of mouth and publication will become critical factors in determining who to trust for annotation.
Portability	The separation of sequence and map information from annotation allows them to be stored and represented in a variety of databases and schemas.

Bibliography

1: Alfred V. Aho.
Accessing informaion from globally distributed knowledge repositories (extended abstract).
Proceedings of the fifteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, 1996.
2: Genome Sequencing Center at Washington University.
C. briggsae Sequencing, 1999.
Available at: http://genome.wustl.edu/gsc/Projects/briggsae.shtml.
3: Genome Sequencing Center at Washington University.
C. elegans Single Nucleotide Polymorphism data, 1999.
Available at: http://genome.wustl.edu/gsc/CEpolymorph/snp.shtml.
4: Amos Bairoch and Rolf Apweiler.
The SWISS-PROT protein sequence data bank and its supplement trembl in 1999.
Nucleic Acids Research, 27(1):49-54, 1999.
5: Winona C. Barker et al.
The PIR-International Protein Sequence Database.
Nucleic Acids Research, 27(1):39-43, 1999.
6: Dennis A. Benson et al.
Genbank.
Nucleic Acids Research, 27(1):12-17, 1999.
7: Tim Berners-Lee.
Information management: A proposal.
CERN, March 1989.
8: F.C. Bernstein et al.
The Protein Data Bank: a computer-based archival file for macromolecular structures.
Mol. Biol., 112(3):535-542, 1997.
9: Phillip A. Bernstein.
Repository System Engineering.
Proceedings of the 1996 ACM SIGMOD international conference on management of data, page 542, 1996.
10: Vannevar Bush.
As We May Think.
Atlantic Monthly, July 1945.
11: Stephen A. Chervitz et al.
Using the Saccharomyces Genome Database (SGD) for analysis of protein similarities and structure.
Nucleic Acids Research, 27(1):74-78, 1999.
12: BioWidgets Consortium.
Available at: http://goodman.jax.org/projects/biowidgets/consortium/ (15 Feb 2000 Link appears broken).
13: FlyBase Consortium.
The FlyBase Database of the Drosophila Genome Projects and community literature.
Nucleic Acids Research, 27(1):85-88, 1999.
14: Florence Corpet, Jerome Gouzy, and Daniel Kahn.
Recent improvements of the prodom database of protein domain families.
Nucleic Acids Research, 27(1):263-267, 1999.
15: S. Cozza et al.
System Design of the Genome Topographer.
DOE Human Genome Program Contractor-Grantee Workshop V, January 1996.
Available at http://www.ornl.gov/hgmis/publicat/96santa/informat/cozza .html.
16: Richard Durbin and Jean Thierry-Mieg.
A c. elegans database, 1991.
Documentation, code, and data available from anonymous FTP servers at lirmm.lirmm.fr, cele.mrc-lmb.cam.ac.uk and ncbi.nlm.nih.gov.
17: Ramez Elmasri and Shamkant B. Navathe.
Fundamentals of Database Systems.
Addison-Wesley, Menlo Park, California, 2 edition, 1994.
18: Thure Etzold et al.
SRS: Information Retrieval System for Molecular Biology Data Bank.
Methods in Enzymology, 266:114, 1996.
19: National Center for Biotechnology Information.
Entrez, 1999.
Available at: http://www.ncbi.nlm.nih.gov/Entrez/.
20: D. Frishman and H.W. Mewes.
PENDANTic genome analysis.
Trends in Genetics, 13:415-416, 1997.
21: T. Gaasterland and C.W. Sensen.
MAGPIE: automated genome interpretation.
Trends in Genetics, 12(2):76-78, 1996.
22: N.L. Harris.
Genotator: a workbench for sequence annotation.
Genome Res, 7:754-62, 1997.
23: Neomorphic Inc.
Neomorphic Genome SDK.
Available at: http://www.neomorphic.com/.
24: Proteome Inc.
WormPD, 1999.
Available at: http://www.proteome.com/databases/index.html.
25: Third Voice Inc.
Available at: http://www.thirdvoice.com/.
26: Peter Karp.
Editorial: What we do not know about sequence analysis and sequence databases.
Bioinformatics, 4(9):753-754, 1998.
27: Peter D. Karp.
A strategy for database interoperation.
Journal of Computational Biology, 2:573-586, 1995.
28: David T. Kingsbury.
Computational biology.
ACM Computing Surveys, 28(1), March 1996.
29: E.F. Kirkness and A.R. Kerlavage.
The TIGR human cDNA database.
Methods Mol Biol, 69:261-268, 1997.
30: E.S. Lander.
The new genomics: global views of biology.
Science, 274:536-9, 1996.
31: Thomas Lengauer.
Editorial: The accessibility of data for bioinformatics.
Bioinformatics, 15(2):91-92, 1999.
32: S.I. Letovsky et al.
GDB: the human genome database.
Nucleic Acids Research, 26(1):94-100, 1998.
33: David M. Levy and Catherine C. Marshall.
Going digital: A look at assumptions underlying digital libraries.
Communications of the ACM, 38(4):77-84, April 1995.
34: T.M. Lowe and S.R. Eddy.
tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence.
Nucleic Acids Res, 25:955-64, 1997.
35: J. Macauley, H. Wang, and N. Goodman.
A model system for studying the integration of molecular biology databases.
Bioinformatics, 14:575-82, 1998.
36: Bonnie L. Maidak et al.
A new version of the RDP (Ribosomal Database Project).
Nucleic Acids Research, 27(1):171-173, 1999.
37: Victor M. Markowitz.
Heterogeneous molecular biology databases.
Journal of Computational Biology, 2:537-538, 1995.
38: Catherine Marshall.
Annotation: from paper books to the digital library.
Proceedings of the 2nd ACM International Conference on Digital Libraries, pages 131-140, 1997.
39: Catherine C. Marshall.
Toward an ecology of hypertext annotation.
Hypertext 98, pages 40-49, 1998.
40: John Morris.
Tace documentation, 1994.
Available at: http://genome.cornell.edu/acedocs/tace.html.
41: Chris Mungall.
Anubis, 1997.
Available at: http://www.ri.bbsrc.ac.uk/anubis/.
42: Yasukazu Nakamura et al.
Extension of CyanoBase. CyanoMutants: repository of mutant information on Synechocystis sp. strain PCC6803.
Nucleic Acids Research, 27(1):66-68, 1999.
43: Jennifer Niederst.
Web Design in a Nutshell.
O'Reilly and Associates, Sebastopol, California, January 1999.
44: Elizabeth Pennisi.
Keeping genome databases clean and up to date.
Science, 286:447-450, 1999.
45: J. Rashbass.
Online Mendelian Inheritance in Man.
Trends in Genetics, 11(1):291-292, 1995.
46: O. Ritter et al.
Prototype implementation of the integrated genomic database.
Comput. Biomed Res., 27(2):97-115, 1994.
47: M. Scharf et al.
GeneQuiz: a workbench for sequence analysis.
ISMB, 2:348-353, 1994.
48: David B. Searls.
Visualizing the genome.
Available at: http://www.cbil.upenn.edu/genlang/papers/refs.html.
49: L.M. Shoman et al.
The Worm Community System, release 2.0 (WCSr2).
Methods Cell Biol, 4:607-25, 1995.
50: M.P. Skupski et al.
The genome sequence database: towards an integrated functional genomics resource.
Nucleic Acids Res, 27(1):35-8, 1999.
51: L.D. Stein, S. Cartinhour, D. Thierry-Mieg, and J. Thierry-Mieg.
JADE: An approach for interconnecting bioinformatics databases.
Gene, 209:39-43, 1998.
52: L.D. Stein and J. Thierry-Mieg.
Scriptable access to the Caenorhabditis elegans genome sequence and other ACeDB databases.
Genome Res, 8:1308-15, 1998.
53: Lincoln Stein, 1999.
Available at: http://stein.cshl.org/AcePerl/AceBrowser/.
54: G. Stoesser, M.A. Tuli, R. Lopez, and P. Sterk.
The EMBL nucleotide sequence database.
Nucleic Acids Res, 27:18-24, 1999.
55: R. Waterson and J. Sulston.
The genome of the Caenorhabditis elegans.
Proc. Natl. Acad. Sci., 92:10836-10840, 1995.
56: WebAce.
Collaboration of many people, information available at http://webace.sanger.ac.uk/.
57: Sarah J. Wheelan and Mark S. Boguski.
Late-Night Thoughts on the Sequence Annotation Problem.
Genome Research, 8(3):168-169, March 1998.
58: Jennifer Widom.
Integrating heterogeneous databases: Lazy or eager?
ACM Computing Surveys, 28, 1996.
59: K.C. Worley, P. Culpepper, B.A. Wiese, and R.F. Smith.
Beauty-X: enhanced blast searches for DNA queries.
Bioinformatics, 14:890-1, 1998.
60: Ka-Ping Yee.
Critlink: Better hyperlinks for the WWW.
Submitted to Hypertext '98, but not accepted, 1998.

Direct comments and questions to <robin@genetics.wustl.edu>