RFC004 TITLE: Annotation ontologies for DAS/2 Author: Thomas Down Dependancies: none Version: 1 Date: 16 August 2001 Introduction ------------ Traditionally, mechanisms for annotating biological sequences have used a simple, flat, namespace of `types'. This is simple to implement, but gives no indication of semantic relationships between the different types, let alone the sets of types offered by multiple data providers. DAS 1.0 takes a slightly more sophisticated approach to this issue by building a two layer hierarchy (category-type). This offers some potential for semantic grouping, especially since a number of `standard' categories are defined in the 1.0 spec. However, it only offers a single level of grouping, and, except for the standard categories, leaves the issue of determining equivalence between types from different data sources open. We propose that the standard type system for DAS/2 should be made up of links into a distributed ontology. This allows more complex hierarchies to be described, e.g.: http://biodas.org/types.daml#feature (root type) | +---> http://biodas.org/types.daml#repeat | +-----> http://repeatmasker.org/repeats.daml#L1 | +-----> http://repeatmasker.org/repeats.daml#Alu | +----> http://repeatmasker.org/repeats.daml#AluJo | +----> http://repeatmasker.org/repeats.daml#AluJb This kind of classification is extremely useful, since while many researchers are not particularly interested in repeats, and would be happy to use a single rendering style for all repeats (or just turn them off completely), other users might prefer a stylesheet which used different rendering styles for each sub-family of Alu elements. Allowing an arbitrarily complex type structure means that the needs of both users (and everyone in between) can be catered for. Note that a DAS/2 core ontology only defines a few very basic types. More specialized knowledge is published elsewhere, hence a distributed ontology. Requirements ------------ We would like to see the following features in a DAS/2 types system. - Use an ontology language (preferably an existing sytem) to define the types of annotation served in DAS, and the relationships between them. - The ontology should also be able to specify which pieces of additional information should be transmitted with a given type of feature (e.g. a sequence homology feature might have a query sequence ID and a percentage sequence identity). - The ontology should be distributed. Anybody should be able to publish a new set of concepts, either derived from existing concepts or hanging directly from the root type. When publishing new annotation, you can either reuse existing ontology documents, or set up your own (or a combination of the two). - The type system should still support simple `bareword' types for cases where the ontology mechanism is not practical. Where these are used, no information should be inferred about their relation to other types. In particular, two `bareword' types from different servers should never be considered equivalent, even if they are lexically identical. - The type system should work well with existing biological ontologies (e.g. GO). This doesn't necessarily imply that we should pull in the whole of GO -- a simpler approach would be to define that all features of type `gene' may have a property (specified in the definition of the gene type), which is a set of links to GO terms. Mechanisms ---------- The DAML+OIL ontology language is a recent but widely supported attempt to build a standard language for communicating ontologies of the type required here. In particular, it has been developed with a view to publishing metadata for Semantic Web applications, and has been endorsed by many advocates of the Semantic Web model. DAML refers to concepts using URIs with fragment identifiers (e.g. http://www.biodas.org/types.daml#repeat), and allows one ontology document to link to many others, providing support for the requirement for distributed ontologies.