RFC004

TITLE: Annotation ontologies for DAS/2
Author: Thomas Down
Dependancies: none
Version: 1
Date: 16 August 2001

Introduction
------------

Traditionally, mechanisms for annotating biological sequences have
used a simple, flat, namespace of `types'.  This is simple to implement,
but gives no indication of semantic relationships between the different
types, let alone the sets of types offered by multiple data providers.

DAS 1.0 takes a slightly more sophisticated approach to this issue
by building a two layer hierarchy (category-type).  This offers some
potential for semantic grouping, especially since a number of `standard'
categories are defined in the 1.0 spec.  However, it only offers a single
level of grouping, and, except for the standard categories, leaves the
issue of determining equivalence between types from different data sources
open.

We propose that the standard type system for DAS/2 should be made up
of links into a distributed ontology.  This allows more complex hierarchies
to be described, e.g.:

   http://biodas.org/types.daml#feature      (root type)
     |
     +---> http://biodas.org/types.daml#repeat
             |
             +-----> http://repeatmasker.org/repeats.daml#L1
             |
             +-----> http://repeatmasker.org/repeats.daml#Alu
                      |
                      +----> http://repeatmasker.org/repeats.daml#AluJo
                      |
                      +----> http://repeatmasker.org/repeats.daml#AluJb

This kind of classification is extremely useful, since while many
researchers are not particularly interested in repeats, and would
be happy to use a single rendering style for all repeats (or just
turn them off completely), other users might prefer a stylesheet
which used different rendering styles for each sub-family of Alu
elements.  Allowing an arbitrarily complex type structure means that
the needs of both users (and everyone in between) can be catered for.

Note that a DAS/2 core ontology only defines a few very basic types.
More specialized knowledge is published elsewhere, hence a distributed
ontology.


Requirements
------------

We would like to see the following features in a DAS/2 types system.

  - Use an ontology language (preferably an existing sytem) to
    define the types of annotation served in DAS, and the relationships
    between them.

  - The ontology should also be able to specify which pieces of additional
    information should be transmitted with a given type of feature
    (e.g. a sequence homology feature might have a query sequence ID
    and a percentage sequence identity).

  - The ontology should be distributed.  Anybody should be able to
    publish a new set of concepts, either derived from existing
    concepts or hanging directly from the root type.  When publishing
    new annotation, you can either reuse existing ontology documents,
    or set up your own (or a combination of the two).

  - The type system should still support simple `bareword' types for
    cases where the ontology mechanism is not practical.  Where these
    are used, no information should be inferred about their relation
    to other types.  In particular, two `bareword' types from different
    servers should never be considered equivalent, even if they are
    lexically identical.

  - The type system should work well with existing biological
    ontologies (e.g. GO).  This doesn't necessarily imply that
    we should pull in the whole of GO -- a simpler approach would
    be to define that all features of type `gene' may have a property
    (specified in the definition of the gene type), which is a set
    of links to GO terms.

Mechanisms
----------

The DAML+OIL ontology language is a recent but widely supported attempt
to build a standard language for communicating ontologies of the type
required here.  In particular, it has been developed with a view to
publishing metadata for Semantic Web applications, and has been endorsed
by many advocates of the Semantic Web model.

DAML refers to concepts using URIs with fragment identifiers (e.g.
http://www.biodas.org/types.daml#repeat), and allows one ontology
document to link to many others, providing support for the
requirement for distributed ontologies.