The Lightweight Distributed Annotation Server (LDAS)

The Lightweight DAS Server (LDAS) is designed for use by small to medium sites. It is ``lightweight'' in the sense that once all the software is installed, annotations can be loaded and updated using tab-delimited text files. It uses only open source (free) software, and does not require a deep knowledge of database design or of the DAS protocol itself.

This server is capable of serving annotations on up to several million features across eukaryotic genomes. It runs on top of the Mysql database engine, the Apache web server, and the Perl programming language. It is designed to be portable between all flavors of Unix and Microsoft Windows.

PREREQUISITES

The following software is required to run the LDAS:

 1) Apache web server 1.3.17 or higher

    Home page      http://www.apache.org
    Source code    http://httpd.apache.org/dist/httpd/apache_1.3.22.tar.gz
    RedHat RPM     to come

 2) Mysql version 3.23 or higher

    Home page       http://www.mysql.org
    Source/binaries http://www.mysql.com/downloads/mysql-3.23.html
    RedHat RPM     to come

 3) Perl 5.6.1 or higher

    Home page      http://www.cpan.org
    Source code    http://www.cpan.org/src/stable.tar.gz
    RedHat RPM     to come

 4) Perl DBI module 1.20 or higher

    Home page      http://www.cpan.org   
    Source code    http://www.cpan.org/modules/by-module/DBI/DBI-1.20.tar.gz

 5) Perl DBD module 1.22 or higher
    Home page      http://www.cpan.org   
    Source code    http://www.cpan.org/modules/by-module/DBD/
                                 Msql-Mysql-modules-1.2216.tar.gz

 3) Bio::DB::GFF version 0.38 or higher


    Home page      http://www.bioperl.org

    This is part of the "live" bioperl 0.9.X distribution.
    See below for instructions on getting the most recent
    version.

In addition, the following is recommended for those who wish to dramatically increase the performance of their system:

 4) Mod_perl version 1.24 or higher

    Home page      http://perl.apache.org
    Source code    http://www.cpan.org/modules/by-module/Apache/mod_perl-1.26.tar.gz
    RedHat RPM     mod_perl-1.25-2cl.i386.rpm

 5) Apache::DBI 0.88 or higher

    Home page      (none)
    Source code    http://www.cpan.org/modules/by-module/Apache/ApacheDBI-0.88.tar.gz
    RedHat RPM     to come

Note that many Linux systems will have (1), (2) and (3) installed already.

To install the RPM versions of these packages, use the ``rpm'' command or an RPM graphical front end if you have one. To install the source code versions, unpack them with the following command (Unix style):

  % gunzip -c the-package.tar.gz | tar xvf -

This will unpack to one or more directories, at the top of which will be a README or INSTALL file. Follow these instructions to build and install the packages.

Installing Bio::DB::GFF

It is a bit tricky to install Bio::DB::GFF since it is currently part of the development version of bioperl and not directly downloadable as a nice package. You must use anonymous CVS to get this package. Here is the recipe (copied from http://cvs.bioperl.org):

    (1) Make sure that CVS is installed on your system.

    (2) Use the following command (all on one line) to login to the server

         % cvs -d :pserver:cvs@cvs.bioperl.org:/home/repository/bioperl login

          when prompted, the password is 'cvs'

    (3) Check out the bioperl package you are interested in, for most
    users this will be the bioperl-live source tree.  The following
    command should be executed as one line.

         % cvs -d :pserver:cvs@cvs.bioperl.org:/home/repository/bioperl checkout bioperl-live

    The login and checkout procedure should only have to be done
    once. To update the source directories in the future it should be
    possible just to enter the top level directory and issue the
    following command:

         % cvs update

This will create the directory ``bioperl-live''. Now build and install bioperl with the following recipe:

         % cd bioperl-live
         % perl Makefile.PL
         % make
         % make test
         % make install

The last step will probably need to be run as root.

INSTALLING THE LDAS

The lightweight DAS server itself is a small Perl script that runs on top of the Bio::DB::GFF Perl library. The architecture looks like this:

              das script <--> Apache ---------------> Das Client
                 |
             Bio::DB::GFF
                 |
               mysql

1) Before you install LDAS, identify the location of the Apache web server's CGI and configuration directories. Also identify the location in which you would like to place the scripts that load the mysql database from flat files.

Typical locations are:

  Configuration directory:     /usr/local/apache/conf
  CGI script directory:        /usr/local/apache/cgi-bin
  Load script directory:       /usr/local/bin

Users of Apache/mod_perl should indicate the location of their Apache::Registry scripts directory for the CGI scripts. This is /usr/local/apache/cgi-perl on many systems, but it depends on how you have configured mod_perl.

If you change your mind about these locations later, you may reinstall, or just manually move the files around.

2) From within the LDAS directory, run ``perl Makefile.PL'':

   % perl Makefile.PL

You will be asked to indicate the locations of the configuration directory, CGI script directory, and scripts directory. Enter your choices.

3) Make, test and install the LDAS:

   % make
   % make test
   % make install

You may have to be root to run the last step. Currently the ``test'' step only confirms that you have installed Bio::DB::GFF.

This will copy the contents of the LDAS distribution's ``bin'' subdirectory to the CGI scripts directory, the contents of the ``conf'' subdirectory to the Apache configuration directory, and the contents of ``scripts'' to the load script directory.

Manual Install

If you are using a Microsoft Windows machine without access to ``make'', or you run into problems during the install, here is how to do the install manually:

1) Enter the scripts subdirectory and run Perl on each of the .PLS files you find there:

  % cd scripts
  % perl Das2GFF.PLS
  % perl ldas_load.plS
  % perl ldas_bulk_load.plS

This will create three .pl files, each configured for your system. Manually copy them into a directory where you keep executable files and scripts.

2) Enter the bin subdirectory and run Perl on the das.PLS file, passing it the path to the configuration directory. For example:

  % cd bin
  % perl das.PLS C:\Apache\conf

This will create the file ``das.pl''. Rename it ``das'' and copy it into your CGI scripts directory:

  % copy das.pl C:\Apache\cgi-bin\das

3) Enter the conf subdirectory and copy all the files you find there into the chosen configuration directory:

  % cd conf
  % copy * C:\Apache\conf\das.conf\

SETTING UP THE DATABASE

You will need to create one Mysql database for each data source that you wish to serve. The same database can be used for annotations and reference information.

Creating the Database

The database must be writable by you, and readable by the user that Apache runs as (usually the ``nobody'' user). In addition, if you wish to use the fast file-based bulk loader, you will have to have FILE privileges on the server. The following illustrates the steps in setting up a new database called ``dicty'':

Create the dicty database
```
  % mysqladmin -uroot -p create dicty
  Enter password: *******
```
You will most likely have to log in as the Mysql administrator (typically ``root'') in order to do this.

Set up privileges for yourself and ``nobody''

  % mysql -uroot -p dicty
  Enter password: *******

  Welcome to the MySQL monitor.  Commands end with ; or \g.
  Your MySQL connection id is 4 to server version: 3.23.43-log

  Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

  mysql> grant all privileges on dicty.* to lstein@localhost;
  Query OK, 0 rows affected (0.00 sec)

  mysql> grant file on *.* to lstein@localhost;
  Query OK, 0 rows affected (0.00 sec)

  mysql> grant select on dicty.* to nobody@localhost;
  Query OK, 0 rows affected (0.00 sec)

  mysql> quit
  Bye

The first grant command in this example show all privileges (select, update, create, delete) being granted to users who log in as ``lstein'' from the local machine. You will want to change the user name to your own login.

The second grant command grants file permissions to this user so that he can use the bulk loader. Because of the way Mysql's bulk loading works, the file permission must be granted to all databases (*.*) and not just to a single one.

The third command grants select permissions to the ``nobody'' user. This enables the web server script to read the dicty database, but not to update or otherwise change it.

You may wish to add password protection to the database. If you do this for the ``nobody'' user, you will need to update the ``user'' and ``passwd'' settings in the configuration file.

Creating the Load Files

The LDAS database is loaded from tab-delimited files containing annotation and assembly information. There are actually three types of tables that can be loaded:

reference point information
This type of information, which is needed both for reference servers and annotation servers, lists the names and lengths of all the landmarks that will be used to describe the positions of annotations. Landmarks are typically sequence accession numbers, such as a Genbank accession number, contig names, supercontig names, or the names of chromosomes. LDAS needs the name and length (in bp) of each reference point that is referred to by the assembly and annotation tables.
assembly information
This type of information, which is needed for reference servers only, describes how the genome is assembled from smaller fragments. LDAS does not assume or require that the genome be finished, but if there is any assembly information at all, it should be represented here.
annotation information
This type of information, which is needed for annotation servers only, describes a series of annotations, each of which is represented as a start and end position relative to one of the reference points.

In practice, you can use a different file for each of the reference point, assembly, and annotation tables, put all the information into different sections of a single file, or distribute the information arbitrarily among multiple files.

Load files are plain, tab-delimited text files, such as can be produced by a text editor or a spreadsheet program. The files must have the extension .das.

The different types of information are proceeded by a short bracketed identifier. Here is an excerpt from the ``test.das'' file that is included with this distribution:

 [references]
 #id    class           length
 Chr1     Chromosome   10000
 Link_1   Link          6000
 Link_2   Link          5000
 Cont_1a  Contig        5000
 Cont_1b  Contig        5000
 Cont_2a  Contig        9000
 Cont_2b  Contig        8000

 [assembly]
 #id    start   end     class   name    start   end
 Chr1   1       5000    Link    Link_1  1001    6000
 Chr1   5001    10000   Link    Link_2  2001    7000
 Link_1 1001    3500    Contig  Cont_1a 1       2500
 Link_1 3501    5000    Contig  Cont_1b 4500    2001
 Link_1 5001    6000    Contig  Cont_1a 5000    4001
 Link_2 2001    4500    Contig  Cont_2a 1001    3500
 Link_2 4501    7000    Contig  Cont_2b 8000    5501

 [annotations]
 #class name    type       subtype      ref        start stop strand    phase   score   tstart  tend
 Gene   abc-1   exon       curated      Cont_2a    5050 5100     +      .       .
 Gene   abc-1   CDS        curated      Cont_2a    5060 5100     +      0       .
 Gene   abc-1   exon       curated      Cont_2a    5200 5280     +      .       .
 Gene   abc-1   CDS        curated      Cont_2a    5200 5280     +      2       .
 Gene   abc-1   exon       curated      Cont_2a    5300 5380     +      .       .
 Gene   abc-1   CDS        curated      Cont_2a    5300 5360     +      2       .
 EST    yk123.1 similarity ESTWise      Cont_2a    5025 5100     .      .       99      1       76
 EST    yk123.1 similarity ESTWise      Cont_2a    5200 5280     .      .       99      77      157
 .      .       repeat     alu  Cont_2a    5050 5150     .      .       80

As shown in the example, the file is divided into multiple sections, each containing a bracketed [section] identifier. There can be multiple sections in a single load file, or you can create a file that contains a single section only. Blank lines, and lines that begin with the # sign, are ignored. All columns must be separated by tabs, not spaces.

The [references] section

A section that begins with b<[references]> is a listing of the reference sequences for the database. The references section has three columns:

 Column 1 Reference name
          The name of the reference sequence.

 Column 2 Reference source
          A one-word description of the reference sequence.
          The source description is used in the LDAS
          configuration file to identify reference sequence entries.

 Column 3 Reference length
          The length of the reference sequence, in base pairs.
          This information is necessary even for annotation
          servers so as to be able to handle coordinate translations
          involving the reverse strand.

It is recommended that you use the ``name.version'' identifier for reference sequences, if you can. LDAS recognizes this format and automatically converts it into version information for the DAS protocol.

The [assembly] section

A section that begins with b<[assembly]> is a listing of the genome assembly. Annotation servers do b<not> need to provide this information, but reference servers do. The format is a 7-column list. Each line contains information about where a particular segment of the assembly comes from:

 Column 1:  Reference name
          The name of a reference sequence which is made out of
          an assembly of smaller pieces.

 Columns 2 & 3:  Start and stop positions in reference sequence coordinates
          Two integer indicating the start of a section of the
          assembly of the reference sequence indicated in the
          first column.  The start position should always be less
          than the stop.

 Columns 4 & 5:  Source and name of the target sequence
          A source and name for the smaller sequence that is "assembled
          into" the reference sequence.

 Columns 6 & 7:  Start and stop positions in target sequence coordinates
          Two integers indicating the position of the assembly in
          the frame of reference of the smaller sequence indicated by
          columns 4 & 5. Unlike the endpoints given in reference
          sequence coordinates, the target start position will be
          greater than the stop position if the local assembly was
          built up from the reverse complement of the target sequence.

The following picture illustrates how this works:

     2001   4500 4501             7000
      |         ||                 |  
   -------------------------------------->  Link_2

   ...-----------....> Cont_2a
      |         |
     1001      3500

  Cont_2b <......------------------......... 
                 |                |
               8000              5501

Positions 2001 to 4500 of Link_2 correspond to positions 1001 to 3500 of Cont_2a, so that relationship is described by

 Link_2 2001    4500    Contig  Cont_2a 1001    3500

Positions 4501 to 7000 of Link_2 correspond to positions 5501 to 8000 of Cont_2b, so that relationship is described by

 Link_2 4501    7000    Contig  Cont_2b 8000    5501

The [annotations] Section

This is the longest section of the load file(s). It is a 10 or 12 column table. Each line corresponds to an annotation on one of the reference sequences. An annotation that spans multiple discontinuous sequence ranges, such as an mRNA->genomic alignment, will occupy several lines of the file.

Here are a few lines from the sample file that illustrate annotated exons for the gene named ``abc-1'':

 Gene   abc-1   curated transcript Cont_2a    5050 5380  +      .       .
 Gene   abc-1   curated exon       Cont_2a    5050 5100  +      .       .
 Gene   abc-1   curated exon       Cont_2a    5050 5100  +      .       .
 Gene   abc-1   curated exon       Cont_2a    5200 5280  +      .       .
 Gene   abc-1   curated exon       Cont_2a    5300 5380  +      .       .

 Columns 1 & 2:  Group class and name
        Some annotations correspond to a named biological object.  For these
        annotations, columns 1 and 2 are used to give the annotation a class
        and a name.  In the example above, the class is "Gene" and the name
        is "abc-1".  Giving the annotation a name allows the LDAS server to
        retrieve the annotation when requested.  It also allows you to 
        provide the LDAS server with a URL linking rule for the server to use
        when users request more information about the annotation.

        When a biological object is composed of multiple feature types, as in 
        the example above (1 transcript, 4 exons), each feature type gets a 
        separate line, but shares the same group class and name.  This mechanism is
        also used when a single object spans multiple discontinuous ranges,
        as in an mRNA aligned to the genome:

         EST    yk123.1 ESTWise similarity Cont_2a    5025 5100...
         EST    yk123.1 ESTWise similarity Cont_2a    5200 5280...

        In this example, the EST named "yk123.1" aligns to positions
        5025-5100, and 5200-5280 of contig Cont_2a.

        A group name can be used to describe a single feature only:

         Knockout G123.1  GeneTrap knockout  Cont_1b 8000 8600....

        For features that are not named, such as anonymous repetitive elements,
        just leave the group class and name blank, or use a single dot character
        ".".

 Columns 3 & 4:  type and subtype
        The type and subtype fields together describe the annotation type.
        The type provides a generic description, such as "exon", and the
        subtype qualifies the description by describing how the annotation
        was made.  For example, in the WormBase database, a type of
        "exon" and a subtype of "curated" means an exon prediction
        that has been examinedand confirmed by a human annotator.  An
        exon with a subtype field of "GeneFinder" is used for an exon
        that  was predicted by Phil Green's GeneFinder program.

        The choices of type and subtype are up to you.  However, it is recommended
        that whenever possible you use the type fields described in the DAS
        specification (http://www.biodas.org/documents/spec.html).

        NOTE: The type and subtype fields correspond to the method and
        source fields of the GFF (Gene Finder Format) specification.

 Columns 5, 6 & 7:  Reference sequence and range
         The next three columns give the reference sequence, and the start and
         stop of the annotation in reference sequence coordinates (bp units).  
         The start is always less than the stop.

 Column 8: Strand
         The eighth column gives the strand on which the annotation is located.
         Use "+" for annotations on the forward strand, "-" for annotations on
         the reverse strand, and "." or blank for annotations that are not
         inherently stranded.  This is typically used for genes and gene 
         products.

 Column 9: Phase
         The next column is used to store the phase of annotations that relate
         to protein coding, such as CDS features.  The phase indicates the 
         position of the first base in the codon, and can be one of 0, 1 or 2.  
         Use a "." or blank for annotations that do not relate to protein coding.

 Column 10: Score
         The tenth column contains a score.  The score is a floating point
         number of unspecified units.  For similarity features, the score
         can be used to store the expectation value or percent similarity.
         For gene predictions, the score can be used to store the prediction
         confidence value.  Use "." or blank for annotations that do not have
         scores.

 Columns 11-12: Similarity alignment range
         The last two columns are optional.  If present, they are used to indicate the
         alignment between the reference sequence and the annotated sequence.  The fields
         are typically used for similarity annotations as in the following example:

  EST yk123.1 ESTWise similarity Cont_2a  5200 5280 . . 1.0e-12 77 157

         This example indicates that bases 5200 to 5280 of contig Cont_2a align
         to bases 77-157 of EST yk123.1.  Also note the expectation value score of
         1.0e-12 (read as 1 times 10 to the -12th power).

You can create these data files using any text editor or spreadsheet program, but be sure to save the results as text only, using tabs to delimit the columns. The data files must have the extension .das, and must begin with one of the section identifiers [references], [assembly] or [annotations]. A file can contain several different sections, and can in fact switch back and forth between them.

The expressivity of the annotations table is limited by the fact that an annotation can only belong to a single group. To express more complex relationships, you must factor out intermediate groups. For example, consider a gene that is composed of two alternative transcripts, each of which is composed of a different subset of four exons:

                         Exon1  Exon2  Exon3  Exon4
        transcript a      x              x      x
        transcript b      x       x      x

Under current restrictions, you will have to express these relationships by creating two named Transcript objects, which overlap in range with a Gene object. Exons 1 and 3 will be duplicated in the table:

  Gene        abc-1  curated gene       Cont_2a 5050 5380 ...
  Transcript  abc-1a curated transcript Cont_2a 5050 5380 ...
  Transcript  abc-1b curated transcript Cont_2a 5050 5280 ...
  Transcript  abc-1a curated exon       Cont_2a 5050 5100 ...
  Transcript  abc-1a curated exon       Cont_2a 5200 5280 ...
  Transcript  abc-1a curated exon       Cont_2a 5300 5380 ...
  Transcript  abc-1b curated exon       Cont_2a 5050 5100 ...
  Transcript  abc-1b curated exon       Cont_2a 5050 5100 ...
  Transcript  abc-1b curated exon       Cont_2a 5200 5280 ...

This restriction will be lifted in the DAS/2 server, which will allow much more expressive grouping of annotations.

Loading the Database

There are two database loaders provided with the LDAS distribution:

ldas_load.pl: This script can be used to load both local and remote databases from tab-delimited files. It can be used to initialize and populate an empty database, and to add new information to an existing database.
ldas_bulk_load.pl: This script loads Mysql directly using its file-based interface. It is faster than ldas_load.pl, but only works with local databases. In addition, ldas_bulk_load.pl always reinitializes the database, and cannot be used for incremental loading.

To load the data, first make sure that the Mysql server is running, and that you have created the database using ``mysqladmin create'' as described earlier.

Assume that the database is named ``dicty'' and the file containing its annotations is in ``dicty.das''. Then you can load the database with ldas_load.pl with the following command:

 % ldas_load.pl --create --database dicty dicty.das

The b<--create> option initializes the database, loading the LDAS schema and deleting any conflicting tables.

The b<--database> option specifies the database to load. You can use the abbreviated form ``dicty'' to load the local database named dicty, or the full form to load a remote database:

    dbi:mysql:dicty:remote_hostname

The script provides b<--user> and b<--pass> options if you need to specify a username and password for the database.

These options can be abbreviated as b<-c>, b<-d> and so on. Call the script with the b<--help> option for more usage information.

If the data is distributed among multiple files, you can load them in one fell swoop like this:

 % ldas_load.pl --create --database dicty dicty1.das dicty2.das dict3.das...

With the ldas_load.pl script (but not the bulk loader!) you can load the database incrementally, adding the contents of additional data files as needed. If you try to load the same file twice, you will see many ``duplicate key'' errors. This is harmless.

All options can be abbreviated to single-letter options. For example, you can use -d instead of --database, and -c instead of --create.

The ldas_bulk_load.pl is called in the same way with the same arguments. However, it b<always> reinitializes the database, throwing out whatever was there before. Ordinarily the script will warn you when it is about to do this and gives you a chance to abort. In this case, the b<--create> option simply turns off this warning.

Do not try to use ldas_bulk_load.pl to load a remote database or to perform an incremental load. It won't work.

There is a very small test data set included in this distribution in the subdirectory ``testdata.''

Testing the Database

The distribution comes with a simple database dumper and query script named ldasdump.pl. Run it with the b<-h> argument to see the usage.

If you've used the test data to load the ``dicty'' database, you can dump out the entire contents of the database like this:

 % ldasdump.pl --database test
 Sequence Chr1   Chromosome Component Chr1   1 10000 +1  
 Sequence Link_1 Link       Component Link_1 1 6000  +1  
 Sequence Link_1 Link       Component Chr1   1 5000  +1   1001 6000
 Sequence Link_2 Link       Component Link_2 1 7000  +1         
 ...

Notice that the data does not come out in exactly the same format as it went in. In particular, the various reference sequences and their components appear as various features of type ``Component.''

To extract a certain set of feature types, use the b<--type> argument. For example:

 % ldasdump.pl -d dicty --type exon,intron
 Gene   abc-1   curated exon    Cont_2a 5300    5380    +1              
 Gene   abc-1   curated exon    Cont_2a 5200    5280    +1              
 Gene   abc-1   curated exon    Cont_2a 5050    5100    +1

To extract a range, for example, everything on contig Cont_2a from position 5000 to 5200, list the range(s) after the options:

 % ldasdump.pl -d dicty Cont_2a:5000,5200
 Sequence Cont_2a Component  Contig   Cont_2a  1     9000...
 Gene     abc-1   exon       curated  Cont_2a  5050  5100 ...
 Gene     abc-1   CDS        curated  Cont_2a  5060  5100 ...
 Gene     abc-1   exon       curated  Cont_2a  5200  5280 ...
 Gene     abc-1   CDS        curated  Cont_2a  5200  5280 ...   
 EST      yk123.1 similarity ESTWise  Cont_2a  5025  5100 ...
 EST      yk123.1 similarity ESTWise  Cont_2a  5200  5280 ...
 .        .       repeat     alu      Cont_2a  5050  5150 ...

The query will find everything that overlaps the specified range. The --type and range arguments can be combined.

If you have the Bio::Graphics module installed (see http://www.gmod.org), you can use the -g option to generate a GIF or PNG file of the region, which can then be piped to your favorite image viewer.

For example,

 % ldasdump.pl -g -d dicty Cont_2a:5000,5200 | display -

Setting Up the FASTA File Directory (reference servers only)

If you are running a DAS reference server, the server will be called on to serve up segments of the genomic DNA. Mysql databases do not handle large segments of DNA well, so LDAS keeps the DNA in external FASTA files.

Select a directory to contain the DNA files. It should be on the same machine that the web server runs, and in a directory that is writable by the web server user (usually ``nobody''). The reason for this is that the very first time the LDAS server needs DNA information, it will construct an index of the contents of the directory and store it in an index file. Thereafter, access to arbitrary sections of DNA will be very fast. This functionality uses Bioperl's Bio::DB::Fasta module, which in terms runs on top of Perl's hash databases.

The FASTA files should contain one entry for each sequenced segment of the genome. These are the lowest-level assembly units listed in the [references] section of the annotation load files, typically corresponding to sequenced clones or to the contigs of a whole genome shotgun effort. In our toy ``dicty'' example the lowest-level entry is Contig. So the FASTA files should contain entries for Cont_1a, Cont_1b, Cont_2a and so forth:

 >Cont_1a
 cgagtatgtcctcaaaaagaagggatggccacaagacagaggtattcttaaaccggacca
 agcggttatccgccgtcacgccgcttgtgcctatgactcgaggtcgcgcacgcagggtat
 gcgtctttatgcatctgcgttgaacctatagtcaatgctgatatcgttggatgtttatat
 ...
 >Cont_1b
 tggaggggctctgttaagttttactgatcacacaggttatcgattggtacgcgtatcttg
 cagtggcggcgaatgtagcttaggtgggaacgttataaagttgagggattaagatattat
 gagagttcctgaaggccgctgccacggaatctcacgtaagacccaccgaagactaattgg
 ...

All the DNAs can be in a single FASTA file, or can be split up among several files for convenience. The FASTA files must then be placed in the designated directory. In the example, we will assume that you wish to place the FASTA files into /var/fasta/dicty.

 % cd /var
 % mkdir -p fasta/dicty
 % chgrp nobody fasta/dicty
 % chmod g+w fasta/dicty
 % cp ~/dicty_genome/*.fasta /var/fasta/dicty

CONFIGURING LDAS TO SERVE ANNOTATIONS

The installation steps will install a single Perl script, named ``das'', in Apache's CGI directory. The exact location of this script depends on the CGI script directory path selected when LDAS was installed, but it is /usr/local/apache/cgi-bin/das by default. DAS clients issue requests to the CGI script by appending the selected data source and desired command to the end of the URL, as in:

   http://www.yourhost.org/cgi-bin/das/dicty/entry_points

In this example, the data source is ``dicty'' and the command is ``entry_points''. There may also be CGI parameters appended to the URL, as in:

 http://www.yourhost.org/cgi-bin/das/dicty/features?segment=Cont_1a:1,5000

(www.yourhost.org is the name of the machine that is running the web server.)

The ``data_source'' part of the URL is a symbolic name used to refer to the database. For example, if your annotation data set corresponds to the dictyostelium genomic assembly, you could choose ``dicty'' as the symbolic name, and refer to the annotations as:

   http://www.yourhost.org/cgi-bin/das/dicty/command

If there are several releases of the genome, each with a different version number, it is suggested that you append the version number to the symbolic name, as in ``dicty3''.

Before the script can start serving annotations, it needs to be configured to handle the annotation data set. This involves creating a custom configuration file. The LDAS configuration styles are stored in the directory selected during the installation process, /usr/local/apache/conf/das.conf by default. They are named using data source with ``.conf'' appended, as in ``dicty.conf''.

A number of sample .conf files are installed: ``test.conf'' is a simple configuration file suitable for use as a template when creating your own sources. ``elegans.conf'' is a more complicated sample configuration used by the C. elegans WormBase database.

To configure a new data source, create a new .conf file from the ``test.conf'' template:

   % cp test.conf dicty.conf

Now edit the .conf file in your favorite text editor, changing the configuration options as appropriate. The configuration file is divided into a number of sections, each one introduced by a [SECTION] title. Each [SECTION] has one or more options. An option consists of a option name, and one or more option values. The general format is this:

  option_name = option_value1 option_value2

We'll now walk through the configuration file. Please refer to the test.conf file during this walkthrough.

[DATA SOURCE] Section
```
 [DATA SOURCE]
 description = Test annotations
 adaptor     = dbi::mysqlopt
 mapmaster   = http://www.test.org/db/das/dicty
 database    = dbi:mysql:database=dicty;host=localhost
 fasta_files =
 user        =
 passwd      =
```
The first section of the configuration file is introduced by the line ``[DATA SOURCE]'', and has name=value configuration options that describe the name of the database, its type, and other information.

b<description> is a human-readable string. It is a human-reaadble description of the data source. It should briefly describe the organism and the type of annotations that are available.

b<adaptor> tells the Bio::DB::GFF module what database schema to use when accessing the database. Use ``dbi::mysqlopt'' unless you know you want to use a different schema.

b<mapaster> is the URL of the reference server for this data source. If you are the reference server, then this URL should be for the das script itself:
```
  http://www.yourhost.org/cgi-bin/das/dicty3
```
b<database> is the address of the Bio::DB::GFF database, using the format expected by the Perl DBI module. You can use the full form used by the template. But if the database is running on the local machine, you are allowed to abbreviate the name as in:
```
  database = dicty
```
b<fasta_files> indicates the directory where raw DNA FASTA files will be stored (reference servers only).

b<user> and b<passwd> provide the username and password for logging into the Mysql database. It is recommended that you use a username that has select-only privileges.
The [CATEGORIES] section
```
 [CATEGORIES]
 default       = structural
 translation   = stop ATG CDS 5'UTR 3'UTR misc_translated
 transcription = exon intron tRNA mRNA ncRNA 5'Cap TSS PolyA Splice5 Splice3 misc_transcribed
 variation     = insertion deletion substitution misc_variation
 structural    = Component clone primer_left primer_right oligo assembly_tag misc_structural
 similarity    = similarity NN NP PN PP misc_similarity misc_homology
 repeat        = microsatellite inverted tandom transposable_element LINE misc_repeat
 experimental  = knockout expression_tag microarrayed RNAi_result RNAi
               transgenic mutant misc_experimental
```
The DAS protocol requires that every type defined in the annotation tables have a corresponding category. The category is a broad, extensible description of the nature of the annotation. This section matches types to categories.

The ``default'' option lists the category that will be returned when a more specific type is not found. Other options correspond to category names, such as ``translation'' and ``transcription''. The option values are a space-delimited list of type names. You can use just the main type name, such as ``exon'', or the more specific combination of type and subtype in the format ``type:subtype'', as in ``exon:curated''.

If you need more space for the list of values, you can continue them on subsequent lines provided that you leave a space in front of each continuation lines. For example:
```
 repeat = microsatellite:dinucleotide microsatellite:trinucleotide
        microsatellite:tetranucleotide microsatellite:pentanucleotide
        microsatellite:other
```
The [LINKS] section
```
 [LINKS]
 default    = http://stein.cshl.org/cgi-bin/test-cgi.pl?name=$name;class=$class;type=$type
 exon       = http://stein.cshl.org/cgi-bin/test-cgi.pl/exon?name=$name;class=$class
 transcript = http://stein.cshl.org/cgi-bin/test-cgi.pl/transcript?name=$name;class=$class
 insertion  = http://stein.cshl.org/cgi-bin/test-cgi.pl/insertion?name=$name;class=$class
```
The DAS protocol allows the LDAS server to generate a web link for each feature and group of features. The [LINKS] section tells the LDAS server how to create these links.

Each option name is a defined annotation type, and can use the short format (``exon'') or the extended one (``exon:curated''). The value is a URL. When the LDAS server is running, it will scan the URL for the keywords $name, $class and $type, and replace them with the name of the annotation, its class, and its type.

The default option provides a generic link to use when no more specific type is defined.

If you do not want a link to be generated at all, use ``none'' for the URL. The following example generates links for all features except those whose types are ``alu'' or ``LINE'':
```
 default    = http://stein.cshl.org/cgi-bin/test-cgi.pl?name=$name;class=$class;type=$type
 alu        = none
 LINE       = none
```
The stein.cshl.org URL given in the examples is a test CGI script that merely echoes back its arguments. It is useful for debugging, but you will want to replace it with a URL that provides more specific information about the selected feature.
The [COMPONENTS] section
```
 [COMPONENTS]
 entry_points = Component:Chromosome Component:Link Component:Contig
 has_subparts = Component:Chromosome Component:Link
 has_superparts = Component:Link Component:Contig
```
The components section is needed by references servers only (it doesn't hurt if annotation servers have it). It describes how the genome assembly is built up from individually sequenced segments of DNA.

A typical sequence assembly will contain two to three layers of components. The exact terminology used (``contig'', ``raw_contig'', ``link'') depends on the particular sequencing project and assembly strategy. LDAS needs to distinguish those components which are built up from smaller components, from those which are the components of larger assemblages. It also needs to mark those components that are sufficiently well-known that they are good entry poins for browsing.

The section contains three options:

The b<entry_points> option refers to a list of annotation types that are to be considered entry points into the data. Entry points are used by DAS clients to select the ``top level'' coordinate system to present to users. LDAS uses a type of ``Component'' to describe every component of the genome assembly (chromosomes, contigs, etc), so the various subclasses of components are distinguished by their subtypes, as shown above.

The b<has_subparts> option refers to a list of those component types that have subparts. That is, they are built up from smaller pieces.

The b<has_superparts> option refers to a list of those component types that have superparts. That is, they are used to build larger pieces.

In the example above, Components of subtype Chromosome and Link have subparts, while Components of subtype Link and Contig have superparts. This describes the following assembly strategy:
```
  ----------------------------------------- Chromosome
  --------------- ----------------- ------- Link
  -- ---- ----- - ----- ------ ---- -- ---- Contig
```
The [FILTER] section
```
 [FILTER]
 include = 
 exclude =
```
The [FILTER] section is used to control what annotations are exported from the database via the DAS protocol. By default, everything that is in the Bio::DB::GFF database will be exported on request. This can be modified by the b<include> and/or b<exclude> options.

The b<include> option is used to limit the exported annotation types to an explicit list. If it is defined, only types that are listed in b<include> will be made available. For example:
```
  include = intron exon CDS Component:Chromosome
```
In this example, only annotations of type ``intron'', ``exon'', ``CDS'' or ``Component:Chromosome'' will be made available by the DAS protocol.

The b<exclude> option acts in the reverse way. If b<exclude> is defined, then any types that appear on the list will be excluded from publication by the DAS protocol. For example:
```
  exclude = Component:Link oligo:proprietary
```
Now all annotations of type and subtype ``Component:Link'' or ``oligo:proprietary'' will be excluded. All other annotation types will be made available.

If both b<include> and b<exclude> lists are present, the LDAS server will take the most restrictive set: those features that are on the include list but not named on the exclude list.

LDAS also install a configuration file named elegans.conf, which is a richer set of definitions used by the WormBase LDAS server. You will probably want to remove it from das.conf before you take your server live.

Configuring the Stylesheet (annotation servers only)

DAS allows annotation servers to provide hints to DAS browsers as to how the various annotations should be rendered graphically. The hints are known as a stylesheet. LDAS keeps its stylesheets in the das.conf directory, in a file named ``source.style'', where ``source'' is the name of the data source. For example, the stylesheet for the example ``dicty'' database will be named ``dicty.style.''

You do not need to create a specialized stylesheet to run an annotation server. The default stylesheet, which is contained in the file ``default.style'' contains a reasonable set of defaults (and will be updated regularly as the DAS data model evolves).

However, if you wish to customize the stylesheet, here is an excerpt from default.style with a description of how it works:

 [default]
 glyph   = box
 bump    = 0
 bgcolor = cyan
 fgcolor = black
 font    = sanserif

 [component]
 glyph   = anchored_arrow

 [transcription:high]
 bump    = 1

 [transcription:low]
 bump    = 0

 [transcript:high]
 bump = 1
 connector = hat

 [transcript:low]
 bump = 0
 connector = solid

The stylesheet is divided into multiple bracketed [sections] just like the configuration files. Each section contains attribute/value pairs that describe how a set of annotations are to be rendered.

The [default] section is special, and provides global defaults used for all annotation types.

Other [sections] contain either a category name or a specific annotation type. For example, this will set the appearance of all annotations of category ``transcription'':

 [transcription]

This will set the appearance of all annotations of type ``intron'':

 [intron]

This will set the appearance of all annotations of type ``intron'', subtype ``curated'':

 [intron:curated]

This will set the appearance of annotations of type ``intron'', subtype ``curated'', at high magnification:

 [intron:curated:high]

and this will set the appearance at low magnification:

 [intron:curated:low]

The definitions of ``high'' and ``low'' are DAS browser-dependent. High magnifications are those in which the density of features is low enough to distinguish individual features. Low magnifications are those in which features are densely packed.

It is also possible to assign rendering attributes to a class of group. Just use the group class in the [section].

 [Gene]

There is the possibility of collision between the naming of categories, annotation types, and group classes. In case of collision, the priority is: annotation type, group class, category.

Within a configuration section are a number of attribute=value pairs. The full listing of the various attributes and their values are described in the DAS specification: http://www.biodas.org/documents/spec.html#glyphid.

The most frequently used attributes are:

glyph

This is the glyph (graphical element) that is used to display the object. Frequently-used values are ``arrow'', ``anchored_arrow'', ``box'', ``hidden'', and ``line''.

fgcolor

This is the foreground color (color of lines and font). Most english-language color names are accepted.

bgcolor

This is the background color used to fill hollow objects, such as boxes.

font

This is the font used to render labels. Options are ``serif'', ``sanserif'', ``helvetica'', ``times'', and ``courier''.

height

This is the height of the glyph, measured in pixels.

bump

This is a true/false flag. If its value is 1, then the browser will turn on collision control, preventing annotations from overlapping on the screen, but potentially making the display very tall. If its value is 0, then overlapping annotations are allowed.

style

When drawing lines, this is the style to use, one of ``hat'', ``solid'' or ``dashed''. It is most often associated with groups. For example, specifying a glyph type of ``line'' and a style of ``hat'' in the context of a Gene group, will produce graphics that look like a traditional gene splicing model:

    [Gene]
    glyph = line
    style = hat

TESTING THE LDAS SERVER

If you have loaded up the ``dicty'' database with the contents of the test.das load file and configured the dicty.conf file as described earlier, you should be able to make queries on the web server. You can do this with your favorite web browser.

Testing the DSN command

Fetch the URL http://www.your.site/cgi-bin/das/dsn.

You should get a screen like this one:

 <?xml version="1.0" standalone="yes"?>
 <!DOCTYPE DASDSN SYSTEM "http://www.biodas.org/dtd/dasdsn.dtd";>
 <DASDSN>
    <DSN>
       <SOURCE id="dicty">dicty</SOURCE>
       <MAPMASTER>http://www.test.org/db/das/dicty<;/MAPMASTER>
       <DESCRIPTION>Test annotations</DESCRIPTION>
    </DSN>
    <DSN>
       <SOURCE id="elegans">elegans</SOURCE>
       <MAPMASTER>http://www.wormbase.org/db/das/elegans<;/MAPMASTER>
       <DESCRIPTION>C. elegans annotations on chromosome I & II</DESCRIPTION>
    </DSN>
    <DSN>
       <SOURCE id="test">test</SOURCE>
       <MAPMASTER>http://www.test.org/db/das/test<;/MAPMASTER>
       <DESCRIPTION>Test annotations</DESCRIPTION>
    </DSN>
 </DASDSN>

There should be one <DSN> section for each configuration file in the das.conf directory.

Testing the entry_points command

Fetch the URL http://www.your.site/cgi-bin/das/dicty/entry_points

You should get a list of entry points that looks like this:

 <?xml version="1.0" standalone="no"?>
 <!DOCTYPE DASEP SYSTEM "http://www.biodas.org/dtd/dasep.dtd";>
 <DASEP>
 <ENTRY_POINTS href="http://localhost/perl/das/dicty/entry_points"; version="1.0">
 <SEGMENT id="Chr1" size="10000" start="1" stop="10000" class="Sequence" orientation="+" subparts="no">Chr1</SEGMENT>
 <SEGMENT id="Link_1" size="6000" start="1" stop="6000" class="Sequence" orientation="+" subparts="no">Link_1</SEGMENT>
 <SEGMENT id="Link_2" size="7000" start="1" stop="7000" class="Sequence" orientation="+" subparts="no">Link_2</SEGMENT>
 <SEGMENT id="Cont_1a" size="5000" start="1" stop="5000" class="Sequence" orientation="+" subparts="no">Cont_1a</SEGMENT>
 <SEGMENT id="Cont_1b" size="5000" start="1" stop="5000" class="Sequence" orientation="+" subparts="no">Cont_1b</SEGMENT>
 <SEGMENT id="Cont_2a" size="9000" start="1" stop="9000" class="Sequence" orientation="+" subparts="no">Cont_2a</SEGMENT>
 <SEGMENT id="Cont_2b" size="8000" start="1" stop="8000" class="Sequence" orientation="+" subparts="no">Cont_2b</SEGMENT>
 </ENTRY_POINTS>
 </DASEP>

Testing the features command

Fetch the URL http://www.your.site/cgi-bin/das/dicty/features?segment=Cont_1a;type=RNAi

You should get information on the single RNAi experiment that overlaps Contig Cont_1a:

 <?xml version="1.0" standalone="yes"?>
 <!DOCTYPE DASGFF SYSTEM "http://www.biodas.org/dtd/dasgff.dtd";>
 <DASGFF>
 <GFF version="1.01" href="http://localhost/perl/das/dicty/features?segment=Cont_1a;type=RNAi">
 <SEGMENT id="Cont_1a" start="1" stop="5000" version="1.0">
    <FEATURE id="exper1" label="exper1">
       <TYPE id="RNAi" category="experimental">RNAi</TYPE>
       <METHOD id="RNAi">RNAi</METHOD>
       <START>1</START>
       <END>2500</END>
       <SCORE>-</SCORE>
       <ORIENTATION>+</ORIENTATION>
       <PHASE>0</PHASE>
       <GROUP id="exper1" />
    </FEATURE>
 </SEGMENT>
 </GFF>
 </DASGFF>

Testing the types command

Fetch the URL http://www.your.site/cgi-bin/das/dicty/types. This should return the list of all annotation types contained in the database:

 <?xml version="1.0" standalone="yes"?>
 <!DOCTYPE DASTYPES SYSTEM "http://www.biodas.org/dtd/dastypes.dtd";>
 <DASTYPES>
 <GFF version="1.2" summary="yes" href="http://localhost/perl/das/dicty/types?enumerate=1";>
 <SEGMENT>
        <TYPE id="similarity:ESTWise" category="similarity" method="similarity" source="ESTWise" />
        <TYPE id="repeat:alu" category="miscellaneous" method="repeat" source="alu" />
        <TYPE id="RNAi" category="experimental" method="RNAi" source="" />
        <TYPE id="Component:Contig" category="structural" method="Component" source="Contig" />
        <TYPE id="exon:curated" category="transcription" method="exon" source="curated" />
        <TYPE id="CDS:curated" category="translation" method="CDS" source="curated" />
        <TYPE id="Component:Link" category="structural" method="Component" source="Link" />
        <TYPE id="Component:Chromosome" category="structural" method="Component" source="Chromosome" />
 </SEGMENT>
 </GFF>
 </DASTYPES>

What to do next?

If your LDAS server is serving annotations on the Homo sapiens assembly used by the Ensembl server, then you should be able to add the URL of your LDAS server to Ensembl's contig display (http://www.ensembl.org). Bring up the contig display for a region of the genome covered by your annotations, select ``Das Sources'' from the menu, and add the URL for your DAS source, using the proper address for your hostname, and the appropriate data source name (not ``dicty''!): e.g. http://www.your.site/cgi-bin/das/your_source. Your annotations should now appear on the Ensembl display.

If your LDAS server is serving annotations on the C. elegans database using the WormBase assembly, please contact Lincoln Stein (lstein@cshl.org), and he will add your server URL to the list used by the C. elegans ``genome hunter'' display (http://www.wormbase.org).

For other organisms, contact the maintainer of the appropriate reference server (a list will be going up on http://www.biodas.org in early November).

If you are running a reference server, you may be able to use either Robin Dowell's Geodesic or Matthew Pocock's das-client application to view your data. However, at the time this was written, the Geodesic client was being updated to accomodate a few last-minute, but important changes to the DAS specification.

Notes for mod_perl Users

If you use the combination of Apache and mod_perl, performance of the LDAS server will be improved dramatically. This is because mod_perl allows the connection between the database and the das CGI script to remain active even when the script is not actively serving data.

To use LDAS with mod_perl, configure an Apache::Registry directory in the way described by the mod_perl documentation. Briefly:

 Alias /perl/ /usr/local/apache/cgi-perl/

 <Location /perl>
   SetHandler      perl-script
   PerlHandler     Apache::Registry
   PerlSendHeader  On
   Options         +ExecCGI
 </Location>

Copy the das script from the cgi-bin directory into the cgi-perl directory so that it is recognized and run by Apache::Registry.

The last step is to tell the das script where to find its configuration files. Create a <Location> section that configures a mod_perl variable named DasConfigFile:

 <Location /perl/das>
    PerlSetVar DasConfigFile conf/das.conf      
 </Location>

If this section is not present, the das script will look in the location specified at installation time. It is a hard-wired variable located towards the top of the das script.

BUG REPORTS

Please report bugs to lstein@cshl.org and to the DAS mailing list: das@ebi.ac.uk

LICENSE AND DISCLAIMER

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See the Artistic License file in the main Perl distribution for specific terms and conditions of use. In addition, the following disclaimers apply:

CSHL makes no representations whatsoever as to the SOFTWARE contained herein. It is experimental in nature and is provided WITHOUT WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE OR ANY OTHER WARRANTY, EXPRESS OR IMPLIED. CSHL MAKES NO REPRESENTATION OR WARRANTY THAT THE USE OF THIS SOFTWARE WILL NOT INFRINGE ANY PATENT OR OTHER PROPRIETARY RIGHT.

By downloading this SOFTWARE, your Institution hereby indemnifies CSHL against any loss, claim, damage or liability, of whatsoever kind or nature, which may arise from your Institution's respective use, handling or storage of the SOFTWARE.

If publications result from research using this SOFTWARE, we ask that CSHL be acknowledged and/or credit be given to CSHL scientists, as scientifically appropriate.