RFC012 TITLE: Queryability of data in DAS/2 Author: Thomas Down Dependencies: Feature table communication, Assembly communication, Directory Services (RFC2) Version: 1 Date: 7 September 2001 Introduction ------------ DAS 1.0 is designed as a simple protocol for browsing features annotated on some `entry point' sequence (typically an assembled chromosome). Mechanisms for querying data were omitted by design. Here, we give a number of use cases which show how a generalised query mechanism could significantly improve the usefulness of the DAS. The emphasis remains on using DAS for browser-type applications, but it is evident that the addition of queryability would also make DAS applicable for many other tasks. Note: since the DAS/2 data model for features and assemblies has not yet been widely discussed, the examples in this document are based on the familiar DAS 1.0/GFF model. Use cases --------- Here, we give a number of typical situations (from the point of view of a user accessing the DAS via a simple browser application), which are not adequately supported by a `browse-only' protocol: 1. User starts browsing at a `small scale' entry point (e.g. a clone), spots an interesting feature, then wishes to scroll to a neighbouring region, or simply zoom out and view a larger, assembled region. 2. User wishes to start browsing at a unique, identified feature (e.g. ensembl gene ID). May later want to scroll and/or zoom to see feature context. Note that the identified feature may come from either the reference server or an annotation server 3. User types some search term (e.g. `Histone') into client search dialog. The client should pop up a window listing all occurances, and allow the user to flip between them, viewing each in context. Historical note --------------- Case 1 could potentially be handled in old, `server-side-assembly' DAS, where it was legal to make a query like: features?ref=my_clone;start=-10000;stop=-5000 The server was expected to use overlap-tables or some other implementation to `walk' out from the specified reference landmark. However, this approach was rejected, since it required all annotation servers (as well as reference servers) to have intimate access to the assembly information. It also raise performance and practicality issues regarding server implementation. Implementing the use cases with queries --------------------------------------- 1. - Consider a clone entry point (e.g. AL020994). - Client issues the query (to reference server): type.category = component && target.id = 'AL020994'; - Server returns the sequence containing this component-feature (in this case the assembly scaffold, `ctg22fin4'). - Client can repeat this procedure to navigate to a higher level (e.g. `chr22') 2. - Client issues a request for feature-by-ID to each active annotation server: id = 'ENSG0000000123542'; - One of these should return a sequence containing this feature. The client can then use the same procedure as in case 1 to navigate to the top-level of the assembly. 3. - As case 2, but one (or more) annotation servers may return multiple hits. Note that, strictly speaking, case 2 can also return multiple hits, since IDs are scoped by reference server. Therefore, the client requires some UI for presenting a set of search results and allowing the user to switch between them. Issues ------ - Queryability will significantly increase the complexity of servers. Do we need it? [Authors note: obviously, I think the answer to this is yes!]. Should it be required, or just an option? - Scope of querying: is the query model simply a `per feature' filter, or should (much) more complex structures be allowed. - What syntax of query language do we use? SQL-like? XQueryX (the W3C XML query standard)? Or something custom-designed (and maybe rather simpler). - We must be careful to avoid disruptions to service by `killer queries': queries which use large amounts of system resources to execute. The query command should have a return state which indicates that the query was syntactically valid, but could not be executed within reasonable time or resource limits (where `reasonable' is obviously defined by the server developer or deployer.