Track metadata handling: Difference between revisions
Line 48: | Line 48: | ||
--[[User:Kate|Kate]] 15:59, 13 February 2007 (PST) | --[[User:Kate|Kate]] 15:59, 13 February 2007 (PST) | ||
< | <b>Metadata Format</b> | ||
All HapMap data exchanged between providers and the DCC is | All HapMap data exchanged between providers and the DCC is | ||
formatted as XML, using XML schema files to specify the semantics. | formatted as XML, using XML schema files to specify the semantics. | ||
Line 54: | Line 55: | ||
validity can be verified by the submitter before handing off. | validity can be verified by the submitter before handing off. | ||
< | <b>Metadata item identifiers</b> | ||
Each item -- data or documentation -- that is tracked by the DCC is assigned an LSID (Life Sciences Identifier). This is a URL-like | Each item -- data or documentation -- that is tracked by the DCC is assigned an LSID (Life Sciences Identifier). This is a URL-like | ||
string (actually, a URN) that is intended to always link to the item, regardless in changes to web sites. | string (actually, a URN) that is intended to always link to the item, regardless in changes to web sites. | ||
Line 71: | Line 72: | ||
how widely supported they are. | how widely supported they are. | ||
< | <b>Some Hapmap metadata types</b> | ||
* Labgroup (Informatics contact, PI, Institution, etc.) | * Labgroup (Informatics contact, PI, Institution, etc.) | ||
* Data submission (Submitter, Comment...) | * Data submission (Submitter, Comment...) | ||
* Protocol (type, submitter, short and long descriptions) | * Protocol (type, submitter, short and long descriptions) | ||
< | <b>Search/Retrieval capabilities</b> | ||
I didn't see anything provided (Daryl ?) | I didn't see anything provided (Daryl ?) |
Revision as of 00:14, 14 February 2007
Background
I have a topic prompted specifically by the ENCODE grant proposal, but it's one that I think could have broad applicability -- how to store and use track 'metadata'. What capabilities do you think we can/should provide relating to metadata ? Are there helpful examples at other bioinformatics sites (e.g. NIH DCC's) that you have seen ?
For ENCODE, metadata typically includes which cell lines were used for an experiment, which antibodies for chip/chip, sometimes timecourse of an experiment (e.g. at 0, 8, and 24 hrs). The ENCODE users may want to locate, for example, all datasets on HeLa cells. More generally at our site, we get ML questions asking if we have XX type experimental data on any organism/assembly. We currently keep metadata in trackDb settings and the track description, and have no explicit search mechanisms. --Kate 12:38, 12 February 2007 (PST)
Discussion
From Daryl:
The HapMap DCC site has great metadata examples. See the Downloads|Documentation section here (the 'Bulk Data Download' link from the main page): http://www.hapmap.org/downloads/index.html.en The Protocols (including versioning) maps directly to what we need. We'll also need a mechanism for tracking reagents -- individual cell lines, antibodies, etc. We should also keep track of chip designs in GEO/ArrayExpress. The HapMap DCC uses XML to communicate the metadata, and has gone through many updates of their formats (http://www.hapmap.org/downloads/xml_docs/). There are more metadata examples on the HapMap DCC internal site, but it is down at the moment. I can send the access info later. The main difference between the HapMap and ENCODE DCCs is going to be the expansion in data types. The output of the HapMap project was primarily diploid genotypes, so this provided a fixed point that allowed many inputs (different genotyping platforms and protocols, different populations and individual samples) and many outputs (analyses -- genotype/allele frequencies, phasing, LD, etc.) The ENCODE DCC will need to be quite a bit more flexible to handle all of the various data types.
Notes on Metadata Handling at the HapMap DCC
--Kate 15:59, 13 February 2007 (PST)
Metadata Format
All HapMap data exchanged between providers and the DCC is formatted as XML, using XML schema files to specify the semantics. An advantage of this approach, they claim, is that file format validity can be verified by the submitter before handing off.
Metadata item identifiers Each item -- data or documentation -- that is tracked by the DCC is assigned an LSID (Life Sciences Identifier). This is a URL-like string (actually, a URN) that is intended to always link to the item, regardless in changes to web sites.
Here's an example:
urn:lsid:pdb.org:1AFT:1
This is the first version of the 1AFT protein in the Protein Data Bank.
There is supposedly some browser support, at least in development, to translate the URN's. There's an overview website at sourceforge, that seems to have mostly broken links( http://lsid.sourceforge.net/), but there are functional links to software: perl and java impementations and a Firefox extension (map URN's to URL's ?). IBM also has a long page on LSID'S: http://www-128.ibm.com/developerworks/opensource/library/os-lsidbp/ Net gossip is skeptical about how broadly LSID's are used and how widely supported they are.
Some Hapmap metadata types
- Labgroup (Informatics contact, PI, Institution, etc.)
- Data submission (Submitter, Comment...)
- Protocol (type, submitter, short and long descriptions)
Search/Retrieval capabilities
I didn't see anything provided (Daryl ?)