DCC pipeline discussion: Difference between revisions
No edit summary |
No edit summary |
||
Line 2: | Line 2: | ||
to look at the existing, manual, process of creating ENCODE tracks, | to look at the existing, manual, process of creating ENCODE tracks, | ||
as well as what new features will be needed for the ENCODE production phase. | as well as what new features will be needed for the ENCODE production phase. | ||
At the end of this page is a proposed submission process/pipeline for the DCC. | At the end of this page is a proposed submission process/pipeline for the DCC | ||
that was developed from discussion of the first two sections. | |||
== Existing process (ENCODE pilot phase) == | == Existing process (ENCODE pilot phase) == |
Revision as of 01:47, 5 November 2007
In order to define the functions of the automated pipeline, it is useful to look at the existing, manual, process of creating ENCODE tracks, as well as what new features will be needed for the ENCODE production phase. At the end of this page is a proposed submission process/pipeline for the DCC that was developed from discussion of the first two sections.
Existing process (ENCODE pilot phase)
1. Data submitter creates a submission package, consisting of data files and documentation files. The data files are in these formats: BED, wiggle, GTF. Usually the files are tarred and gzipped, with arbitrary directory structure. Files are named to indicate the experiment. Sometimes there is a README. Sometimes there are custom track headers in the files. Sometimes there are multiple custom tracks in a file. Sometimes the documentation is HTML, sometimes it is MS-WORD. Usually they follow the description page template we provide. Often the descriptions are incomplete. Sometimes they provide URL's for cell lines and antibodies. Sometimes they provide URL's for references (but not in our standard format). Often there are multiple tracks for the same experiment (e.g. Signal and Sites).
2. Data submitter posts the submission to our FTP site, or posts in their web space, then emails UCSC to notify of the submission.
3. Kate responds to email, creates a named & dated build dir transfers the submission to the build dir, and creates an entry in the 'make doc'. Updates the ENCODE portal Data Status page, and notifies the submitter that we have received it.
4. Kate requests an engineer assignment from Donna, updates the project list (or dev pushQ) to include the new dataset.
5. Kate/developer decide on track group, track types, and track structure -- Should it be wiggle or bedGraph if it's float valued ? Should it have special track display, details, or filtering ? Should this be a new track or a new subtrack of an existing track? If it's a new track, should it be a composite ? Should it be part of an existing super-track, or should a new super-track be created ? Based on these choices and the metadata, labels and tablenames are chosen.
6. Developer processes files in preparation for loading:
- remove track lines
- split multi-track files
- truncate precision
- trim overlaps
- fix off-by-one
- coordinate conversion
- assign unique item names
- scale scores
- sanity check data distribution with histogram
7. Developer loads data, including wigEncoding, and symlinking wib files.
8. Developer creates track configuration (trackDb):
- ordering
- labels: include submitter, experiment type, distinguishing metdata
e.g. <submitter> <type> <antibody> Yale ChIP Pol2
- colors: selected to distinguish and draw attention to similar experiments
- wiggle view limits determined by histogram
- data version: MON YYYY
- original assembly: how data was originally submitted
9. Developer edits & installs track description, or updates existing track description to include new subtrack info. Creates or updates super-track description if needed. Optionally passes on for scientific review (e.g. Ting, Rachel, Jim) or technical writing review (Donna).
10. Developer installs on genome-test, and requests review from submitter.
11. Developer posts downloads for any wiggle files.
12. Developer creates pushQ entry, notifies Kate that track is ready.
13. Kate updates internal Project List, external Data Status page, and reviews track.
14. Q/A reviews track and releases. Automation updates the ENCODE Release Log (if track name begins with 'ENCODE').
15. Kate updates the Data Status page.
16. Periodically (ideally quarterly, or when something significant happens), Kate posts a News item on the ENCODE portal, summarizing tracks released or other events. Also emails to the ENCODE Consortium mailing list.
16. Periodically (e.g NHGRI Progress Reports), Kate collects overall stats on tracks released and generates a report.
New Features for ENCODE Production phase
- Web-based submission process
- Standardized submission package with formal metadata (controlled vocabulary or URL)
- Track structure defined by submission type and metadata
- Track configuration generated automatically from metadata
- Manual tweaking of generated track configuration (by developer or Q/A ?)
- Manual editing of track description (by developer, Q/A, scientific lead, tech writer ?)
- Interactive query of submissions and status
- Automated notification to submitter if submission has problems
- Automated request for review by submitter
- Submitter acceptance triggers automated creation of pushQ entry.
- Automated notification to submitter that track has been released
- Regular, automated reporting of submissions and status -- quarterly summary to Consortium members, detailed report to NHGRI