GenbankAlignments: Difference between revisions
mNo edit summary |
|||
(9 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
The Genbank Alignment Process | |||
== The Genbank Alignment Process == | |||
This is a description of the behind the scenes parts of the Genbank alignment process. If you want doc on how to add a new species to the list of aligned assemblies you want to go here. | This is a description of the behind the scenes parts of the Genbank alignment process. If you want doc on how to add a new species to the list of aligned assemblies you want to go here. | ||
Line 8: | Line 9: | ||
'''GenBank/RefSeq Update Goals''' | '''GenBank/RefSeq Update Goals''' | ||
Incremental update of mRNAs and ESTs for multiple species and assemblies based on daily updates from NCBI. | * Incremental update of mRNAs and ESTs for multiple species and assemblies based on daily updates from NCBI. | ||
Automatically run from cron, possibly every night. | * Automatically run from cron, possibly every night. | ||
Only require manual intervention on non-recoverable errors | * Only require manual intervention on non-recoverable errors | ||
Incremental across GenBank releases; don't force a full realignment every quarter. | * Incremental across GenBank releases; don't force a full realignment every quarter. | ||
Allow removal of older genbank full releases (and still not force a full realignment). | * Allow removal of older genbank full releases (and still not force a full realignment). | ||
Avoid corruption of disk files and databases. | * Avoid corruption of disk files and databases. | ||
Recover from failures state, automatically when possible, making manual recover easy. | * Recover from failures state, automatically when possible, making manual recover easy. | ||
Allow restarting failed steps without restarting the entire process. | * Allow restarting failed steps without restarting the entire process. | ||
Don't require the process to be run at defined intervals. When a run is done, data files will be updated to reflect the current state of the NCBI repository | * Don't require the process to be run at defined intervals. When a run is done, data files will be updated to reflect the current state of the NCBI repository. | ||
'''GenBank/RefSeq Download Step''' | '''GenBank/RefSeq Download Step''' | ||
The download step retrieves files from the NCBI ftp area and store the results in the download/ directory. | The download step retrieves files from the NCBI ftp area and store the results in the download/ directory. | ||
Algorithm | Algorithm | ||
Run the gbDownloadStep script, which: | Run the gbDownloadStep script, which: | ||
* Check if the GenBank release version has change from last version. | |||
Check if the GenBank release version has change from last version. | * If version changed, create a new version directory and download the full release and non-cumulative daily files. | ||
If version changed, create a new version directory and download the full release and non-cumulative daily files. | * If the version is unchanged, download only the new, non-cumulative daily files. | ||
If the version is unchanged, download only the new, non-cumulative daily files. | * Downloads RefSeq files, similar to GenBank download | ||
Downloads RefSeq files, similar to GenBank download | |||
Note that this process isn't mirroring; it doesn't overwrite existing files. This minimizes the danger of leaving the data in an indeterminate state. Only the required subset of files are downloaded. | Note that this process isn't mirroring; it doesn't overwrite existing files. This minimizes the danger of leaving the data in an indeterminate state. Only the required subset of files are downloaded. | ||
'''Directory structure''' | '''Downloads Directory structure''' | ||
The directory structure for GenBank and RefSeq are a subset of the directories at the NCBI ftp site. Release version numbers are added to the database directory names to allow keeping multiple versions. Now that RefSeq is doing versioned releases, it is handled in the same manner as GenBank on an independent releases cycle. | The directory structure for GenBank and RefSeq are a subset of the directories at the NCBI ftp site. Release version numbers are added to the database directory names to allow keeping multiple versions. Now that RefSeq is doing versioned releases, it is handled in the same manner as GenBank on an independent releases cycle. | ||
$gbRoot/data/download/ - downloaded files from NCBI ftp | $gbRoot/data/download/ - downloaded files from NCBI ftp | ||
genbank/${ver}/ | genbank/${ver}/ | ||
README.genbank - version number is parsed from text | :README.genbank - version number is parsed from text | ||
gbrel.txt - release nodes | :gbrel.txt - release nodes | ||
gb*.seq.gz - sequences, in flat-file format, grouped by division. | :gb*.seq.gz - sequences, in flat-file format, grouped by division. | ||
full.md5 - md5 checksums for files downloaded with the full releases. | :full.md5 - md5 checksums for files downloaded with the full releases. | ||
daily-nc/ | :daily-nc/ | ||
nc0825.flat.gz - flat-file update for a 08/25. | ::nc0825.flat.gz - flat-file update for a 08/25. | ||
nc0825.flat.gz.md5 - md5 checksum for update file | ::nc0825.flat.gz.md5 - md5 checksum for update file | ||
refseq.${ver} | refseq.${ver} | ||
release/complete/ | release/complete/ | ||
complete*.rna.gbff.gz - flat-files containing full RefSeq mRNA . | :complete*.rna.gbff.gz - flat-files containing full RefSeq mRNA . | ||
full.md5 - md5 checksum of all files | :full.md5 - md5 checksum of all files | ||
daily/ | :daily/ | ||
rsnc.0101.2002.gbff.gz - Flat-file update for 01/01/2002. | ::rsnc.0101.2002.gbff.gz - Flat-file update for 01/01/2002. | ||
rsnc.0101.2002.gbff.gz.md5 - md5 checksum | ::rsnc.0101.2002.gbff.gz.md5 - md5 checksum | ||
... | ... | ||
'''GenBank/RefSeq Data Processing Step''' | |||
The data processing step extracts data from the downloaded GenBank files into a format that is ready for import into the database. | |||
Run the gbProcessStep script, which: | |||
:Examines each full and daily download files to determine which files need to be created. For each set of source files, check to see if mrna.md5 and est.*.md5 files exist in the appropriate processed/ directory. | |||
:For each missing *.md5 file, run the gbProcessSeqs script, which: | |||
::Parse flat-files with gbToFaRa into data files that are used to update the browser databases. An index file (*.gbidx) is created to location the each sequence and version. All species remain grouped together; spliting by species at this step would generate a very large number of small files. | |||
::Checksum (md5) the data files. The checksum file serves as indicator that the task completed successfully. | |||
Processed Directory structure | |||
$gbRoot/data/processed/ - data extracted from the NCBI flat-files | |||
genbank.${ver}/ | |||
:full/ | |||
::mrna.ra.gz - meta-data for mRNAs | |||
::mrna.fa - fasta sequence data | |||
::mrna.gbidx - index file | |||
::mrna.md5 - checksums of all mRNA files | |||
::est.aa.ra.gz - files for ESTs accessions starting with AA (case insensitive). | |||
::est.aa.fa, est.aa.gbidx, est.aa.cksum | |||
::est.ab.ra.gz, est.ab.fa, est.ab.gbidx, est.ab.cksum | |||
:daily.${date}/ | |||
::mrna.ra.gz, ... | |||
::est.aa.ra, ... | |||
:refseq.${ver}/ | |||
::full/ | |||
::mrna.ra.gz, ... | |||
:daily.${date}/ | |||
::mrna.ra.gz, | |||
'''Genbank index file''' | |||
A GenBank index file is a tab-seperated file in the format: | |||
acc version moddate organism | |||
The name of the file is either mrna.gbidx or est.*.gbidx and is associated with the a *.ra or *.fa files of the same name. The columns are: | |||
acc - GenBank or RefSeq accession | |||
version - Version number, not including the accession | |||
moddate - Modification date, in 2002-22-08 format | |||
organism - Organism name | |||
'''GenBank/RefSeq Alignment Step''' | |||
This step is done for each browser database (species and assembly) that is being updated. | |||
Algorithm | |||
* Select the most current GenBank (and corresponding RefSeq) full releases from the processed/ directory. | |||
* If there is not a corresponding aligned/.../full directory for the release: | |||
:*If there is a previous aligned release for this database, find the aligned sequences that have not changed since that release and the new, full release. Save these in a a temporary directory. | |||
:*Copy new and changed sequences to temporary fasta files for alignment. | |||
*For each processed/.../daily.${ver}/ directory that does not have a completed alignment/ directory: | |||
:*Copy new and changed sequences to temporary fasta files for alignment. | |||
*If the number of sequences requiring alignment exceeds some configured threshold, send e-mail requesting an alignment on the big cluster and stop the automated process. Normally, this should only be required when a new database is built. | |||
*If the number of sequences is below the threshold, run BLAT on the mini-cluster, using the parasol make facility. | |||
*Process the completed alignments: | |||
:*Combine alignments migrated from the previous releases if pending | |||
:*Building an index file | |||
:*Checksum the files. | |||
:*Do some sanity check on the alignment. Check that changed sequences continue to align, at least in most cases and that the number of aligned sequences increases. | |||
'''Directory structure''' | |||
*$gbRoot/data/aligned/ - aligned files | |||
:*genbank.${ver}/ | |||
::*${db}/ - alignments for this genome database (e.g. hg12). | |||
:::*full/ - Alignments corrisponding to the full release. This is a combination alignments migrated from previous releases and new alignments. | |||
::::*mrna.native.psl.gz, mrna.native.oi.gz, mrna.native.alidx, mrna.native.md5 | |||
::::*est.aa.native.psl.gz, est.aa.native.oi.gz, est.aa.native.alidx, est.aa.native.md5, | |||
::::*mrna.xeno.psl.gz, mrna.xeno.oi.gz, mrna.xeno.alidx, mrna.xeno.md5 | |||
::::*est.aa.xeno.psl.gz, est.aa.xeno.oi.gz, est.aa.xeno.alidx, est.aa.xeno.md5, | |||
:::*daily.${date}/ - Alignments for sequence that were new or modified in the daily update. | |||
::*refseq.${ver}/ - BLAT alignments for RefSeq, same structure as used for GenBank, with only native mRNAs. | |||
'''Index file''' | |||
Two alignment index files are always created for the corrisponding processed.gbidx file, for native and xeno, the if there are no sequences to align. This supports easy checking for the alignment being completed. The file is a tab-seperated in the format: | |||
acc version numaligns | |||
The name of the file is either mrna.alidx or est.*.alidx and is associated with the a *.psl file of the same name. The columns are: | |||
*acc - GenBank or RefSeq accession | |||
*version - Version number, not including the accession | |||
*numAligns - Count of the number of alignments for this accession. | |||
'''GenBank/RefSeq Data Database Update Step''' | |||
This step is done for each browser database (species and assembly) on each database server. When building a new database, this process is run on the master database (genbank), and then the database pushed to hgwdev. | |||
'''Genbank Status Table''' | |||
This table (gbStatus) is used to keep track of the current version of the Genbank data this is loaded in the database. While some of this information is redundant to the mrna table, this data is only used by by the update. This table is updated last, so that it can be used to record form failures during update. | |||
The columns of the tables are: | |||
*acc - GenBank accession | |||
*version - GenBank version number. | |||
*modDate - last modified date. | |||
*type - the type of the entry: EST or mRNA | |||
*srcDb - source database: GenBank or RefSeq | |||
*gbSeq - id in gbSeq table. | |||
*numAligns - number of alignments of the accession in the approriate track | |||
*seqRelease - release version where the sequence was obtained | |||
*seqUpdate - update where sequence was obtained (date or full | |||
*metaRelease - release version where the metadata was obtained | |||
*metaUpdate - update where the metadata was obtained (date or full) | |||
*extRelease - release version containing the external file | |||
*extUpdate - update containing the external file (date or full) | |||
*time - time that this entry was inserted | |||
'''Genbank Loaded Table''' | |||
This table contains the releases, updates and partitions that have been loaded. For a given partition, the updates containing gbidx or alidx files are compared to the gbLoaded table. If all of the updates are loaded, there is no need to do any more checking. This saves loading the alidx files and querying the gbStatus and gbSeq tables. | |||
The columns of the tables are: | |||
*srcDb - source database: GenBank or RefSeq | |||
*type - the type of the entry: EST or mRNA | |||
*loadRelease - release version | |||
*loadUpdate - update date or full | |||
*accPrefix - accession prefix for ESTs | |||
*time - time entry was added | |||
'''Algorithm''' | |||
Table updates are done the the following steps. This is designed to allow restarting the update process on a crash from any point. It prevents display of stale data if the update process aborts. There is a window where a sequence that has change will not be in the database. | |||
If the load process crashes, new sequences may be in the other tables, but not in the gbStatus table. To detect this, we check the new sequences against the gbSeq table. These are orphaned sequences that must first be removed before loading. new seqChanged metaChanged deleted orphaned | |||
To minimize the memory require for the update, one partition of the date is loaded at a time. Partitions the RefSeq mRNAs, the GenBank mRNAs, and the GenBank ESTs split on the first two letters of the accession. | |||
*For each partition of the GenBank and RefSeq data: | |||
:*Determine if this partition could have any data to load by comparing the processed gbidx files with the gbLoaded, skipping the partition if there is no missing updates. | |||
:*Delete any accession in etc/ignore.idx from all relevant tables. Doing this upfront prevents a lot of complexity in other code. | |||
:*Compare accession versions stored in gbStatus table to contents of the processed/ and aligned/ directories (gbIndex). classifying each accessions as: | |||
::*new - Accession is not in gbStatus. | |||
::*seqChanged - The sequence and metadata changed. | |||
::*metaChanged - The metadata changed. | |||
::*extChanged - The release containing the external sequence files has changed and the entry has not changed. This is used to migrate fasta file references to the latest release, to allow cleanup of older releases. | |||
::*deleted - The accession is not in the gbIndex. | |||
*Check gbSeq table to see if contains any of the new entries, which become orphaned. These are sequences from a failed load. | |||
*Remove seqChanged, deleted, and orphaned from alignment and orientation information tables. | |||
*Remove deleted and orphaned accessions from the mrna table. | |||
*If this is RefSeq, remove deleted and orphaned accessions from the refSeqStatus and refLink tables. | |||
*Remove deleted and orphanedaccessions from the gbSeq table. | |||
*Update rows for seqChanged and metaChanged accessions in the gbSeq table. Add new accessions to the gbSeq table. This must be the first table update so that orphans can be detected. | |||
*Update rows for seqChanged and metaChanged accessions in the gbSeq table. Add new accessions to the gbSeq table. | |||
*If this is RefSeq, update rows for seqChanged and metaChanged accessions in the refSeqStatus and refLink tables and add new accessions to these tables. | |||
*Add new strings to the unique string tables (author, library, etc), No attempt is made to remove entries that will no longer be referenced. | |||
*Update rows for seqChanged and metaChanged accessions in the gbSeq table. Add new accessions to the gbSeq table. | |||
*Update rows for seqChanged and metaChanged accessions in the mrna table. Add new accessions to the mrna table. | |||
*Add new and seqChanged rows to the alignment and orientation information tables. | |||
*Update sequence fastas references the gbSeq and gbExtFile tables for all relChanged entries. | |||
*Update rows for seqChanged and metaChanged accessions in the gbStatus table. Add new accessions to the table. | |||
'''GenBank/RefSeq Error Handling''' | '''GenBank/RefSeq Error Handling''' | ||
The following approaches are used to make this process as robust as possible: | The following approaches are used to make this process as robust as possible: | ||
The output of the all scripts is logged; error notification is done by e-mail. | *The output of the all scripts is logged; error notification is done by e-mail. | ||
Semaphore files are created when a task script is running, which prevents other tasks from being accidental run at the same time. If a task fails, a file is created indicate that this occurred. Tasks will not run as long as a failed semaphore file is in place. The condition must be manually correct and the semaphores removed. | *Semaphore files are created when a task script is running, which prevents other tasks from being accidental run at the same time. If a task fails, a file is created indicate that this occurred. Tasks will not run as long as a failed semaphore file is in place. The condition must be manually correct and the semaphores removed. | ||
There is a defined data flow between each step. A step can be restart from the beginning to synchronize it with results of the previous step without corrupting data. | *There is a defined data flow between each step. A step can be restart from the beginning to synchronize it with results of the previous step without corrupting data. | ||
Files are written in an atomic manner where needed, first writing the file in the same directory with a temporary name, then renaming it. This is done any time the existance of a file indicates a step is complete. | *Files are written in an atomic manner where needed, first writing the file in the same directory with a temporary name, then renaming it. This is done any time the existance of a file indicates a step is complete. | ||
Data files are not modified after successful creation. | *Data files are not modified after successful creation. | ||
Various verifications and sanity checks are used. | *Various verifications and sanity checks are used. | ||
Ability to explictly exclude incorrect genbank entries (data/ignore.idx files). | *Ability to explictly exclude incorrect genbank entries (data/ignore.idx files). | ||
'''GenBank/RefSeq Annoying Issues''' | '''GenBank/RefSeq Annoying Issues''' | ||
The entire GenBank directory is replace when a new version is release. Daily releases are relative to this. | |||
GenBank daily release don't indicate deleted entries. | *The entire GenBank directory is replace when a new version is release. Daily releases are relative to this. | ||
GenBank daily filenames don't include a year, so daily files between the beginning of the year and next release (probably Jan 15th) will not sort in a simple manner. | *GenBank daily release don't indicate deleted entries. | ||
RefSeq updates it cumulative files daily as well as having separate daily files. There is no concept of a release. | *GenBank daily filenames don't include a year, so daily files between the beginning of the year and next release (probably Jan 15th) will not sort in a simple manner. | ||
RefSeq deleted entries are still in the older daily releases. There no daily records indicating when an entry has been deleted. | *RefSeq updates it cumulative files daily as well as having separate daily files. There is no concept of a release. | ||
Ocassionally, there are incorrect genbank entries that break assumptions in this code. These are skipped by placing them an data/ignore.idx acc. | *RefSeq deleted entries are still in the older daily releases. There no daily records indicating when an entry has been deleted. | ||
MySql ISAM tables don't support foreign keys. Using auto_increment for id columns was a problems because mysqlimport would reset the numbers (or at least not insert zero). | *Ocassionally, there are incorrect genbank entries that break assumptions in this code. These are skipped by placing them an data/ignore.idx acc. | ||
Want to use disk files rather than a database to track genbank repository files. This is faster when we need to look at all entries and makes setup and loading multiple database servers easier. It was also easier to implement. However this proved to be a problem for ESTs, which require large amount of memory to handle. To reduce the memory required, ESTs are partitioned by the first two letters of the accession. | *MySql ISAM tables don't support foreign keys. Using auto_increment for id columns was a problems because mysqlimport would reset the numbers (or at least not insert zero). | ||
Don't handle realigning sequences (say to take advantage of changes to the aligner). | *Want to use disk files rather than a database to track genbank repository files. This is faster when we need to look at all entries and makes setup and loading multiple database servers easier. It was also easier to implement. However this proved to be a problem for ESTs, which require large amount of memory to handle. To reduce the memory required, ESTs are partitioned by the first two letters of the accession. | ||
*Don't handle realigning sequences (say to take advantage of changes to the aligner). | |||
'''Overview of directories''' | '''Overview of directories''' | ||
$gbRoot/ - root directory | $gbRoot/ - root directory | ||
etc/ - configuration files and scripts | :etc/ - configuration files and scripts | ||
ignore.idx - ignore index file. | ::ignore.idx - ignore index file. | ||
genban.conf - configuration file. | ::genban.conf - configuration file. | ||
data/ - data files | :data/ - data files | ||
download/ - downloaded files from NCBI ftp | :download/ - downloaded files from NCBI ftp | ||
genbank.${ver}/ | ::genbank.${ver}/ | ||
genbank.${ver}/daily-nc/ | ::genbank.${ver}/daily-nc/ | ||
refseq.${ver}/cummulative/ | ::refseq.${ver}/cummulative/ | ||
refseq.${ver}/daily/ | ::refseq.${ver}/daily/ | ||
processed/ - data extracted from the NCBI flat-files | :processed/ - data extracted from the NCBI flat-files | ||
genbank.${ver}/ | ::genbank.${ver}/ | ||
full/ | :::full/ | ||
daily.${date}/ | :::daily.${date}/ | ||
refseq.${ver}/ | ::refseq.${ver}/ | ||
full/ | :::full/ | ||
daily.${date}/ | :::daily.${date}/ | ||
aligned/ - aligned sequences | :aligned/ - aligned sequences | ||
${db}/ | ::${db}/ | ||
var/build/ - files associated with download and build steps. Only on build server | :var/build/ - files associated with download and build steps. Only on build server | ||
run/ - semaphore files | ::run/ - semaphore files | ||
logs/ - log files | ::logs/ - log files | ||
build.time - File contain the time that the last download and alignment steps completed, in seconds since 00:00:00 1970-01-01 UTC. This is used by process running on other systems to poll for completion. | ::build.time - File contain the time that the last download and alignment steps completed, in seconds since 00:00:00 1970-01-01 UTC. This is used by process running on other systems to poll for completion. | ||
var/copy/ - files associated with copying to the gbdb server. | :var/copy/ - files associated with copying to the gbdb server. | ||
run/ - semaphore files | ::run/ - semaphore files | ||
logs/ - log files | ::logs/ - log files | ||
build.time - copy of build/build.time from the last completed copy. | ::build.time - copy of build/build.time from the last completed copy. | ||
copy.time - file containing time last copy completed. | ::copy.time - file containing time last copy completed. | ||
var/dbload/$host/ - files associated with that last database load on database server $host. | :var/dbload/$host/ - files associated with that last database load on database server $host. | ||
run/ - semaphore files | ::run/ - semaphore files | ||
logs/ - log files | ::logs/ - log files | ||
copy.time - copy of copy/copy.time from the last completed copy. | ::copy.time - copy of copy/copy.time from the last completed copy. | ||
load.time - file containing time last load completed completed. | ::load.time - file containing time last load completed completed. | ||
'''Realigning Tracks''' | '''Realigning Tracks''' | ||
It maybe necessary to realign and reload tracks to change alignment parameters or other attributes. This is fairly straight forward when a genome databases is initially being built. It's more complex if one has to sync up multiple systems. | It maybe necessary to realign and reload tracks to change alignment parameters or other attributes. This is fairly straight forward when a genome databases is initially being built. It's more complex if one has to sync up multiple systems. | ||
*To triger a realignment, on needs to remove the related files for some partation of the data for all updates. These live under either the genbank or refseq alignment directories, for example: | |||
:*data/aligned/genbank.139.0/hg16/ | |||
To triger a realignment, on needs to remove the related files for some partation of the data for all updates. These live under either the genbank or refseq alignment directories, for example: | :*data/aligned/refseq.139.0/hg16/ | ||
data/aligned/genbank.139.0/hg16/ | *To realign native RefSeq mRNAs for hg16, one would remove: | ||
data/aligned/refseq.139.0/hg16/ | :*data/aligned/refseq.139.0/hg16/*/mrna.native.* | ||
To realign native RefSeq mRNAs for hg16, one would remove: | *To realign xeno GeneBank ESTs for hg16, one would remove: | ||
data/aligned/refseq.139.0/hg16/*/mrna.native.* | :*data/aligned/refseq.139.0/hg16/*/est.*.xeno.* | ||
To realign xeno GeneBank ESTs for hg16, one would remove: | *Reload the database with the partation of data that was realigned. The -srcDb and -type options restrict the subset. The organism category (native or xeno) isn't specified. Reloading of ESTs isn't supported, use -drop and -initialLoad instead. | ||
data/aligned/refseq.139.0/hg16/*/est.*.xeno.* | :*nice bin/gbDbLoadStep -reload -srcDb=genbank -type=mrna $db | ||
Reload the database with the partation of data that was realigned. The -srcDb and -type options restrict the subset. The organism category (native or xeno) isn't specified. Reloading of ESTs isn't supported, use -drop and -initialLoad instead. | |||
nice bin/gbDbLoadStep -reload -srcDb=genbank -type=mrna $db |
Latest revision as of 18:29, 16 January 2019
The Genbank Alignment Process
This is a description of the behind the scenes parts of the Genbank alignment process. If you want doc on how to add a new species to the list of aligned assemblies you want to go here.
Overview
The genbank alignment process aligns RNA and EST sequences from NCBI, as well as the RefSeq mRNA's to (almost) all the assemblies that UCSC supports. The process is divided into roughly five parts: download, process, align, database load, and dissemination. The first four parts and the beginning of the fifth happen on the genbank-101 machine, the dissemination part includes hgwdev, hgwbeta, and then to our official mirrors (RR, euro, japan).
GenBank/RefSeq Update Goals
- Incremental update of mRNAs and ESTs for multiple species and assemblies based on daily updates from NCBI.
- Automatically run from cron, possibly every night.
- Only require manual intervention on non-recoverable errors
- Incremental across GenBank releases; don't force a full realignment every quarter.
- Allow removal of older genbank full releases (and still not force a full realignment).
- Avoid corruption of disk files and databases.
- Recover from failures state, automatically when possible, making manual recover easy.
- Allow restarting failed steps without restarting the entire process.
- Don't require the process to be run at defined intervals. When a run is done, data files will be updated to reflect the current state of the NCBI repository.
GenBank/RefSeq Download Step
The download step retrieves files from the NCBI ftp area and store the results in the download/ directory. Algorithm Run the gbDownloadStep script, which:
- Check if the GenBank release version has change from last version.
- If version changed, create a new version directory and download the full release and non-cumulative daily files.
- If the version is unchanged, download only the new, non-cumulative daily files.
- Downloads RefSeq files, similar to GenBank download
Note that this process isn't mirroring; it doesn't overwrite existing files. This minimizes the danger of leaving the data in an indeterminate state. Only the required subset of files are downloaded.
Downloads Directory structure
The directory structure for GenBank and RefSeq are a subset of the directories at the NCBI ftp site. Release version numbers are added to the database directory names to allow keeping multiple versions. Now that RefSeq is doing versioned releases, it is handled in the same manner as GenBank on an independent releases cycle.
$gbRoot/data/download/ - downloaded files from NCBI ftp genbank/${ver}/
- README.genbank - version number is parsed from text
- gbrel.txt - release nodes
- gb*.seq.gz - sequences, in flat-file format, grouped by division.
- full.md5 - md5 checksums for files downloaded with the full releases.
- daily-nc/
- nc0825.flat.gz - flat-file update for a 08/25.
- nc0825.flat.gz.md5 - md5 checksum for update file
refseq.${ver}
release/complete/
- complete*.rna.gbff.gz - flat-files containing full RefSeq mRNA .
- full.md5 - md5 checksum of all files
- daily/
- rsnc.0101.2002.gbff.gz - Flat-file update for 01/01/2002.
- rsnc.0101.2002.gbff.gz.md5 - md5 checksum
...
GenBank/RefSeq Data Processing Step
The data processing step extracts data from the downloaded GenBank files into a format that is ready for import into the database.
Run the gbProcessStep script, which:
- Examines each full and daily download files to determine which files need to be created. For each set of source files, check to see if mrna.md5 and est.*.md5 files exist in the appropriate processed/ directory.
- For each missing *.md5 file, run the gbProcessSeqs script, which:
- Parse flat-files with gbToFaRa into data files that are used to update the browser databases. An index file (*.gbidx) is created to location the each sequence and version. All species remain grouped together; spliting by species at this step would generate a very large number of small files.
- Checksum (md5) the data files. The checksum file serves as indicator that the task completed successfully.
Processed Directory structure
$gbRoot/data/processed/ - data extracted from the NCBI flat-files genbank.${ver}/
- full/
- mrna.ra.gz - meta-data for mRNAs
- mrna.fa - fasta sequence data
- mrna.gbidx - index file
- mrna.md5 - checksums of all mRNA files
- est.aa.ra.gz - files for ESTs accessions starting with AA (case insensitive).
- est.aa.fa, est.aa.gbidx, est.aa.cksum
- est.ab.ra.gz, est.ab.fa, est.ab.gbidx, est.ab.cksum
- daily.${date}/
- mrna.ra.gz, ...
- est.aa.ra, ...
- refseq.${ver}/
- full/
- mrna.ra.gz, ...
- daily.${date}/
- mrna.ra.gz,
Genbank index file A GenBank index file is a tab-seperated file in the format:
acc version moddate organism
The name of the file is either mrna.gbidx or est.*.gbidx and is associated with the a *.ra or *.fa files of the same name. The columns are: acc - GenBank or RefSeq accession version - Version number, not including the accession moddate - Modification date, in 2002-22-08 format organism - Organism name
GenBank/RefSeq Alignment Step
This step is done for each browser database (species and assembly) that is being updated. Algorithm
- Select the most current GenBank (and corresponding RefSeq) full releases from the processed/ directory.
- If there is not a corresponding aligned/.../full directory for the release:
- If there is a previous aligned release for this database, find the aligned sequences that have not changed since that release and the new, full release. Save these in a a temporary directory.
- Copy new and changed sequences to temporary fasta files for alignment.
- For each processed/.../daily.${ver}/ directory that does not have a completed alignment/ directory:
- Copy new and changed sequences to temporary fasta files for alignment.
- If the number of sequences requiring alignment exceeds some configured threshold, send e-mail requesting an alignment on the big cluster and stop the automated process. Normally, this should only be required when a new database is built.
- If the number of sequences is below the threshold, run BLAT on the mini-cluster, using the parasol make facility.
- Process the completed alignments:
- Combine alignments migrated from the previous releases if pending
- Building an index file
- Checksum the files.
- Do some sanity check on the alignment. Check that changed sequences continue to align, at least in most cases and that the number of aligned sequences increases.
Directory structure
- $gbRoot/data/aligned/ - aligned files
- genbank.${ver}/
- ${db}/ - alignments for this genome database (e.g. hg12).
- full/ - Alignments corrisponding to the full release. This is a combination alignments migrated from previous releases and new alignments.
- mrna.native.psl.gz, mrna.native.oi.gz, mrna.native.alidx, mrna.native.md5
- est.aa.native.psl.gz, est.aa.native.oi.gz, est.aa.native.alidx, est.aa.native.md5,
- mrna.xeno.psl.gz, mrna.xeno.oi.gz, mrna.xeno.alidx, mrna.xeno.md5
- est.aa.xeno.psl.gz, est.aa.xeno.oi.gz, est.aa.xeno.alidx, est.aa.xeno.md5,
- daily.${date}/ - Alignments for sequence that were new or modified in the daily update.
- refseq.${ver}/ - BLAT alignments for RefSeq, same structure as used for GenBank, with only native mRNAs.
Index file Two alignment index files are always created for the corrisponding processed.gbidx file, for native and xeno, the if there are no sequences to align. This supports easy checking for the alignment being completed. The file is a tab-seperated in the format:
acc version numaligns
The name of the file is either mrna.alidx or est.*.alidx and is associated with the a *.psl file of the same name. The columns are:
- acc - GenBank or RefSeq accession
- version - Version number, not including the accession
- numAligns - Count of the number of alignments for this accession.
GenBank/RefSeq Data Database Update Step
This step is done for each browser database (species and assembly) on each database server. When building a new database, this process is run on the master database (genbank), and then the database pushed to hgwdev.
Genbank Status Table
This table (gbStatus) is used to keep track of the current version of the Genbank data this is loaded in the database. While some of this information is redundant to the mrna table, this data is only used by by the update. This table is updated last, so that it can be used to record form failures during update.
The columns of the tables are:
- acc - GenBank accession
- version - GenBank version number.
- modDate - last modified date.
- type - the type of the entry: EST or mRNA
- srcDb - source database: GenBank or RefSeq
- gbSeq - id in gbSeq table.
- numAligns - number of alignments of the accession in the approriate track
- seqRelease - release version where the sequence was obtained
- seqUpdate - update where sequence was obtained (date or full
- metaRelease - release version where the metadata was obtained
- metaUpdate - update where the metadata was obtained (date or full)
- extRelease - release version containing the external file
- extUpdate - update containing the external file (date or full)
- time - time that this entry was inserted
Genbank Loaded Table
This table contains the releases, updates and partitions that have been loaded. For a given partition, the updates containing gbidx or alidx files are compared to the gbLoaded table. If all of the updates are loaded, there is no need to do any more checking. This saves loading the alidx files and querying the gbStatus and gbSeq tables.
The columns of the tables are:
- srcDb - source database: GenBank or RefSeq
- type - the type of the entry: EST or mRNA
- loadRelease - release version
- loadUpdate - update date or full
- accPrefix - accession prefix for ESTs
- time - time entry was added
Algorithm
Table updates are done the the following steps. This is designed to allow restarting the update process on a crash from any point. It prevents display of stale data if the update process aborts. There is a window where a sequence that has change will not be in the database. If the load process crashes, new sequences may be in the other tables, but not in the gbStatus table. To detect this, we check the new sequences against the gbSeq table. These are orphaned sequences that must first be removed before loading. new seqChanged metaChanged deleted orphaned
To minimize the memory require for the update, one partition of the date is loaded at a time. Partitions the RefSeq mRNAs, the GenBank mRNAs, and the GenBank ESTs split on the first two letters of the accession.
- For each partition of the GenBank and RefSeq data:
- Determine if this partition could have any data to load by comparing the processed gbidx files with the gbLoaded, skipping the partition if there is no missing updates.
- Delete any accession in etc/ignore.idx from all relevant tables. Doing this upfront prevents a lot of complexity in other code.
- Compare accession versions stored in gbStatus table to contents of the processed/ and aligned/ directories (gbIndex). classifying each accessions as:
- new - Accession is not in gbStatus.
- seqChanged - The sequence and metadata changed.
- metaChanged - The metadata changed.
- extChanged - The release containing the external sequence files has changed and the entry has not changed. This is used to migrate fasta file references to the latest release, to allow cleanup of older releases.
- deleted - The accession is not in the gbIndex.
- Check gbSeq table to see if contains any of the new entries, which become orphaned. These are sequences from a failed load.
- Remove seqChanged, deleted, and orphaned from alignment and orientation information tables.
- Remove deleted and orphaned accessions from the mrna table.
- If this is RefSeq, remove deleted and orphaned accessions from the refSeqStatus and refLink tables.
- Remove deleted and orphanedaccessions from the gbSeq table.
- Update rows for seqChanged and metaChanged accessions in the gbSeq table. Add new accessions to the gbSeq table. This must be the first table update so that orphans can be detected.
- Update rows for seqChanged and metaChanged accessions in the gbSeq table. Add new accessions to the gbSeq table.
- If this is RefSeq, update rows for seqChanged and metaChanged accessions in the refSeqStatus and refLink tables and add new accessions to these tables.
- Add new strings to the unique string tables (author, library, etc), No attempt is made to remove entries that will no longer be referenced.
- Update rows for seqChanged and metaChanged accessions in the gbSeq table. Add new accessions to the gbSeq table.
- Update rows for seqChanged and metaChanged accessions in the mrna table. Add new accessions to the mrna table.
- Add new and seqChanged rows to the alignment and orientation information tables.
- Update sequence fastas references the gbSeq and gbExtFile tables for all relChanged entries.
- Update rows for seqChanged and metaChanged accessions in the gbStatus table. Add new accessions to the table.
GenBank/RefSeq Error Handling
The following approaches are used to make this process as robust as possible:
- The output of the all scripts is logged; error notification is done by e-mail.
- Semaphore files are created when a task script is running, which prevents other tasks from being accidental run at the same time. If a task fails, a file is created indicate that this occurred. Tasks will not run as long as a failed semaphore file is in place. The condition must be manually correct and the semaphores removed.
- There is a defined data flow between each step. A step can be restart from the beginning to synchronize it with results of the previous step without corrupting data.
- Files are written in an atomic manner where needed, first writing the file in the same directory with a temporary name, then renaming it. This is done any time the existance of a file indicates a step is complete.
- Data files are not modified after successful creation.
- Various verifications and sanity checks are used.
- Ability to explictly exclude incorrect genbank entries (data/ignore.idx files).
GenBank/RefSeq Annoying Issues
- The entire GenBank directory is replace when a new version is release. Daily releases are relative to this.
- GenBank daily release don't indicate deleted entries.
- GenBank daily filenames don't include a year, so daily files between the beginning of the year and next release (probably Jan 15th) will not sort in a simple manner.
- RefSeq updates it cumulative files daily as well as having separate daily files. There is no concept of a release.
- RefSeq deleted entries are still in the older daily releases. There no daily records indicating when an entry has been deleted.
- Ocassionally, there are incorrect genbank entries that break assumptions in this code. These are skipped by placing them an data/ignore.idx acc.
- MySql ISAM tables don't support foreign keys. Using auto_increment for id columns was a problems because mysqlimport would reset the numbers (or at least not insert zero).
- Want to use disk files rather than a database to track genbank repository files. This is faster when we need to look at all entries and makes setup and loading multiple database servers easier. It was also easier to implement. However this proved to be a problem for ESTs, which require large amount of memory to handle. To reduce the memory required, ESTs are partitioned by the first two letters of the accession.
- Don't handle realigning sequences (say to take advantage of changes to the aligner).
Overview of directories
$gbRoot/ - root directory
- etc/ - configuration files and scripts
- ignore.idx - ignore index file.
- genban.conf - configuration file.
- data/ - data files
- download/ - downloaded files from NCBI ftp
- genbank.${ver}/
- genbank.${ver}/daily-nc/
- refseq.${ver}/cummulative/
- refseq.${ver}/daily/
- processed/ - data extracted from the NCBI flat-files
- genbank.${ver}/
- full/
- daily.${date}/
- refseq.${ver}/
- full/
- daily.${date}/
- genbank.${ver}/
- aligned/ - aligned sequences
- ${db}/
- var/build/ - files associated with download and build steps. Only on build server
- run/ - semaphore files
- logs/ - log files
- build.time - File contain the time that the last download and alignment steps completed, in seconds since 00:00:00 1970-01-01 UTC. This is used by process running on other systems to poll for completion.
- var/copy/ - files associated with copying to the gbdb server.
- run/ - semaphore files
- logs/ - log files
- build.time - copy of build/build.time from the last completed copy.
- copy.time - file containing time last copy completed.
- var/dbload/$host/ - files associated with that last database load on database server $host.
- run/ - semaphore files
- logs/ - log files
- copy.time - copy of copy/copy.time from the last completed copy.
- load.time - file containing time last load completed completed.
Realigning Tracks
It maybe necessary to realign and reload tracks to change alignment parameters or other attributes. This is fairly straight forward when a genome databases is initially being built. It's more complex if one has to sync up multiple systems.
- To triger a realignment, on needs to remove the related files for some partation of the data for all updates. These live under either the genbank or refseq alignment directories, for example:
- data/aligned/genbank.139.0/hg16/
- data/aligned/refseq.139.0/hg16/
- To realign native RefSeq mRNAs for hg16, one would remove:
- data/aligned/refseq.139.0/hg16/*/mrna.native.*
- To realign xeno GeneBank ESTs for hg16, one would remove:
- data/aligned/refseq.139.0/hg16/*/est.*.xeno.*
- Reload the database with the partation of data that was realigned. The -srcDb and -type options restrict the subset. The organism category (native or xeno) isn't specified. Reloading of ESTs isn't supported, use -drop and -initialLoad instead.
- nice bin/gbDbLoadStep -reload -srcDb=genbank -type=mrna $db