GenbankAlignments
The Genbank Alignment Process
This is a description of the behind the scenes parts of the Genbank alignment process. If you want doc on how to add a new species to the list of aligned assemblies you want to go here.
Overview
The genbank alignment process aligns RNA and EST sequences from NCBI, as well as the RefSeq mRNA's to (almost) all the assemblies that UCSC supports. The process is divided into roughly five parts: download, process, align, database load, and dissemination. The first four parts and the beginning of the fifth happen on the genbank-101 machine, the dissemination part includes hgwdev, hgwbeta, and then to our official mirrors (RR, euro, japan).
GenBank/RefSeq Update Goals Incremental update of mRNAs and ESTs for multiple species and assemblies based on daily updates from NCBI. Automatically run from cron, possibly every night. Only require manual intervention on non-recoverable errors or when a large cluster run is required to do a large alignment. Incremental across GenBank releases; don't force a full realignment every quarter. Allow removal of older genbank full releases (and still not force a full realignment). Avoid corruption of disk files and databases. Recover from failures state, automatically when possible, making manual recover easy. Allow restarting failed steps without restarting the entire process. Don't require the process to be run at defined intervals. When a run is done, data files will be updated to reflect the current state of the NCBI repository. Include HTS files in automated download process.
GenBank/RefSeq Annoying Issues The entire GenBank directory is replace when a new version is release. Daily releases are relative to this. GenBank daily release don't indicate deleted entries. GenBank daily filenames don't include a year, so daily files between the beginning of the year and next release (probably Jan 15th) will not sort in a simple manner. RefSeq updates it cumulative files daily as well as having separate daily files. There is no concept of a release. RefSeq deleted entries are still in the older daily releases. There no daily records indicating when an entry has been deleted. Ocassionally, there are incorrect genbank entries that break assumptions in this code. These are skipped by placing them an data/ignore.idx acc. MySql ISAM tables don't support foreign keys. Using auto_increment for id columns was a problems because mysqlimport would reset the numbers (or at least not insert zero). Want to use disk files rather than a database to track genbank repository files. This is faster when we need to look at all entries and makes setup and loading multiple database servers easier. It was also easier to implement. However this proved to be a problem for ESTs, which require large amount of memory to handle. To reduce the memory required, ESTs are partitioned by the first two letters of the accession. Don't handle realigning sequences (say to take advantage of changes to the aligner).
Overview of directories $gbRoot/ - root directory etc/ - configuration files and scripts ignore.idx - ignore index file. genban.conf - configuration file. data/ - data files download/ - downloaded files from NCBI ftp genbank.${ver}/ genbank.${ver}/daily-nc/ refseq.${ver}/cummulative/ refseq.${ver}/daily/ processed/ - data extracted from the NCBI flat-files genbank.${ver}/ full/ daily.${date}/ refseq.${ver}/ full/ daily.${date}/ aligned/ - aligned sequences ${db}/ var/build/ - files associated with download and build steps. Only on build server run/ - semaphore files logs/ - log files build.time - File contain the time that the last download and alignment steps completed, in seconds since 00:00:00 1970-01-01 UTC. This is used by process running on other systems to poll for completion. var/copy/ - files associated with copying to the gbdb server. run/ - semaphore files logs/ - log files build.time - copy of build/build.time from the last completed copy. copy.time - file containing time last copy completed. var/dbload/$host/ - files associated with that last database load on database server $host. run/ - semaphore files logs/ - log files copy.time - copy of copy/copy.time from the last completed copy. load.time - file containing time last load completed completed.
Realigning Tracks It maybe necessary to realign and reload tracks to change alignment parameters or other attributes. This is fairly straight forward when a genome databases is initially being built. It's more complex if one has to sync up multiple systems. If automated alignment or update has been enabled for the database, disable it by editing $gbRoot/etc/align.dbs. Make sure an automated alignment isn't current running. To triger a realignment, on needs to remove the related files for some partation of the data for all updates. These live under either the genbank or refseq alignment directories, for example: data/aligned/genbank.139.0/hg16/ data/aligned/refseq.139.0/hg16/ To realign native RefSeq mRNAs for hg16, one would remove: data/aligned/refseq.139.0/hg16/*/mrna.native.* To realign xeno GeneBank ESTs for hg16, one would remove: data/aligned/refseq.139.0/hg16/*/est.*.xeno.* Do an initial alignment as described above, restricting with -srcDb and -type. Reload the database with the partation of data that was realigned. The -srcDb and -type options restrict the subset. The organism category (native or xeno) isn't specified. Reloading of ESTs isn't supported, use -drop and -initialLoad instead. nice bin/gbDbLoadStep -reload -srcDb=genbank -type=mrna $db