Building GenArk genomes
GenArk
The Genome Archive (GenArk) are assembly hubs that come pre-loaded with several annotation tracks, gene models, and the ability to align genomic sequence to the reference assembly using the BLAT alignment tool. The genomes in the GenArk are sourced from NCBI RefSeq, the Vertebrate Genomes Project (VGP), and other projects.
Steps from Hiram's Talk (by Gerardo) 2/2/2022
See Recorded Demo here: https://genome-test.gi.ucsc.edu/~hiram/zoomRecordings/2022-02-02.GenArkBuild/video1137793487.mp4
Before starting
Check to see if you have the goto and gotos command. If not, add the following to your .bashrc file:
function goto() { export asmId=$1 export gcX=${asmId:0:3} export d0=${asmId:4:3} export d1=${asmId:7:3} export d2=${asmId:10:3} export destDir="" if [ "${gcX}" = "GCF" ]; then destDir="/hive/data/genomes/asmHubs/refseqBuild/${gcX}/${d0}/${d1}/${d2}" else destDir="/hive/data/genomes/asmHubs/genbankBuild/${gcX}/${d0}/${d1}/${d2}" fi export cdDir=`ls -d ${destDir}/${asmId}* 2> /dev/null` if [ -d "${cdDir}" ]; then printf "cd $cdDir\n" 1>&2 cd "$cdDir" else printf "# can not find ${destDir}/${asmId}*\n" 1>&2 fi } function gotos() { export asmId=$1 export gcX=${asmId:0:3} export d0=${asmId:4:3} export d1=${asmId:7:3} export d2=${asmId:10:3} export destDir="/hive/data/outside/ncbi/genomes/${gcX}/${d0}/${d1}/${d2}" export dirCount=`ls -d ${destDir}/${asmId}* 2> /dev/null | wc -l` if [ "${dirCount}" -ne 1 ]; then printf "# can not find ${destDir}/${asmId}*\n" 1>&2 else export cdDir=`ls -d ${destDir}/${asmId}* 2> /dev/null` if [ -d "${cdDir}" ]; then printf "cd $cdDir\n" 1>&2 cd "$cdDir" else printf "# can not find ${destDir}/${asmId}*\n" 1>&2 fi fi }
Directory that everything came from ncbi, accumulating via weekly rsync and this is the genome we want to work from:
gotos GCA_001708105.1 pwd /hive/data/outside/ncbi/genomes/GCA/001/708/105/GCA_001708105.1_ASM170810v1
How big and long its going to take to run:
faSize GCA_001708105.1_ASM170810v1_rna_from_genomic.fna.gz
The command to run
The runBuild command template (scientific name should have underscores instead of blanks):
time (./runBuild assemblyName Scientific_Name clade)
Finding the info for runBuild
History of every command used to build the browser:
head kent/src/hg/makeDb/doc/asmHubs/master.run.list
Full assembly name here:
less *report.txt
Example report.txt:
# Assembly name: ASM170810v1 # Organism name: Komagataella pastoris (budding yeasts) # Infraspecific name: strain=ATCC 28485
Going to need the scientific name, ex. Komagataella pastoris
Find what the clade. See what others are called:
cd ~kent/src/hg/makeDb/doc/ grep -i budding yeasts */*.tsv
fungiAsmHub/fungi.orderList.tsv:GCF_001661345.1_Ascru1 budding yeast A.rubescens DSM 1968 fungiAsmHub/fungi.orderList.tsv:GCF_011074885.1_ASM1107488v2 budding yeast B.bruxellensis UCD 2041 fungiAsmHub/fungi.orderList.tsv:GCF_001661335.1_Babin1 budding yeast B.inositovora NRRL Y-12698 fungiAsmHub/fungi.orderList.tsv:GCF_011074865.1_ASM1107486v2 budding yeast B.nanus fungiAsmHub/fungi.orderList.tsv:GCF_000182965.3_ASM18296v3 budding yeast C.albicans SC5314 fungiAsmHub/fungi.orderList.tsv:GCF_002775015.1_Cand_auris_B11221_V1 budding yeast C.auris fungiAsmHub/fungi.orderList.tsv:GCF_000026945.1_ASM2694v1 budding yeast C.dubliniensis CD36 ...
Running the runBuild command
The command can be run anywhere but preferably accumulated into one directory:
cd /hive/data/genomes/asmHubs/allBuild
Saved the command in a file:
cd /hive/data/genomes/asmHubs/allBuild/history echo './runBuild GCA_001708105.1 GCA_001708105.1_ASM170810v1 fungi Komagataella_pastoris' > GCA_001708105.1.list
Move to a screen:
screen -S screenName
You can exit the screen by doing Ctrl-a then Ctrl-d
You can reattach to the screen by the following command:
screen -r -d screenName
Run the command:
cd /hive/data/genomes/asmHubs/allBuild time (./runBuild GCA_001708105.1 GCA_001708105.1_ASM170810v1 fungi Komagataella_pastoris) > GCA_001708105.1.log 2>&1 &
Even before building the genome, it allows to compare the genome with the other ones to find out the equivalence between all the different genomes. Everything has an idKeys directory where this work has started before
Checking progress
Look at progress:
goto GCA_001708105.1
Check what track is getting built (can get stuck here):
og trackData/
Can see each step happening in the build log:
tail -f build.log
You can confirm the run finished when it makes a trackDb:
GCA_001708105.1_ASM170810v1.trackDb.txt
Post runBuild steps
Add the command:
./runBuild GCA_001708105.1 GCA_001708105.1_ASM170810v1 fungi Komagataella_pastoris
In the master run list in the source tree, record that this has been done and sorted:
kent/src/hg/makeDb/doc/asmHubs/master.run.list
Commit changes, ex.
git commit -m 'adding fungi per user request, refs #28821'
Add an entry with a common name and strain to the name. Second column by alphabetical order. Use *report.txt and used the nomenclature in the clade orderList.tsv file. ex:
GCA_001708105.1_ASM170810v1 budding yeast K.pastoris ATCC 28485
to:
makeDb/doc/fungiAsmHub/fungi.orderList.tsv
The second column would show up in the pull down menu.
Commit changes, ex.
git commit -m 'adding yeast K.pastoris per user request refs #28821'
Then run the following make in makeDb/doc/*AsmHub to get the symlinks constructed and make this assembly appear in the outside world:
time (make) > dbg 2>&1 &
Same directory, verify that it worked:
time (make verifyTestDownload) >> test.down.log 2>&1 &
Its going through each assembly in that assembly hub and making sure it has enough tracks to be a legitimate assembly hub.
You should be able to see for yourself the assembly in the hgDownload-test listing and file directory. Example below:
https://hgdownload-test.gi.ucsc.edu/hubs/GCA/011/100/615/GCA_011100615.1/ https://hgdownload-test.gi.ucsc.edu/hubs/primates/index.html
Pushing to the RR
Verify you can login to hgdownload and dynablat. For the first time, you may need to update your keys. If it doesn't work, you may already have a key and you may need to delete it. The command with "date" afterward is a way to check your access without actually staying in that server.
ssh qateam@hgdownload.soe.ucsc.edu date
ssh qateam@dynablat-01.soe.ucsc.edu date
Until we have a larger dynamic blat machine we can not push all the mammal dynamic blat server files. Enable dynablat for the assembly if the clade is not mammal:
vim ~/kent/src/hg/makeDb/doc/asmHubs/sendToHgdownload.sh
Edit:
if [ 1 -eq 0 ]; then
To
if [ 1 -eq 1 ]; then
Run final RR push from ~/kent/src/hg/makeDb/doc/*AsmHub:
time (make sendDownload ) >> send.down.log 2>&1 &
Revert the sendToHgdownload.sh setting back to original:
git checkout ../asmHubs/sendToHgdownload.sh
Run:
time (make verifyDownload) > verify.down.log 2>&1 &