Building GenArk genomes: Difference between revisions

From Genecats
Jump to navigationJump to search
(Updating the building genArk genomes procedure, refs #29046)
 
(3 intermediate revisions by 2 users not shown)
Line 12: Line 12:
<nowiki>
<nowiki>
function goto() {
function goto() {
export destDir=""
export asmId=$1
export asmId=$1
case "${asmId}" in
    GC*)
      gcX=${asmId:0:3}
      d0=${asmId:4:3}
      d1=${asmId:7:3}
      d2=${asmId:10:3}
      if [ "${gcX}" = "GCF" ]; then
      destDir="/hive/data/genomes/asmHubs/refseqBuild/${gcX}/${d0}/${d1}/${d2}"
      else
      destDir="/hive/data/genomes/asmHubs/genbankBuild/${gcX}/${d0}/${d1}/${d2}"
      fi
      export cdDir=`ls -d ${destDir}/${asmId}* 2> /dev/null`


export gcX=${asmId:0:3}
      if [ -d "${cdDir}" ]; then
export d0=${asmId:4:3}
        destDir="${cdDir}/trackData"
export d1=${asmId:7:3}
      else
export d2=${asmId:10:3}
        printf "# can not find ${destDir}/${asmId}*\n" 1>&2
 
      fi
export destDir=""
      ;;
    *) destDir="/hive/data/genomes/${asmId}/bed"
      if [ -d "/hive/data/genomes/${asmId}/trackData" ]; then
        destDir="/hive/data/genomes/${asmId}/trackData"
      fi
      ;;
esac


if [ "${gcX}" = "GCF" ]; then
if [ -d "${destDir}" ]; then
   destDir="/hive/data/genomes/asmHubs/refseqBuild/${gcX}/${d0}/${d1}/${d2}"
   printf "cd $destDir\n" 1>&2
  cd "$destDir"
else
else
   destDir="/hive/data/genomes/asmHubs/genbankBuild/${gcX}/${d0}/${d1}/${d2}"
   printf "# can not find ${destDir}\n" 1>&2
fi
fi


export cdDir=`ls -d ${destDir}/${asmId}* 2> /dev/null`
if [ -d "${cdDir}" ]; then
  printf "cd $cdDir\n" 1>&2
  cd "$cdDir"
else
  printf "# can not find ${destDir}/${asmId}*\n" 1>&2
fi
}
}


function gotos() {
function gotos() {
Line 50: Line 62:
export dirCount=`ls -d ${destDir}/${asmId}* 2> /dev/null | wc -l`
export dirCount=`ls -d ${destDir}/${asmId}* 2> /dev/null | wc -l`
if [ "${dirCount}" -ne 1 ]; then
if [ "${dirCount}" -ne 1 ]; then
  printf "# can not find ${destDir}/${asmId}*\n" 1>&2
  printf "# can not find ${destDir}/${asmId}*\n" 1>&2
else
else
  export cdDir=`ls -d ${destDir}/${asmId}* 2> /dev/null`
  export cdDir=`ls -d ${destDir}/${asmId}* 2> /dev/null`
  if [ -d "${cdDir}" ]; then
  if [ -d "${cdDir}" ]; then
    printf "cd $cdDir\n" 1>&2
    printf "cd $cdDir\n" 1>&2
    cd "$cdDir"
    cd "$cdDir"
  else
  else
    printf "# can not find ${destDir}/${asmId}*\n" 1>&2
    printf "# can not find ${destDir}/${asmId}*\n" 1>&2
  fi
  fi
fi
fi
}
}
Line 65: Line 77:




Directory that everything came from ncbi, accumulating via weekly rsync and this is the genome we want to work from:
===Finding the info for the runBuild command===
  gotos GCA_001708105.1
Hiram recommends checking if a genome browser already exists in the source tree (grep number part of the GCA/GCF ID: 002776525).
pwd
 
/hive/data/outside/ncbi/genomes/GCA/001/708/105/GCA_001708105.1_ASM170810v1
Example:
  $ grep 002776525 ~/kent/src/hg/makeDb/doc/*AsmHub/*.tsv
 
Hiram has list of genomes pre-ready to be run in the allBuild directory. There are GenBank versions or RefSeq versions. Decide which one is best or most up to date, RefSeq is always first choice.
 
'''Genbank''': /hive/data/outside/ncbi/genomes/reports/newAsm/gb.todo.*


How big and long its going to take to run:
'''RefSeq''': /hive/data/outside/ncbi/genomes/reports/newAsm/rs.todo.*
faSize GCA_001708105.1_ASM170810v1_rna_from_genomic.fna.gz


grep number part of the GCA/GCF ID: 002776525 to find genomes that are pre-ready to be run:
$ grep 002776525 /hive/data/outside/ncbi/genomes/reports/newAsm/rs.todo.*


===The command to run===
$ grep 002776525 /hive/data/outside/ncbi/genomes/reports/newAsm/gb.todo.* 
'''The runBuild command template (scientific name should have underscores instead of blanks):'''
/hive/data/outside/ncbi/genomes/reports/newAsm/gb.todo.primates.txt:./runBuild GCA_002776525.5_ASM277652v5 primates Piliocolobus_tephrosceles 2019_12_12
'''time (./runBuild assemblyName Scientific_Name clade)'''


If it doesn't show up in the list of genomes pre-ready to be run, check the 'master' listings from NCBI:
$ grep 002776525 /hive/data/outside/ncbi/genomes/reports/assembly_summary*.txt


===Finding the info for runBuild===
===Running the runBuild command===


History of every command used to build the browser:
Move to a screen:
  head kent/src/hg/makeDb/doc/asmHubs/master.run.list
  $ screen -S screenName


Full assembly name here:
The command can be run anywhere but preferably accumulated into one directory.
  less *report.txt
  $ cd /hive/data/genomes/asmHubs/allBuild


Example report.txt:
The runBuild command template (scientific name should have underscores instead of blanks):
<pre>
  '''time (./runBuild assemblyName clade Scientific_Name)'''
<nowiki>
# Assembly name: ASM170810v1
# Organism name: Komagataella pastoris (budding yeasts)
# Infraspecific name:  strain=ATCC 28485
</nowiki>
</pre>


Going to need the scientific name, ex. Komagataella pastoris
Running the command:
$ time (./runBuild GCA_001632725.1_ASM163272v1 invertebrate Corbicula_fluminea) >> GCA_001632725.1.log 2>&1


Find what the clade. See what others are called:
===Checking progress===
cd ~kent/src/hg/makeDb/doc/
grep -i budding yeasts */*.tsv
<pre>
<nowiki>
fungiAsmHub/fungi.orderList.tsv:GCF_001661345.1_Ascru1 budding yeast A.rubescens DSM 1968
fungiAsmHub/fungi.orderList.tsv:GCF_011074885.1_ASM1107488v2 budding yeast B.bruxellensis UCD 2041
fungiAsmHub/fungi.orderList.tsv:GCF_001661335.1_Babin1 budding yeast B.inositovora NRRL Y-12698
fungiAsmHub/fungi.orderList.tsv:GCF_011074865.1_ASM1107486v2 budding yeast B.nanus
fungiAsmHub/fungi.orderList.tsv:GCF_000182965.3_ASM18296v3 budding yeast C.albicans SC5314
fungiAsmHub/fungi.orderList.tsv:GCF_002775015.1_Cand_auris_B11221_V1 budding yeast C.auris
fungiAsmHub/fungi.orderList.tsv:GCF_000026945.1_ASM2694v1 budding yeast C.dubliniensis CD36
...
</nowiki>
</pre>


Check progress by looking at what track is getting built in the trackData/ directory for the assembly. Use goto command to find the directory:
  $ goto GCA_001632725.1


$ ls
addMask  allGaps  assemblyGap  augustus  chromAlias  cpgIslands  cytoBand  gapOverlap  gc5Base  idKeys  ncbiGene  repeatMasker  simpleRepeat  tandemDups  windowMasker  xenoRefGene


===Running the runBuild command===
You can confirm the run finished when it makes a trackDb:
GCA_001632725.1_ASM163272v1.trackDb.txt


The command can be run anywhere but preferably accumulated into one directory:
You will get an email when the run finishes: '''genArk build done: GCA_001632725.1_ASM163272v1'''
cd /hive/data/genomes/asmHubs/allBuild


Saved the command in a file:
cd /hive/data/genomes/asmHubs/allBuild/history
echo './runBuild GCA_001708105.1 GCA_001708105.1_ASM170810v1 fungi Komagataella_pastoris' > GCA_001708105.1.list


===Post runBuild steps===


Move to a screen:
Saved the command in a file:
  screen -S screenName
  $ cd /hive/data/genomes/asmHubs/allBuild/history
$ echo './runBuild GCA_001632725.1_ASM163272v1 invertebrate Corbicula_fluminea' > GCA_001632725.1.list


You can exit the screen by doing Ctrl-a then Ctrl-d
After run finishes, you will need to add and commit the runBuild command to the master run list in the source tree and ordered by GC identifier:


You can reattach to the screen by the following command:
Edit: '''~/kent/src/hg/makeDb/doc/asmHubs/master.run.list'''
screen -r -d screenName


Run the command:
Add:  
cd /hive/data/genomes/asmHubs/allBuild
time (./runBuild GCA_001708105.1 GCA_001708105.1_ASM170810v1 fungi Komagataella_pastoris) > GCA_001708105.1.log 2>&1 &


Even before building the genome, it allows to compare the genome with the other ones to find out the equivalence between all the different genomes. Everything has an idKeys directory where this work has started before
./runBuild GCA_001632725.1_ASM163272v1 invertebrate Corbicula_fluminea


Commit changes, ex.
$ git commit -m 'adding GCA_001632725.1_ASM163272v1 per user request, refs #29545'


===Checking progress===


Look at progress:
Add and commit an entry with the assembly name and common name in the orderList.tsv file. Entry should be in alphabetical order (case insensitive) by the second column which is the common name. You can use Hiram’s commonNames.pl script to find a common name:  
goto GCA_001708105.1


Check what track is getting built (can get stuck here):
Place the name in a temporary file, in this case called '1'
  og trackData/
$ echo GCA_001632725.1_ASM163272v1 > 1
  $ ~/kent/src/hg/makeDb/doc/asmHubs/commonNames.pl 1
GCA_001632725.1_ASM163272v1 asian clam (DE_01 2016)


Can see each step happening in the build log:
Edit: '''~/kent/src/hg/makeDb/doc/invertebrateAsmHub/invertebrate.orderList.tsv'''
tail -f build.log


You can confirm the run finished when it makes a trackDb:
Add:  
  GCA_001708105.1_ASM170810v1.trackDb.txt
  GCA_001632725.1_ASM163272v1    Asian clam (DE_01 2016)




===Post runBuild steps===
Add the command:
./runBuild GCA_001708105.1 GCA_001708105.1_ASM170810v1 fungi Komagataella_pastoris
In the master run list in the source tree, record that this has been done and sorted:
kent/src/hg/makeDb/doc/asmHubs/master.run.list
Commit changes, ex.
Commit changes, ex.
  git commit -m 'adding fungi per user request, refs #28821'
  $ git commit -m 'adding Corbicula_fluminea Asian clam (DE_01 2016) refs #29545'


Add an entry with a common name and strain to the name. Second column by alphabetical order. Use *report.txt and used the nomenclature in the clade orderList.tsv file. ex:
=== Pushing to the RR ===
GCA_001708105.1_ASM170810v1    budding yeast K.pastoris ATCC 28485
to:
  makeDb/doc/fungiAsmHub/fungi.orderList.tsv
The second column would show up in the pull down menu.


Commit changes, ex.
git commit -m 'adding yeast K.pastoris per user request refs #28821'
   
   
Then run the following make in makeDb/doc/*AsmHub to get the symlinks constructed and make this assembly appear in the outside world:
Run the following make in makeDb/doc/[animalGroup]AsmHub to get the symlinks constructed and make this assembly appear in the outside world:
  time (make) > dbg 2>&1 &
  $ time (make) > dbg 2>&1 &
 
Check for errors:
$ grep -i error dbg
$ tail dbg
 
This make sets up a number of file links to get pushed out.


Same directory, verify that it worked:
Same directory, verify that it worked:
  time (make verifyTestDownload) >> test.down.log 2>&1 &
  $ time (make verifyTestDownload) >> test.down.log 2>&1 &
Its going through each assembly in that  assembly hub and making sure it has enough tracks to be a legitimate assembly hub.
Its going through each assembly in that  assembly hub and making sure it has enough tracks to be a legitimate assembly hub.
The test.down.log file should end in the line something like:
  $ tail test.down.log
  # checked 598 hubs, 598 success,  0 fail, total tracks: 10288, 2023-06-17 12:40:15


And if you accumulate that test.down.log file, you can see the changes
over time via:
  $ grep check test.down.log
for each of those 'checked' lines.


You should be able to see for yourself the assembly in the hgDownload-test listing and file directory. Example below:
https://hgdownload-test.gi.ucsc.edu/hubs/GCA/011/100/615/GCA_011100615.1/
https://hgdownload-test.gi.ucsc.edu/hubs/primates/index.html




Get log in keys (Required for the first time):
Verify you can login to hgdownload and dynablat. For the first time, you may need to update your keys. If it doesn't work, you may already have a key and you may need to delete it. The command with "date" afterward is a way to check your access without actually staying in that server.
ssh qateam@hgdownload.soe.ucsc.edu
  ssh qateam@hgdownload.soe.ucsc.edu date
  ssh qateam@hgdownload.soe.ucsc.edu date


$ ssh qateam@dynablat-01.soe.ucsc.edu date


ssh qateam@dynablat-01.soe.ucsc.edu
ssh qateam@dynablat-01.soe.ucsc.edu date
'''Until we have a larger dynamic blat machine we can not push all the mammal dynamic blat server files'''. Enable dynablat for the assembly if the clade is not mammal:
makeDb/doc/asmHubs/sendToHgdownload.sh
Edit:
if [ 1 -eq 0 ]; then
To


  if [ 1 -eq 1 ]; then
Run RR push from ~/kent/src/hg/makeDb/doc/[animalGroup]AsmHub:
  $ time (make sendDownload ) >> send.down.log 2>&1 &


Run:
check for errors:
time (make sendDownload ) >> send.down.log 2>&1 &
  $ grep -i error send.down.log
  $ tail send.down.log


Revert the sendToHgdownload.sh setting back to original:
Then final check that it is on hgdownload:
  git checkout ../asmHubs/sendToHgdownload.sh
  $ time (make verifyDownload) > verify.down.log 2>&1 &


Run:
Should have a valid last 'checked' line
  time (make verifyDownload) > verify.down.log 2>&1 &
  $ tail verify.down.log

Latest revision as of 21:10, 2 January 2025

GenArk

The Genome Archive (GenArk) are assembly hubs that come pre-loaded with several annotation tracks, gene models, and the ability to align genomic sequence to the reference assembly using the BLAT alignment tool. The genomes in the GenArk are sourced from NCBI RefSeq, the Vertebrate Genomes Project (VGP), and other projects.


Steps from Hiram's Talk (by Gerardo) 2/2/2022

See Recorded Demo here: https://genome-test.gi.ucsc.edu/~hiram/zoomRecordings/2022-02-02.GenArkBuild/video1137793487.mp4

Before starting

Check to see if you have the goto and gotos command. If not, add the following to your .bashrc file:


function goto() {

export destDir=""
export asmId=$1
case "${asmId}" in
    GC*)
      gcX=${asmId:0:3}
      d0=${asmId:4:3}
      d1=${asmId:7:3}
      d2=${asmId:10:3}
      if [ "${gcX}" = "GCF" ]; then
       destDir="/hive/data/genomes/asmHubs/refseqBuild/${gcX}/${d0}/${d1}/${d2}"
      else
      destDir="/hive/data/genomes/asmHubs/genbankBuild/${gcX}/${d0}/${d1}/${d2}"
      fi
      export cdDir=`ls -d ${destDir}/${asmId}* 2> /dev/null`

      if [ -d "${cdDir}" ]; then
        destDir="${cdDir}/trackData"
      else
        printf "# can not find ${destDir}/${asmId}*\n" 1>&2
      fi
      ;;
    *) destDir="/hive/data/genomes/${asmId}/bed"
      if [ -d "/hive/data/genomes/${asmId}/trackData" ]; then
        destDir="/hive/data/genomes/${asmId}/trackData"
      fi
      ;;
esac

if [ -d "${destDir}" ]; then
   printf "cd $destDir\n" 1>&2
   cd "$destDir"
else
   printf "# can not find ${destDir}\n" 1>&2
fi

}

function gotos() {
export asmId=$1

export gcX=${asmId:0:3}
export d0=${asmId:4:3}
export d1=${asmId:7:3}
export d2=${asmId:10:3}

export destDir="/hive/data/outside/ncbi/genomes/${gcX}/${d0}/${d1}/${d2}"

export dirCount=`ls -d ${destDir}/${asmId}* 2> /dev/null | wc -l`
if [ "${dirCount}" -ne 1 ]; then
  printf "# can not find ${destDir}/${asmId}*\n" 1>&2
else
  export cdDir=`ls -d ${destDir}/${asmId}* 2> /dev/null`
  if [ -d "${cdDir}" ]; then
    printf "cd $cdDir\n" 1>&2
    cd "$cdDir"
  else
    printf "# can not find ${destDir}/${asmId}*\n" 1>&2
  fi
fi
}


Finding the info for the runBuild command

Hiram recommends checking if a genome browser already exists in the source tree (grep number part of the GCA/GCF ID: 002776525).

Example:

$ grep 002776525 ~/kent/src/hg/makeDb/doc/*AsmHub/*.tsv

Hiram has list of genomes pre-ready to be run in the allBuild directory. There are GenBank versions or RefSeq versions. Decide which one is best or most up to date, RefSeq is always first choice.

Genbank: /hive/data/outside/ncbi/genomes/reports/newAsm/gb.todo.*

RefSeq: /hive/data/outside/ncbi/genomes/reports/newAsm/rs.todo.*

grep number part of the GCA/GCF ID: 002776525 to find genomes that are pre-ready to be run:

$ grep 002776525 /hive/data/outside/ncbi/genomes/reports/newAsm/rs.todo.*
$ grep 002776525 /hive/data/outside/ncbi/genomes/reports/newAsm/gb.todo.*  
/hive/data/outside/ncbi/genomes/reports/newAsm/gb.todo.primates.txt:./runBuild GCA_002776525.5_ASM277652v5 primates Piliocolobus_tephrosceles	2019_12_12

If it doesn't show up in the list of genomes pre-ready to be run, check the 'master' listings from NCBI:

$ grep 002776525 /hive/data/outside/ncbi/genomes/reports/assembly_summary*.txt

Running the runBuild command

Move to a screen:

$ screen -S screenName

The command can be run anywhere but preferably accumulated into one directory.

$ cd /hive/data/genomes/asmHubs/allBuild

The runBuild command template (scientific name should have underscores instead of blanks):

time (./runBuild assemblyName clade Scientific_Name)

Running the command:

$ time (./runBuild GCA_001632725.1_ASM163272v1 invertebrate Corbicula_fluminea) >> GCA_001632725.1.log 2>&1

Checking progress

Check progress by looking at what track is getting built in the trackData/ directory for the assembly. Use goto command to find the directory:

 $ goto GCA_001632725.1
$ ls 
addMask  allGaps  assemblyGap  augustus  chromAlias  cpgIslands  cytoBand  gapOverlap  gc5Base  idKeys  ncbiGene  repeatMasker  simpleRepeat  tandemDups  windowMasker  xenoRefGene

You can confirm the run finished when it makes a trackDb:

GCA_001632725.1_ASM163272v1.trackDb.txt

You will get an email when the run finishes: genArk build done: GCA_001632725.1_ASM163272v1


Post runBuild steps

Saved the command in a file:

$ cd /hive/data/genomes/asmHubs/allBuild/history
$ echo './runBuild GCA_001632725.1_ASM163272v1 invertebrate Corbicula_fluminea' > GCA_001632725.1.list

After run finishes, you will need to add and commit the runBuild command to the master run list in the source tree and ordered by GC identifier:

Edit: ~/kent/src/hg/makeDb/doc/asmHubs/master.run.list

Add:

./runBuild GCA_001632725.1_ASM163272v1 invertebrate Corbicula_fluminea

Commit changes, ex.

$ git commit -m 'adding GCA_001632725.1_ASM163272v1 per user request, refs #29545'


Add and commit an entry with the assembly name and common name in the orderList.tsv file. Entry should be in alphabetical order (case insensitive) by the second column which is the common name. You can use Hiram’s commonNames.pl script to find a common name:

Place the name in a temporary file, in this case called '1'

$ echo GCA_001632725.1_ASM163272v1 > 1
$ ~/kent/src/hg/makeDb/doc/asmHubs/commonNames.pl 1
GCA_001632725.1_ASM163272v1	asian clam (DE_01 2016)

Edit: ~/kent/src/hg/makeDb/doc/invertebrateAsmHub/invertebrate.orderList.tsv

Add:

GCA_001632725.1_ASM163272v1    Asian clam (DE_01 2016)


Commit changes, ex.

$ git commit -m 'adding Corbicula_fluminea Asian clam (DE_01 2016) refs #29545'

Pushing to the RR

Run the following make in makeDb/doc/[animalGroup]AsmHub to get the symlinks constructed and make this assembly appear in the outside world:

$ time (make) > dbg 2>&1 &

Check for errors:

$ grep -i error dbg
$ tail dbg

This make sets up a number of file links to get pushed out.

Same directory, verify that it worked:

$ time (make verifyTestDownload) >> test.down.log 2>&1 &

Its going through each assembly in that assembly hub and making sure it has enough tracks to be a legitimate assembly hub. The test.down.log file should end in the line something like:

  $ tail test.down.log
  # checked 598 hubs, 598 success,   0 fail, total tracks: 10288, 2023-06-17 12:40:15

And if you accumulate that test.down.log file, you can see the changes over time via:

  $ grep check test.down.log

for each of those 'checked' lines.

You should be able to see for yourself the assembly in the hgDownload-test listing and file directory. Example below:

https://hgdownload-test.gi.ucsc.edu/hubs/GCA/011/100/615/GCA_011100615.1/
https://hgdownload-test.gi.ucsc.edu/hubs/primates/index.html


Verify you can login to hgdownload and dynablat. For the first time, you may need to update your keys. If it doesn't work, you may already have a key and you may need to delete it. The command with "date" afterward is a way to check your access without actually staying in that server.

ssh qateam@hgdownload.soe.ucsc.edu date
$ ssh qateam@dynablat-01.soe.ucsc.edu date


Run RR push from ~/kent/src/hg/makeDb/doc/[animalGroup]AsmHub:

$ time (make sendDownload ) >> send.down.log 2>&1 &

check for errors:

  $ grep -i error send.down.log
  $ tail send.down.log

Then final check that it is on hgdownload:

$ time (make verifyDownload) > verify.down.log 2>&1 &

Should have a valid last 'checked' line

$ tail verify.down.log