KnownGene build: Difference between revisions
(21 intermediate revisions by 2 users not shown) | |||
Line 3: | Line 3: | ||
I haven't been doing this recently. We need to look into whether the work Max has done with uniprot should replace this. | I haven't been doing this recently. We need to look into whether the work Max has done with uniprot should replace this. | ||
JC 8/24/24: I rebuilt UniProt a couple of releases ago - it took a very long time because it's much larger now. I didn't get around to rebuilding Protein. Replacing this with content from the uniprot otto job is a good idea, but there may be some additional tables that we'll still need to construct from the otto data. I know accToTaxon is used by hgLinkIn (used by UniProt to generate links to our site), and that's not part of the otto build. | |||
== Consider updating underlying databases == | |||
* If it's been a while since we updated BioCyc versions for this species, consider doing so (we download the public BioCyc flat file .tar.gz distribution for mouse and human) | |||
JC 8/14/24: Last I checked BioCyc, they were no longer providing any public data files for mouse or human. The only data they provide for free is EcoCyc (e. coli). We can try reaching out to them about a free license for our display and link purposes, but their site seems pretty explicit that data access requires a paid subscription. | |||
== Initialize work directory == | == Initialize work directory == | ||
Line 21: | Line 27: | ||
*Copy buildEnv.sh from previous build on this db | *Copy buildEnv.sh from previous build on this db | ||
olddir=`ls -trd /hive/data/genomes/$db/bed/gencodeVM*/build | tail -n 2 | head -1` | |||
cp $olddir/buildEnv.sh buildEnv.sh | |||
edit buildEnv.sh to have correct values | edit buildEnv.sh to have correct values | ||
. buildEnv.sh | . buildEnv.sh | ||
Line 27: | Line 34: | ||
*Find Table and File list from previous build | *Find Table and File list from previous build | ||
cp ${oldGeneDir}/${PREV_GENCODE_VERSION}.files.txt . | cp ${oldGeneDir}/${PREV_GENCODE_VERSION}.files.txt . | ||
cp ${oldGeneDir}/${PREV_GENCODE_VERSION}.tables.txt . | cp ${oldGeneDir}/${PREV_GENCODE_VERSION}.tables.txt . | ||
* Confirm existing assembly tables are in a knownGene* database | * Confirm existing assembly tables are in a knownGene* database (sort syntax is a bashism - if using tcsh, sort the tables before the diff) | ||
hgsql ${oldKnownDb} -Ne "show tables" > ${oldKnownDb}.tables.txt | hgsql ${oldKnownDb} -Ne "show tables" > ${oldKnownDb}.tables.txt | ||
diff ${PREV_GENCODE_VERSION}.tables.txt ${oldKnownDb}.tables.txt | diff <(sort ${PREV_GENCODE_VERSION}.tables.txt) <(sort ${oldKnownDb}.tables.txt) | ||
== Setting environment variables == | == Setting environment variables == | ||
Line 54: | Line 63: | ||
* Building bigGenePred | * Building bigGenePred | ||
* Building GTF file | * Building GTF file | ||
NB: The GTF step is not currently a part of the script. It can be manually run with | |||
cd /hive/data/genomes/$db/goldenPath/bigZips/genes | |||
genePredToGtf -utr ${tempDb} knownGene ${db}.knownGene.gtf | |||
rm -f ${db}.knownGene.gtf.gz | |||
gzip ${db}.knownGene.gtf | |||
== Copying over tables == | == Copying over tables == | ||
Line 59: | Line 76: | ||
drop chromInfo and history from knownGene database | drop chromInfo and history from knownGene database | ||
hgsql knownGene${GENCODE_VERSION} -Ne "drop table chromInfo, history" | hgsql knownGene${GENCODE_VERSION} -Ne "drop table if exists chromInfo, history" | ||
hgsql knownGene${GENCODE_VERSION} -Ne "show tables" > ${GENCODE_VERSION}.tables.txt | hgsql knownGene${GENCODE_VERSION} -Ne "show tables" | egrep "knownGene|kgXref" > ${GENCODE_VERSION}.tables.txt | ||
hgsql knownGene${GENCODE_VERSION} -Ne "show tables" | egrep -v "knownGene|kgXref" >> ${GENCODE_VERSION}.tables.txt | |||
look for unexpected differences between this release and the last one | look for unexpected differences between this release and the last one | ||
Line 74: | Line 92: | ||
check for orphans and drop them (or build them) if appropriate | check for orphans and drop them (or build them) if appropriate | ||
hgsql $db -Ne "show tables like 'known%'" > orphan.lst | hgsql $db -Ne "show tables like 'known%'" > orphan.lst | ||
copy tables from knownGene database to assembly database | |||
copyFilesToAssembly.sh ${GENCODE_VERSION}.tables.txt knownGene${GENCODE_VERSION} > copyScript.txt | |||
cat copyScript.txt | hgsql $db | |||
== Edit trackDb to add new trackDb == | == Edit trackDb to add new trackDb == | ||
include knownGene.ra beta,public | cd $HOME/kent/src/hg/makeDb/trackDb/*/$db | ||
vi trackDb.ra | |||
include knownGene.ra beta,public | |||
include knownGene.alpha.ra alpha | |||
sed "s/$PREV_GENCODE_VERSION/$GENCODE_VERSION/g" knownGene.ra > knownGene.alpha.ra | |||
cp knownGene$PREV_GENCODE_VERSION.html knownGene$GENCODE_VERSION.html | |||
git add knownGene.alpha.ra knownGene$GENCODE_VERSION.html trackDb.ra | |||
git commit -m "$GENCODE_VERSION knownGene trackDb" | |||
git push | |||
cd ../.. | |||
make DBS=$db alpha | |||
cd $dir | |||
NB: In the above process, don't forget to edit the new knownGene*.html file before committing it. All mentions of the version number will need to change, | |||
the statistics will need to be updated with the ones posted for this release on the GENCODE website ("Immunoglobulin/T-cell receptor gene segments" is the | |||
sum of the two listed values), and the references section might need to be updated if there's a new paper that should be cited. Don't forget to amend | |||
the credits section too, as needed. | |||
Then edit knownGeneArchive.ra, copying in the settings from the previous version of knownGene as a new subtrack. Be sure to adjust | |||
settings as appropriate for the release (parent, priority, externalDb). Remember to also set it to release alpha (the same release status as the new knownGene track) | |||
unless the entire archive track set is set to release alpha (it might be included that way in trackDb.ra). | |||
== Adding IsPcr server == | == Adding IsPcr server == | ||
On hgwdev, drop old records in blatServers and targetDb. | |||
hgsql hgcentraltest -Ne "delete from blatServers where db like '${db}Kg%'" | |||
hgsql hgcentraltest -Ne "delete from targetDb where name like '${db}Kg%'" | |||
Ask cluster-admin to start an untranslated, -stepSize=5 gfServer on /gbdb/$db/targetDb/${db}KgSeq${GENCODE_VERSION}.2bit | |||
genIspcrMail.sh | |||
send to cluster-admin | |||
cluster-admin will say something like this: | cluster-admin will say something like this: | ||
Starting untrans gfServer for mm39KgSeqV38 on host blat1b port 17921 | Starting untrans gfServer for mm39KgSeqV38 on host blat1b port 17921 | ||
where blat1b is the serverName and the port is 17921 | |||
Add this info to blatServers and targetDb tables in hgcentral. | Add this info to blatServers and targetDb tables in hgcentral. | ||
addIspcrToCentral.sh serverName port | |||
== all.joiner changes == | == all.joiner changes == | ||
Line 131: | Line 154: | ||
knownGeneId | knownGeneId | ||
joinerCheck all.joiner -identifier=knownGeneId -keys -database= | joinerCheck all.joiner -identifier=knownGeneId -keys -database=${db} | ||
== Bundle up logs and check them in == | == Bundle up logs and check them in == | ||
Make a short log file in makeDb/doc/ucscGenes/. | |||
== Redmine ticket files and tables == | == Redmine ticket files and tables == | ||
NB: You'll probably want to generate the ${GENCODE_VERSION}.files.txt file, since there's nothing in the above procedure that does it. Unless the list of files changes, the easiest way is to just make a copy of the previous version's file and update the version numbers inside it (it's like 6 or 7 lines). | |||
== Post release push "other species" blast tables == | == Post release push "other species" blast tables == |
Latest revision as of 15:54, 14 August 2024
Build UniProt and Protein databases
I haven't been doing this recently. We need to look into whether the work Max has done with uniprot should replace this.
JC 8/24/24: I rebuilt UniProt a couple of releases ago - it took a very long time because it's much larger now. I didn't get around to rebuilding Protein. Replacing this with content from the uniprot otto job is a good idea, but there may be some additional tables that we'll still need to construct from the otto data. I know accToTaxon is used by hgLinkIn (used by UniProt to generate links to our site), and that's not part of the otto build.
Consider updating underlying databases
- If it's been a while since we updated BioCyc versions for this species, consider doing so (we download the public BioCyc flat file .tar.gz distribution for mouse and human)
JC 8/14/24: Last I checked BioCyc, they were no longer providing any public data files for mouse or human. The only data they provide for free is EcoCyc (e. coli). We can try reaching out to them about a free license for our display and link purposes, but their site seems pretty explicit that data access requires a paid subscription.
Initialize work directory
- Set version variable
export GENCODE_VERSION=V39
- Start a screen.
screen -S knownGene$GENCODE_VERSION
- Create and cd into work directory of the form /hive/data/genomes/$db/bed/gencode$GENCODE_VERSION/build
export db=hg38 mkdir /hive/data/genomes/$db/bed/gencode$GENCODE_VERSION/build cd /hive/data/genomes/$db/bed/gencode$GENCODE_VERSION/build
- Set PATH to include $HOME/kent/src/hg/utils/otto/knownGene
PATH=$HOME/kent/src/hg/utils/otto/knownGene":$PATH"
- Copy buildEnv.sh from previous build on this db
olddir=`ls -trd /hive/data/genomes/$db/bed/gencodeVM*/build | tail -n 2 | head -1` cp $olddir/buildEnv.sh buildEnv.sh edit buildEnv.sh to have correct values . buildEnv.sh
- Find Table and File list from previous build
cp ${oldGeneDir}/${PREV_GENCODE_VERSION}.files.txt . cp ${oldGeneDir}/${PREV_GENCODE_VERSION}.tables.txt .
- Confirm existing assembly tables are in a knownGene* database (sort syntax is a bashism - if using tcsh, sort the tables before the diff)
hgsql ${oldKnownDb} -Ne "show tables" > ${oldKnownDb}.tables.txt diff <(sort ${PREV_GENCODE_VERSION}.tables.txt) <(sort ${oldKnownDb}.tables.txt)
Setting environment variables
The environment variables used in the build are set in the script buildEnv.sh. All the other scripts assume that this script has been sourced in the current shell. You have to edit this by hand. Most of the variables don't change. The hairiest ones are the other assemblies for the blast tables.
Running the build
To run the build execute hg/utils/otto/knownGene/buildKnown.sh.
buildKnown.sh & tail -f doKnown.log
It builds into the knownGene${GENCODE_VERSION} database. It does the following steps:
- Extracting Gencode data
- Building initial knownGene table
- Adding primary reference tables
- Building final knownGene core tables
- Building bigGenePred
- Building GTF file
NB: The GTF step is not currently a part of the script. It can be manually run with
cd /hive/data/genomes/$db/goldenPath/bigZips/genes genePredToGtf -utr ${tempDb} knownGene ${db}.knownGene.gtf rm -f ${db}.knownGene.gtf.gz gzip ${db}.knownGene.gtf
Copying over tables
drop chromInfo and history from knownGene database
hgsql knownGene${GENCODE_VERSION} -Ne "drop table if exists chromInfo, history" hgsql knownGene${GENCODE_VERSION} -Ne "show tables" | egrep "knownGene|kgXref" > ${GENCODE_VERSION}.tables.txt hgsql knownGene${GENCODE_VERSION} -Ne "show tables" | egrep -v "knownGene|kgXref" >> ${GENCODE_VERSION}.tables.txt
look for unexpected differences between this release and the last one
diff ${PREV_GENCODE_VERSION}.tables.txt ${GENCODE_VERSION}.tables.txt
drop old tables
hgsql $db -Ne "drop table knownGene, kgXref;" grep -v "ToKg" ${PREV_GENCODE_VERSION}.tables.txt | egrep -vw "knownGene|kgXref" | awk '{printf "drop table %s;\n", $1}' > toDrop.lst cat toDrop.lst | hgsql $db
check for orphans and drop them (or build them) if appropriate
hgsql $db -Ne "show tables like 'known%'" > orphan.lst
copy tables from knownGene database to assembly database
copyFilesToAssembly.sh ${GENCODE_VERSION}.tables.txt knownGene${GENCODE_VERSION} > copyScript.txt cat copyScript.txt | hgsql $db
Edit trackDb to add new trackDb
cd $HOME/kent/src/hg/makeDb/trackDb/*/$db vi trackDb.ra include knownGene.ra beta,public include knownGene.alpha.ra alpha sed "s/$PREV_GENCODE_VERSION/$GENCODE_VERSION/g" knownGene.ra > knownGene.alpha.ra cp knownGene$PREV_GENCODE_VERSION.html knownGene$GENCODE_VERSION.html git add knownGene.alpha.ra knownGene$GENCODE_VERSION.html trackDb.ra git commit -m "$GENCODE_VERSION knownGene trackDb" git push cd ../.. make DBS=$db alpha cd $dir
NB: In the above process, don't forget to edit the new knownGene*.html file before committing it. All mentions of the version number will need to change,
the statistics will need to be updated with the ones posted for this release on the GENCODE website ("Immunoglobulin/T-cell receptor gene segments" is the
sum of the two listed values), and the references section might need to be updated if there's a new paper that should be cited. Don't forget to amend
the credits section too, as needed.
Then edit knownGeneArchive.ra, copying in the settings from the previous version of knownGene as a new subtrack. Be sure to adjust settings as appropriate for the release (parent, priority, externalDb). Remember to also set it to release alpha (the same release status as the new knownGene track) unless the entire archive track set is set to release alpha (it might be included that way in trackDb.ra).
Adding IsPcr server
On hgwdev, drop old records in blatServers and targetDb.
hgsql hgcentraltest -Ne "delete from blatServers where db like '${db}Kg%'" hgsql hgcentraltest -Ne "delete from targetDb where name like '${db}Kg%'"
Ask cluster-admin to start an untranslated, -stepSize=5 gfServer on /gbdb/$db/targetDb/${db}KgSeq${GENCODE_VERSION}.2bit
genIspcrMail.sh
send to cluster-admin
cluster-admin will say something like this:
Starting untrans gfServer for mm39KgSeqV38 on host blat1b port 17921
where blat1b is the serverName and the port is 17921
Add this info to blatServers and targetDb tables in hgcentral.
addIspcrToCentral.sh serverName port
all.joiner changes
I haven't added anything to this recently.
The relevant id's are :
knownGeneId
joinerCheck all.joiner -identifier=knownGeneId -keys -database=${db}
Bundle up logs and check them in
Make a short log file in makeDb/doc/ucscGenes/.
Redmine ticket files and tables
NB: You'll probably want to generate the ${GENCODE_VERSION}.files.txt file, since there's nothing in the above procedure that does it. Unless the list of files changes, the easiest way is to just make a copy of the previous version's file and update the version numbers inside it (it's like 6 or 7 lines).
Post release push "other species" blast tables
Load the other species blastTab tables.
buildLoadOther.sh