QAing UCSC Genes: Difference between revisions
(→Other random notes: changed the header, clarified sentence.) |
|||
(48 intermediate revisions by 7 users not shown) | |||
Line 1: | Line 1: | ||
The UCSC Gene set is created at UCSC for human and mouse. For rat, fly, worm, and yeast, an outside gene set is used, but we build the knownGene-related sets of tables. It is built once during the initial release of a new assembly, then updated sporadically after that. | |||
==wiki pages about known genes== | ==wiki pages about known genes== | ||
[[QAing UCSC Genes]]<br> | |||
[[UCSC Genes tables]]<br> | |||
[[UCSC Genes Staging Process]] (has a lot of information that is helpful during the QA process) <br> | |||
[[Post-Release-Checklist]] | |||
==Redmine tips== | |||
For such a large track, it is extremely helpful to create new redmine tickets for any problems found and relate it to the main ticket. It is also useful (especially as the release of the track nears) to have a "next gene build" ticket to keep track of problems that aren't going to be dealt with until the next build. | |||
==Determining the tables involved == | ==Determining the tables involved == | ||
The complete list of tables (for the assembly, proteome and uniProt databases) *should* be in the push queue. However, you should look for tables that shouldn't be pushed (e.g., history, chromInfo, tableDescriptions), and tables that are missing. Compare the current list to the previous list in [[ | The complete list of tables (for the assembly, proteome and uniProt databases) *should* be in the push queue. However, you should look for tables that shouldn't be pushed (e.g., history, chromInfo, tableDescriptions), and tables that are missing. Compare the current list to the previous list in [[UCSC Genes tables]]. Keep that list up to date so that it can be used in future releases. | ||
==joinerCheck== | ==joinerCheck== | ||
Line 40: | Line 20: | ||
* knownGeneId | * knownGeneId | ||
* knownIsoformCluster | * knownIsoformCluster | ||
* UniProtId | * UniProtId | ||
* UniProtAccession | * UniProtAccession | ||
Line 46: | Line 25: | ||
A new identifier needs to be added (and checked) for each release: | A new identifier needs to be added (and checked) for each release: | ||
* knownGeneOld{N}Id | * knownGeneOld{N}Id | ||
Make sure that the latest kg{N}ToKg{N+1} table appears in the related tables when knownGene is selected in the Table Browser. If it doesn't show up, it probably needs to be added to the knownGeneId identifier. | |||
Also run '''joinerCheck -times''' for the assembly database. There may be problems for tracks that have nothing to do with UCSC Genes that can be ignored. | Also run '''joinerCheck -times''' for the assembly database. There may be problems for tracks that have nothing to do with UCSC Genes that can be ignored. | ||
Line 53: | Line 34: | ||
==Makedoc== | ==Makedoc== | ||
The script used to generate UCSC genes can be found in the source tree in kent/src/hg/makeDb/doc/ucscGenes/; there won't be an entry for it in the assembly.txt makedoc. | The script used to generate UCSC genes can be found in the source tree in kent/src/hg/makeDb/doc/ucscGenes/; there won't be an entry for it in the assembly.txt makedoc. | ||
==QAing the uniProt and proteome databases== | |||
See [[UCSC_Genes_Staging_Process#Details_About_UniProt_and_Proteome_Databases]] for background information on these databases. | |||
* Find out which versions of spYYMMDD and proteinsYYMMDD were used to build the gene set. Make sure the symlinks are going to the correct place on hgwdev. | |||
* If a new download of proteins from uniProt was done for this build, you will need to QA the tables in the spYYMMDD and proteinsYYMMDD databases. Compare the tables present and the row counts for each to the previous databases. Note that the history and tableDescriptions tables *should* be included in the push of these databases. | |||
* Check the README files in 4 places: | |||
/usr/local/apache/htdocs-hgdownload/goldenPath/proteinDB/ | |||
/usr/local/apache/htdocs-hgdownload/goldenPath/proteinDB/proteinsYYMMDD/database/ | |||
/usr/local/apache/htdocs-hgdownload/goldenPath/uniProt/ | |||
/usr/local/apache/htdocs-hgdownload/goldenPath/uniProt/spYYMMDD/database/ | |||
: Make sure that the UniProt license info is current. Include the new assembly in the goldenPath/proteinDB/proteinsYYMMDD/database/README.txt (and edit it out of the previous file, if applicable). After the push to the RR, you will need to update the "Protein database for <assembly>" link in downloads.html. | |||
==QAing the Gene Set== | |||
If this is a human or mouse build, there needs to be more checks on the gene set than if we are using a gene set from another source. The person who built the track, and possibly Jim, should help decide whether the gene set looks good. | |||
Look at the numbers of genes in the old-to-new table like so: | |||
<pre> | |||
mysql> select count(*), status from kg6ToKg7 group by status; | |||
+----------+------------+ | |||
| count(*) | status | | |||
+----------+------------+ | |||
| 5556 | compatible | | |||
| 74657 | exact | | |||
| 88 | none | | |||
| 621 | overlap | | |||
+----------+------------+ | |||
4 rows in set (0.05 sec) | |||
</pre> | |||
Compare the counts to the previous old-to-new table. If the numbers are very different, there may be a problem in the build. Get the developer's opinion if you're not sure what is expected. | |||
Instructions from Jim: Look at about 5 genes for each of these major cases, and make sure the changes make sense: | |||
* genes that were dropped (status "none") | |||
* genes that have compatible extensions, meaning that the new gene is the same as the old gene, except at the ends (status "compatible") | |||
* genes that overlap with old genes, meaning that there is a chunk in the middle somewhere that has changed (status "overlap") | |||
* genes that are completely new in this build | |||
New genes with status "compatible" or "overlap" will have a note about the way in which they have changed. Most genes have more than one note. You can see the different kinds of notes with mysql queries like this one: | |||
<pre> | |||
mysql> select oldId, newId, note from kg6ToKg7 where status="overlap" order by rand() limit 5; | |||
+------------+------------+-------------------------------------------------------------------------------+ | |||
| oldId | newId | note | | |||
+------------+------------+-------------------------------------------------------------------------------+ | |||
| uc021uza.1 | uc002qac.3 | differentNumberOfBlocks,basesRemovedFromCoding,intronsChanged,basesAddedToUTR | | |||
| uc001jwp.1 | uc001jwn.2 | differentNumberOfBlocks,intronsChanged,basesRemovedFromUTR,basesAddedToUTR | | |||
| uc010wjq.2 | uc010wju.2 | differentNumberOfBlocks,intronsChanged,basesRemovedFromUTR | | |||
| uc011hhv.2 | uc011hht.2 | differentNumberOfBlocks,basesAddedToCoding,intronsChanged,basesAddedToUTR | | |||
| uc001miw.2 | uc010rcb.1 | intronsChanged | | |||
+------------+------------+-------------------------------------------------------------------------------+ | |||
5 rows in set (0.02 sec) | |||
</pre> | |||
Finding the genes that are completely new to this build will take a couple of steps. You will want to compare the newIds of the kg#ToKg# table with the names of the knownGene table. Use commands similar to the following: | |||
<pre> | |||
$ hgsql -Ne "select distinct(newId) from kg6ToKg7 order by newId" mm10 > oldToNew | |||
$ hgsql -Ne "select name from knownGene order by name" mm10 > new | |||
$ diff oldToNew new > newIds | |||
</pre> | |||
Look up a few of the IDs from your output file ($ shuf -n 5 newIds). Also note the use of "distinct(newId)" in the first query above. This is because there are likely multiple oldIds that map to the same newId, so the same newId may show up multiple times. You will also want to look at some of these new IDs that are essentially conglomerates of multiple old IDs. Look up the old IDs associated with the new ID. Use commands similar to the following: | |||
<pre> | |||
mysql> select newId,count(*) from kg6ToKg7 group by newId having count(*)>1 order by count(*) desc,newId; | |||
+------------+----------+ | |||
| newId | count(*) | | |||
+------------+----------+ | |||
| | 19 | (this row is from IDs that have been removed from the gene set, and thus the newId field is blank) | |||
| uc029voj.2 | 6 | | |||
| uc033icd.1 | 5 | | |||
| uc008xei.2 | 4 | | |||
| | | |||
| | | |||
| | | |||
| uc033jow.1 | 2 | | |||
| uc033joy.1 | 2 | | |||
+------------+----------+ | |||
296 rows in set (0.10 sec) | |||
mysql> select oldId,oldChrom,oldStart,oldEnd,newId from kg6ToKg7 where newId="uc029voj.2"; | |||
+------------+----------+-----------+-----------+------------+ | |||
| oldId | oldChrom | oldStart | oldEnd | newId | | |||
+------------+----------+-----------+-----------+------------+ | |||
| uc029voj.1 | chr5 | 124629188 | 124636080 | uc029voj.2 | | |||
| uc029vok.1 | chr5 | 124636068 | 124637613 | uc029voj.2 | | |||
| uc029vol.1 | chr5 | 124639263 | 124644988 | uc029voj.2 | | |||
| uc029vom.1 | chr5 | 124644986 | 124646007 | uc029voj.2 | | |||
| uc029von.1 | chr5 | 124646603 | 124646733 | uc029voj.2 | | |||
| uc029voo.1 | chr5 | 124712207 | 124722146 | uc029voj.2 | | |||
+------------+----------+-----------+-----------+------------+ | |||
6 rows in set (0.00 sec) | |||
</pre> | |||
Search some of the IDs in the Browser and visually inspect them to make sure that everything looks right. You will certainly want to look at some of the ones with higher counts. It helps to view GENCODE, Ensembl, Old UCSC Genes and the native mRNAs track along with the new UCSC Genes track for direct comparison. | |||
Pick some of each kind of gene: noncoding, coding, etc., and look at the entry for each one in kgXref and ensure that the accessions and descriptions look correct. Query the kgTxInfo table to find the various categories and IDs from them: | |||
<pre> | |||
mysql> select count(*),category from kgTxInfo group by category; | |||
+----------+---------------+ | |||
| count(*) | category | | |||
+----------+---------------+ | |||
| 5 | antibodyParts | | |||
| 1119 | antisense | | |||
| 45455 | coding | | |||
| 4021 | nearCoding | | |||
| 11042 | noncoding | | |||
+----------+---------------+ | |||
5 rows in set (0.08 sec) | |||
mysql> select category,name from kgTxInfo where category="coding" order by rand() limit 5; | |||
+----------+------------+ | |||
| category | name | | |||
+----------+------------+ | |||
| coding | uc009idq.1 | | |||
| coding | uc008lwi.2 | | |||
| coding | uc033jsw.1 | | |||
| coding | uc007fux.1 | | |||
| coding | uc007jkx.2 | | |||
+----------+------------+ | |||
5 rows in set (0.12 sec) | |||
</pre> | |||
Also compare the number of genes in knownCanonical between old and new sets. | |||
'''featureBits''': there should be no bases in gaps. Coverage should be pretty comparable to previous build. Try to get the builder to comment on what to expect (like this: http://redmine.soe.ucsc.edu/issues/20#note-12). | |||
==QAing the many related tables== | |||
Compare row counts of each table in the old vs. new gene set and look for anomalously high or low counts that may indicate a problem. Here is an example loop to run on hgwdev to get counts for all tables on both machines (pushQlist is the list of tables to check): | |||
<pre> | |||
$ for table in `cat pushQlist` | |||
do | |||
echo hgwdev: | |||
hgsql -e "select count(*) as $table from $table" hg19 | |||
echo hgwbeta: | |||
hgsql -h hgwbeta -e "select count(*) as $table from $table" hg19 | |||
echo | |||
done | |||
</pre> | |||
Compare the hgGene and hgNear pages for a few transcripts (perhaps one transcript that didn't change, one that did change, and a completely new transcript). Note, hgNear uses knownCanonical. Many tables should have the same counts as knownGene: kgXref, kgColor, kgTargetAli, kgTxInfo, knownIsoforms, knownGeneMrna, and knownGeneTxMrna. Also, knownGenePep and knownGeneTxPep should have an equal number of rows. | |||
'''Common re-build problem''': The UCSC Genes tables are usually built in a temporary database and then moved into the real assembly database when the build is complete. If a problem is found in a new gene build and the set is re-built, it's possible that the knownGeneOld#, kgXrefOld#, and kg#ToKg# tables will be built with the wrong data. Check the counts and contents of these tables to be sure they use the correct old gene set. | |||
Don't forget to QA the Old UCSC Genes track, the Alt Events track, and the [[blastTabs]]. You may need to coordinate the track release with changes in the otherOrgs.ra files in kent/src/hg/hgGene/hgGeneData/$org/$db/otherOrgs.ra. | |||
== PCR Target == | |||
Human and Mouse have a "UCSC Genes" target on the PCR page. Choosing this target runs PCR against a .2bit file made from the sequences in the kgTargetAli table (/gbdb/<db>/targetDb/kgTargetSeq.2bit). | |||
* Run twoBitInfo on kgTargetSeq.2bit and confirm the count (wc -l output.tab) is the same as knownGene count. | |||
* Grep for a new version of a gene in the output file, make sure it's there (uc009vis.3 instead of uc009vis.2, for instance). | |||
* Do the same count and gene checks on kgTargetAli table. | |||
* Make sure the blat server is actually using the correct two bit file (hgsql -e "select * from targetDb where name like 'dbKgSeq%" hgcentraltest; check that the seqFile is the correct one). | |||
* Find a set of primers that work for the new gene set but don't work for the old gene set, so you can use it to test that the new target is working on hgwbeta and the RR. | |||
* Cleaning up PCR gfServers | |||
** When releasing a new version of kgTargetSeq the previous gfServers should be turned off so that orphan gfServers aren't being left running. Previously we didn't check for this, and so there were many gfServers running that were never being used any more. | |||
** Also only the DNA version of the gfServer is needed, not the translated protein version. | |||
** Also, it might be nice for future gfServer archaeologists if the 2bit had the assembly in the name for the gfServer step: kgTargetSeq.2bit -> kgTargetSeqMm10.2bit, cluster-admin would appreciate this as well. This is because you can have a situation where both mouse and human have the same 2bit name. | |||
== FASTA Alignments == | |||
If we provide CDS FASTA Alignments for this assembly, a new set should be generated (usually Brian Raney does this) and QAed. These are linked on downloads.html with text like "FASTA alignments of 45 vertebrate genomes with Human for CDS regions" (which goes to http://hgdownload.soe.ucsc.edu/goldenPath/hg19/multiz46way/alignments/). | |||
Check that the genes within the file correspond to the new gene set. Check the md5sum.txt file. | |||
== hgGene Page Source Information == | == hgGene Page Source Information == | ||
Line 592: | Line 744: | ||
*'''knownToAllenBrain''' - w/ allenBrainUrl creates Gene Sorter "Allen Brain" column/link | *'''knownToAllenBrain''' - w/ allenBrainUrl creates Gene Sorter "Allen Brain" column/link | ||
*'''knownToEnsembl''' - used in link to Ensembl | *'''knownToEnsembl''' - used in link to Ensembl | ||
'''knownToGnf1m''' (similar to knownToGnfAtlas2 - not sure what it's for) | *'''knownToGnf1m''' (similar to knownToGnfAtlas2 - not sure what it's for) | ||
*'''knownToGnfAtlas2''' - "Microarray Expression Data" section, Gene Sorter column "GNF Atlas 2 ID" | *'''knownToGnfAtlas2''' - "Microarray Expression Data" section, Gene Sorter column "GNF Atlas 2 ID" | ||
*'''knownToHprd''' - creates the "HPRD" link in "Sequence & Links to Tools &Databases" section of hgGene | *'''knownToHprd''' - creates the "HPRD" link in "Sequence & Links to Tools &Databases" section of hgGene | ||
Line 606: | Line 758: | ||
*'''rnBlastTab''' - rat info in "Orthologous Genes in Other Species" section of hgGene | *'''rnBlastTab''' - rat info in "Orthologous Genes in Other Species" section of hgGene | ||
*'''scBlastTab''' - S. cerevisiae info in "Orthologous Genes in Other Species" section of hgGene | *'''scBlastTab''' - S. cerevisiae info in "Orthologous Genes in Other Species" section of hgGene | ||
*'''scopDesc''' - acc and description in "SCOP Domains" of "Prot | *'''scopDesc''' - acc and description in "SCOP Domains" of "Prot Domain & Structure Info" section of hgGene | ||
*'''spMrna''' - intermediate table? Doesn't seem to directly affect hgGene or GS | *'''spMrna''' - intermediate table? Doesn't seem to directly affect hgGene or GS | ||
*'''ucscScop''' - from ucscID gets scop domainName | *'''ucscScop''' - from ucscID gets scop domainName | ||
Line 620: | Line 772: | ||
*affyU133 | *affyU133 | ||
*allenBrainAli | *allenBrainAli | ||
*exoniphy | *exoniphy (no longer building as of Feb. 2012) | ||
*gnfAtlas2 | *gnfAtlas2 | ||
*nibbImageProbes | *nibbImageProbes | ||
Line 642: | Line 794: | ||
*pepMwAa | *pepMwAa | ||
[[Category:Browser QA tracks]] | [[Category:Browser QA tracks]] | ||
[[Category:Browser QA]] | [[Category:Browser QA]] | ||
[[Category:Browser QA UCSC Genes]] | [[Category:Browser QA UCSC Genes]] |
Latest revision as of 23:04, 9 September 2019
The UCSC Gene set is created at UCSC for human and mouse. For rat, fly, worm, and yeast, an outside gene set is used, but we build the knownGene-related sets of tables. It is built once during the initial release of a new assembly, then updated sporadically after that.
wiki pages about known genes
QAing UCSC Genes
UCSC Genes tables
UCSC Genes Staging Process (has a lot of information that is helpful during the QA process)
Post-Release-Checklist
Redmine tips
For such a large track, it is extremely helpful to create new redmine tickets for any problems found and relate it to the main ticket. It is also useful (especially as the release of the track nears) to have a "next gene build" ticket to keep track of problems that aren't going to be dealt with until the next build.
Determining the tables involved
The complete list of tables (for the assembly, proteome and uniProt databases) *should* be in the push queue. However, you should look for tables that shouldn't be pushed (e.g., history, chromInfo, tableDescriptions), and tables that are missing. Compare the current list to the previous list in UCSC Genes tables. Keep that list up to date so that it can be used in future releases.
joinerCheck
Fun fact: the Known Genes track was the original impetus to develop all.joiner. Jim identified the identifiers below that need to be checked before releasing the track (we don't have to run it for every single table!). He said that most of the remaining identifiers that involve knownGene are probably there to show table relationships in the browser. The all.joiner identifiers to check with joinerCheck -keys are:
- knownGeneId
- knownIsoformCluster
- UniProtId
- UniProtAccession
A new identifier needs to be added (and checked) for each release:
- knownGeneOld{N}Id
Make sure that the latest kg{N}ToKg{N+1} table appears in the related tables when knownGene is selected in the Table Browser. If it doesn't show up, it probably needs to be added to the knownGeneId identifier.
Also run joinerCheck -times for the assembly database. There may be problems for tracks that have nothing to do with UCSC Genes that can be ignored.
Note that the two UniProt* identifiers above can't be run with a database specified, since the checks go across databases. The identifiers generate a lot of output, only some of which may be pertinent to the new track.
Makedoc
The script used to generate UCSC genes can be found in the source tree in kent/src/hg/makeDb/doc/ucscGenes/; there won't be an entry for it in the assembly.txt makedoc.
QAing the uniProt and proteome databases
See UCSC_Genes_Staging_Process#Details_About_UniProt_and_Proteome_Databases for background information on these databases.
- Find out which versions of spYYMMDD and proteinsYYMMDD were used to build the gene set. Make sure the symlinks are going to the correct place on hgwdev.
- If a new download of proteins from uniProt was done for this build, you will need to QA the tables in the spYYMMDD and proteinsYYMMDD databases. Compare the tables present and the row counts for each to the previous databases. Note that the history and tableDescriptions tables *should* be included in the push of these databases.
- Check the README files in 4 places:
/usr/local/apache/htdocs-hgdownload/goldenPath/proteinDB/ /usr/local/apache/htdocs-hgdownload/goldenPath/proteinDB/proteinsYYMMDD/database/ /usr/local/apache/htdocs-hgdownload/goldenPath/uniProt/ /usr/local/apache/htdocs-hgdownload/goldenPath/uniProt/spYYMMDD/database/
- Make sure that the UniProt license info is current. Include the new assembly in the goldenPath/proteinDB/proteinsYYMMDD/database/README.txt (and edit it out of the previous file, if applicable). After the push to the RR, you will need to update the "Protein database for <assembly>" link in downloads.html.
QAing the Gene Set
If this is a human or mouse build, there needs to be more checks on the gene set than if we are using a gene set from another source. The person who built the track, and possibly Jim, should help decide whether the gene set looks good.
Look at the numbers of genes in the old-to-new table like so:
mysql> select count(*), status from kg6ToKg7 group by status; +----------+------------+ | count(*) | status | +----------+------------+ | 5556 | compatible | | 74657 | exact | | 88 | none | | 621 | overlap | +----------+------------+ 4 rows in set (0.05 sec)
Compare the counts to the previous old-to-new table. If the numbers are very different, there may be a problem in the build. Get the developer's opinion if you're not sure what is expected.
Instructions from Jim: Look at about 5 genes for each of these major cases, and make sure the changes make sense:
- genes that were dropped (status "none")
- genes that have compatible extensions, meaning that the new gene is the same as the old gene, except at the ends (status "compatible")
- genes that overlap with old genes, meaning that there is a chunk in the middle somewhere that has changed (status "overlap")
- genes that are completely new in this build
New genes with status "compatible" or "overlap" will have a note about the way in which they have changed. Most genes have more than one note. You can see the different kinds of notes with mysql queries like this one:
mysql> select oldId, newId, note from kg6ToKg7 where status="overlap" order by rand() limit 5; +------------+------------+-------------------------------------------------------------------------------+ | oldId | newId | note | +------------+------------+-------------------------------------------------------------------------------+ | uc021uza.1 | uc002qac.3 | differentNumberOfBlocks,basesRemovedFromCoding,intronsChanged,basesAddedToUTR | | uc001jwp.1 | uc001jwn.2 | differentNumberOfBlocks,intronsChanged,basesRemovedFromUTR,basesAddedToUTR | | uc010wjq.2 | uc010wju.2 | differentNumberOfBlocks,intronsChanged,basesRemovedFromUTR | | uc011hhv.2 | uc011hht.2 | differentNumberOfBlocks,basesAddedToCoding,intronsChanged,basesAddedToUTR | | uc001miw.2 | uc010rcb.1 | intronsChanged | +------------+------------+-------------------------------------------------------------------------------+ 5 rows in set (0.02 sec)
Finding the genes that are completely new to this build will take a couple of steps. You will want to compare the newIds of the kg#ToKg# table with the names of the knownGene table. Use commands similar to the following:
$ hgsql -Ne "select distinct(newId) from kg6ToKg7 order by newId" mm10 > oldToNew $ hgsql -Ne "select name from knownGene order by name" mm10 > new $ diff oldToNew new > newIds
Look up a few of the IDs from your output file ($ shuf -n 5 newIds). Also note the use of "distinct(newId)" in the first query above. This is because there are likely multiple oldIds that map to the same newId, so the same newId may show up multiple times. You will also want to look at some of these new IDs that are essentially conglomerates of multiple old IDs. Look up the old IDs associated with the new ID. Use commands similar to the following:
mysql> select newId,count(*) from kg6ToKg7 group by newId having count(*)>1 order by count(*) desc,newId; +------------+----------+ | newId | count(*) | +------------+----------+ | | 19 | (this row is from IDs that have been removed from the gene set, and thus the newId field is blank) | uc029voj.2 | 6 | | uc033icd.1 | 5 | | uc008xei.2 | 4 | | | | | | | | uc033jow.1 | 2 | | uc033joy.1 | 2 | +------------+----------+ 296 rows in set (0.10 sec) mysql> select oldId,oldChrom,oldStart,oldEnd,newId from kg6ToKg7 where newId="uc029voj.2"; +------------+----------+-----------+-----------+------------+ | oldId | oldChrom | oldStart | oldEnd | newId | +------------+----------+-----------+-----------+------------+ | uc029voj.1 | chr5 | 124629188 | 124636080 | uc029voj.2 | | uc029vok.1 | chr5 | 124636068 | 124637613 | uc029voj.2 | | uc029vol.1 | chr5 | 124639263 | 124644988 | uc029voj.2 | | uc029vom.1 | chr5 | 124644986 | 124646007 | uc029voj.2 | | uc029von.1 | chr5 | 124646603 | 124646733 | uc029voj.2 | | uc029voo.1 | chr5 | 124712207 | 124722146 | uc029voj.2 | +------------+----------+-----------+-----------+------------+ 6 rows in set (0.00 sec)
Search some of the IDs in the Browser and visually inspect them to make sure that everything looks right. You will certainly want to look at some of the ones with higher counts. It helps to view GENCODE, Ensembl, Old UCSC Genes and the native mRNAs track along with the new UCSC Genes track for direct comparison.
Pick some of each kind of gene: noncoding, coding, etc., and look at the entry for each one in kgXref and ensure that the accessions and descriptions look correct. Query the kgTxInfo table to find the various categories and IDs from them:
mysql> select count(*),category from kgTxInfo group by category; +----------+---------------+ | count(*) | category | +----------+---------------+ | 5 | antibodyParts | | 1119 | antisense | | 45455 | coding | | 4021 | nearCoding | | 11042 | noncoding | +----------+---------------+ 5 rows in set (0.08 sec) mysql> select category,name from kgTxInfo where category="coding" order by rand() limit 5; +----------+------------+ | category | name | +----------+------------+ | coding | uc009idq.1 | | coding | uc008lwi.2 | | coding | uc033jsw.1 | | coding | uc007fux.1 | | coding | uc007jkx.2 | +----------+------------+ 5 rows in set (0.12 sec)
Also compare the number of genes in knownCanonical between old and new sets.
featureBits: there should be no bases in gaps. Coverage should be pretty comparable to previous build. Try to get the builder to comment on what to expect (like this: http://redmine.soe.ucsc.edu/issues/20#note-12).
Compare row counts of each table in the old vs. new gene set and look for anomalously high or low counts that may indicate a problem. Here is an example loop to run on hgwdev to get counts for all tables on both machines (pushQlist is the list of tables to check):
$ for table in `cat pushQlist` do echo hgwdev: hgsql -e "select count(*) as $table from $table" hg19 echo hgwbeta: hgsql -h hgwbeta -e "select count(*) as $table from $table" hg19 echo done
Compare the hgGene and hgNear pages for a few transcripts (perhaps one transcript that didn't change, one that did change, and a completely new transcript). Note, hgNear uses knownCanonical. Many tables should have the same counts as knownGene: kgXref, kgColor, kgTargetAli, kgTxInfo, knownIsoforms, knownGeneMrna, and knownGeneTxMrna. Also, knownGenePep and knownGeneTxPep should have an equal number of rows.
Common re-build problem: The UCSC Genes tables are usually built in a temporary database and then moved into the real assembly database when the build is complete. If a problem is found in a new gene build and the set is re-built, it's possible that the knownGeneOld#, kgXrefOld#, and kg#ToKg# tables will be built with the wrong data. Check the counts and contents of these tables to be sure they use the correct old gene set.
Don't forget to QA the Old UCSC Genes track, the Alt Events track, and the blastTabs. You may need to coordinate the track release with changes in the otherOrgs.ra files in kent/src/hg/hgGene/hgGeneData/$org/$db/otherOrgs.ra.
PCR Target
Human and Mouse have a "UCSC Genes" target on the PCR page. Choosing this target runs PCR against a .2bit file made from the sequences in the kgTargetAli table (/gbdb/<db>/targetDb/kgTargetSeq.2bit).
- Run twoBitInfo on kgTargetSeq.2bit and confirm the count (wc -l output.tab) is the same as knownGene count.
- Grep for a new version of a gene in the output file, make sure it's there (uc009vis.3 instead of uc009vis.2, for instance).
- Do the same count and gene checks on kgTargetAli table.
- Make sure the blat server is actually using the correct two bit file (hgsql -e "select * from targetDb where name like 'dbKgSeq%" hgcentraltest; check that the seqFile is the correct one).
- Find a set of primers that work for the new gene set but don't work for the old gene set, so you can use it to test that the new target is working on hgwbeta and the RR.
- Cleaning up PCR gfServers
- When releasing a new version of kgTargetSeq the previous gfServers should be turned off so that orphan gfServers aren't being left running. Previously we didn't check for this, and so there were many gfServers running that were never being used any more.
- Also only the DNA version of the gfServer is needed, not the translated protein version.
- Also, it might be nice for future gfServer archaeologists if the 2bit had the assembly in the name for the gfServer step: kgTargetSeq.2bit -> kgTargetSeqMm10.2bit, cluster-admin would appreciate this as well. This is because you can have a situation where both mouse and human have the same 2bit name.
FASTA Alignments
If we provide CDS FASTA Alignments for this assembly, a new set should be generated (usually Brian Raney does this) and QAed. These are linked on downloads.html with text like "FASTA alignments of 45 vertebrate genomes with Human for CDS regions" (which goes to http://hgdownload.soe.ucsc.edu/goldenPath/hg19/multiz46way/alignments/).
Check that the genes within the file correspond to the new gene set. Check the md5sum.txt file.
hgGene Page Source Information
Click on the following link to view a sample hgGene page annotated with the sources of the different components:
File:Hg19uc002ypa.2.pdf
Gene Sorter Column Sources
Name |
Description |
Source |
# |
Item Number in Displayed List/Select Gene |
n/a |
Name |
Gene Name/Select Gene |
kgXref.geneSymbol |
UCSC ID |
UCSC Transcript ID |
knownGene.name |
UniProtKB |
UniProtKB Protein Display ID |
kgXref.spDisplayID or kgXref.spID_organism |
UniProtKB Acc |
UniProtKB Protein Accession |
kgXref.spID |
RefSeq |
NCBI RefSeq Gene Accession |
kgXref.refseq |
Entrez Gene |
NCBI Entrez Gene/LocusLink ID |
knownToLocusLink |
GenBank |
GenBank mRNA Accession |
kgXref.refseq or kgXref.mRNA |
Ensembl |
Ensembl Transcript ID |
knownToEnsembl |
GNF Atlas 2 ID |
ID of Associated GNF Atlas 2 Expression Data |
knownToGnfAtlas2 |
Gene Category |
High Level Gene Category - Coding, Antisense, etc. |
kgTxInfo.category |
CDS Score |
Coding potential score from txCdsPredict |
kgTxInfo.cdsScore |
VisiGene |
UCSC VisiGene In Situ Image Browser |
knownToVisiGene |
Allen Brain |
Allen Brain Atlas In Situ Images of Adult Mouse Brains |
knownToAllenBrain & allenBrainUrl |
U133 ID |
ID of Associated Affymetrix U133 Expression Data |
knownToU133 |
GNF Atlas 2 |
GNF Expression Atlas 2 Data from U133A and GNF1H Chips |
gnfAtlas2 |
Max GNF Atlas 2 |
Maximum Expression Value of GNF Expression Atlas 2 |
calculated? |
GNF Atlas 2 Delta |
Normalized Difference in GNF Expression Atlas 2 from Selected Gene |
gnfAtlas2Distance |
BLASTP |
NCBI BLASTP Bit Score |
knownBlastTab.bitScore |
BLASTP |
NCBI BLASTP E-Value |
knownBlastTab.evalue |
%ID |
NCBI BLASTP Percent Identity |
knownBlastTab.identity |
5' UTR Fold |
5' UTR Fold Energy (Estimated kcal/mol) |
foldUtr5.energy |
3' UTR Fold |
3' UTR Fold Energy (Estimated kcal/mol) |
foldUtr3.energy |
Exon Count |
Number of Exons (Including Non-Coding) |
knownGene.exonCount |
Intron Size |
Size of biggest (or optionally smallest) intron |
knownGene exonStarts - exonEnds |
Genome Position |
Genome Position/Link to Genome Browser |
(knownGene.txStart + txEnd)/2 |
Mouse |
Mouse Ortholog (Best Blastp Hit to UCSC Known Genes) |
mmBlastTab |
Rat |
Rat Ortholog (Best Blastp Hit to UCSC Known Genes) |
rnBlastTab |
Zebrafish |
Danio rerio Ortholog (Best Blastp Hit to Ensembl) |
drBlastTab |
Drosophila |
D. melanogaster Ortholog (Best Blastp Hit to FlyBase Proteins) |
dmBlastTab |
C. elegans |
C. elegans Ortholog (Best Blastp Hit to WormPep) |
ceBlastTab |
Yeast |
Saccharomyces cerevisiae Ortholog (Best Blastp Hit to RefSeq) |
scBlastTab |
Pfam Domains |
Protein Family Domain Structure |
knownToPfam à pfamDesc |
Superfamily |
Protein Superfamily Assignments |
ucscScop & scopDesc |
PDB |
Protein Data Bank |
kgProtMap2 & sp###### database |
Gene Ontology |
Gene Ontology (GO) Terms Associated with Gene |
kgProtMap2 & sp###### database |
M. Vidal P2P |
Human Protein-Protein Interaction Network from Marc Vidal |
humanVidalP2P |
E. Wanker P2P |
Human Protein-Protein Interaction Network from Erich Wanker |
humanWankerP2P |
HPRD P2P |
Human Protein-Protein Interaction Network from the Human Reference Protein Database |
humanHprdP2P |
Description |
Short Description Line/Link to Details Page |
kgXref.description |
Table Descriptions
Annotated details page: File:Hg19uc002ypa.2.pdf
Attempt to describe the uses of the tables used in or related to UCSC Genes.
UCSC Gene & GS Table Descriptions
- allenBrainGene - "Human Cortex Gene Expression" link in "Sequence & Links to Tools & Databases" section of hgGene
- allenBrainUrl - w/ knownToAllenBrain creates GS column, "Allen Brain"
- bioCycMapDesc - BioCyc description name in "Biochem & Signaling Pathways" section of hgGene
- bioCycPathway - BioCyc pathway name in "Biochem & Signaling Pathways" section of hgGene
- ccdsKgMap - CCDS in the "Other names for this Gene" section of hgGene
- ceBlastTab - C. elegans info in "Orthologous Genes in Other Species" section of hgGene
- cgapAlias - links cgapID with kgXref.geneSymbol to pull info for gene
- cgapBiocDesc - BioCarta description in "Biochem & Signaling Pathways" section of hgGene
- cgapBiocPathway - BioCarta pathway name in "Biochem & Signaling Pathways" section of hgGene
- dmBlastTab - D. melanogaster info in "Orthologous Genes in Other Species" section of hgGene
- drBlastTab - zebrafish info in "Orthologous Genes in Other Species" section of hgGene
- foldUtr3 - 3' info in "mRNA Secondary Structure of 3' and 5' UTRs" section of hgGene
- foldUtr5 - 5' info in "mRNA Secondary Structure of 3' and 5' UTRs" section of hgGene
- gnfAtlas2 - separate track, QA'd with that track but also determines the "Microarray expression Data" section of hgGene and the Gene Sorter column, "GNF Atlas 2"
- gnfAtlas2Distance - Gene Sorter column "GNF Atlas 2 Delta" & "Expression (GNF Atlas2)" "sort by" option
- humanHprdP2P - Gene Sorter column "HPRD P2P" & "sort by"
- humanVidalP2P - Gene Sorter column "M. Vidal Protein-to-Protein" & "sort by"
- humanWankerP2P - Gene Sorter column "E. Wanker Protein-to-Protein" & "sort by"
- keggMapDesc - KEGG pathway description in "Biochem & Signaling Pathways" section of hgGene
- keggPathway - KEGG pathway name in "Biochem & Signaling Pathways" section of hgGene
- kg4ToKg5 - allows searching of an old ID from previous gene set in new gene set or users can check the kg4ToKg5 table directly to find corresponding gene IDs.
- kgAlias - "Alternate Gene Symbols" in "Other Names for This Gene" section of hgGene
- kgColor - colors the gene in browser
- kgProtAlias - intermediate table?
- kgProtMap2 - Scop Domains in "Protein Domain & Structure Information" section of hgGene & Protein Data Bank column in GS need this table to work properly; also involved with proteome browser (not releasing with proteome browser with hg19; being phased out)
- kgSpAlias - duplicate of kgAlias w/ extra field, spID, that is blank in all records
- kgTxInfo - table info in the "Gene Model Information" section of hgGene
- kgXref - "Alternate Gene Symbols" in the "Other Names for This Gene" section of hgGene
- knownAlt - separate track, "Alt Events"; needs to be QA'd separately
- knownBlastTab - Gene Sorter columns: GS "ID%"=knownBlastTab.identity, GS"BLASTP E-Value"=knownBlastTab.eValue, GS "BLASTP Bits"=knownBlastTab.bitScore)
- knownCanonical - best transcript from each clusterId (note, GS only works with genes in this table)
- knownGene - primary table
- knownGeneMrna - "mRNA" link in "Sequence & Links to Tools &Databases" section of hgGene
- knownGenePep - "protein" link in "Sequence & Links to Tools &Databases" section of hgGene
- knownIsoforms - transcript grouped into clusters named by clusterId
- knownToAllenBrain - w/ allenBrainUrl creates Gene Sorter "Allen Brain" column/link
- knownToEnsembl - used in link to Ensembl
- knownToGnf1m (similar to knownToGnfAtlas2 - not sure what it's for)
- knownToGnfAtlas2 - "Microarray Expression Data" section, Gene Sorter column "GNF Atlas 2 ID"
- knownToHprd - creates the "HPRD" link in "Sequence & Links to Tools &Databases" section of hgGene
- knownToLocusLink - used in link to Entrez Gene, see issues below
- knownToPfam - Pfam Domains in "Protein Domain & Structure info" of hgGene & Gene Sorter column: Pfam Domains
- knownToRefSeq - used in link to RefSeq in "Other Names for This Gene" section of hgGene
- knownToSuper - contains scop domain info with gene name & start/end
- knownToTreefam - used in link to Treefam website in "Sequence & Links to Tools &Databases" section of hgGene
- knownToU133 - Gene Sorter column "U133 ID"
- knownToVisiGene - used in link to VisiGene
- mmBlastTab - mouse info in "Orthologous Genes in Other Species" section of hgGene
- pfamDesc - Pfam description in "Protein Domain & Structure Info" section of hgGene and in "Pfam Domains" column of Gene Sorter
- rnBlastTab - rat info in "Orthologous Genes in Other Species" section of hgGene
- scBlastTab - S. cerevisiae info in "Orthologous Genes in Other Species" section of hgGene
- scopDesc - acc and description in "SCOP Domains" of "Prot Domain & Structure Info" section of hgGene
- spMrna - intermediate table? Doesn't seem to directly affect hgGene or GS
- ucscScop - from ucscID gets scop domainName
Click for more information about blastTabs
UCSC Genes Tables in other Databases
Proteome DB (e.g. proteins090821)
- spReactomeEvent - "Reactome" info in "Biochemical and Signaling Pathways section of hgGene (linked through dependent on spID in kgXref)
- spReactomeId - "Reactome" link in "Sequence & Links to Tools &Databases" section of hgGene (unsure??)
Tables Related to UCSC Genes That are Separate tracks
- affyU133
- allenBrainAli
- exoniphy (no longer building as of Feb. 2012)
- gnfAtlas2
- nibbImageProbes
- vgAllProbes
No longer UCSC Genes Tables
- knownToCdsSnp - dropping on all assemblies. Found too many issues; Populated Cds Snp column in Gene Sorter.
- knownToGnf1h - part of GNF Atlas 1, which is not on hg19
Proteome Browser Tables (no longer releasing)
- pbAnomLimit
- pbResAvgStd
- pepCCntDist
- pepExonCntDist
- pepHydroDist
- pepIPCntDist
- pepMolWtDist
- pepPi
- pepPiDist
- pepResDist
- pepMwAa