QAing UCSC Genes
wiki pages about known genes
http://genomewiki.cse.ucsc.edu/genecats/index.php/UCSC_Genes_Staging_Process http://genomewiki.cse.ucsc.edu/genecats/index.php/QAing_UCSC_Genes http://genomewiki.cse.ucsc.edu/genecats/index.php/Post-Release-Checklist
Determining the tables involved
The complete list of tables (for the assembly, proteome and uniProt databases) *should* be in the push queue. However, you should look for tables that shouldn't be pushed (e.g., history, chromInfo, tableDescriptions), and tables that are missing. Compare the current list to the previous list in UCSC_Genes_Staging_Process#Tables_in_the_Assembly_Database. Keep that list up to date so that it can be used in future releases.
HGGeneCheck
HGGeneCheck takes a long time to run (usually over a week), and it doesn't usually find any errors. But since it doesn't take a lot of effort and it may find something, we still run it.
This program basically does this:
For all rows in the table knownGene|sgdGene|bdgpGene|sangerGene|rgdGene2, view details page. * Loops over all assemblies in props file. * For all pages viewed, check for non-200 return code. * Doesn't click into any links. * Doesn't check for HGERROR.
If your gene table isn't listed above, it needs to be added to the HGGeneCheck.java program first (in the source code at kent/java/src/edu/ucsc/genome/qa/cgiCheck/). This program defaults to running on hg17, but if you supply a props file you can specify a database for it to run on.
Sample props file (note that unlike TrackCheck if you include a zoomcount line HGGeneCheck will not run):
machine hgwdev.cse.ucsc.edu quick false dbSpec rn4
Note that you can change the machine name to something other than hgwdev (e.g., hgwdev-demo5, if there are CGI changes that have not yet been checked into the master branch but are still in development).
Sample command to run:
nohup HGGeneCheck props > HGrobot.out &
joinerCheck
Fun fact: the Known Genes track was the original impetus to develop all.joiner. Jim identified the identifiers below that need to be checked before releasing the track (we don't have to run it for every single table!). He said that most of the remaining identifiers that involve knownGene are probably there to show table relationships in the browser. The all.joiner identifiers to check with joinerCheck -keys are:
- knownGeneId
- knownIsoformCluster
- refSeqId
- UniProtId
- UniProtAccession
A new identifier needs to be added (and checked) for each release:
- knownGeneOld{N}Id
Also run joinerCheck -times for the assembly database. There may be problems for tracks that have nothing to do with UCSC Genes that can be ignored.
Note that the two UniProt* identifiers above can't be run with a database specified, since the checks go across databases. The identifiers generate a lot of output, only some of which may be pertinent to the new track.
Makedoc
The script used to generate UCSC genes can be found in the source tree in kent/src/hg/makeDb/doc/ucscGenes/; there won't be an entry for it in the assembly.txt makedoc.
Comparison to previous build
Look at the numbers of genes in the old-to-new table like so:
mysql> select count(*), status from kg5ToKg6 group by status; +----------+------------+ | count(*) | status | +----------+------------+ | 48057 | compatible | | 26658 | exact | | 1010 | none | | 1889 | overlap | +----------+------------+ 4 rows in set (0.00 sec)
Compare the counts to the previous old-to-new table. If the numbers are very different, there may be a problem in the build. Get the developer's opinion if you're not sure what is expected.
Look at about 5 genes for each of these major cases, and make sure the changes make sense:
- genes that were dropped (status "none")
- genes that have compatible extensions, meaning that the new gene is the same as the old gene, except at the ends (status "compatible")
- genes that overlap with old genes, meaning that there is a chunk in the middle somewhere that has changed (status "overlap")
- genes that are completely new in this build
New genes with status "compatible" or "overlap" will have a note about the way in which they have changed. You can see the different kinds of notes and their counts with mysql queries like this one:
mysql> select count(*), note from kg5ToKg6 where status="overlap" group by note; +----------+-------------------------------------+ | count(*) | note | +----------+-------------------------------------+ | 84 | Bases added to coding region | | 54 | Bases added to UTR | | 130 | Bases removed from coding region | | 3 | Bases removed from UTR | | 992 | Different number of exons | | 237 | Intron boundaries have changed | | 90 | No longer considered protein coding | | 299 | Now considered protein coding | +----------+-------------------------------------+ 8 rows in set (0.00 sec)
Also compare the number of genes in knownCanonical between old and new sets.
hgGene Page Source Information
Click on the following link to view a sample hgGene page annotated with the sources of the different components:
File:Hg19uc002ypa.2.pdf
Gene Sorter Column Sources
Name |
Description |
Source |
# |
Item Number in Displayed List/Select Gene |
n/a |
Name |
Gene Name/Select Gene |
kgXref.geneSymbol |
UCSC ID |
UCSC Transcript ID |
knownGene.name |
UniProtKB |
UniProtKB Protein Display ID |
kgXref.spDisplayID or kgXref.spID_organism |
UniProtKB Acc |
UniProtKB Protein Accession |
kgXref.spID |
RefSeq |
NCBI RefSeq Gene Accession |
kgXref.refseq |
Entrez Gene |
NCBI Entrez Gene/LocusLink ID |
knownToLocusLink |
GenBank |
GenBank mRNA Accession |
kgXref.refseq or kgXref.mRNA |
Ensembl |
Ensembl Transcript ID |
knownToEnsembl |
GNF Atlas 2 ID |
ID of Associated GNF Atlas 2 Expression Data |
knownToGnfAtlas2 |
Gene Category |
High Level Gene Category - Coding, Antisense, etc. |
kgTxInfo.category |
CDS Score |
Coding potential score from txCdsPredict |
kgTxInfo.cdsScore |
VisiGene |
UCSC VisiGene In Situ Image Browser |
knownToVisiGene |
Allen Brain |
Allen Brain Atlas In Situ Images of Adult Mouse Brains |
knownToAllenBrain & allenBrainUrl |
U133 ID |
ID of Associated Affymetrix U133 Expression Data |
knownToU133 |
GNF Atlas 2 |
GNF Expression Atlas 2 Data from U133A and GNF1H Chips |
gnfAtlas2 |
Max GNF Atlas 2 |
Maximum Expression Value of GNF Expression Atlas 2 |
calculated? |
GNF Atlas 2 Delta |
Normalized Difference in GNF Expression Atlas 2 from Selected Gene |
gnfAtlas2Distance |
BLASTP |
NCBI BLASTP Bit Score |
knownBlastTab.bitScore |
BLASTP |
NCBI BLASTP E-Value |
knownBlastTab.evalue |
%ID |
NCBI BLASTP Percent Identity |
knownBlastTab.identity |
5' UTR Fold |
5' UTR Fold Energy (Estimated kcal/mol) |
foldUtr5.energy |
3' UTR Fold |
3' UTR Fold Energy (Estimated kcal/mol) |
foldUtr3.energy |
Exon Count |
Number of Exons (Including Non-Coding) |
knownGene.exonCount |
Intron Size |
Size of biggest (or optionally smallest) intron |
knownGene exonStarts - exonEnds |
Genome Position |
Genome Position/Link to Genome Browser |
(knownGene.txStart + txEnd)/2 |
Mouse |
Mouse Ortholog (Best Blastp Hit to UCSC Known Genes) |
mmBlastTab |
Rat |
Rat Ortholog (Best Blastp Hit to UCSC Known Genes) |
rnBlastTab |
Zebrafish |
Danio rerio Ortholog (Best Blastp Hit to Ensembl) |
drBlastTab |
Drosophila |
D. melanogaster Ortholog (Best Blastp Hit to FlyBase Proteins) |
dmBlastTab |
C. elegans |
C. elegans Ortholog (Best Blastp Hit to WormPep) |
ceBlastTab |
Yeast |
Saccharomyces cerevisiae Ortholog (Best Blastp Hit to RefSeq) |
scBlastTab |
Pfam Domains |
Protein Family Domain Structure |
knownToPfam à pfamDesc |
Superfamily |
Protein Superfamily Assignments |
ucscScop & scopDesc |
PDB |
Protein Data Bank |
kgProtMap2 & sp###### database |
Gene Ontology |
Gene Ontology (GO) Terms Associated with Gene |
kgProtMap2 & sp###### database |
M. Vidal P2P |
Human Protein-Protein Interaction Network from Marc Vidal |
humanVidalP2P |
E. Wanker P2P |
Human Protein-Protein Interaction Network from Erich Wanker |
humanWankerP2P |
HPRD P2P |
Human Protein-Protein Interaction Network from the Human Reference Protein Database |
humanHprdP2P |
Description |
Short Description Line/Link to Details Page |
kgXref.description |
Table Descriptions
Annotated details page: File:Hg19uc002ypa.2.pdf
Attempt to describe the uses of the tables used in or related to UCSC Genes.
UCSC Gene & GS Table Descriptions
- allenBrainGene - "Human Cortex Gene Expression" link in "Sequence & Links to Tools & Databases" section of hgGene
- allenBrainUrl - w/ knownToAllenBrain creates GS column, "Allen Brain"
- bioCycMapDesc - BioCyc description name in "Biochem & Signaling Pathways" section of hgGene
- bioCycPathway - BioCyc pathway name in "Biochem & Signaling Pathways" section of hgGene
- ccdsKgMap - CCDS in the "Other names for this Gene" section of hgGene
- ceBlastTab - C. elegans info in "Orthologous Genes in Other Species" section of hgGene
- cgapAlias - links cgapID with kgXref.geneSymbol to pull info for gene
- cgapBiocDesc - BioCarta description in "Biochem & Signaling Pathways" section of hgGene
- cgapBiocPathway - BioCarta pathway name in "Biochem & Signaling Pathways" section of hgGene
- dmBlastTab - D. melanogaster info in "Orthologous Genes in Other Species" section of hgGene
- drBlastTab - zebrafish info in "Orthologous Genes in Other Species" section of hgGene
- foldUtr3 - 3' info in "mRNA Secondary Structure of 3' and 5' UTRs" section of hgGene
- foldUtr5 - 5' info in "mRNA Secondary Structure of 3' and 5' UTRs" section of hgGene
- gnfAtlas2 - separate track, QA'd with that track but also determines the "Microarray expression Data" section of hgGene and the Gene Sorter column, "GNF Atlas 2"
- gnfAtlas2Distance - Gene Sorter column "GNF Atlas 2 Delta" & "Expression (GNF Atlas2)" "sort by" option
- humanHprdP2P - Gene Sorter column "HPRD P2P" & "sort by"
- humanVidalP2P - Gene Sorter column "M. Vidal Protein-to-Protein" & "sort by"
- humanWankerP2P - Gene Sorter column "E. Wanker Protein-to-Protein" & "sort by"
- keggMapDesc - KEGG pathway description in "Biochem & Signaling Pathways" section of hgGene
- keggPathway - KEGG pathway name in "Biochem & Signaling Pathways" section of hgGene
- kg4ToKg5 - allows searching of an old ID from previous gene set in new gene set or users can check the kg4ToKg5 table directly to find corresponding gene IDs.
- kgAlias - "Alternate Gene Symbols" in "Other Names for This Gene" section of hgGene
- kgColor - colors the gene in browser
- kgProtAlias - intermediate table?
- kgProtMap2 - Scop Domains in "Protein Domain & Structure Information" section of hgGene & Protein Data Bank column in GS need this table to work properly; also involved with proteome browser (not releasing with proteome browser with hg19; being phased out)
- kgSpAlias - duplicate of kgAlias w/ extra field, spID, that is blank in all records
- kgTxInfo - table info in the "Gene Model Information" section of hgGene
- kgXref - "Alternate Gene Symbols" in the "Other Names for This Gene" section of hgGene
- knownAlt - separate track, "Alt Events"; needs to be QA'd separately
- knownBlastTab - Gene Sorter columns: GS "ID%"=knownBlastTab.identity, GS"BLASTP E-Value"=knownBlastTab.eValue, GS "BLASTP Bits"=knownBlastTab.bitScore)
- knownCanonical - best transcript from each clusterId (note, GS only works with genes in this table)
- knownGene - primary table
- knownGeneMrna - "mRNA" link in "Sequence & Links to Tools &Databases" section of hgGene
- knownGenePep - "protein" link in "Sequence & Links to Tools &Databases" section of hgGene
- knownIsoforms - transcript grouped into clusters named by clusterId
- knownToAllenBrain - w/ allenBrainUrl creates Gene Sorter "Allen Brain" column/link
- knownToEnsembl - used in link to Ensembl
knownToGnf1m (similar to knownToGnfAtlas2 - not sure what it's for)
- knownToGnfAtlas2 - "Microarray Expression Data" section, Gene Sorter column "GNF Atlas 2 ID"
- knownToHprd - creates the "HPRD" link in "Sequence & Links to Tools &Databases" section of hgGene
- knownToLocusLink - used in link to Entrez Gene, see issues below
- knownToPfam - Pfam Domains in "Protein Domain & Structure info" of hgGene & Gene Sorter column: Pfam Domains
- knownToRefSeq - used in link to RefSeq in "Other Names for This Gene" section of hgGene
- knownToSuper - contains scop domain info with gene name & start/end
- knownToTreefam - used in link to Treefam website in "Sequence & Links to Tools &Databases" section of hgGene
- knownToU133 - Gene Sorter column "U133 ID"
- knownToVisiGene - used in link to VisiGene
- mmBlastTab - mouse info in "Orthologous Genes in Other Species" section of hgGene
- pfamDesc - Pfam description in "Protein Domain & Structure Info" section of hgGene and in "Pfam Domains" column of Gene Sorter
- rnBlastTab - rat info in "Orthologous Genes in Other Species" section of hgGene
- scBlastTab - S. cerevisiae info in "Orthologous Genes in Other Species" section of hgGene
- scopDesc - acc and description in "SCOP Domains" of "Prot Domainn & Structure Info" section of hgGene
- spMrna - intermediate table? Doesn't seem to directly affect hgGene or GS
- ucscScop - from ucscID gets scop domainName
Click for more information about blastTabs
UCSC Genes Tables in other Databases
Proteome DB (e.g. proteins090821)
- spReactomeEvent - "Reactome" info in "Biochemical and Signaling Pathways section of hgGene (linked through dependent on spID in kgXref)
- spReactomeId - "Reactome" link in "Sequence & Links to Tools &Databases" section of hgGene (unsure??)
Tables Related to UCSC Genes That are Separate tracks
- affyU133
- allenBrainAli
- exoniphy - created by Adam Siepel of Cornell for each assembly (2nd choice is to lift from previous assembly)
- gnfAtlas2
- nibbImageProbes
- vgAllProbes
No longer UCSC Genes Tables
- knownToCdsSnp - dropping on all assemblies. Found too many issues; Populated Cds Snp column in Gene Sorter.
- knownToGnf1h - part of GNF Atlas 1, which is not on hg19
Proteome Browser Tables (no longer releasing)
- pbAnomLimit
- pbResAvgStd
- pepCCntDist
- pepExonCntDist
- pepHydroDist
- pepIPCntDist
- pepMolWtDist
- pepPi
- pepPiDist
- pepResDist
- pepMwAa