SNP Track QA: Difference between revisions

From Genecats
Jump to navigationJump to search
(→‎AT RELEASE BE SURE TO:: downloads.html link)
(Adding a note that the page is no longer relevant, refs #27751)
 
(30 intermediate revisions by 5 users not shown)
Line 1: Line 1:
Starting with snp132, the SNP track was split into 4 tracks, [http://genome.ucsc.edu/goldenPath/newsarch.html#041811.2 announced here]. If you have any general questions about snps this is a good resource: http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=helpsnpfaq.  In addition to the SNP-specific checks on this page, be sure to follow the [[New_track_checklist | new track checklist]], too.
<h2><span style=background-color:yellow> <br><br> This page is no longer relevant since the SNP track is now using bigBeds.</span><br><br></h2>


* look at mysql tables
Starting with snp132, the SNP track was split into 4 tracks, [http://genome.ucsc.edu/goldenPath/newsarch.html#041811.2 announced here]. Angie gave a [http://genomewiki.ucsc.edu/index.php/Image:VariationGenecats20110406.pptx presentation] [http://www.ustream.tv/recorded/13819247 (video)] on the big changes made starting with snp132If you have any general questions about SNPs this is a good resource: http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=helpsnpfaq.
** If possible, compare to old SNP tables for this species. Look for big jumps in the number of different func types of SNPs as well as the number and type of exceptions.
** check that func types in snp### table are documented in the html, both in the methods and in the check boxes. If there are new func types, make sure they are displayed correctly in the browser
** check that the weights column in snp### are only 1,2 or 3
** look at the exceptionDesc table: spot-check the counts, make sure the exception messages show up on hgc pages, make sure the descriptions make senseIdeally, this table would be visible in the Table Browser when the track is selected, but it isn't right now.  See [http://redmine.soe.ucsc.edu/issues/3606 Redmine #3606] for current status of this problem.


In addition to the SNP-specific checks on this page, be sure to follow the [[New_track_checklist | new track checklist]], too.


*  check settings (hgTrackUi and clicking on individual snp)
==MySQL tables==
** make sure that all settings that you can select for are present in mysql tables and vice versa
* If possible, compare to old SNP tables for this species. Look for big jumps in the number of different func types of SNPs as well as the number and type of exceptions. Here's an example:
** make sure that methods section of the description page mentions all func types that you can select in trackUi
<pre>
** check that can turn different gene tracks on in the trackUi
mysql> SELECT COUNT(DISTINCT (func)) FROM snp146;
</pre>
* Look for SNPs that may have been erroneously dropped between the old and new versions. You can get a list of all new and dropped rsIDs by comapring the name column of the two tables. For example:
<pre>
$ hgsql -Ne "SELECT name FROM snp146" hg19 | sort -u > hg19.snp146.names
$ hgsql -Ne "SELECT name FROM snp144" hg19 | sort -u > hg19.snp144.names
$ diff hg19.snp144.names hg19.snp146.names > hg19.snp144.snp146.diffs
</pre>


** look at featureBits/Coverage for main snp table as well as the dbCoding and the ortho table. Also check coverage of haplo chromosomes.
* Check that func types in snp### table are documented in the html, both in the methods and in the check boxes. If there are new func types, make sure they are displayed correctly in the browser.
** see how coverage compares to old snp track if possible. Also when open up the track, most snps should match to the same place (though some will change)


** make sure appropriate entries in hgFindSpec (this is very important for snps since the table is so large). Check that entries in hgFindSpec work by searching for some snps.
* Check that the weights column in snp### are only 1,2 or 3.
** make sure track "turns off" when viewing large regions (table too large to load for large regions)
* Look at the snp###ExceptionDesc table:
** look at large snp insertions/deletions and check that display correctly in hgTracks and hgTrackUi
** Make sure the exception messages show up on hgc pages
** Make sure the descriptions make sense
** spot-check the counts. You can also compare these exceptions to those from the previous SNP release. To do so, you can use some commands like:
<pre>
$ hgsql -Ne "select exception, count from snp146ExceptionDesc" hg19 | sort > hg19.snp146ExceptionDesc.counts
$ hgsql -Ne "select exception, count from snp144ExceptionDesc" hg19 | sort > hg19.snp144ExceptionDesc.counts
$ join -j 1 hg19.snp144ExceptionDesc.counts hg19.snp146ExceptionDesc.counts | column -t -s ' '
</pre>
The output of these commands will be a three column file, where column 1 is the exception name, column 2 is the count of that exception from the old SNP track, and column 3 is the count from the new SNP track. For example:
<pre>
AlleleFreqSumNot1          1842    1752
DuplicateObserved          591680  390122
FlankMismatchGenomeEqual    182246  222451
</pre>


*   description
==Settings==
** make sure the source files listed are correct and exist at ncbi
* Make sure that there are no values in the snp### table that are not represented in the tracks filter controls.
** Note that the controls in the cgi are hard coded, so you may find that controls exist for values not present in the snp### table.
** Here's an example (repeat 4x where $field is class, valid, func, molType):
<pre>
mysql> SELECT DISTINCT($field) FROM 'snp###'; 
</pre>


* downloads - If this is a human assembly, make sure there are downloads for "Masked FASTA Files (human assemblies only)" Check hat the FASTA files are actually masked with snps. Look for some lines with y, w, m, etc. Can use a command like:
* Try some table browser queries that use the checkboxes on the filter page.
  zcat chr1.subst.fa.gz | head -300 | grep -i [^atgcn]
* Try turning on selected gene tracks on track controls page, make sure results show up in the "UCSC's predicted function relative to selected gene tracks:" section.
* Try the various color options.
 
==Details==
* Check the following types of SNPs:
** A SNP with Orthologous alleles in Chimp/Rhesus/Orangutan. Only done for SNP tracks on human assemblies. Can be found with a simple <tt>hgsql -e "SELECT * FROM snp146OrthoPt4Pa2Rm3 LIMIT 1\G" hg38</tt>.
** A SNP with coding annotations by dbSNP. Find one to check by doing <tt>hgsql -e "SELECT * FROM snp146CodingDbSnp LIMIT 1\G" hg38</tt>.
** A SNP with chimera and ls-snp links in your testing. While the lsSnpPdb table is quite outdated (last updated 2010-12-03), these linkouts still show up on details pages for SNPs in hg19, so it is useful to check that the linkouts still work as expected. To find a SNP in the lsSnpPdb table that still in the current release, you can use a MySQL command like the following: <tt>select chrom,chromStart,chromEnd,name,pdbId from snp146,lsSnpPdb where snpId = name limit 10;</tt>
** (In any above example, you can replace with snp146 with whatever SNP track you may be QA-ing.)
* Make sure the tracks "turn off" when viewing large regions (these tables are too large to load for large regions). Note that this doesn't apply to some of the smaller SNP tracks, such as the "Flagged" track, which for hg38.snp146Flagged has only ~150,000 items.
* Pay special attention to the PAR regions and haplotype chromosomes.  We shouldn't necessarily exclude snps that map to more than one position from sets that are uniquely mapped.
 
==Description==
* Note that most of the description is the same on all 4 snp tracks -- this is kept in one file (snp132.shared.html for snp132) that is included in the other html with a line like:
  <nowiki><!--#insert file="sharedText.html"--></nowiki>
* Make sure the source files listed are correct and exist at NCBI.
* Make sure that "Interpreting and Configuring the Graphical Display" section mentions all func types that you can select in TrackUi.
 
==Ask Admins to "myisampack" Tables==
 
As the SNP tables grow and grow, we have started to ask the admins to "myisampack" the tables before release to reduce their final size. After the tables are looking good on Dev (or Beta if that's where you've done your QA), then you can send a request to the cluster admins requesting that they "myisampack" the tables on Dev. You can also cc the engineer in charge of these tables on this email as well. Here's an example email:
<pre>
Hello Cluster Admin,
 
Can you myisampack then flush the following tables on hgwdev:
 
hg19.snp146
hg19.snp146CodingDbSnp
hg19.snp146Common
hg19.snp146ExceptionDesc
hg19.snp146Flagged
hg19.snp146Mult
hg19.snp146OrthoPt4Pa2Rm3
hg19.snp146Seq
 
Thanks!
</pre>
 
The process can take quite a bit of time (~4 hours or more for the tables on a single assembly), so similar to the push request for the /gbdb/ files, don't send the request hoping for a quick turn-around.
 
Note that if you've pushed the tables from Dev to Beta for QA, you will need to re-push the tables to Beta after the admins have "myismapack" the tables.


* be sure to include a snp with chimera and ls-snp links in your testing. To find such a SNP, look in the '''lsSnpPdb''', and then check the links under "Mappings to PDB protein structures" on the track details page.
==snp###.fa files==
These files will be located in /gbdb/$db/snp/. Be sure to request the push of these files from hgwdev to hgnfs1/genome-euro well in advance of the release. These files can be upwards of 33GB (and will likely keep growing!), and it can take 12 hours to transfer that much data from UCSC to genome-euro.


* check with Galt to make sure that the current snp track will be used in genome graphs.  (rsID is one of the accepted [http://genome.ucsc.edu/goldenPath/help/hgGenomeHelp.html#Format genome graphs formats]When rsIDs are entered, their positions are looked up in the snp tableGalt said in April 2011 that currently: "It starts looking for snp134, then snp133, ... down to about snp125. The first one it finds that exists in the db is returned and used for resolving user symbols that might be rsIds."
==Downloads==
If this is a human assembly, make sure there are downloads for "Masked FASTA Files (human assemblies only)" (/usr/local/apache/htdocs-hgdownload/goldenPath/hg19/snp###Mask)Check hat the FASTA files are actually masked with snpsLook for some lines with y, w, m, etc. Can use a command like:
  zcat chr1.subst.fa.gz | head -300 | grep -i [^atgcn]


==GWAS Catalog==
Ping Jonathan (or whoever is currently in charge of the GWAS Catalog track) to ask about updating the snpTrack and snpVersion parameters in trackDb to include the most recent version numbers.


==AT RELEASE BE SURE TO:==
==AT RELEASE BE SURE TO:==
* make the old SNP track (if any) hidden by default, and check to see if any old SNP tracks should be dropped from the RR.
* Make the old SNP track (if any) hidden by default, and check to see if any old SNP tracks should be dropped from the RR.
* announce the release on genome-announce.
* Announce the release on genome-announce.
* when you push the snp-masked downloads, add a link to downloads.html.
* When you push the snp-masked downloads, add a link to downloads.html.
* Once all of the snp###* tables are on mysqlrr, they should then be automatically pushed to hgdownload (e.g., http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/) on the next upcoming Sunday. Once on hgdownload, genome-mysql should sync nightly.  


[[Category:Browser QA tracks]]
[[Category:Browser QA tracks]]
[[Category:Browser QA]]
[[Category:Browser QA]]

Latest revision as of 00:28, 17 August 2022



This page is no longer relevant since the SNP track is now using bigBeds.


Starting with snp132, the SNP track was split into 4 tracks, announced here. Angie gave a presentation (video) on the big changes made starting with snp132. If you have any general questions about SNPs this is a good resource: http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=helpsnpfaq.

In addition to the SNP-specific checks on this page, be sure to follow the new track checklist, too.

MySQL tables

  • If possible, compare to old SNP tables for this species. Look for big jumps in the number of different func types of SNPs as well as the number and type of exceptions. Here's an example:
mysql> SELECT COUNT(DISTINCT (func)) FROM snp146;
  • Look for SNPs that may have been erroneously dropped between the old and new versions. You can get a list of all new and dropped rsIDs by comapring the name column of the two tables. For example:
$ hgsql -Ne "SELECT name FROM snp146" hg19 | sort -u > hg19.snp146.names
$ hgsql -Ne "SELECT name FROM snp144" hg19 | sort -u > hg19.snp144.names
$ diff hg19.snp144.names hg19.snp146.names > hg19.snp144.snp146.diffs
  • Check that func types in snp### table are documented in the html, both in the methods and in the check boxes. If there are new func types, make sure they are displayed correctly in the browser.
  • Check that the weights column in snp### are only 1,2 or 3.
  • Look at the snp###ExceptionDesc table:
    • Make sure the exception messages show up on hgc pages
    • Make sure the descriptions make sense
    • spot-check the counts. You can also compare these exceptions to those from the previous SNP release. To do so, you can use some commands like:
$ hgsql -Ne "select exception, count from snp146ExceptionDesc" hg19 | sort > hg19.snp146ExceptionDesc.counts
$ hgsql -Ne "select exception, count from snp144ExceptionDesc" hg19 | sort > hg19.snp144ExceptionDesc.counts
$ join -j 1 hg19.snp144ExceptionDesc.counts hg19.snp146ExceptionDesc.counts | column -t -s ' '

The output of these commands will be a three column file, where column 1 is the exception name, column 2 is the count of that exception from the old SNP track, and column 3 is the count from the new SNP track. For example:

AlleleFreqSumNot1           1842     1752
DuplicateObserved           591680   390122
FlankMismatchGenomeEqual    182246   222451

Settings

  • Make sure that there are no values in the snp### table that are not represented in the tracks filter controls.
    • Note that the controls in the cgi are hard coded, so you may find that controls exist for values not present in the snp### table.
    • Here's an example (repeat 4x where $field is class, valid, func, molType):
mysql> SELECT DISTINCT($field) FROM 'snp###';  
  • Try some table browser queries that use the checkboxes on the filter page.
  • Try turning on selected gene tracks on track controls page, make sure results show up in the "UCSC's predicted function relative to selected gene tracks:" section.
  • Try the various color options.

Details

  • Check the following types of SNPs:
    • A SNP with Orthologous alleles in Chimp/Rhesus/Orangutan. Only done for SNP tracks on human assemblies. Can be found with a simple hgsql -e "SELECT * FROM snp146OrthoPt4Pa2Rm3 LIMIT 1\G" hg38.
    • A SNP with coding annotations by dbSNP. Find one to check by doing hgsql -e "SELECT * FROM snp146CodingDbSnp LIMIT 1\G" hg38.
    • A SNP with chimera and ls-snp links in your testing. While the lsSnpPdb table is quite outdated (last updated 2010-12-03), these linkouts still show up on details pages for SNPs in hg19, so it is useful to check that the linkouts still work as expected. To find a SNP in the lsSnpPdb table that still in the current release, you can use a MySQL command like the following: select chrom,chromStart,chromEnd,name,pdbId from snp146,lsSnpPdb where snpId = name limit 10;
    • (In any above example, you can replace with snp146 with whatever SNP track you may be QA-ing.)
  • Make sure the tracks "turn off" when viewing large regions (these tables are too large to load for large regions). Note that this doesn't apply to some of the smaller SNP tracks, such as the "Flagged" track, which for hg38.snp146Flagged has only ~150,000 items.
  • Pay special attention to the PAR regions and haplotype chromosomes. We shouldn't necessarily exclude snps that map to more than one position from sets that are uniquely mapped.

Description

  • Note that most of the description is the same on all 4 snp tracks -- this is kept in one file (snp132.shared.html for snp132) that is included in the other html with a line like:
 <!--#insert file="sharedText.html"-->
  • Make sure the source files listed are correct and exist at NCBI.
  • Make sure that "Interpreting and Configuring the Graphical Display" section mentions all func types that you can select in TrackUi.

Ask Admins to "myisampack" Tables

As the SNP tables grow and grow, we have started to ask the admins to "myisampack" the tables before release to reduce their final size. After the tables are looking good on Dev (or Beta if that's where you've done your QA), then you can send a request to the cluster admins requesting that they "myisampack" the tables on Dev. You can also cc the engineer in charge of these tables on this email as well. Here's an example email:

Hello Cluster Admin,

Can you myisampack then flush the following tables on hgwdev:

hg19.snp146
hg19.snp146CodingDbSnp
hg19.snp146Common
hg19.snp146ExceptionDesc
hg19.snp146Flagged
hg19.snp146Mult
hg19.snp146OrthoPt4Pa2Rm3
hg19.snp146Seq

Thanks!

The process can take quite a bit of time (~4 hours or more for the tables on a single assembly), so similar to the push request for the /gbdb/ files, don't send the request hoping for a quick turn-around.

Note that if you've pushed the tables from Dev to Beta for QA, you will need to re-push the tables to Beta after the admins have "myismapack" the tables.

snp###.fa files

These files will be located in /gbdb/$db/snp/. Be sure to request the push of these files from hgwdev to hgnfs1/genome-euro well in advance of the release. These files can be upwards of 33GB (and will likely keep growing!), and it can take 12 hours to transfer that much data from UCSC to genome-euro.

Downloads

If this is a human assembly, make sure there are downloads for "Masked FASTA Files (human assemblies only)" (/usr/local/apache/htdocs-hgdownload/goldenPath/hg19/snp###Mask). Check hat the FASTA files are actually masked with snps. Look for some lines with y, w, m, etc. Can use a command like:

  zcat chr1.subst.fa.gz | head -300 | grep -i [^atgcn]

GWAS Catalog

Ping Jonathan (or whoever is currently in charge of the GWAS Catalog track) to ask about updating the snpTrack and snpVersion parameters in trackDb to include the most recent version numbers.

AT RELEASE BE SURE TO:

  • Make the old SNP track (if any) hidden by default, and check to see if any old SNP tracks should be dropped from the RR.
  • Announce the release on genome-announce.
  • When you push the snp-masked downloads, add a link to downloads.html.
  • Once all of the snp###* tables are on mysqlrr, they should then be automatically pushed to hgdownload (e.g., http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/) on the next upcoming Sunday. Once on hgdownload, genome-mysql should sync nightly.