Assembly QA Part 1 DEV Steps: Difference between revisions
Line 118: | Line 118: | ||
Below is a description of alignment files, for example, human (hg38) and lizard (anoCar2). | Below is a description of alignment files, for example, human (hg38) and lizard (anoCar2). | ||
Hiram has suggested to go to this directory and read more about the file types here: | |||
<pre> | |||
less /hive/data/genomes/hg38/bed/lastzAnoCar2.2015-02-05/axtChain/netChains.csh | |||
</pre> | |||
Line 126: | Line 130: | ||
:'''Chain Files''' | :'''Chain Files''' | ||
:: | ::Chain files contain all possible chains, before a subset of "best" chains are filtered into the liftOver file. | ||
:::hg38.anoCar2.all.chain.gz: chained blastz alignments. | :::hg38.anoCar2.all.chain.gz: chained blastz alignments. | ||
:::::The chain format is described in [http://genome.ucsc.edu/goldenPath/help/chain.html on the chain help page]. | :::::The chain format is described in [http://genome.ucsc.edu/goldenPath/help/chain.html on the chain help page]. |
Revision as of 22:11, 8 November 2016
This page is currently a draft in progress. For now, use Releasing an assembly instead.
Navigation Menu |
Dev 1.0. Getting started in hive
During this assembly release process, you will be generating a lot of output, and you'll need a place to put everything. The use of the "hive" directory is encouraged as the best location because of ample space.
Dev 1.1. Make a directory in your hive
mkdir /hive/users/userName/assemblies/assemblyName e.g.: mkdir /hive/users/cath/assemblies/manPen1
Dev 1.2. Optional: Create an alias to your new dir
When you add an alias from your .bashrc file, you can simply type that alias in your command line as a shortcut to the associated command. A "shortcut" alias can be created to allow fast access to your hive directory for this assembly.
To do this, follow the steps below:
- In your terminal, connect to hgwdev and type "cd" (go to your home directory).
- Confirm the location of .bashrc. Type "ls -a" in your home directory to see all hidden files that have a " . " in the filename. This way you can confirm the location of your .bashrc file.
- Open your .bashrc file for editing. If you're using the vi editor, you can type "vi .bashrc" to edit the file. Add an alias by typing in the line below, then save your changes.
alias hive='cd /hive/users/yourUserName/assemblies/yourAssembly' e.g., alias hive='cd /hive/users/cath/assemblies/manPen1'
Dev 2.0: Getting started in Redmine
- Find your assembly in the associated Assembly Redmine ticket.
- If there is no Redmine ticket for your assembly, you should create one.
- If a Redmine ticket for your assembly already exists, read through it carefully.
- For any issues found in the QA process, report in the Redmine ticket.
Dev 2.1. Redmine: Set 'assignee' as yourself
Dev 2.2. Redmine: Set the engineer as "watcher"
Dev 3.0: Check "minimal browser" criteria
Under construction: Jairo
Does this assembly have the required tracks?
Visit this page to check that the assembly contains the required tracks to be considered a minimal browser on the RR.
To add explaination: genbank mrnas & ests (/cluster/data/genbank/data/organism.lst) Blatservers
Dev 4.0: Getting started in the PushQ
Dev 4.1. PushQ: Set 'assignee' as yourself
- Find your assembly in the PushQ
- Click on the link in the "Queue ID" column
- Click the "lock" button at the top of the page to "unlock" the fields for editing.
- Add your name to the "Reviewer" column.
- Press the "Submit" button to save your edits.
Dev 4.2. PushQ: Check validity of alignment tables
section under construction - cath
This step ensures that all associated alignment (chain/net/liftOver) files are to other VALID assemblies on the RR.
- Go to the pushQ and click on the "Gateway" link at the top of the page.
- In the Gateway Queue is a list of assemblies. Click on your assembly to enter the "Track Push Queue for yourAssembly."
- Search for Chain/Net on the page and note the file names.
- Next, go to /gbdb and see what liftOver files exist for your assembly. For example,
cd /gbdb/manPen1/liftOver or, from the location /gbdb on hgwdev: ls -d */liftOver/*hg38*
If your assembly has chain/net/liftOver to/from an assembly that is *not* on the RR (and not in the pushQ as an upcoming new assembly), you do not need to QA them or push them to the RR. Drop the relevant row(s) from your sub-pushQ by going to the track entry, clicking lock and then clicking the delete button.
Below is a description of alignment files, for example, human (hg38) and lizard (anoCar2).
Hiram has suggested to go to this directory and read more about the file types here:
less /hive/data/genomes/hg38/bed/lastzAnoCar2.2015-02-05/axtChain/netChains.csh
- LiftOver Files
- A liftOver file is a chain file, it is a subset of all chains used in creating the net file.
- hg38ToAnoCar2.over.chain.gz: These files are required for the liftOver utility.
- The file names reflect the assembly conversion data contained within in the format <db1>To<Db2>.over.chain.gz. For example, a file named hg38ToAnoCar2.over.chain.gz file contains the liftOver data needed to convert hg38 coordinates to the anoCar2 assembly.
- hg38ToAnoCar2.over.chain.gz: These files are required for the liftOver utility.
- A liftOver file is a chain file, it is a subset of all chains used in creating the net file.
- Chain Files
- Chain files contain all possible chains, before a subset of "best" chains are filtered into the liftOver file.
- hg38.anoCar2.all.chain.gz: chained blastz alignments.
- The chain format is described in on the chain help page.
- hg38.anoCar2.all.chain.gz: chained blastz alignments.
- Chain files contain all possible chains, before a subset of "best" chains are filtered into the liftOver file.
- Net Files
- hg38.anoCar2.net.gz: "net" file.
- This file describes rearrangements between the species and the best Lizard match to any part of the Human genome. The net format is described in on the net help page.
- hg38.anoCar2.net.gz: "net" file.
- Axt Files
- hg38.anoCar2.net.axt.gz: chained and netted alignments.
- i.e. the best chains in the Human genome, with gaps in the best chains filled in by next-best chains where possible. The axt format is described in the axt help page.
- hg38.anoCar2.net.axt.gz: chained and netted alignments.
Dev 4.3. PushQ: Copy sub-pushQ tracks to your track checklist
NEED TO ADD STEPS HERE. Not part of assembly checklist template yet. Will QA tracks as the very last DEV step before moving on to BETA steps.
Dev 5.0: Compare Chrom Sizes
CHRIS V TO EDIT
- Ignore this if assembly is the first for a species.
- For a new assembly version, compare the chrom sizes from the last assembly to this new assembly version. You are not checking annotations on the reference sequence, you are just checking the number of base pairs per chrom/contig, and making sure that nothing has changed drastically (i.e., millions of base pairs different). Also take a look for general differences, such as chrom labels or number of chrom/contigs.
- Output chrom sizes into two files, sort each file by using the command below
- Compare the sorted files
Add note about viewing http://genome.ucsc.edu/cgi-bin/hgGateway and clicking on "View Sequences" button - bring up 2 windows side by side
hgwdev > hgsql -Ne "select chrom, size from chromInfo" $oldDb > oldChromSizes assemblyName (e.g., "panTro4") hgwdev > hgsql -Ne "select chrom, size from chromInfo" $newDb > newChromSizes assemblyName (e.g., "panTro5") hgwdev > sdiff -s oldChromSizes newChromSizes
Dev 6.0: Gateway Page Checks
Dev 6.1. Check default position
Dev 6.3. Organism image check
Dev 6.4. Accesion ID check
Assemblies/sequences, from various organizations, are submitted to the mother ship GenBank.
Those assemblies might be included in RefSeq if criteria are met.
The QA check should be to go out to NCBI and double check that the accessionID is correct.
- RefSeq assemblies:
- use accession ID: GCF_000002315.4 (e.g., galGal5)
- are delivered with chrMt (if they exisit)
- are delivered with NCBI gene predictions
- Genbank assemblies:
- use accession ID: GCA_000001305.2
- delivered without a chrMt.
- do not have gene predictions.
For the UCSC Genome Browser, it is preferable to use RefSeq assemblies (in part due to 'more data'). This is a "learn as we go" direction; historically GeneBank was preferred.
Helpful article: Nature, 2012 A beginner's guide to eukaryotic genome annotation
Dev 7.0: md5sum Checks
Under construction by Jairo
Dev 7.1. bigZips: md5sum
Dev 7.2. bigZips: readme
Dev 7.3. bigZips: corruption
Dev 7.4. database: readme
Dev 7.5. liftOver: md5sum
Dev 7.6. liftOver: README
Dev 7.7. liftOver: corruption
Dev 8.0: liftOver files exist?
Dev 8.1. liftOver: new-to-old
Dev 8.2. liftOver: old-to-new
Dev 9.0: downloads permissions check
Dev 10.0: Do Track QA for all relevant tracks
- Follow the [New_track_checklist | New Track Checklist] on the wiki.
- NOTE TO SELF - Explain which tracks don't need checking, this is confusing for new employees.
.
.
🔵 Done with DEV steps? Go to Assembly QA Part 2: BETA Steps