DoSameSpeciesLiftOver.pl: Difference between revisions
(note about bin/blat) |
(add NCBI assembly link) |
||
Line 74: | Line 74: | ||
This example is going to use two Soy bean assemblies, one from '''genbank''' and one from '''refseq''' | This example is going to use two Soy bean assemblies, one from '''genbank''' and one from '''refseq''' | ||
assemblies at NCBI. | assemblies at [https://www.ncbi.nlm.nih.gov/assembly NCBI]. | ||
mkdir -p /data/genomes/genbank /data/genomes/refseq | mkdir -p /data/genomes/genbank /data/genomes/refseq |
Revision as of 22:31, 25 April 2018
Licensing
For commercial use of these toolsets, please note the license considerations for the kent source tools at the: Genome Store
This process also uses the blat command. For commercial license please see: Kent Informatics
Prerequisites
This discussion assumes you are familiar with Unix shell command line programming and scripting. You will be encountering and interacting with csh/tcsh, bash, perl, and python scripting languages. You will need at least one computer with several CPU cores, preferably a multiple compute cluster system or equivalent in a cloud computing environment.
This entire discussion assumes the bash shell is the user's unix shell.
Parasol Job Control System
The scripts and programs used here expect to find the Parasol_job_control_system in place and operational.
Install scripts and kent command line utilities
This is a bit of a kludge at this time (April 2018), we are working on a cleaner distribution of these scripts. As was mentioned in the Parasol_job_control_system setup, the kent command line binaries and these scripts are going to reside in /data/bin/ and /data/scripts/. This is merely a style custom to keep scripts separate from binaries, this is not strictly necessary to keep them separate.
mkdir -p /data/scripts /data/bin chmod 755 /data/scripts /data/bin rsync -a rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/ /data/bin/ git archive --remote=git://genome-source.soe.ucsc.edu/kent.git \ --prefix=kent/ HEAD src/hg/utils/automation \ | tar vxf - -C /data/scripts --strip-components=5 \ --exclude='kent/src/hg/utils/automation/incidentDb' \ --exclude='kent/src/hg/utils/automation/configFiles' \ --exclude='kent/src/hg/utils/automation/ensGene' \ --exclude='kent/src/hg/utils/automation/genbank' \ --exclude='kent/src/hg/utils/automation/lastz_D' \ --exclude='kent/src/hg/utils/automation/openStack' wget -O /data/bin/bedSingleCover.pl 'http://genome-source.soe.ucsc.edu/gitweb/?p=kent.git;a=blob_plain;f=src/utils/bedSingleCover.pl'
PATH setup
Add or verify the two directories /data/bin and /data/scripts are added to the shell PATH environment. This can be added simply to the .bashrc file in your home directory:
echo 'export PATH=/data/bin:/data/bin/blat:/data/scripts:$PATH' >> $HOME/.bashrc
Note: /data/bin/blat has been added to the PATH for access to the blat command. Then, source that file to add that to this current shell:
. $HOME/.bashrc
Verify you see those pathnames on the PATH variable:
echo $PATH /data/bin:/data/bin/blat:/data/scripts:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/centos/.local/bin:/home/centos/bin
Obtain genome sequences
This example is going to use two Soy bean assemblies, one from genbank and one from refseq assemblies at NCBI.
mkdir -p /data/genomes/genbank /data/genomes/refseq cd /data/genomes/genbank rsync -L -a -P \ rsync://ftp.ncbi.nlm.nih.gov/genomes/genbank/plant/Glycine_max/all_assembly_versions/GCA_000004515.3_Glycine_max_v2.0/GCA_000004515.3_Glycine_max_v2.0_genomic.fna.gz ./ cd /data/genomes/refseq rsync -L -a -P \ rsync://ftp.ncbi.nlm.nih.gov/genomes/refseq/plant/Glycine_max/all_assembly_versions/GCF_000004515.3_V1.1/GCF_000004515.3_V1.1_genomic.fna.gz ./ cd /data/genomes ls -og genbank/*.gz refseq/*.gz -r--r--r--. 1 296231629 Jun 13 2016 genbank/GCA_000004515.3_Glycine_max_v2.0_genomic.fna.gz -r--r--r--. 1 296228780 Jan 6 2015 refseq/GCF_000004515.3_V1.1_genomic.fna.gz
Working directories
Organize your work in a directory hierarchy for convenience of bookeeping and shell script automation for numerous sequences.
The target sequence name is GCA_000004515.3_Glycine_max_v2.0 and the query sequence name is GCF_000004515.3_V1.1.
Convert the fasta sequence to .2bit files and calculate chrom.sizes files.
mkdir /data/genomes/GCA_000004515.3_Glycine_max_v2.0 mkdir /data/genomes/GCF_000004515.3_V1.1 cd /data/genomes/GCA_000004515.3_Glycine_max_v2.0 faToTwoBit /data/genomes/genbank/GCA_000004515.3_Glycine_max_v2.0_genomic.fna.gz GCA_000004515.3_Glycine_max_v2.0.2bit twoBitInfo GCA_000004515.3_Glycine_max_v2.0.2bit stdout | sort -k2,2nr > GCA_000004515.3_Glycine_max_v2.0.chrom.sizes ls -og -rw-rw-r--. 1 287569208 Apr 13 17:30 GCA_000004515.3_Glycine_max_v2.0.2bit -rw-rw-r--. 1 19439 Apr 13 17:31 GCA_000004515.3_Glycine_max_v2.0.chrom.sizes cd mkdir /data/genomes/GCF_000004515.3_V1.1 faToTwoBit /data/genomes/GCF_000004515.3_V1.1/GCF_000004515.3_V1.1_genomic.fna.gz GCF_000004515.3_V1.1.2bit twoBitInfo GCF_000004515.3_V1.1.2bit stdout | sort -k2,2nr > GCF_000004515.3_V1.1.chrom.sizes ls -og -rw-rw-r--. 1 286264183 Apr 13 17:30 GCF_000004515.3_V1.1.2bit -rw-rw-r--. 1 23246 Apr 13 17:31 GCF_000004515.3_V1.1.chrom.sizes
Construct ooc file
The blat operation in this procedure works much more efficiently when a pre-computed ooc file is constructed to use for all blat comparisons. This is a file that counts up over used 11-mer tiles for blat to eliminate them from the initial consideration for alignment, thereby limiting the amount of alignment that has to take place. We base the repMatch parameter on the size of the genome compared to UCSC hg19 sequence. A genome of that size used -repMatch=1024. We want to adjust that parameter in proportion to that size. Measure the size of the target genome:
cd /data/genomes/GCA_000004515.3_Glycine_max_v2.0 twoBitToFa GCA_000004515.3_Glycine_max_v2.0.2bit stdout | faSize stdin 978416860 bases (23046524 N's 955370336 real 521703227 upper 433667109 lower) in 1189 sequences in 1 files
Note the number of real bases 955370336 to use in this proportion calculation:
calc \( 955370336 / 2861349177 \) \* 1024 ( 955370336 / 2861349177 ) * 1024 = 341.901377
Round down the answer to the nearest 50 for the -repMatch=300 use in blat:
blat GCA_000004515.3_Glycine_max_v2.0.2bit /dev/null /dev/null -tileSize=11 \ -makeOoc=GCA_000004515.3_Glycine_max_v2.0.ooc -repMatch=300 Loading GCA_000004515.3_Glycine_max_v2.0.2bit Counting GCA_000004515.3_Glycine_max_v2.0.2bit Writing GCA_000004515.3_Glycine_max_v2.0.ooc Wrote 64902 overused 11-mers to GCA_000004515.3_Glycine_max_v2.0.ooc