TRF Simple Repeats: Difference between revisions
(initial contents) |
|||
(4 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
==Partition== | ==Partition== | ||
The TRF program can be obtained from [[http://tandem.bu.edu/trf/trf.html TRF/Gary Benson]] | |||
The TRF/SimpleRepeats operations is similar to the Repeat Masker procedure. Partition the sequence into bits, run trf on each bit, collect the results: | The TRF/SimpleRepeats operations is similar to the Repeat Masker procedure. Partition the sequence into bits, run trf on each bit, collect the results: | ||
mkdir -p /data/genomes/ricCom1/bed/simpleRepeat/run.cluster | mkdir -p /data/genomes/ricCom1/bed/simpleRepeat/run.cluster | ||
Line 24: | Line 26: | ||
./TrfRun.csh /data/genomes/ricCom1/TrfPart/002/002.lst.bed | ./TrfRun.csh /data/genomes/ricCom1/TrfPart/002/002.lst.bed | ||
... etc. | ... etc. | ||
Typical parasol operation: | |||
para make jobList | |||
para check | |||
para time > run.time | |||
cat run.time | |||
Completed: 8 of 8 jobs | |||
CPU time in finished jobs: 4059s 67.66m 1.13h 0.05d 0.000 y | |||
IO & Wait Time: 407s 6.78m 0.11h 0.00d 0.000 y | |||
Average job time: 558s 9.30m 0.16h 0.01d | |||
Longest finished job: 1391s 23.18m 0.39h 0.02d | |||
Submission to last job: 1395s 23.25m 0.39h 0.02d | |||
Each partition bit is processed by the script '''TrfRun.csh''': | Each partition bit is processed by the script '''TrfRun.csh''': | ||
Line 72: | Line 86: | ||
popd | popd | ||
rm -rf $tmpDir | rm -rf $tmpDir | ||
==post-process Filter== | |||
UCSC uses the repeats of period less than or equal to 12 for the masking of the genome sequence: | |||
cd /data/genomes/ricCom1/bed/simpleRepeat | |||
cat /data/genomes/ricCom1/bed/simpleRepeat/TrfPart/???/*.bed > simpleRepeat.bed | |||
endsInLf simpleRepeat.bed | |||
if ($status) then | |||
echo Uh-oh -- simpleRepeat.bed fails endsInLf. Look at /hive/data/genomes/ricCom1/bed/simpleRepeat/TrfPart/ bed files. | |||
exit 1 | |||
endif | |||
awk '{if ($5 <= 12) print;}' simpleRepeat.bed > trfMask.bed | |||
==Final 2bit masking== | |||
WIth the '''trfMask.bed''' file and the decision of whether to use the [[RepeatMasker]] or [[Window Masker]] result: | |||
twoBitMask -type=.bed ricCom1.unmasked.2bit bed/windowMasker/cleanWMask.bed.gz \ | |||
ricCom1.wmsk.sdust.2bit | |||
twoBitMask ricCom1.wmsk.sdust.2bit -add bed/simpleRepeat/trfMask.bed ricCom1.2bit | |||
# you can safely ignore the warning about fields >= 13 | |||
twoBitToFa ricCom1.2bit stdout | faSize stdin > faSize.ricCom1.2bit.txt | |||
Resulting masking: | |||
350621860 bases (13662715 N's 336959145 real 185500504 upper 151458641 lower) in 25763 sequences in 1 files | |||
Total size: mean 13609.5 sd 106411.1 min 202 (EQ999533) max 4693355 (EQ973772) median 1094 | |||
%43.20 masked total, %44.95 masked real | |||
==Construct bigBed track files== | |||
Rearranging the simpleRepeat.bed file a bit with this perl script '''bedToBed4+.pl'''. | |||
It truncates the repeat string to 16 characters to use as the ''name'' of | |||
the item in the track: | |||
#!/usr/bin/env perl | |||
use strict; | |||
use warnings; | |||
open (FH, "<simpleRepeat.bed") or die "can not read simpleRepeat.bed"; | |||
while (my $line = <FH>) { | |||
chomp $line; | |||
my @a = split('\s+', $line); | |||
my $name = substr($a[15],0,16); | |||
printf "%s\t%d\t%d\t%s\t%d\t%s\t%d\t%d\t%d\t%d\t%d\t%d\t%d\t%d\t%s\t%s\n", | |||
$a[0], $a[1], $a[2], $name, $a[4], $a[5], $a[6], $a[7], $a[8], $a[9], $a[10], $a[11], $a[12], $a[13], $a[14], $a[15]; | |||
} | |||
close (FH); | |||
Sort the output of that script into the result file: | |||
./bedToBed4+.pl | sort -k1,1 -k2,2n > ricCom1.bed16.bed | |||
bedToBigBed -tab -type=bed4+12 -as=simpleRepeat.as ricCom1.bed16.bed \ | |||
../../chrom.sizes ricCom1.simpleRepeat.bb | |||
Using the source tree [http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=blob_plain;f=src/hg/lib/simpleRepeat.as simpleRepeat.as] definition | |||
of the fields in the track. | |||
==trackDb.txt== | |||
The track hub trackDb.txt entry: | |||
track simpleRepeat_ | |||
shortLabel Simple Repeats | |||
longLabel Simple Tandem Repeats by TRF | |||
group varRep | |||
priority 149.3 | |||
visibility dense | |||
type bigBed 4 + | |||
bigDataUrl bbi/ricCom1.simpleRepeat.bb | |||
html ../trackDescriptions/simpleRepeat | |||
==See also== | |||
Please note specific details for running each of the masking operations: | |||
* [[RepeatMasker]] | |||
* [[Window Masker]] | |||
[[Category:Technical FAQ]] | |||
[[Category:User Developed Scripts]] | |||
[[Category:Assembly/Track Hubs]] |
Latest revision as of 20:36, 2 July 2015
Partition
The TRF program can be obtained from [TRF/Gary Benson]
The TRF/SimpleRepeats operations is similar to the Repeat Masker procedure. Partition the sequence into bits, run trf on each bit, collect the results:
mkdir -p /data/genomes/ricCom1/bed/simpleRepeat/run.cluster cd /data/genomes/ricCom1/bed/simpleRepeat/run.cluster rm -rf /data/genomes/ricCom1/TrfPart simplePartition.pl /data/genomes/ricCom1/ricCom1.unmasked.2bit 50000000 /data/genomes/ricCom1/TrfPart rm -f /data/genomes/ricCom1/bed/simpleRepeat/TrfPart ln -s /data/genomes/ricCom1/TrfPart /data/genomes/ricCom1/bed/simpleRepeat/TrfPart
Cluster run
gensub2 template file:
#LOOP ./TrfRun.csh /data/genomes/ricCom1/TrfPart/$(path1).bed #ENDLOOP
gensub2 constructs the jobList:
gensub2 /data/genomes/ricCom1/TrfPart/partitions.lst single template jobList
Typical jobList commands:
./TrfRun.csh /data/genomes/ricCom1/TrfPart/000/000.lst.bed ./TrfRun.csh /data/genomes/ricCom1/TrfPart/001/001.lst.bed ./TrfRun.csh /data/genomes/ricCom1/TrfPart/002/002.lst.bed ... etc.
Typical parasol operation:
para make jobList para check para time > run.time cat run.time Completed: 8 of 8 jobs CPU time in finished jobs: 4059s 67.66m 1.13h 0.05d 0.000 y IO & Wait Time: 407s 6.78m 0.11h 0.00d 0.000 y Average job time: 558s 9.30m 0.16h 0.01d Longest finished job: 1391s 23.18m 0.39h 0.02d Submission to last job: 1395s 23.25m 0.39h 0.02d
Each partition bit is processed by the script TrfRun.csh:
#!/bin/csh -ef set finalOut = $1 set inLst = $finalOut:r set inLft = $inLst:r.lft # Use local disk for output, and move the final result to $finalOut # when done, to minimize I/O. set tmpDir = `mktemp -d -p /scratch/tmp doSimpleRepeat.cluster.XXXXXX` pushd $tmpDir foreach spec (`cat $inLst`) # Remove path and .2bit filename to get just the seq:start-end spec: set base = `echo $spec | sed -r -e 's/^[^:]+://'` # If $spec is the whole sequence, twoBitToFa removes the :start-end part, # which causes liftUp to barf later. So tweak the header back to # seq:start-end for liftUp's sake: twoBitToFa $spec stdout \ | sed -e "s/^>.*/>$base/" \ | trfBig -trf=/cluster/bin/trf stdin /dev/null -bedAt=$base.bed -tempDir=/scratch/tmp end # Due to the large chunk size, .lft files can have thousands of lines. # Even though the liftUp code does &lineFileClose, somehow we still # run out of filehandles. So limit the size of liftSpecs: split -a 3 -d -l 500 $inLft SplitLft. # Lift up: foreach splitLft (SplitLft.*) set bedFiles = `awk '{print $2 ".bed"};' $splitLft` endsInLf -zeroOk $bedFiles cat $bedFiles \ | liftUp -type=.bed tmpOut.$splitLft $splitLft error stdin end cat tmpOut.* > tmpOut__bed # endsInLf is much faster than using para to {check out line}: endsInLf -zeroOk tmpOut* # Move final result into place: mv tmpOut__bed $finalOut popd rm -rf $tmpDir
post-process Filter
UCSC uses the repeats of period less than or equal to 12 for the masking of the genome sequence:
cd /data/genomes/ricCom1/bed/simpleRepeat cat /data/genomes/ricCom1/bed/simpleRepeat/TrfPart/???/*.bed > simpleRepeat.bed endsInLf simpleRepeat.bed if ($status) then echo Uh-oh -- simpleRepeat.bed fails endsInLf. Look at /hive/data/genomes/ricCom1/bed/simpleRepeat/TrfPart/ bed files. exit 1 endif awk '{if ($5 <= 12) print;}' simpleRepeat.bed > trfMask.bed
Final 2bit masking
WIth the trfMask.bed file and the decision of whether to use the RepeatMasker or Window Masker result:
twoBitMask -type=.bed ricCom1.unmasked.2bit bed/windowMasker/cleanWMask.bed.gz \ ricCom1.wmsk.sdust.2bit twoBitMask ricCom1.wmsk.sdust.2bit -add bed/simpleRepeat/trfMask.bed ricCom1.2bit # you can safely ignore the warning about fields >= 13 twoBitToFa ricCom1.2bit stdout | faSize stdin > faSize.ricCom1.2bit.txt
Resulting masking:
350621860 bases (13662715 N's 336959145 real 185500504 upper 151458641 lower) in 25763 sequences in 1 files Total size: mean 13609.5 sd 106411.1 min 202 (EQ999533) max 4693355 (EQ973772) median 1094 %43.20 masked total, %44.95 masked real
Construct bigBed track files
Rearranging the simpleRepeat.bed file a bit with this perl script bedToBed4+.pl.
It truncates the repeat string to 16 characters to use as the name of the item in the track:
#!/usr/bin/env perl use strict; use warnings; open (FH, "<simpleRepeat.bed") or die "can not read simpleRepeat.bed"; while (my $line = <FH>) { chomp $line; my @a = split('\s+', $line); my $name = substr($a[15],0,16); printf "%s\t%d\t%d\t%s\t%d\t%s\t%d\t%d\t%d\t%d\t%d\t%d\t%d\t%d\t%s\t%s\n", $a[0], $a[1], $a[2], $name, $a[4], $a[5], $a[6], $a[7], $a[8], $a[9], $a[10], $a[11], $a[12], $a[13], $a[14], $a[15]; } close (FH);
Sort the output of that script into the result file:
./bedToBed4+.pl | sort -k1,1 -k2,2n > ricCom1.bed16.bed bedToBigBed -tab -type=bed4+12 -as=simpleRepeat.as ricCom1.bed16.bed \ ../../chrom.sizes ricCom1.simpleRepeat.bb
Using the source tree simpleRepeat.as definition of the fields in the track.
trackDb.txt
The track hub trackDb.txt entry:
track simpleRepeat_ shortLabel Simple Repeats longLabel Simple Tandem Repeats by TRF group varRep priority 149.3 visibility dense type bigBed 4 + bigDataUrl bbi/ricCom1.simpleRepeat.bb html ../trackDescriptions/simpleRepeat
See also
Please note specific details for running each of the masking operations: