TRF Simple Repeats: Difference between revisions

From genomewiki
Jump to navigationJump to search
(initial contents)
 
(post-process filter)
Line 24: Line 24:
  ./TrfRun.csh /data/genomes/ricCom1/TrfPart/002/002.lst.bed
  ./TrfRun.csh /data/genomes/ricCom1/TrfPart/002/002.lst.bed
  ... etc.
  ... etc.
Typical parasol operation:
para make jobList
para check
para time > run.time
cat run.time
Completed: 8 of 8 jobs
CPU time in finished jobs:      4059s      67.66m    1.13h    0.05d  0.000 y
IO & Wait Time:                  407s      6.78m    0.11h    0.00d  0.000 y
Average job time:                558s      9.30m    0.16h    0.01d
Longest finished job:            1391s      23.18m    0.39h    0.02d
Submission to last job:          1395s      23.25m    0.39h    0.02d


Each partition bit is processed by the script '''TrfRun.csh''':
Each partition bit is processed by the script '''TrfRun.csh''':
Line 72: Line 84:
  popd
  popd
  rm -rf $tmpDir
  rm -rf $tmpDir
==post-process Filter==
UCSC uses the repeats of period less than or equal to 12 for the masking of the genome sequence:
cd /data/genomes/ricCom1/bed/simpleRepeat
cat /data/genomes/ricCom1/bed/simpleRepeat/TrfPart/???/*.bed > simpleRepeat.bed
endsInLf simpleRepeat.bed
if ($status) then
  echo Uh-oh -- simpleRepeat.bed fails endsInLf.  Look at /hive/data/genomes/ricCom1/bed/simpleRepeat/TrfPart/ bed files.
  exit 1
endif
awk '{if ($5 <= 12) print;}' simpleRepeat.bed > trfMask.bed
==Final 2bit masking==
WIth the '''trfMask.bed''' file and the decision of whether to use the [[RepeatMasker]] or [[Window Masker]] result:
twoBitMask -type=.bed ricCom1.unmasked.2bit bed/windowMasker/cleanWMask.bed.gz \
          ricCom1.wmsk.sdust.2bit
twoBitMask ricCom1.wmsk.sdust.2bit -add bed/simpleRepeat/trfMask.bed ricCom1.2bit
#  you can safely ignore the warning about fields >= 13
twoBitToFa ricCom1.2bit stdout | faSize stdin > faSize.ricCom1.2bit.txt
Resulting masking:
350621860 bases (13662715 N's 336959145 real 185500504 upper 151458641 lower) in 25763 sequences in 1 files
Total size: mean 13609.5 sd 106411.1 min 202 (EQ999533) max 4693355 (EQ973772) median 1094
%43.20 masked total, %44.95 masked real

Revision as of 18:50, 23 April 2013

Partition

The TRF/SimpleRepeats operations is similar to the Repeat Masker procedure. Partition the sequence into bits, run trf on each bit, collect the results:

mkdir -p /data/genomes/ricCom1/bed/simpleRepeat/run.cluster
cd /data/genomes/ricCom1/bed/simpleRepeat/run.cluster

rm -rf /data/genomes/ricCom1/TrfPart
simplePartition.pl /data/genomes/ricCom1/ricCom1.unmasked.2bit 50000000 /data/genomes/ricCom1/TrfPart
rm -f /data/genomes/ricCom1/bed/simpleRepeat/TrfPart
ln -s /data/genomes/ricCom1/TrfPart /data/genomes/ricCom1/bed/simpleRepeat/TrfPart

Cluster run

gensub2 template file:

#LOOP
./TrfRun.csh /data/genomes/ricCom1/TrfPart/$(path1).bed
#ENDLOOP

gensub2 constructs the jobList:

gensub2 /data/genomes/ricCom1/TrfPart/partitions.lst single template jobList

Typical jobList commands:

./TrfRun.csh /data/genomes/ricCom1/TrfPart/000/000.lst.bed
./TrfRun.csh /data/genomes/ricCom1/TrfPart/001/001.lst.bed
./TrfRun.csh /data/genomes/ricCom1/TrfPart/002/002.lst.bed
... etc.

Typical parasol operation:

para make jobList
para check
para time > run.time
cat run.time
Completed: 8 of 8 jobs
CPU time in finished jobs:       4059s      67.66m     1.13h    0.05d  0.000 y
IO & Wait Time:                   407s       6.78m     0.11h    0.00d  0.000 y
Average job time:                 558s       9.30m     0.16h    0.01d
Longest finished job:            1391s      23.18m     0.39h    0.02d
Submission to last job:          1395s      23.25m     0.39h    0.02d

Each partition bit is processed by the script TrfRun.csh:

#!/bin/csh -ef

set finalOut = $1

set inLst = $finalOut:r
set inLft = $inLst:r.lft

# Use local disk for output, and move the final result to $finalOut
# when done, to minimize I/O.
set tmpDir = `mktemp -d -p /scratch/tmp doSimpleRepeat.cluster.XXXXXX`
pushd $tmpDir

foreach spec (`cat $inLst`)
  # Remove path and .2bit filename to get just the seq:start-end spec:
  set base = `echo $spec | sed -r -e 's/^[^:]+://'`

  # If $spec is the whole sequence, twoBitToFa removes the :start-end part,
  # which causes liftUp to barf later.  So tweak the header back to
  # seq:start-end for liftUp's sake:
  twoBitToFa $spec stdout \
  | sed -e "s/^>.*/>$base/" \
  | trfBig -trf=/cluster/bin/trf stdin /dev/null -bedAt=$base.bed -tempDir=/scratch/tmp
end

# Due to the large chunk size, .lft files can have thousands of lines.
# Even though the liftUp code does &lineFileClose, somehow we still
# run out of filehandles.  So limit the size of liftSpecs:
split -a 3 -d -l 500 $inLft SplitLft.

# Lift up:
foreach splitLft (SplitLft.*)
  set bedFiles = `awk '{print $2 ".bed"};' $splitLft`
  endsInLf -zeroOk $bedFiles
  cat $bedFiles \
  | liftUp -type=.bed tmpOut.$splitLft $splitLft error stdin
end
cat tmpOut.* > tmpOut__bed

# endsInLf is much faster than using para to {check out line}:
endsInLf -zeroOk tmpOut*

# Move final result into place:
mv tmpOut__bed $finalOut

popd
rm -rf $tmpDir

post-process Filter

UCSC uses the repeats of period less than or equal to 12 for the masking of the genome sequence:

cd /data/genomes/ricCom1/bed/simpleRepeat
cat /data/genomes/ricCom1/bed/simpleRepeat/TrfPart/???/*.bed > simpleRepeat.bed
endsInLf simpleRepeat.bed
if ($status) then
  echo Uh-oh -- simpleRepeat.bed fails endsInLf.  Look at /hive/data/genomes/ricCom1/bed/simpleRepeat/TrfPart/ bed files.
  exit 1
endif
awk '{if ($5 <= 12) print;}' simpleRepeat.bed > trfMask.bed

Final 2bit masking

WIth the trfMask.bed file and the decision of whether to use the RepeatMasker or Window Masker result:

twoBitMask -type=.bed ricCom1.unmasked.2bit bed/windowMasker/cleanWMask.bed.gz \
         ricCom1.wmsk.sdust.2bit
twoBitMask ricCom1.wmsk.sdust.2bit -add bed/simpleRepeat/trfMask.bed ricCom1.2bit
#   you can safely ignore the warning about fields >= 13
twoBitToFa ricCom1.2bit stdout | faSize stdin > faSize.ricCom1.2bit.txt

Resulting masking:

350621860 bases (13662715 N's 336959145 real 185500504 upper 151458641 lower) in 25763 sequences in 1 files
Total size: mean 13609.5 sd 106411.1 min 202 (EQ999533) max 4693355 (EQ973772) median 1094
%43.20 masked total, %44.95 masked real