TRF Simple Repeats
From genomewiki
Jump to navigationJump to search
Partition
The TRF/SimpleRepeats operations is similar to the Repeat Masker procedure. Partition the sequence into bits, run trf on each bit, collect the results:
mkdir -p /data/genomes/ricCom1/bed/simpleRepeat/run.cluster cd /data/genomes/ricCom1/bed/simpleRepeat/run.cluster rm -rf /data/genomes/ricCom1/TrfPart simplePartition.pl /data/genomes/ricCom1/ricCom1.unmasked.2bit 50000000 /data/genomes/ricCom1/TrfPart rm -f /data/genomes/ricCom1/bed/simpleRepeat/TrfPart ln -s /data/genomes/ricCom1/TrfPart /data/genomes/ricCom1/bed/simpleRepeat/TrfPart
Cluster run
gensub2 template file:
#LOOP ./TrfRun.csh /data/genomes/ricCom1/TrfPart/$(path1).bed #ENDLOOP
gensub2 constructs the jobList:
gensub2 /data/genomes/ricCom1/TrfPart/partitions.lst single template jobList
Typical jobList commands:
./TrfRun.csh /data/genomes/ricCom1/TrfPart/000/000.lst.bed ./TrfRun.csh /data/genomes/ricCom1/TrfPart/001/001.lst.bed ./TrfRun.csh /data/genomes/ricCom1/TrfPart/002/002.lst.bed ... etc.
Each partition bit is processed by the script TrfRun.csh:
#!/bin/csh -ef set finalOut = $1 set inLst = $finalOut:r set inLft = $inLst:r.lft # Use local disk for output, and move the final result to $finalOut # when done, to minimize I/O. set tmpDir = `mktemp -d -p /scratch/tmp doSimpleRepeat.cluster.XXXXXX` pushd $tmpDir foreach spec (`cat $inLst`) # Remove path and .2bit filename to get just the seq:start-end spec: set base = `echo $spec | sed -r -e 's/^[^:]+://'` # If $spec is the whole sequence, twoBitToFa removes the :start-end part, # which causes liftUp to barf later. So tweak the header back to # seq:start-end for liftUp's sake: twoBitToFa $spec stdout \ | sed -e "s/^>.*/>$base/" \ | trfBig -trf=/cluster/bin/trf stdin /dev/null -bedAt=$base.bed -tempDir=/scratch/tmp end # Due to the large chunk size, .lft files can have thousands of lines. # Even though the liftUp code does &lineFileClose, somehow we still # run out of filehandles. So limit the size of liftSpecs: split -a 3 -d -l 500 $inLft SplitLft. # Lift up: foreach splitLft (SplitLft.*) set bedFiles = `awk '{print $2 ".bed"};' $splitLft` endsInLf -zeroOk $bedFiles cat $bedFiles \ | liftUp -type=.bed tmpOut.$splitLft $splitLft error stdin end cat tmpOut.* > tmpOut__bed # endsInLf is much faster than using para to {check out line}: endsInLf -zeroOk tmpOut* # Move final result into place: mv tmpOut__bed $finalOut popd rm -rf $tmpDir