CPG Islands

From genomewiki
Jump to navigationJump to search

UCSC uses a local copy of the CPG Island Ext software. The same as with the Genscan this works with hard masked sequence.

Hard mask

All masked sequence is converted to N's. UCSC runs this as a cluster job, although it is a simple loop if you are willing to wait. The gensub2 template is:

#LOOP
./runOne.csh $(root1) ../hardMaskedFa/$(lastDir1)/$(file1)
#ENDLOOP

The runOne.csh script:

#!/bin/csh -ef
set chrom = $1
set result = $2
twoBitToFa /data/genomes/ricCom1/ricCom1.2bit:$chrom stdout \
 | maskOutFa stdin hard $result

C-shell script syntax here to manually construct a directory hierarchy to contain all the resulting hard masked fasta files and the chr.list file to use with gensub2:

mkdir -p /data/genomes/ricCom1/bed/cpgIslands/run.hardMask
cd /data/genomes/ricCom1/bed/cpgIslands/run.hardMask
mkdir -p ../hardMaskedFa
set perDirLimit = 4000
set ctgCount = `twoBitInfo /data/genomes/ricCom1/ricCom1.2bit stdout | wc -l`
set subDirCount = `echo $ctgCount | awk '{printf "%d", 1+$1/4000}'`
@ dirCount = 0
set dirName = `echo $dirCount | awk '{printf "%03d", $1}'`
@ perDirCount = 0
mkdir ../hardMaskedFa/$dirName
/bin/rm -f chr.list
/bin/touch chr.list
foreach chrom ( `twoBitInfo /data/genomes/ricCom1/ricCom1.2bit stdout | cut -f1` )
  if ($perDirCount < $perDirLimit) then
    @ perDirCount += 1
  else
    @ dirCount += 1
    set dirName = `echo $dirCount | awk '{printf "%03d", $1}'`
    set perDirCount = 1
    mkdir ../hardMaskedFa/$dirName
  endif
  echo $dirName/$chrom.fa >> chr.list
end

The gensub2 constructs the jobList:

gensub2 chr.list single template jobList

Typical parasol cluster run:

para make jobList
para check
para time > run.time
cat run.time
Completed: 25763 of 25763 jobs
CPU time in finished jobs:         86s       1.43m     0.02h    0.00d  0.000 y
IO & Wait Time:                 65591s    1093.18m    18.22h    0.76d  0.002 y
Average job time:                   3s       0.04m     0.00h    0.00d
Longest finished job:               6s       0.10m     0.00h    0.00d
Submission to last job:           261s       4.35m     0.07h    0.00d

CPG Island cluster run

The gensub2 template:

#LOOP
./runCpg.csh $(root1) $(lastDir1) {check out exists results/$(lastDir1)/$(root1).cpg}
#ENDLOOP

The runCpg.csh script:

#!/bin/csh -ef
set chrom = $1
set dir = $2
set result = $3
set resultDir = $result:h
mkdir -p $resultDir
set seqFile = hardMaskedFa/$dir/$chrom.fa
set C = `faCount $seqFile | egrep -v "^#seq|^total" | awk '{print $2 - $7}'`
if ( $C > 200 ) then
    cpg_lh $seqFile > $result
else
    touch $result
endif

Typical parasol operation:

cd /data/genomes/ricCom1/bed/cpgIslands
mkdir -p results
rm -f file.list
find ./hardMaskedFa -type f > file.list
gensub2 file.list single gsub jobList
para make jobList
para check
para time > run.time
cat run.time
Completed: 25763 of 25763 jobs
CPU time in finished jobs:         73s       1.21m     0.02h    0.00d  0.000 y
IO & Wait Time:                 66879s    1114.66m    18.58h    0.77d  0.002 y
Average job time:                   3s       0.04m     0.00h    0.00d
Longest finished job:               7s       0.12m     0.00h    0.00d
Submission to last job:           353s       5.88m     0.10h    0.00d

Collect results, create bigBed file

cd /data/genomes/ricCom1/bed/cpgIslands
catDir -r results \
     | awk '{$2 = $2 - 1; width = $3 - $2;  printf("%s\t%d\t%s\t%s %s\t%s\t%s\t%0.0f\t%0.1f\t%s\t%s\n", $1, $2, $3, $5, $6, width, $6, width*$7*0.01, 100.0*2*$6/width, $7, $9);}' \
     | sort -k1,1 -k2,2n > cpgIsland.bed
bedToBigBed -tab -type=bed4+6 -as=cpgIslandExt.as cpgIsland.bed \
    ../../chrom.sizes ricCom1.cpgIslandExt.bb

Using the cpgIslandExt.as table definition from the source tree

table cpgIslandExt
"Describes the CpG Islands (includes observed/expected ratio)"
   (
   string chrom;       "Reference sequence chromosome or scaffold"
   uint   chromStart;  "Start position in chromosome"
   uint   chromEnd;    "End position in chromosome"
   string name;        "CpG Island"
   uint   length;      "Island Length"
   uint   cpgNum;      "Number of CpGs in island"
   uint   gcNum;       "Number of C and G in island"
   float  perCpg;      "Percentage of island that is CpG"
   float  perGc;       "Percentage of island that is C or G"
   float  obsExp;      "Ratio of observed(cpgNum) to expected(numC*numG/length) CpG in island"
   )

Track hub trackDb.txt

track cpgIslandExt_
shortLabel CpG Islands
longLabel CpG Islands (Islands < 300 Bases are Light Green)
group regulation
priority 90
visibility pack
color 0,100,0
altColor 128,228,128
type bigBed 4 +
bigDataUrl bbi/ricCom1.cpgIslandExt.bb
html ../trackDescriptions/cpgIslandExt