Gene Set Summary Statistics

From genomewiki
Jump to navigationJump to search

gene sets measured

  • hg17 - knownGenes version 2
  • hg18 - knownGenes version 3
  • mm8 - knownGenes version 2
  • mm9 - knownGenes version 3

The min, max and mean measurements are per gene

summary of gene and exon counts

dbgene
count
total exon
count
min exon
count
max exon
count
mean exon
count
hg1739368405720114910
hg1856722519308128999
hg1982960742493150659
mm831863314628131310
mm94922041711416108

summary of exon size statistics

dbsum exon
sizes
min exon
size
max exon
size
mean exon
size
hg17106839720118172263
hg18146371091136861282
hg192219240891205012299
mm883159087417497264
mm9117671086129698282

summary of intron size statistics

dbsum intron
sizes
min intron
size
max intron
size
mean intron
size
hg172223224397610964506069
hg182784923600110473206023
hg194127555916111604106260
mm81476081990913475505220
mm92055504784112534305589

Top five exon count genes

dbgene name (exon count)
hg17 NM_004543 (149) AF535142 (146) AF535142 (146) NM_033071 (146) AF495910 (146)
hg18 uc001yrq.1 (2899) uc002zvw.1 (322) uc002umr.1 (313) uc002stk.1 (217) uc002umt.1 (194)
mm8 NM_011652 (313) NM_028004 (192) NM_007738 (118) NM_134448 (99) DQ067088 (99)
mm9 uc007pgj.1 (610) uc008kfn.1 (313) uc008kfo.1 (192) uc008jqv.1 (157) uc009rrh.1 (118)

Top five largest CDS extent genes

dbgene name (CDS extent size: thickEnd-thickStart)
hg17 NM_014141 (2298740) NM_000109 (2217347) CR749820 (2138880) NM_004006 (2089394) X14298 (2089394)
hg18 uc003weu.1 (2298740) uc004ddb.1 (2217347) uc001pak.1 (2138880) uc004dda.1 (2089394) uc003wqd.1 (2055833)
hg19 uc021ott.2 (2307732) uc003weu.2 (2298740) uc004ddb.1 (2217347) uc001pak.2 (2138880) uc004dda.1 (2089394)
mm8 NM_007868 (2253366) NM_001004357 (2238304) NM_053011 (2055883) AK134694 (1988713) NM_053171 (1639258)
mm9 uc009tri.1 (2253366) uc009bst.1 (2238325) uc007zfr.1 (2189582) uc008jon.1 (2055883) uc008mpv.1 (1988713)

Top five smallest transcript genes

dbgene name (transcript size: txEnd-txStart)
hg17 AF241539 (168) AF277175 (176) AY459291 (240) AY605064 (243) AF503918 (258)
hg18 uc004buj.1 (20) uc001dcm.1 (22) uc001seo.1 (22) uc001sqn.1 (22) uc002wpa.1 (22)
hg19 uc031pxj.1 (19) uc021qzo.1 (19) uc021pfi.1 (20) uc021oot.1 (20) uc021qbd.1 (20)
mm8 AJ319753 (217) BC107019 (231) BC016221 (286) NM_130876 (303) NM_130873 (304)
mm9 uc007bma.1 (22) uc007gmr.1 (22) uc007khz.1 (22) uc007pay.1 (22) uc007qpn.1 (22)

Custom Track of Small Exons and Introns

Custom track: Hg18 small exons and introns on the UCSC Genes track

These are exons of size less than 22 bases, and introns of size less than 12 bases. The score column contains the size and thus you can filter smaller subsets via the score column in the table browser.

These small exons and introns are used to maintain frame coding boundaries as found in mRNAs compared to the reference genome coordinates.

Histogram graphs

Hg17 hg18.exonCount.png The caption on the graph above is incorrect. The X axis is Exon Count per Gene

Mm8 mm9.exonCount.png The caption on the graph above is incorrect. The X axis is Exon Count per Gene




Hg17 hg18.exonSize.png

Mm8 mm9.exonSize.png

Hg17 hg18 exonsTo300.png

Mm8 mm9 exonsTo300.png

Hg17 hg18.intronsTo170.png

Mm8 mm9.intronsTo170.png

Hg17 hg18.intronSize.png

Mm8 mm9.intronSize.png

Methods

  • From the table browser, request three different bed files for the knownGenes track:
  1. whole gene
  2. exons only
  3. introns only
  • From those bed files, stats can be extracted
  1. gene count from: 'wc -l wholeGene.bed'
  2. exon count stats from:
 STATS=`ave -col=10 wholeGene.bed -tableOut | grep -v "^#"`
 MIN=`echo $STATS | cut -d' ' -f1`
 MAX=`echo $STATS | cut -d' ' -f5`
 MEAN=`echo $STATS | cut -d' ' -f6 | awk '{printf "%d", $1+0.5}'`
 COUNT=`echo $STATS | cut -d' ' -f8 | awk '{printf "%d", $1}'`
  • for exon or intron size stats:
 STATS=`awk '{print $3-$2}' {introns,exons}.bed \
      | ave -col=1 stdin -tableOut | grep -v "^#"`
 MIN=`echo $STATS | cut -d' ' -f1`
 MAX=`echo $STATS | cut -d' ' -f5 | awk '{printf "%d", $1}'`
 MEAN=`echo $STATS | cut -d' ' -f6 | awk '{printf "%d", $1+0.5}'`
 SUM_SIZE=`awk '{sum += $3-$2} END{printf "%d", sum}' {introns,exons}.bed`
  • top five exon count genes
sort -k10nr wholeGene.bed | head -5
  • top five CDS size genes
awk '{cdsSize=$8-$7
if (cdsSize > 0) {printf "%s\t%s\t%s\t%s\t%d\n", $1,$2,$3,$4,cdsSize}
}' wholeGene.bed | sort -k5nr | head -5
  • top five smallest transcript genes
awk '{size=$3-$2
if (size > 0) {printf "%s\t%s\t%s\t%s\t%d\n", $1,$2,$3,$4,size}
}' wholeGene.bed | sort -k5n | head -5