VGP Assembly gap analysis
VGP assembly gap annotation vs. AGP file
I'm curious about the lack of gap annotation in the VGP genome assemblies. Looking at the AGP files supplied with the assemblies, for example:
GCA_007474595.1_mLynCan4_v1.p
the chr*.comp.agp.gz files have no references at all to the gaps in the assembly. Of the 24 assemblies I have seen to date, only one has any gap annotation in the AGP files:
GCF_900246225.1_fAstCal1.2
with the following characteristics of the annotated gaps:
Assembly ID
UCSC browser link |
assembly
size |
sequence
count |
number
of gaps |
total
gap size |
smallest
gap |
largest
gap |
median
size |
mean
size |
common name
VGP link |
---|---|---|---|---|---|---|---|---|---|
GCF_900246225.1_fAstCal1.2 | 880,445,564 | 249 | 115 | 11,500 | 100 | 100 | 100 | 100.0 | eastern happy |
The actual total gaps in this assembly are much more than indicated in these AGP file specifications.
The table below indicates the actual gaps (any sequence of unknown nucleotides) in the assemblies. Clearly some of these assemblies have significant gaps and I suspect some of them are non-bridged gaps.
Gap statistics
Assembly ID
UCSC browser link |
assembly
size |
sequence
count |
number
of gaps |
total
gap size |
smallest
gap |
largest
gap |
median
size |
mean
size |
common name
VGP link |
---|---|---|---|---|---|---|---|---|---|
GCA_003957555.2_bCalAnn1_v1.p | 1,059,706,240 | 160 | 429 | 16,096,800 | 1 | 664,828 | 100 | 37,521.7 | Anna's hummingbird |
GCA_003957565.2_bTaeGut1_v1.p | 1,058,012,133 | 135 | 312 | 2,334,250 | 1 | 535,815 | 200 | 7,481.6 | zebra finch |
GCA_004027225.1_bStrHab1_v1.p | 1,165,639,803 | 100 | 362 | 27,568,500 | 1 | 9,491,740 | 500 | 76,156.1 | owl parrot |
GCA_004115265.2_mRhiFer1_v1.p | 2,075,785,400 | 135 | 158 | 7,546,100 | 1 | 833,261 | 499 | 47,760.1 | greater horseshoe bat |
GCA_007364275.1_fArcCen1 | 932,947,025 | 189 | 744 | 16,903,500 | 1 | 1,459,710 | 100 | 22,719.7 | flier cichlid |
GCA_007399415.1_rGopEvg1_v1.p | 2,298,564,209 | 383 | 562 | 29,560,300 | 1 | 3,059,780 | 100 | 52,598.4 | Goodes thornscrub tortoise |
GCA_007474595.1_mLynCan4_v1.p | 2,408,900,816 | 67 | 848 | 2,778,740 | 1 | 378,177 | 100 | 3,276.8 | Canada lynx |
GCA_900324465.2_fAnaTes1.2 | 555,641,398 | 50 | 266 | 3,606,500 | 25 | 482,892 | 100 | 13,558.3 | climbing perch |
GCA_900324485.2_fMasArm1.2 | 591,935,101 | 122 | 238 | 13,231,300 | 2 | 607,827 | 300 | 55,593.9 | zig-zag eel |
GCA_901699155.1_bStrTur1.1 | 1,178,928,410 | 357 | 894 | 3,876,370 | 2 | 156,350 | 100 | 4,336.0 | turtle dove |
GCA_901709675.1_fSynAcu1.1 | 324,331,233 | 87 | 43 | 13,700 | 100 | 500 | 200 | 318.6 | greater pipefish |
GCA_901765095.1_aMicUni1.1 | 4,685,923,413 | 1,080 | 2,452 | 47,437,300 | 1 | 2,824,530 | 100 | 19,346.4 | tiny Cayenne caecilian |
GCF_004115215.1_mOrnAna1.p.v1 | 1,858,552,590 | 305 | 522 | 15,159,800 | 1 | 1,622,070 | 100 | 29,041.8 | platypus |
GCF_004126475.1_mPhyDis1_v1.p | 2,117,764,065 | 141 | 831 | 22,966,500 | 1 | 1,480,060 | 500 | 27,637.1 | pale spear-nosed bat |
GCF_900246225.1_fAstCal1.2 | 880,445,564 | 249 | 490 | 1,186,660 | 19 | 211,000 | 25 | 2,421.8 | eastern happy |
GCF_900634415.1_fCotGob3.1 | 609,391,784 | 322 | 445 | 2,553,980 | 9 | 500,417 | 100 | 5,739.3 | channel bull blenny |
GCF_900634625.1_fParRan2.1 | 551,012,959 | 156 | 1,523 | 10,561,500 | 1 | 153,246 | 4,304 | 6,934.7 | Indian glassy fish |
GCF_900634775.1_fGouWil2.1 | 937,150,793 | 441 | 1,160 | 14,353,500 | 1 | 2,682,550 | 100 | 12,373.7 | blunt-snouted clingfish |
GCF_900700375.1_fDenClu1.1 | 567,401,054 | 460 | 464 | 4,648,390 | 13 | 518,837 | 100 | 10,018.1 | denticle herring |
GCF_900747795.1_fErpCal1.1 | 3,811,038,701 | 1,885 | 5,614 | 238,295,000 | 3 | 1,733,650 | 5,046 | 42,446.5 | reedfish |
GCF_900963305.1_fEcheNa1.1 | 544,229,245 | 38 | 140 | 599,356 | 13 | 248,230 | 100 | 4,281.1 | live sharksucker |
GCF_900964775.1_fSclFor1.1 | 784,563,014 | 72 | 145 | 41,360 | 13 | 14,756 | 100 | 285.2 | Asian bonytongue |
GCF_901000725.2_fTakRub1.2 | 384,126,662 | 128 | 402 | 3,688,530 | 10 | 428,276 | 100 | 9,175.5 | torafugu |
GCF_901001135.1_aRhiBiv1.1 | 5,319,239,201 | 1,330 | 3,573 | 33,955,800 | 4 | 771,063 | 100 | 9,503.5 | two-lined caecilian |