Genome completion status
Which metazoan species currently have genomic data available? That's hard to say -- it's a difficult process to track. There are no announcements, maintained lists, or publications; sequencing centers rarely update their websites or indicate specific future plans. Consequently few researchers are adequately aware of what species have genomic data, and so typically undersample when doing comparative genomics projects. Sampling species more densely often overturns working hypotheses of feature evolution.
Sequencing centers post raw trace reads on a day-by-day basis at NCBI's trace archives. NCBI performs some quality control and adds them to the accruing database that is blastn accessible. Later the center may assemble them into contigs and post them to the "wgs" division of GenBank (more rarely at "gss" or "htgs"). Depending on the coverage and finishing effort, these contigs can be hosted as a genome by a browser center such as UCSC. It may take 2-3 years for data to complete its migration from trace sequencing to contigs to genome. More rarely, traces are withheld and a genome assembly appears abruptly, as with elephantfish.
Further complications include multiple subspecies for gorilla, orang and gibbon, personal human genomes, diploid genomes, areas of confused taxonomy (alpaca vs vicuna) and so on. NISC and sequencing centers do not always work from same individual animal or even same subspecies so each trace compilation has to be checked separately. Transcript programs can use yet other individual animals or subspecies.
To annotate at kbp scales (adequate for exons and small genes), one can reliably use the traces or contigs and not wait (years) for a genome browser to appear.
If the exon or feature is 1000 bp (or in such pieces), the trace archives work quite well, especially for establishing presence. Absence is not as informative because data might simply be missing due to low coverage. No vertebrae species is truly complete yet including human. it requires a few million traces before a given incoming genome is worth checking for a given feature.
Not every trace makes it into a contigs or assembly -- singletons are often omitted, often millions of them. Sequencing often continues after a release, as is happening now with elephant and guinea pig. Consequently the trace archive at NCBI is always the resource of last resort, ie if a feature is missing from assembled traces, it is best to go back to the trace archive because all original data is there. However NCBI posting of traces to its blast database can lag trace inputs from the sequencing centers by a week or more, which can amount to 1.5 million traces in an active project.
There are also "cdna" species like the marsupial, Trichosurus vulpecular, which have rather complete coverage of coding genes but no genome project underway. Such sequences can furnish critical close-in query material to improve the sensitivity of trace blast (which is not sensitive at any evolutionary distance if the feature is evolving rapidly).
In a concrete comparative genomics research project, it is important to document which species were considered. For that it is convenient to enter annotation data in a column next to its species, in a spreadsheet containing all species with genomic data (provided below and illustrated below that with a concrete coding indel example). This allows last-minute updating prior to paper submission.
Finally, PCR can be used on species currently lacking genomic or cdna projects when it is critical to augment sampling density. Flying lemur would be a good choice in primate-oriented projects because it appears to be the immediate outgroup (hence a great improvement over distant mouse).
Two recent papers illustrate these concepts and explain methods of contemporary comparative genomics in greater detail:
Janecka JE, Miller W, Pringle TH, Wiens F, Zitzmann A, Helgen KM, Springer MS, Murphy WJ. Molecular and genomic data identify the closest living relative of primates. Science. 2007 Nov 2;318(5851):792-4. PMID: 17975064 Murphy WJ, Pringle TH, Crider TA, Springer MS, Miller W. Using genomic data to unravel the root of the placental mammal phylogeny. Genome Res. 2007 Apr;17(4):413-21. PMID: 17322288
The table is correct as of 01 Feb 08. Traces indicated in millions, eg Trc12 means 12 million traces but no wgs contigs or assembly available Wgs08 means wgs division of GenBank contains short assembled contigs searchable with tBlastn Mar06 etc means the March 2006 assembly is the most recent available at UCSC Mar06 homSap Homo sapiens (human) Mar06 panTro Pan troglodytes (chimp) Trc04 gorGor Gorilla gorilla (gorilla) Jul07 ponPyg Pongo pygmaeus (orang_abelii) Trc19 nomLeu Nomascus leucogenys (gibbon) Jan06 macMul Macaca mulatta (rhesus) Trc12 papHam Papio hamadryas (baboon) Trc17 tarSyr Tarsius syrichta (tarsier) Jun07 calJac Callithrix jacchus (marmoset) Dec06 otoGar Otolemur garnettii (bushbaby) Wgs08 micMur Microcebus murinus (mouse_lemur) Trc00 cynVol Cynocephalus volans (flying_lemur) Dec06 tupBel Tupaia belangeri (treeshrew) Jul07 musMus Mus musculus (mouse) Nov04 ratNor Rattus norvegicus (rat) Wgs08 speTri Spermophilus tridecemlineatus (ground_squirrel) Trc07 dipOrd Dipodomys ordii (kangaroo_rat) Wgs08 cavPor Cavia porcellus (guinea_pig) May05 oryCun Oryctolagus cuniculus (rabbit) Wgs08 ochPri Ochotona princeps (pika) May05 canFam Canis familiaris (dog) Mar06 felCat Felis catus (cat) Aug06 bosTau Bos taurus (cow) Trc10 turTru Tursiops truncatus (dolphin) Trc06 susScr Sus scrofa (pig) Trc11 vicVic Vicugna vicugna (vicugna) Jan07 equCab Equus caballus (horse) Wgs08 myoLuc Myotis lucifugus (microbat) Trc08 pteVam Pteropus vampyrus (macrobat) Wgs08 sorAra Sorex araneus (shrew) Wgs08 eriEur Erinaceus europaeus (hedgehog) May05 loxAfr Loxodonta africana (elephant) Trc09 proCap Procavia capensis (hyrax) Jul05 echTel Echinops telfairi (tenrec) May05 dasNov Dasypus novemcinctus (armadillo) Trc09 choHof Choloepus hoffmanni (sloth) Trc10 macEug Macropus eugenii (wallaby) Jan06 monDom Monodelphis domestica (opossum) Mar07 ornAna Ornithorhynchus anatinus (platypus) May06 galGal Gallus gallus (chicken) Trc15 taeGut Taeniopygia guttata (finch) Feb07 anoCar Anolis carolinensis (lizard) Aug05 xenTro Xenopus tropicalis (frog) Jul07 danRer Danio rerio (zebrafish) Oct04 takRub Takifugu rubripes (fugu) Feb04 tetNig Tetraodon nigroviridis (pufferfish) Feb06 gasAcu Gasterosteus aculeatus (stickleback) Apr06 oryLat Oryzias latipes (medaka) Wgs08 calMil Callorhinchus milii (elephantfish) Mar07 petMar Petromyzon marinus (lamprey)
A coding indel example (a coding exon from gene SPC25 on human chr2) illustrates the usefulness of multiple genomes in timing and understanding evolution of insertions and deletions.
homSap MVEDELALFDKSINEFWNKFKST--DTSCQMAGLRDTYKDSIKAFA panTro MVEDELALFDKSINEFWNKFKST--DTSCQMAGLRDTYKDSIKAFA ponPyg MVEDELALFDKSINEFWNKFKST--DTSCQMAGLRDTYKDSIKAFA macMul MVEDELALFDKSINEFWNKFKST--DTSCQMAGLRDTYKDSIKAFA calJac MVEDELALFDKSLNEFWNKFKST--DTTFQMAGLRDTYKDSLKAFA tarSyr MVEDELTLFDKSINEFWNKFKST--DTANQMMGLRDTYKDSVKAFA otoGar MVEDQLALLDKNINEFWNKFKST--DTAGQMAGLRDTYKDSIKTFA micMur MVEDELVLFDKTVNEFWNKFKST--DTSCHMVGLRDTYKDSLKAFA cynVol .................NKFTST--DTSCQMMGLRGTNK....... tupBel MVEDELALFDKGINEFWNKFRSTVSDTSCQMVGLRDAYKDSIKAFA musMus MGEDELALLNQSINEFGDKFRNRLDDNHSQVLGLRDAFKDSMKAFS ratNor MGEDELAAFEKSINEFGDKFRYRLSDNRSQVLGLKDAFKDSIRALS cavPor MVEDELALFDKSINEFGNKFRNTLSDTPCQMLGLRDACKDSIKTLA speTri MMEDELARFDKSINEFGNKFRNTFSDTRCQMVGLRDVFKDSIEALA dipOrd MVEDELAHFDKSISEFGSKFRNTLSDTPSQTVGLRDAYKDSIKALS oryCun MVEDELALFDKSINEFGSKFRSTLSDAPCQMVGLRDAYKDSVKSLT ochPri MVEDELALFDKSINEFGSKFRSTLSDTPCQMVGLREACKDSVRLLT canFam MIDDELAQFDKSISEFWSKFKGTVSDTSSQMVGLRETYKDSIKACA felCat MIEDELALFDKSINEFWNKFKSTLSDTSCQMMGLRDTYKDSIKALT equCab MVEDELALFDKSINEFWNKFKNTVSDTSCQMVGLRDAYKDSIKAFA myoLuc MVEDELALLDKNINEFWNKFKSNVNDTSCQMVGLRDNYKDISKAFT pteVam MVEDELALLDKSINEFWNKFKSSVSDTSCQMMALRDSYKDINKAFT bosTau MVEDELALFDKSINEFWNKFKSTVSDTSCQMVGLRETYKDSIKAFA turTru MVEDELALFDKSINEFWNKFRSTVSDTSCQMVGLRDTYKDSIKAFA susScr MVEDELALFDKSINEFWNRFKSTVSDTSCQMVGLRENYKDSLKAFA oviAri MVEDELALFDKSLNEFWNKFKSTVNDTSCQMVGLREAYKDSIKAFA eriEur MVEDELALFDKSINEFWNKFKGTVSDTSFQMVGLRDTYKDSIKIFT sorAra MVEDELVLFEKSINEFVNEFESTASDTTCQVVGPRDADKDSIKALA dasNov MIEDELALFDKSINEFWNKFKGTVSDNSCQMVGLRDTYKDSIKAFA choHof MIEDELALFDKSINEFWNKFKSAVSDTSCQMVGLRDTYKDSIKAFA loxAfr MIEDELVQFDKSINEFWNKFINTASDTSCQMVGLRDAYKDSMKAFA proCap MIEDELRQFDKSINEFWNKFINTTSDTSCQMAGLRDAYKDSMKAFA echTel MIEDELLQFDKSMNEFRNKHFNTLNDTSGQMMGLRDTYRDSMKAFA monDom MSHIKTEEELDLFNKSINDFWNKFRNTTLNEHCSQMVGLRDTYKDSIEALT macEug MSHIKTEEELDIFEKSISDFWNRFRNTAFNEPYSQVVGVRDTYKYSIETLT triVul MSHIKTEEELDIFNKSINDFWNRFRNTTFNEHYSQVVGLRDTYKNSIEALT ornAna MSHIKTEEELALFDKSIDEFWTKFKNTWISEYSCQTVTLRDAHKEAIKALT galGal MSAVKTEDEITVVEREMKEFWTELKSVYGTEQINQTLALRDSCKESINVLS taeGut MGNAQAEDEVALFEKDMKEFWIQFKISYGTEQNNQTMKEFWIQFKISYGTE anoCar MAKAKEEDELTMLEKGIEELCTQIETTYCRQSLEKTSGPRNKCYKSGPRNK