Bison: mitochondrial genomics
Introduction to bison conservation genomics
(to be continued)
Phylogeny: bison and yak are sister groups
(to be continued)
Interpreting bison CYTB variation
(to be continued)
Interpreting yak CYTB variation
Although the mitochondria encodes the usual 20 amino acids, only a subset of physio-chemically similar residues (the reduced alphabet) ever appear at a given position in a given protein. This subset describes the acceptable substitutions that do not significantly disrupt protein functionality. Discovery of this reduced alphabet can be achieved with greater sensitivity when the number of available species and their individual sequencies multiplicities are high. For mitochondrial proteins, that sensitivity is 1 in 10,000 (0.01% occurence frequency) for a given amino acid.
Interpretive certainty is never attained without experimentation but improves (up to a point) with more sequence data. Here it is important to check whether certain less common substitutions have persisted over evolutionary time in a phylogenetically coherent manner (ie a sub-clade) or are novel adaptations perhaps in conjunction with a co-evolving residue at another site (or another protein, perhaps even nuclear-encoded). After these considerations, the remaining rare changes are either deleterious or sequencing error. Polymorphism significance can be pursued at the xray structural level for only 3 of the 13 mitochondrial proteins (CYTB, COX2, COX1) and even this is complicated in the case of CYTB by its oliomeric association with 3 nuclear encoded proteins.
Aligning CTYB from the 70 complete yak mitochondrial genomes available on 1 Dec 10 shows variation at just 9 sites along the protein (ie 9 nsSNPs). These are quickly found when the web alignment tool retains input sequence order, displays residues identical to the top sequence as dots, gaps fragmentary data correctly, and allows a wide display permitting effective cross-species comparisons.
Yak and bison -- despite being sister species -- share variation only at one site, position 98. Here yak is exclusively valine with the exception of a single deleterious occurence (see below) of leucine, whereas bison have a mix of valine and alanine (which otherwise is very rare at this position in mammals), ie the ancestral residue was valine. Thus no lineage sorting occured at any amino acid position in CYTB at the time these species diverged. Lineage sorting however may be important in the overall evolution of the Bovini: 53 ancient polymorphisms (at the dna level) are said to have persisted since Bos and Bison diverged from Bubalus 5–8 million years ago.
The summary table of yak CYTB amino acid polymorphisms below arises from alignment of 5000 full-length mammalian cytochrome b orthologs. Red indicates deleterious mutation, green a possibly acceptable change but of restricted distribution, and blue a near-neutral substitution. It can be seen that the smallish yak population sampled(21 wild, 48 domestic added in Aug 10 to 3-4 previously available) already contains 5 deleterious alleles in CYTB which represents only 10% of the mitochondrial proteome.
A017T A084T V098L I188T I192T V195A D214N V329M I348F 927 A 4,994 A 4522 V 4309 I 94 I 4528 V 4429 D 4610 V 4232 I 4018 S 3 T 430 I 667 S 4353 L 427 I 512 N 188 T 651 V 46 T 1 P 34 M 14 I 505 M 25 T 43 E 133 A 63 T 3 L 1 V 11 A 1 T 31 T 4 G 8 S 44 I 45 M 3 M 1 L 3 F 4 M 2 Y 22 M 4 N 1 F 1 N 2 V 1 A 1 H 2 G 2 F 1 P 1 A 1 E 1 A 1 S (analysis to be continued) A017T 927 A 4018 S 46 T 3 L 3 M 1 F 1 P (analysis to be continued) A084T 4,994 A 3 T 1 P 1 V
V098L: At position 98, the reduced alphabet consists of valine 90% of the time regardless of mammalian clade with the similar (branched chain aliphatic) isoleucine having substantial dispersed representation at nearly 9%. The 430 species in which it occurs are scattered incoherently within mammal clades, meaning that it has arisen independantly many times. V098I may be slightly suboptimal as there is an evident bias (at some level) against equal occurence. It likely co-exists with valine in most non-bottlenecked populations of mammals, observed if enough individuals of a given species are sequenced.
However leucine, the seemingly similar third aliphatic residue, occurs one once despite being but a single base change transition away from the dominant residue. Were leucine a near-neutral substitution, its incidence would be vastly higher. Thus the change V098L reported for yak represents either a deleterious mutation or an unprecedented adaptation (eg to high altitude) or sequencing error in GenBank entry ACU82101. The same can be said for the more overtly radical change V098N in lemur AAS00156.
V098L 4522 V most common amino acid at position 98 of CYTB 430 I 34 M 11 A bison 1 L yak 1 N lemur (analysis to be continued) V098L 4522 V 430 I 34 M 11 A 1 L 1 N (analysis to be continued) I188T 4309 I 667 S 14 I 1 T (analysis to be continued) I192T 94 I 4353 L 505 M 31 T 3 F 2 V 1 A 1 S (analysis to be continued) V195A 4528 V 427 I 25 T 11 X 4 G 4 M 1 A (analysis to be continued) D214N 4429 D 512 N 43 E 8 S 4 X 2 Y 1 H (analysis to be continued) V329M 4610 V 188 T 133 A 44 I 22 M 2 G 1 E (analysis to be continued) I348F 4232 I 651 V 63 T 45 M 4 N 2 F 1 A
Kilo-sequence alignment tricks
New sequencing technologies have greatly affected the amount of mammalian mitochondrial genomic data available at GenBank. Five years ago, it was acceptable to publish population-level D loop sequences accompanied by a few fragmentary coding reads; today, a publication might offer 60-70 entire mitochondrial genomes. This favors evolutionary study of mitochondrial proteins over comparative genomics of nuclear genome products because the latter is still restricted to around 50 species (Dec 2010) almost all incompletely sequenced.
Many long-standing issues such as introgression, historic bottlenecks, population mixing, accrual of deleterious coding variants, hard polytomies, and lineage sorting during speciation can now be approached and resolved, especially with the increasing sequencing of end-Pleistocene frozen dna. This may allow more enlightened management of endangered species such as bison where populations reached rock bottom -- recovering numbers is not enough if genomic integrity is still at risk.
However, the flood of data raises significant issues in extraction of significant information: it is not instructive to align the tens of thousands of sequences available for each of 13 mitochondrial proteins -- that give a an intractible array of 3789 amino acids by 12500 sequences, enough to fill 20 x 100 = 2000 screens on the largest possible computer monitor. That data must be distilled down somehow to take-away information.
This section explains a practical desktop protocol for extracting the 'reduced phylogenetic alphabet' at each residue of the mitochondrial proteome. The method depends heavily on current capabilities of Blastp at NCBI and so may not be completely stable to changes made there over time.
First note that tBlastn cannot be used against the nr or wgs nucleotide databases at NCBI (or with Blat at UCSC) since the signficantly different genetic code of mammalian mitochondia is no longer supported as a parameter option. Other oddities involve missing terminal nucleotides that are added before translation. However mitochondrial dna is usually translated sensibly at GenBank protein entries.
The vertebrate mitochondrial code: TTT F Phe TCT S Ser TAT Y Tyr TGT C Cys TTC F Phe TCC S Ser TAC Y Tyr TGC C Cys TTA L Leu TCA S Ser TAA * Ter TGA W Trp TTG L Leu TCG S Ser TAG * Ter TGG W Trp CTT L Leu CCT P Pro CAT H His CGT R Arg CTC L Leu CCC P Pro CAC H His CGC R Arg CTA L Leu CCA P Pro CAA Q Gln CGA R Arg CTG L Leu CCG P Pro CAG Q Gln CGG R Arg ATT I Ile ACT T Thr AAT N Asn AGT S Ser ATC I Ile i ACC T Thr AAC N Asn AGC S Ser ATA M Met i ACA T Thr AAA K Lys AGA * Ter Bos can use ATA as initiation codon ATG M Met i ACG T Thr AAG K Lys AGG * Ter GTT V Val GCT A Ala GAT D Asp GGT G Gly GTC V Val GCC A Ala GAC D Asp GGC G Gly GTA V Val GCA A Ala GAA E Glu GGA G Gly GTG V Val i GCG A Ala GAG E Glu GGG G Gly AAs = FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIMMTTTTNNKKSS**VVVVAAAADDEEGGGG Start = --------------------------------MMMM---------------M------------ Base1 = TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG Base2 = TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG Base3 = TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG