USH2A SNPs: Difference between revisions
Tomemerald (talk | contribs) |
Tomemerald (talk | contribs) |
||
Line 372: | Line 372: | ||
=== Allele assessment by PolyPhen, Sift etc. is unsatisfactory for S3743N of USH2A === | === Allele assessment by PolyPhen, Sift etc. is unsatisfactory for S3743N of USH2A === | ||
Notice the popular SNP evaluation tool, [http://genetics.bwh.harvard.edu/pph/ PolyPhen], is not helpful here because it calculates using blastp matches found at SwissProt rather than at NCBI wgs, thus finding a meagre 6 homologs for USH2A vs 46 here. Orthology is not established -- a highly problematic procedure since this could mix in gene duplications that have neofunctionalized | Notice the popular SNP evaluation tool, [http://genetics.bwh.harvard.edu/pph/ PolyPhen], is not helpful here because it calculates using blastp matches found at SwissProt rather than at NCBI wgs, thus finding a meagre 6 homologs for USH2A vs 46 here, resulting in a poor estimation of the reduced alphabet at 3743. Orthology is assumed, not established -- a highly problematic procedure since this could mix in lineage-specific gene duplications that have neofunctionalized with divergence of constraints on the critical residue. | ||
Sequences are not considered in their phylogenetic tree context so cumulative supporting branch length time | Sequences are not considered in their phylogenetic tree context so cumulative supporting branch length time is an unavailable metric. Experimental literature on fibronectin domains and binding partners is ignored and paralogous values to S3743 in internal and external FN3 domains are not in the mix. | ||
The list of known Usher 2A Syndrome causative alleles lying in other USH2A fibronectin domains could be scored at PolyPhen to assess its accuracy (typically 70% might be correctly identified); | The list of known Usher 2A Syndrome causative alleles lying in other USH2A fibronectin domains could be scored at PolyPhen to assess its accuracy (typically 70% might be correctly identified); in this particular case, none of these lie in the 22nd FN3 domain and none lie in homologous position to S3743. However 3 do lie in the conserved patch about 3743. The tendency for residues important to normal function to occur in patches has not yet been systematically evaluated but may be useful in allele interpretation. | ||
PolyPhen thus treats S3743N as borderline benign based primarily on S->N innocuousness; its algorithm proceeds without a valid description of the reduced alphabet at this position (S,T: hydroxyl) nor knowledge of the subsequent fixation of S in amniote. It is inappropriate to use BloSum transition matrices which are broad averages over unrelated proteins and so greatly enriched for bland transitions such as S-->N which indeed are generally neutral (indeed by design of the genetic code). PolyPhen rules require 50% identity for the PDB match and S matching, whereas 25% identity is already sufficient and S/T matching is quite satisfactory, so critical structural information is discarded. | |||
In short, software such as Sift and PolyPhen fail to use all available information, always a bad idea in feature annotation. Such tools have value in quick screening of millions of alleles for flaming anomalies but are not particularly useful for specific genes because curational judgement on more extensive information will always outperform them. This is illustrated above in comparative genomics products not available to these tools but very useful in making the best possible annotation call. It is exceedingly important to make all-out use of bioinformatics in view of the high costs and delays of experimental validation. | |||
In short, software such as Sift and PolyPhen fail to use all available information, always a | |||
[[Category:Comparative Genomics]] | [[Category:Comparative Genomics]] |
Revision as of 13:35, 31 July 2009
USH2A
Usherin (USH2A), a 71-exon coding gene located on human chromosome 1q41], encodes a 5202 residue multi-domain protein comprised of a signal peptide, a PDZ1 binding domain (for USH1C and WHRN), 1 laminin NT-terminal domain, 10 laminin EGF-like domains, 4 fibronectin type-III domains (for collagen IV and fibronectin), and 2 laminin G-like domains followed by 31 additional fibronectin type-III domains all tethered to the cytoplasmic exterior by a single transmembrane domain.
The usherin gene is expressed in the basement membrane of many (but not all) cell types, notably in ear interstereocilia ankles and below retinal pigment epithelial cells (Bruch's layer). When normal function is disrupted by mutations in both copies, non-vestibular sensorineural deafness and degeneration of retinal photoreceptor cells called Usher syndrome type IIA results.
Initially, only the first 21 exons were studied but later it emerged that the gene was much longer and mutations along the entire length of the protein all led to the same disease: 125, 163, 230, 268, 303, 334, 346, 352, 478, 536, 595, 644, 713, 759, 1212, 1349, 1486, 1572, 1665, 1757, 2080, 2086, 2106, 2169, 2238, 2265, 2266, 2292, 2562, 2875, 2886, 3088, 3099, 3115, 3124, 3144, 3199, 3411, 3504, 3521, 3590, 3835, 3868, 3893, 4054, 4115, 4232, 4433, 4439, 4487, 4592, 4624, 4795, 5031.
This note evaluates a tentative new SNP in USH2A with comparative genomics. The mutation occurs as a non-hotspot G-->A transition causing a seemingly innoculous S-->N amino acid change at postion 3743. This is just downstream from a glycosylation motif and very near known FN3 interdomain contact residues and a cytokine receptor motif (according to its annotation at SwissProt). This residue lies in the 22nd fibronectin domain which is split across exon 56 and 57.
This change will be shown significant (not plausibly neutral). It could represent an adaptive innovation but is more likely deleterious. The gene is single-copy so there are no prospects for compensation by a second gene. Consequently the mutation, if present on both alleles, could well result in a new form of Usher syndrome type IIA.
Background
USH2A, while resembling just another domain scramble, actually traces back to pre-blaterans in a coherent manner. Nearly full-length ortholog candidates can readily be recovered from sea anemone and hydra. Despite good representation in cnidarians (where its normal function may be similar), the gene seems completely lost (or majjorly diverged) from all arthropods and lophotrochozoans. It's not clear whether the need for a basement membrane organizer has been lost or whether some other gene has taken over USH2A's role in these clades.
Focusing now on the fibronectin FN3 domains (that envelop S3743), these are an ancient and exceedingly common domain in bilaterans with 2% of the human proteome containing them (400 genes), often in multiple tandem copies having a role in cell adhesion. However they are not particularly well conserved in primary sequence, though the tertiary structure likely holds up well enough for the structure at residue 3743 to be determined with both serine and asparagine present.
Here the best blastp match elsewhere within the human proteome to the FN3 domain containg residue 3743 is a fibronectin domain of PTPRQ, a dimly related protein tyrosine phosphatase with merely 28% of the fibronectin residues matching.
Internally, the best match to the other 30 FN3 domains of USH2A is not noticably better, suggesting very substantial divergence since these domains duplicated from a common source (either as internal tandems or domain shuffles). If as suggested above, USH2A had already assumed its contemporary domain structure in pre-bilateran metazoa, ample time has passed to produce the observed divergences between the individual domains.
While comparative genomics of intron positions and phases in a 71-exon protein are tedious to curationally pursue, the fibronectin domain containing S3743 falls across parts of three exons, whose phases are 12 and 21. This suggests that subsequent to the ancient intronation era, simple internal tandem duplications might not result in either a coherent reading phase or domain. Thus the domain structure of USH2A, while appearing somewhat arbitrary in its FN3 multiplicities, actually may be quite constrained by intronation against both contraction or expansion (in addition to whatever individual functional domain constrains exist).
As can be seen below, the internal fibronectin repeats are most often T threonine at the position corresponding to S3743 though other residues, not including the asparagine of S3743N, also occur. Here the numbering of better matches within the full length protein indicates they do not always correspond in match quality order to the linear order of the FN3 repeat within the protein.
FBN22 1 WSLPEKPNGLVSQYQLSRNGNLLFLGGSEEQNFTDKNLEPNS WSLPEKPNGLVSQYQLSRNGNLLFLGGSEEQNFTDKNLEPNS FBN.. 3702 WSLPEKPNGLVSQYQLSRNGNLLFLGGSEEQNFTDKNLEPNS FBN22 1 WSLPEKPNGLVSQYQLSRNGN-LLFLGGSEEQNFTDKNLEPNS WS+PEK NG++ +YQ+ + G L+ ++ + T L+P + FBN.. 3610 WSVPEKSNGVIKEYQIRQVGKGLIHTDTTDRRQHTVTGLQPYT FBN22 1 WSLPEKPNGLVSQYQLSRNGNLL-FLGGSEEQNFTDKNLEPNS W PE+ NG++ Y+L RN L F N+TD+ L P S FBN.. 4285 WIPPEQSNGIIQSYRLQRNEMLYPFSFDPVTFNYTDEELLPFS FBN22 1 WSLPEKPNGLVSQYQLSRNGNLLFLGGSEEQNFTDKNLEPNS W P K NG+++ Y + +G L N T +L P + FBN.. 2553 WQHPRKSNGVITHYNIYLHGRLYLRTPGNVTNCTVMHLHPYT FBN22 1 WSLPEKPNGLVSQYQLSRNGNLLFLGGSEEQNFTDKNLEPNS W P PNG + Y+L R+G +++ G E + D L P FBN.. 4464 WKPPRNPNGQIRSYELRRDGTIVYTG--LETRYRDFTLTPGV FBN22 1 WSLPEKPNGLVSQYQLSRNGNLLFLGGSEEQNFTDKNLEPNS W+ P+K NG+++QY L +G L++ G E+N+T +L + FBN.. 2075 WNPPKKANGIITQYCLYMDGRLIYSG--SEENYTVTDLAVFT FBN22 1 WSLPEKPNGLVSQYQLSRNGNLLFLGGSEEQNFTDKN-LEPNS W P + NG + Y L RNG F G S +F+DK ++P FBN.. 3521 WRKPIQSNGPIIYYILLRNGIERFRGTS--LSFSDKEGIQPFQ FBN22 1 WSLPEKPNGLVSQYQLSRNGNLLFLGGSEEQNFTDKNLEPNS W+ P PNG+V++Y + N L G + +F ++L P + FBN.. 3040 WTSPSNPNGVVTEYSIYVNNKLYKTGMNVPGSFILRDLSPFT FBN22 1 WSLPEKPNGLVSQYQLSRN-------GNLLFLGGSEEQNFTDKN--LEPNS W P PNGLV + + R L+ L S F DK L P + FBN.. 2644 WQPPTHPNGLVENFTIERRVKGKEEVTTLVTLPRSHSMRFIDKTSALSPWT FBN22 1 WSLPEKPNGLVSQYQLSRNGNLLFLGGSEEQNFTDKNLEPNS WS P + NG++ Y + +G L + G + + F + L+P + FBN.. 4087 WSEPMRTNGVIKTYNIFSDGFLEYSGLNRQ--FLFRRLDPFT FBN22 1 WSLPEKPNGLVSQYQLSRNGNLLFLGGSEE----QNFTDKNLEPNS WS P+ PN Y L R+G ++ + Q F D +L P + FBN.. 1074 WSPPDSPNAHWLTYSLLRDGFEIYTTEDQYPYSIQYFLDTDLLPYT FBN22 1 WSLPEKPNGLVSQYQLSRNG------NLLFLGGSEEQNFTDK--NLEPNS W PEKPNG++ Y + R ++LF+ F D+ L P + FBN.. 3887 WMPPEKPNGIIINYFIYRRPAGIEEESVLFVWSEGALEFMDEGDTLRPFT FBN22 1 WSLPEKPNGLVSQYQLSRNGNLLFLGGSEEQNFTDKNLEPNS WS P NG +++Y L R N L G + +L+P S FBN.. 4376 WSPPTVQNGKITKY-LVRYDNKESLAG-QGLCLLVSHLQPYS FBN22 1 WSLPEKPNGLVSQYQL--------SRNGNLLFLGGSEEQNFTDKNLEPNS W+ P +PNG V Y+L R N + + +F D L P + FBN.. 4657 WTGPLQPNGKVLYYELYRRQIATQPRKSNPVLIYNGSSTSFIDSELLPFT FBN22 1 WSLPEKPNGLVSQYQL------SRNGNLLFLGGSEE----QNFTDKNLEPNS W P + NG + Y L R ++ + + Q++ L+P FBN.. 4552 WDPPVRTNGDIINYTLFIRELFERETKIIHINTTHNSFGMQSYIVNQLKPFH FBN22 1 WSLPEKPNGLVSQYQLSRN-------GN--------LLFLGGSEEQN---FTDKNLEPNS 42 WS P PNG + +Y++ R GN ++F + E+N + D L+P + FBN.. 4175 WSEPVNPNGKIIRYEVIRRCFEGKAWGNQTIQADEKIVFTEYNTERNTFMYNDTGLQPWT 4234
Pseudogene issues
Long isoform USH2A transcripts are over 15,000 bp in length. Consequently position 3743 is not even represented in the set of all human direct transcripts. Even should a retrogene arise from retropositioing, it is unlikely that the process would extent upstream so many exons. Unsurprisingly no processed pseudogenes are evident in any mammalian genome (tblastn of wgs division of GenBank). Thus no potential for confusion exists in locating orthologs of USH2A even in distant species with incomplete genomes.
Paralog issues
No close paralog exists in the human proteome according to the UCSC GeneSorter track. The nearest matches are to other proteins containing laminin or fibronectin domains. No potential for confusion with other genes exists within vertebrates; however comparative genomics at and before teleost fish divergence needs more careful treatment because of whole genome and domain expansion.
Tandem domain repeat issues
In proteins with multiple copies of a given domain, both expansion and contraction can occur over evolutionary timescales resulting in different numbers of repeats in different clades. Under these circumstances it can be difficult to establish orthologs of a given domain. However here the fibronectin domains diverged early on and the 22nd domain seems to be present in all vertebrates with genome projects as a single-copy domain (meaning here no recent duplications or losses).
The alignment of fibronectin domains in human USH2A shows pockets of conservation (notably LEPNSRY about S3743 in the 22nd FN3 domain) and certain conserved anchor residues but on the whole is mediocre due to gaps necessitated by length differences. The second alignment -- just of FN3 domains contained verified pathogenic mutationv -- shows these sites are highly correlated with conserved residues (8 of 24 are represented, two sites multiple times in separate FN3 domains). Possibly some of the best conserved sites cannot be mutated without much more far-reaching effects on all tissues in which USH2A is expressed.
Analysis of the full set fibronectin domains somewhat strengthens the case for S3743N pathogenicity (it lies in a conserved patch with two nearby sites proven pathogenic and similar hydroxyl T is most abundant residue here and in deeper phylogeny) but not overwhelmingly (the 8th domain has N at homologous position).
Some scepticism is in order for pathogenicity of Y4487C and Q4592H in the 30th and 31st FN3 domains in view of their position in apparently unconstrained loop positions with no observed interdomain conservation. Yet tblastn of both at GenBank wgs shows remarkable phylogenetic conservation (data not shown), similar to domain 22.
What is needed here online alignment software that accept a fasta sequence array (in effect a simple relational db) and outputs something whose rows are individual Logos. More simply, it could use a faster header naming scheme to collapse a conventional alignment to the index species. Many human genes have internally repeated domains and many others have full-length paralogs (which can be treated like tandem repeats). Evaluating coding SNPs is a huge issue in genomic medicine and even slight improvements in forecasting could have significant benefits.
Fasta sequences for the 35 fibronectin domains of USH2A are shown below (as delineated at SwissProt]). Those containing a mutation known to give rise to USH2A Syndrome contain * in their header -- some are found in patients but are of uncertain pathogenicity. The mutation itself is flanked by spaces in the fasta sequence itself for readability. There are 22 known sites in 18 different fibronectin domains according to this analysis. Note still other pathogenic mutations occur interstitially (between FN3 domains).
>01*1058-1143 86aa P1059L uncertain pathogenicity P P PRGQVQSSSAINLSWSPPDSPNAHWLTYSLLRDGFEIYTTEDQYPYSIQYFLDTDLLPYTKYSYYIETTNVHGSTRSVAVTYKT >02*1145-1238 94aa P1212L PGVPEGNLTLSYIIPIGSDSVTLTWTTLSNQSGPIEKYILSCAPLAGGQPCVSYEGHETSATIWNLV P FAKYDFSVQACTSGGCLHSLPITVTT >03:1242-1357 116aa PPQRLSPPKMQKISSTELHVEWSPPAELNGIIIRYELYMRRLRSTKETTSEESRVFQSSGWLSPHSFVESANENALKPPQTMTTITGLEPYTKYEFRVLAVNMAGSVSSAWVSERT >04:1367-1462 96aa PPSVFPLSSYSLNISWEKPADNVTRGKVVGYDINMLSEQSPQQSIPMAFSQLLHTAKSQELSYTVEGLKPYRIYEFTITLCNSVGCVTSASGAGQT >05:1871-1949 79aa GAVVNLASVSSGAVRVNLDGCLSTDSAVNCRGNDSILVYQGKEQSVYEGGLQPFTEYLYRVIASHEGGSVYSDWSRGRT >06:1954-2051 98aa PQSVPTPSRVRSLNGYSIEVTWDEPVVRGVIEKYILKAYSEDSTRPPRMPSASAEFVNTSNLTGILTGLLPFKNYAVTLTACTLAGCTESSHALNIST >07.2052-2138 87aa PQEAPQEVQPPVAKSLPSSLLLSWNPPKKANGIITQYCLYMDGRLIYSGSEENYIVTDLAVFTPHQFLLSACTHVGCTNSSWVLLYT >08.2142-2236 95aa PPEHVDSPVLTVLDSRTIHIQWKQPRKISGILERYVLYMSNHTHDFTIWSVIYNSTELFQDHMLQYVLPGNKYLIKLGACTGGGCTVSEASEALT >09*2241-2325 85aa A2249D PEGVPAPK A HSYSPDSFNVSWTEPEYPNGVITSYGLYLDGILIHNSSELSYRAYGFAPWSLHSFRVQACTAKGCALGPLVENRTL >10*2328-2432 105aa R2354H PPEGTVNVFVKTQGSRKAHVRWEAPF R PNGLLTHSVLFTGIFYVDPVGNNYTLLNVTKVMYSGEETNLWVLIDGLVPFTNYTVQVNISNSQGSLITDPITIAMPP >11:2435-2528 94aa PDGVLPPRLSSATPTSLQVVWSTPARNNAPGSPRYQLQMRSGDSTHGFLELFSNPSASLSYEVSDLQPYTEYMFRLVASNGFGSAHSSWIPFMT >12:2533-2619 87aa PGPVVPPILLDVKSRMMLVTWQHPRKSNGVITHYNIYLHGRLYLRTPGNVTNCTVMHLHPYTAYKFQVEACTSKGCSLSPESQTVWT >13:2621-2718 98aa PGAPEGIPSPELFSDTPTSVIISWQPPTHPNGLVENFTIERRVKGKEEVTTLVTLPRSHSMRFIDKTSALSPWTKYEYRVLMSTLHGGTNSSAWVEVT >14*2724-2812 89aa A2795S PAGVQPPVVTVLEPDAVQVTWKPPLIQNGDILSYEIHMPDPHITLTNVTSAVLSQKVTHLIPFTNYSVTIV A CSGGNGYLGGCTESLPT >15.2821-2920 100aa PQNVGPLSVIPLSESYVVISWQPPSKPNGPNLRYELLRRKIQQPLASNPPEDLNRWHNIYSGTQWLYEDKGLSRFTTYEYMLFVHNSVGFTPSREVTVTT >16.2925-3015 91aa PERGANLTASVLNHTAIDVRWAKPTVQDLQGEVEYYTLFWSSATSNDSLKILPDVNSHVIGHLKPNTEYWIFISVFNGVHSINSAGLHATT >17:3020-3105 86aa PQGMLPPEVVIINSTAVRVIWTSPSNPNGVVTEYSIYVNNKLYKTGMNVPGSFILRDLSPFTIYDIQVEVCTIYACVKSNGTQITT >18*3110-3200 91aa R3124G uncertain pathogenicity PSDIPTPTIRGITS R SLQIDWVSPRKPNGIILGYDLLWKTWYPCAKTQKLVQDQSDELCKAVRCQKPESICGHICYSSEAKVCCNGVLYNP >19:3404-3494 91aa PASMEATEHCGRCDFNFTSHICTVIRGSHNSTGKASIEEMCSSAEETIHTGSVNTYSYTDVNLKPYMTYEYRISAWNSYGRGLSKAVRART >20*3499-3585 87aa P3504T W3521R T3571M PQGVS P PTWTKIDNLEDTIVLN W RKPIQSNGPIIYYILLRNGIERFRGTSLSFSDKEGIQPFQEYSYQLKAC T VAGCATSSKVVAAT >21:3590-3676 87aa PESILPPSITALSAVALHLSWSVPEKSNGVIKEYQIRQVGKGLIHTDTTDRRQHTVTGLQPYTNYSFTLTACTSAGCTSSEPFLGQT >22*3677-3767 91aa S3743N uncertain pathogenicity LQAAPEGVWVTPRHIIINSTTVELYWSLPEKPNGLVSQYQLSRNGNLLFLGGSEEQNFTDKNLEPN S RYTYKLEVKTGGGSSASDDYIVQT >23:3768-3862 95aa PMSTPEEIYPPYNITVIGPYSIFVAWIPPGILIPEIPVEYNVLLNDGSVTPLAFSVGHHQSTLLENLTPFTQYEIRIQACQNGSCGVSSRMFVKT >24*3863-3960 98aa G3895E PEAAPMDLNSPVLKALGSACIEIKWMPPEKPN G IIINYFIYRRPAGIEEESVLFVWSEGALEFMDEGDTLRPFTLYEYRVRACNSKGSVESLWSLTQT >25*3961-4062 102aa T3976M S4054I LEAPPQDFPAPWAQA T SAHSVLLNWTKPESPNGIISHYRVVYQERPDDPTFNSPTVHAFTVKGTSHQAHLYGLEPFTTYRIGVVAANHAGEIL S PWTLIQTL >26*4066-4150 85aa R4115C uncertain pathogenicity PSGLRNFIVEQKENGRALLLQWSEPMRTNGVIKTYNIFSDGFLEYSGLN R QFLFRRLDPFTLYTLTLEACTRAGCAHSAPQPLWT >27*4154-4258 105aa P4232R PPDSQLAPTVHSVKSTSVELSWSEPVNPNGKIIRYEVIRRCFEGKAWGNQTIQADEKIVFTEYNTERNTFMYNDTGLQ P WTQCEYKIYTWNSAGHTCSSWNVVRT >28*4265-4351 87aa T4337M GLSPPVISYVSMNPQKLLISWIPPEQSNGIIQSYRLQRNEMLYPFSFDPVTFNYTDEELLPFSTYSYALQAC T SGGCSTSKPTSITT >29*4356-4439 84aa T4425M T4439I PSEVSPPDLWAVSATQMNVCWSPPTVQNGKITKYLVRYDNKESLAGQGLCLLVSHLQPYSQYNFSLVAC T NGGCTASVSKSAW T >30*4444-4528 85aa Y4487C PENMDSPTLQVTGSESIEITWKPPRNPNGQIRSYELRRDGTIV Y TGLETRYRDFTLTPGVEYSYTVTASNSQGGILSPLVKDRTS >31*4529-4627 99aa Q4592H PSAPSGMEPPKLQARGPQEILVNWDPPVRTNGDIINYTLFIRELFERETKIIHINTTHNSFGM Q SYIVNQLKPFHRYEIRIQACTTLGCASSDWTFIQT >32:4633-4730 98aa LMQPPPHLEVQMAPGGFQPTVSLLWTGPLQPNGKVLYYELYRRQIATQPRKSNPVLIYNGSSTSFIDSELLPFTEYEYQVWAVNSAGKAPSSWTWCRT >33*4732-4825 94aa L4795R P4818L PAPPEGLRAPTFHVISSTQAVVNISAPGKPNGIVSLYRLFSSSAHGAETVLSEGMATQQTLHG L QAFTNYSIGVEACTCFNCCSKG P TAELRTH >34:4826-4927 102aa PAPPSGLSSPQIGTLASRTASFRWSPPMFPNGVIHSYELQFHVACPPDSALPCTPSQIETKYTGLGQKASLGGLQPYTTYKLRVVAHNEVGSTASEWISFTT >35:4928-5014 87aa QKELPQYRAPFSVDSNLSVVCVNWSDTFLLNGQLKEYVLTDGGRRVYSGLDTTLYIPRTADKTFFFQVICTTDEGSVKTPLIQYDTS
Known structural variations
There are no known issues with alternative splicing that would affect the fibronectin domain under consideration here. As noted earlier, a short version of the protein studied initially does not contain residue 3743 at all.
Structural significance
The 3D structure of the 22nd FN3 domain could be evaluated using best-blastp to a structurally determined FN3 domain in PDB, then modelling the FN3 domain in question by submitting it to SwissModel with both S3743 and T3743. Here the percent identity to an already-determined structure is mediocre but perhaps still sufficient.
If the serine at 3743 is on the surface and involved in a binding interaction with a second (unknown) protein, then the effect of the 3743N substitution would be very difficult to evaluate because asparagine and serine are of similar bulk and polarity. While S <--> N is a benign substitution at many positions in many proteins, at residue 3743 it appears that the hydroxyl lacking in asparagine is critical because, to the extent that any subsition at all is tolerated, it is threonine. (Bulk too may play a role because tyrosine is never seen.)
>pdb|1X5L|A Related structures Chain A, Solution Structure Of The Second Fn3 Domain Of Eph Receptor Identities = 27/93 (29%), Positives = 44/93 (47%), Gaps = 16/93 (17%) Query GVWVTPRHIIINSTTVELYWSLPEKPNGLVSQYQLSRNGNLLFLGGSEEQNF----------TDKNLEPNSRYTYKLEVKTGGG V V R T+V L W PE+PNG++ +Y++ + E Q++ T L+P +RY +++ +T G Sbjct QV-VVIRQERAGQTSVSLLWQEPEQPNGIILEYEIK-----YYEKDKEMQSYSTLKAVTTRATVSGLKPGTRYVFQVRARTSAG
Functional significance
Here human individuals homozygous for S3743T could be examined for early loss of hearing accompanied by initial loss of mid-periferal and night vision. They need not be homozygous because compound mutations would suffice, that is, a different USH2A mutation on the other chr1 allele.
Alternatively, since mouse has an reasonably conserved orthologous fibronectin domain, the effect of S3743N could be considered as a knockin. Here, even if the mouse gene did serve as a disease model for other alleles, symptoms for S3743N might or might not develop within the 2-3 year lifespan of laboratory mouse.
For the immediate term, comparative genomics is best available guide. Here it is clear that S3743 is immensely conserved over several billion years of evolutionary time in those clades observable via genome projects (transcripts are too rare in this long gene to sample species diversity further). This establishes that N3743 is not part of the acceptable reduced alphabet at this residue, though T3743 at one time appears to have been acceptable in the teleost ancestor (and indeed is retained to the present day in early diverging deuterostomes).
The difference alignment below of the exon containing S3743 shows overall conservation well above human proteome average but not extraordinary inflexibility at most positions. The fibronectin portion is evolving as well, no doubt through both drift, internal adaptive change, and co-evolutionary response to binding partner change.
Consequently S3743N -- despite its innocuous appearing nature (ie high Dayhoff matrix score) is likely to have significant non-adaptive impacts on either standalone structure of USH2A protein or its interaction with other proteins in the basement membrane. If selective pressure did not exist to maintain S3743, then what would account for its constancy despite copious variation in nearby residues over the same time span?
The large number of known loci throughout this protein that give rise to Usher Syndrome 2A suggest that not only does this protein play an exceedingly important structural role highly sensitive to seemingly minor mutation perturbations but also that no other gene product is able to compensate for its absence.
Of the 22 known disease-causing point mutations that within a FN3 domain, none is situated at a position homologous to S3743; the closest are P1212L at -3 of the 2nd FN3 domain, T3571M at +10 of the 20th FN3 domain, and T4425M at +10 of the 29nd FN3 domain.
In a sense, the real mystery with USH2A is how nearly all of its 33 FN3 domains could be so mission-critical that a slight perturbation in one cannot be compensated for by the strength of interaction in the remaining 32. The answer may be in the observation that this is not a disease noticed at birth but one that develops over a decade for hearing and two or more for vision, sensory systems where we are perhaps more aware of functional loss than in other basement membranes of USH2A expression.
Speculatively, USH2A protein may not be replenished over the lifespan in the basement membranes of these terminally differentiated cell types and slight dysfunctionalities might lead to slight enhancements in turnover rate, over decades leading to excessive loss of cell matrix structure and perhaps death or inability of the hosting cell to carry out its other functions. The pattern of vision loss, first mid-periferal and then rod vision, may have some information about the structure of the retina and its support cells vis-a-vis USH2A expression in Bruch membranes.
Comparative genomics
The alignments below show the orthologous exon from 46 species. While no variation at S3743 occurs at any mammal or bird USH2A, lizard is possibly anomalous with asparagine in its best matching FN3 domain as are some fish with arginine and early-diverging deuterostomes and cnidaria with threonine.
However the lizard situation is bioinformatically uncertain because the the 3 exons centering on 3743 are missing from the UCSC genome assembly upon whole USH2A blat, whereas the best matching domain is present in AAWZ01000661 upon tblastn at wgs, with the asparagine supported by 4 raw trace reads. The putative relevent exon itself is unexpectedly diverged, causing it to fall to the bottom of the alignment tree in conflict with phylogenetic position. It further has an unusual one residue deletion 6 amino acids prior to 3743.
Consequently the Anolis feature may not represent the orthologous exon of a functioning gene copy. However it provides support for the idea that some fibronectin domains in some species can tolerate asparagine at paralogous position. Thus while N3743, if valid, detracts only mildly from story of invariant S3743 (with T3743 tolerated), the divergence time with mammals is some 310 myr ago.
The arginine anomaly in four telost fish but not zebrafish cannot be read or assembly error. S3743R is not at all a conservative substitution. Parsimoniously, it represents a single event in a late diverging clupeomorph fish, Since it has persisted in descendent lineages, it may represent adaptive change. Note shark has S3743, as do amphioxus and sea urchin. Lamprey genome is incomplete here.
In summary, S3743 has been fixed for billions of years of branch length within mammals and beyond. The reduced alphabet here is very restricted (outside of rapidly evolving teleost fish) with T3743 probably ancestral and nearly the full extent of admitted variation. Note asparagine codons, like threonine, lie a single base transition away and so experience no need for two mutational steps and the consequent intermediate barrier. This implies a small amino acid with hydroxyl at this position is critical to proper functionality of USH2A. Hydrogen-bonding capability (eg asparagine) is likely not sufficient in a substituent for serine.
Thus S3743N, though it could be an adaptive functional innovation, is most likely a maladaptive mutation. The symptoms of Usher Syndrome 2A are the likeliest outcome in the homozygote given the situation at the many known other disease alleles, though the penetrance and age of onset remains unpredictable.
As a cautionary note, a distinction must be observed between significant impact to normal function and significant impact to fitness. For example, sickle-cell hemoglobin evidently disrupts normal protein function, yet it adds to fitness (malarial resistance) in the heterozygote. Here allele population statistics are illuminating. Prion disease complicates that: amyloid and dementia surely does not add to either normal function nor fitness yet age of onset of familial CJD is so late that harmful alleles rarely came into play during lifespans typical of almost all of human evolution. This has allowed certain lethal alleles to attain substantial frequencies through founder effect and drift.
............................................................^. hatch marks S3743 site USH2A_homSap GVWVTPRHIIINSTTVELYWSLPEKPNGLVSQYQLSRNGNLLFLGGSEEQNFTDKNLEPNSR USH2A_panTro GVWVTPRHIIINSTTVELYWSLPEKPNGLVSQYQLSRNGNLLFLGGSEEQNFTDKNLEPNSR USH2A_gorGor GVWVTPRHIIINSTTVELYWSLPEKPNGLISQYQLSRNGNLLFLGGSEEQNFTDKNLEPNSR USH2A_ponAbe GVWVTPRHIIINSTTVELYWSLPEKPNGLISQYQLSRNGNLLFLGGSEKQNFTDKNLEPNSR USH2A_nomLeu GVWVTPRHIIINSTTVELYWSLPEKPNGLISQYQLSRNGNLLFLGGSEEQKFTDKNLEPNSR USH2A_macMul GVWVTPRHIIINSTTVELYWSLPEKPNGLISQYQLSRNGNLLFLGGSEEQNFTDKNLEPNSR USH2A_calJac GVWVTPRHIIINSTTVELYWSLPEKPNGLISQYQLSRNGNLLFLGGSEEQNFTDKNLQPNSR USH2A_tarSyr GVWVTPRHIIINSTTVELYWSPPEKPNGLISQYQLSRNGTLLFLGGSEEQNFTDKNLEPHSR USH2A_micMur GVWVTPRHIIINSTTVELYWSPPEKPNGLISQYQLRRNGTLLFLGGSEEQNFTDKNLEPNSR USH2A_tupBel GVWVTPRHIIINSTTVELYWSLPKKPNGLISQYQLSRNGTLLFLGGSEEQNFTDKNLEPDSR USH2A_musMus GVWVTPRHIIINSTTVELYWNPPERPNGLISQYQLRRNGSLLLVGGRDNQSFTDSNLEPGSR USH2A_ratNor GVWVTPRHIIINSTTVELYWNPPERPNGVISQYRLRRNGSLLLVGGRDDQSFTDKNLEPNSR USH2A_dipOrd GVWVTPRHIIINSTAVELYWSPPEKPNGLISQYQLSRNGSVLFLGGREEQMFTDTNLEPNSR USH2A_cavPor GVWVTPRHTVINSTSVELYWSPPEKPNGLISQYRLSRNGTLLFVGGGEEQNFTDKHLEPNSR USH2A_speTri GVWVTPRHMIINSTTVELYWSPPEKPNGLISQYQLSRNGTLLLLGGSEERNFTDKHLEPNSR USH2A_oryCun GVWVTPRHIIINSTTVELYWTPPEKPNGLISQYQLNRNGIVVFLGGSKEQNFTDRNLKPNSR USH2A_ochPri GVWVSPRHIVINCTAVILYWSPPEKPNGIISQYQLIRNETVLYLGSGKEQNFTDGNLEPNSR USH2A_vicPac GVWVTPRHIIINPTTVELYWSPPEKPNGLISQYQLSRNGTLVFLGGSEEQNFTDKNLEPNSR USH2A_susScr GVWVTPRHIIVNSTTVELYWSLPEKPNGLISQYQLSRNGTVVFLGGSEERNFTDKNLEPNSR USH2A_turTru GVWVTPRHIIINSTTVELYWSLPEKPNGLISQYQLSRNGSLVFLGGSEEQNFTDKNLEPNSR USH2A_bosTau GVWVTPRHIVVNSTTVELFWSPPEKPNGLVSQYQLSRNGSLIFLGGSEEHNFTDKNLEPNSR USH2A_equCab GVWMTPRHIIINSTTVELYWSPPENPNGLISQYQLSRNGTLVFLGGSEEQNFTDKNLEPNSR USH2A_felCat GVWVTPRHIIINSTTVELYWSPPEKPNGLISQYQLSRNGTLVFLGGNEEQNFTDKNLEPNSR USH2A_canFam GVWVTPRHIIINSTTVELYWNPPEKPNGLISQYQLSRNGTLVFLGGSEEQNFTDKNLEPNSR USH2A_myoLuc GVWATPRHIIINATAVELYWRPPERPNGLISRYQLIRNGTSVFLGGSEDQHFTDHNLAPNSR USH2A_pteVam GVWVTPQHIIINSTAVELCWSPPEEPNGLISQYRLSRDGNLVFLAGAEEHCFTDKNLEPNSR USH2A_loxAfr GVWLTPRHIIINPTTVELYWSQPEKPNGLISRYHLRRNGTLVLLGGSEEQNFTDKNLEPNSR USH2A_proCap GVWMTPRHIVINSTTVELHWSLPEKPNGHISQYRLRRNGTLVFQGGGEEQNFTDTNLEPNSR USH2A_dasNov GVWVTPGHIIINSTTVELYWSQPEKPNGLISHYQLSRNGTLIFLAGREEQSFTDKNLEPNSR USH2A_choHof GVWVTPQHIIINSTTVELYWSQPEKPNGLISQYQLSRNGTSVFQGGREEQHFTDKNLEPSSR USH2A_monDom GVWSIPRHIIINSTTVELYWNEPEKPNGLISKYQLHRNGTVIFLGGREDQNFTDDSLEPKSS USH2A_ornAna GVWSKPQHITVSSTTVELYWSQPEKPNGVISQYRLIRNGTEIFAGTRDSLNFTDDSLESNSR USH2A_galGal GVWPKPHHIIVSSTEVEIYWSEPEIPNGLITQYRLFRDEEQIFLGGSRDLNFTDVNLQPNSR USH2A_taeGut GVWPKPHHIIVSSTEVEMYWSEPEEPNGLITHYRLFRDGEQIFLGGSTARNFTDVNLQPNSR USH2A_anoCar GVWSQPRHVIVSSKIVELYWDEPEEPNGIISLYRLFRNGEEIFMGGELNLNFTD-TVQPNNR 4 traces, not in assembly USH2A_xenTro GVWSNPYHVTINESVLELYWSEPETPNGIVSQYRLILNGEVISLRSGECLNFTDVGLQPNSR USH2A_tetNig GVWSKPRHLTVNASAVELHWDPPQQPQGLVSQYRLKRDGRAVFTGDHLQRNYTDAGLQPQRR USH2A_takRub GVWSKPRHLIVTTAVVELYWDPPQQPHGHISQYKLKRDGQTVFTGDHDDQNYTDTGLRPHRR USH2A_gasAcu GVWSSPRHVVINTSAVELYWDQPLQPNGHISQYRLNRDGDTIFTGDHREQNYTDTGLLPNRR USH2A_oryLat GVWSKPRHLIINTSAVELYWDQPSQPNGLISQYRLIRDGLTVFTGARRDQNYTDTGLEPKRR USH2A_danRer GVWSMPRHIQLNSSAVELHWSDPLKLNGLLSGYRLLRDGELVFTADGGKMSYTDAGLQPNTR USH2A_calMil GIWPKPCHVIVNSSTVELYWTEPEKPNGIITQFRLLRDNAVIYTGTRRNRNYTDAGLQPDTR USH2A_braFlo QEVSRPRFVVVSSTEIEVYWSEPGRPNGIITQYQLVRDGSVIYSGG--DMNFTDSGLTPSTT XM_002214612 aligns over 2807 aa USH2A_strPur EGLMQPTHVVVSSTILELYWFEPSQPNGVITSYILYRDDELVYSGNNSVLTYVDTGLTPNTR XM_788345 aligns over 5030 aa USH2A_nemVec SQQPAPVITVSSSRRLDLAWSPPDNPNGIILRYELYRNGTEVYRG--VIRGYNDTNLQPDTL XM_001638773 aligns over 3005 aa USH2A_hydMag SQQGAPFVLFQTSRLINIGWFPPDNLNGILIKYELYRDRTKIFVG--LDNNYTDNNLKPYTY XM_002165140 ............................................................^. USH_homSap GVWVTPRHIIINSTTVELYWSLPEKPNGLVSQYQLSRNGNLLFLGGSEEQNFTDKNLEPNSR USH_panTro .............................................................. USH_gorGor .............................I................................ USH_macMul .............................I................................ USH_calJac .............................I...........................Q.... USH_ponAbe .............................I..................K............. USH_nomLeu .............................I....................K........... USH_turTru .............................I.........S.V.................... USH_tupBel .......................K.....I.........T...................D.. USH_susScr ..........V..................I.........TVV.......R............ USH_tarSyr .....................P.......I.........T...................H.. USH_micMur .....................P.......I.....R...T...................... USH_vicPac ............P........P.......I.........T.V.................... USH_felCat .....................P.......I.........T.V....N............... USH_canFam ....................NP.......I.........T.V.................... USH_equCab ...M.................P..N....I.........T.V.................... USH_bosTau .........VV.......F..P.................S.I.......H............ USH_cavPor ........TV....S......P.......I...R.....T...V..G........H...... USH_speTri ........M............P.......I.........T..L......R.....H...... USH_oryCun ....................TP.......I.....N...IVV.....K......R..K.... USH_dipOrd ..............A......P.......I.........SV.....R...M...T....... USH_loxAfr ...L........P........Q.......I.R.H.R...T.VL................... USH_dasNov ......G..............Q.......I.H.......T.I..A.R...S........... USH_choHof ......Q..............Q.......I.........TSV.Q..R...H........S.. USH_proCap ...M.....V........H.........HI...R.R...T.V.Q..G.......T....... USH_musMus ....................NP..R....I.....R...S..LV..RDN.S...S....G.. USH_ratNor ....................NP..R...VI...R.R...S..LV..RDD.S........... USH_myoLuc ...A........A.A.....RP..R....I.R...I...TSV......D.H...H..A.... USH_pteVam ......Q.......A...C..P..E....I...R...D...V..A.A..HC........... USH_monDom ...SI...............NE.......I.K...H...TVI....R.D.....DS...K.S USH_ochPri ....S....V..C.A.I....P......II.....I..ETV.Y..SGK......G....... USH_ornAna ...SK.Q..TVS.........Q......VI...R.I...TEI.A.TRDSL....DS..S... USH_galGal ...PK.H...VS..E..I...E..I....IT..R.F.DEEQI.....RDL....V..Q.... USH_taeGut ...PK.H...VS..E..M...E..E....ITH.R.F.D.EQI.....TAR....V..Q.... USH_xenTro ...SN.Y.VT..ESVL.....E..T...I....R.IL..EVIS.RSG.CL....VG.Q.... USH_anoCar ...SQ...V.VS.KI.....DE..E...II.L.R.F...EEI.M..ELNL....-TVQ..N. ............................................................^.
Allele assessment by PolyPhen, Sift etc. is unsatisfactory for S3743N of USH2A
Notice the popular SNP evaluation tool, PolyPhen, is not helpful here because it calculates using blastp matches found at SwissProt rather than at NCBI wgs, thus finding a meagre 6 homologs for USH2A vs 46 here, resulting in a poor estimation of the reduced alphabet at 3743. Orthology is assumed, not established -- a highly problematic procedure since this could mix in lineage-specific gene duplications that have neofunctionalized with divergence of constraints on the critical residue.
Sequences are not considered in their phylogenetic tree context so cumulative supporting branch length time is an unavailable metric. Experimental literature on fibronectin domains and binding partners is ignored and paralogous values to S3743 in internal and external FN3 domains are not in the mix.
The list of known Usher 2A Syndrome causative alleles lying in other USH2A fibronectin domains could be scored at PolyPhen to assess its accuracy (typically 70% might be correctly identified); in this particular case, none of these lie in the 22nd FN3 domain and none lie in homologous position to S3743. However 3 do lie in the conserved patch about 3743. The tendency for residues important to normal function to occur in patches has not yet been systematically evaluated but may be useful in allele interpretation.
PolyPhen thus treats S3743N as borderline benign based primarily on S->N innocuousness; its algorithm proceeds without a valid description of the reduced alphabet at this position (S,T: hydroxyl) nor knowledge of the subsequent fixation of S in amniote. It is inappropriate to use BloSum transition matrices which are broad averages over unrelated proteins and so greatly enriched for bland transitions such as S-->N which indeed are generally neutral (indeed by design of the genetic code). PolyPhen rules require 50% identity for the PDB match and S matching, whereas 25% identity is already sufficient and S/T matching is quite satisfactory, so critical structural information is discarded.
In short, software such as Sift and PolyPhen fail to use all available information, always a bad idea in feature annotation. Such tools have value in quick screening of millions of alleles for flaming anomalies but are not particularly useful for specific genes because curational judgement on more extensive information will always outperform them. This is illustrated above in comparative genomics products not available to these tools but very useful in making the best possible annotation call. It is exceedingly important to make all-out use of bioinformatics in view of the high costs and delays of experimental validation.