Sulfatase evolution: ARSK: Difference between revisions

From genomewiki
Jump to navigationJump to search
mNo edit summary
Line 11: Line 11:
A second unusual feature is that the 7 introns within the coding region of ARSK do not bear any relationship in position or phase to those of other human sulfatases. This suggests that the gene duplication event leading to ARSK and other sulfatases preceded the main era of gene intronation, ie sulfatses initially had no introns (as in bacterial genes) and were subsequently independently intronated in early eukaryotes. Once established, the introns of ARSK have been stable over billions of years of gene tree branch length.
A second unusual feature is that the 7 introns within the coding region of ARSK do not bear any relationship in position or phase to those of other human sulfatases. This suggests that the gene duplication event leading to ARSK and other sulfatases preceded the main era of gene intronation, ie sulfatses initially had no introns (as in bacterial genes) and were subsequently independently intronated in early eukaryotes. Once established, the introns of ARSK have been stable over billions of years of gene tree branch length.


The phylogenetic distribution of ARSK also raises many questions. Within deuterostomes, orthologs are readily located in representatives of all major subclades with the exception of echinoderms and tunicates. ARSK has evolved quite conservatively here, with the human protein still having 54% and 52% identity over 500 residues to Branchiostoma (amphioxus) and Saccoglossus (acornworm) respectively, despite divergences that preceded the Cambrian. Intron positions and phases are precisely preserved beyond two minor fission events, leaving no doubt of orthology within deuterostomes. However, ARSK is otherwise completely missing from other eumetazoans (ecdysozoa, lophotrochozoa, and cnidaria). Also missing from Trichoplax and sponge genomes, it makes its last eukaryotic appearance in Monosiga, a marine choanoflagellate, before fading into bacterial sequences of uncertain affinities.
The phylogenetic distribution of ARSK also raises many questions. Within deuterostomes, orthologs are readily located in representatives of all major subclades with the exception of echinoderms and tunicates. ARSK has evolved quite conservatively here, with the human protein still having 54% and 52% identity over 500 residues to Branchiostoma (amphioxus) and Saccoglossus (acornworm) respectively, despite divergences that preceded the Cambrian. Intron positions and phases are precisely preserved beyond two minor fission events, leaving no doubt of orthology within deuterostomes. However, ARSK is otherwise completely missing from other eumetazoans (ecdysozoa, lophotrochozoa, and cnidaria). Also missing from Trichoplax and sponge genomes, it makes its last eukaryotic appearance (two diverged paralogs) in Monosiga, a marine [http://en.wikipedia.org/wiki/Choanoflagellate choanoflagellate], before fading into fungal and bacterial sequences of uncertain affinities.


A final oddity of ARSK [http://www.mad-cow.org/00/annotation_frames/tools/genbrow/sulfatases/sulfatases.html#ddd observed early on] is its close proximity to an apparently unrelated gene, TTC37 (twenty tetratricopeptide repeats 37): only 144 bp separate the two genes. These are transcribed divergently and could well share a bidirection promoter or overlap in 5' UTR. This relationship is by no means restricted to the human genes -- it is readily traced back throughout vertebrates. The putative chaperone function of TTC37 remains unspecified, though in June 2010 a disease has been [http://www.ncbi.nlm.nih.gov/pubmed/20176027 assigned] to it: trichohepatoenteric syndrome (THES) -- an "autosomal-recessive disorder characterized by life-threatening diarrhea in infancy, immunodeficiency, liver disease, trichorrhexis nodosa, facial dysmorphism, hypopigmentation, and cardiac defects". This does not immediately suggest why ARSK and TTC37 should be so closely linked.
A final oddity of ARSK [http://www.mad-cow.org/00/annotation_frames/tools/genbrow/sulfatases/sulfatases.html#ddd observed early on] is its close proximity to an apparently unrelated gene, TTC37 (twenty tetratricopeptide repeats 37): only 144 bp separate the two genes. These are transcribed divergently and could well share a bidirection promoter or overlap in 5' UTR. This relationship is by no means restricted to the human genes -- it is readily traced back throughout vertebrates. The putative chaperone function of TTC37 remains unspecified, though in June 2010 a disease has been [http://www.ncbi.nlm.nih.gov/pubmed/20176027 assigned] to it: trichohepatoenteric syndrome (THES) -- an "autosomal-recessive disorder characterized by life-threatening diarrhea in infancy, immunodeficiency, liver disease, trichorrhexis nodosa, facial dysmorphism, hypopigmentation, and cardiac defects". This does not immediately suggest why ARSK and TTC37 should be so closely linked.
Line 19: Line 19:
As noted above, ARSK has evolved with above-average conservation in deuterostomes and especially vertebrates. Its apparent loss in echinoderms and tunicates -- which in itself is not unusual -- might be overturned if more genomes were sequenced from these clades. Its occurence in all earlier diverging lineages is restricted to Monosiga, which requires several independent losses as the species tree topology now stands. Monosiga thus illustrates the importance of thorough genomic sampling.
As noted above, ARSK has evolved with above-average conservation in deuterostomes and especially vertebrates. Its apparent loss in echinoderms and tunicates -- which in itself is not unusual -- might be overturned if more genomes were sequenced from these clades. Its occurence in all earlier diverging lineages is restricted to Monosiga, which requires several independent losses as the species tree topology now stands. Monosiga thus illustrates the importance of thorough genomic sampling.


However even without the Monosiga sequence, it is certain that ARSK did not arise in deuterostomes either by horizontal gene transfer (from bacteria), nor from gene duplication and rapid divergence from a sulfatase exising in the bilateran ancestor, nor de nov from say junk dna. These options are all ruled out by its 7 immensely conserved GT-AG coding introns that do not resemble anything from these other sources. (A very high percentage of exons are precisely conserved between human and sponge implying the main era of intron creation occured much earlier.)
However even without the Monosiga sequences, it is certain that ARSK did not arise in deuterostomes either by horizontal gene transfer (from bacteria), nor from gene duplication and rapid divergence from a sulfatase exising in the bilateran ancestor, nor de nov from say junk dna. These options are all ruled out by its 7 immensely conserved GT-AG coding introns that do not resemble anything from these other sources. (A very high percentage of exons are precisely conserved between human and sponge implying the main era of intron creation occured much earlier.)


The Monosiga sequence below does have some problematic aspects. In part these arise from the poor quality pipeline model XM_001747506 posted to GenBank that skipped over the small exon containing the catalytic site motif CPSRT as well as omitting a long distal exon. These missing exons are easily located  using Blastx of the enveloping contig ABFJ01000822 [http://www.proweb.org/proweb/Tools/WU-blast.html against a small database] of validated ARSK orthologs. However ambiguity remaining in two earlier exons with weak homology can only be resolved by transcript sequencing.
The Monosiga sequences below do have some problematic aspects. In part these arise from poor quality GenBank pipeline models of XM_001747506 and XM_001750805. The former even skips over small exon containing the catalytic site motif CPSRT as well as omitting a long distal exon; the latter model has a half dozen macro errors. Missing exons are however locatable using Blastx of the enveloping contig ABFJ01000822 [http://www.proweb.org/proweb/Tools/WU-blast.html against a small database] of validated ARSK orthologs. However ambiguity remaining in earlier exons with weak homology that can only be resolved by transcript sequencing.


Despite these uncertainties, it is clear that intron positions and phases do not match those of deuterostomes very well. In particular, the latter all have a phase 1 intron starting one residue after the CCPSR motif. This motif and the following WSG pattern are easily recognized in Monosiga but there is no possibility of a GT-AG intron in the anticipated position:
Despite these uncertainties, it is clear that intron positions and phases do not match those of deuterostomes very well. In particular, the latter all have a phase 1 intron starting one residue after the CCPSR motif. This motif and the following WSG pattern are easily recognized in Monosiga but there is no possibility of a GT-AG intron in the anticipated position:
Line 28: Line 28:
   A  P  V  <font color =red>C  C  P  S  R</font>  T  S  T  W  S  G  R  H
   A  P  V  <font color =red>C  C  P  S  R</font>  T  S  T  W  S  G  R  H


Despite unresolved intron issues,  back-Blastp of Monosiga to the 17 sulfatases of human at GenBank gives ARSK_homSap as best match by a wide margin. When the target is all 'non-redundant' deuterostome sequences at GenBank, the best match is amphioxus ARSK_braFlo followed by a long list of ARSK in other species. After these, much weaker matches to the IDS sulfate appear.  
Despite unresolved intron issues,  back-Blastp of Monosiga to the 17 sulfatases of human at GenBank gives ARSK_homSap as best match by a wide margin. When the target is all 'non-redundant' deuterostome sequences at GenBank, the best match is acornworm ARSK_sacKow followed by a long list of ARSK in other species. After these, much weaker matches to IDS sulfatases appear.  


[[Image:CholineSulfate.png|left]]
[[Image:CholineSulfate.png|left]]


Curiously, when Blastp of Monosiga (or any bona fide ARSK) is restricted to bacteria (which are rich in sulfatases), the top matches are typically annotated as IDS-like or choline-sulfatase. This is consistent with a deep ancestral ARSK/IDS gene whose substrate was (and still is, in bacteria because of the phylogenetic breadth of these annotations) choline sulfate.  
Curiously, when Blastp of Monosiga (or any bona fide ARSK) is restricted to sulfatase-rich bacteria, top matches are typically annotated as IDS-like or choline-sulfatases. This is consistent with a deep ancestral ARSK/IDS gene whose substrate was (and still is, in bacteria ) choline sulfate. That can be infered from the phylogenetic breadth of bacterial choline sulfatases.


After gene duplication and divergence, one copy changed its substrate over time in eukaryotes to iduronate (ie become IDS). The other copy became ARSK; its substrate today is unknown but might well still include choline sulfate. Later, this hypothesis goes, that molecule stopped being made in various lineages (eg arthropods) or was alternatively metabolized more effectively, resulting in the subsequent loss of ARSK in those lineages (under the evolutionary principle of [http://www.ncbi.nlm.nih.gov/pubmed/18085818 'use it or lose it']).
After gene duplication and divergence, one copy may have changed its substrate over time in eukaryotes to iduronate (ie become IDS). The other copy became ARSK; its substrate today is unknown but might well still include choline sulfate. Later, this hypothesis goes, that molecule stopped being made in various lineages (eg arthropods) or was metabolized more effectively some other way, resulting in the subsequent loss of ARSK in those lineages (under the evolutionary principle of [http://www.ncbi.nlm.nih.gov/pubmed/18085818 'use it or lose it']).


<br clear=all>
<br clear=all>

Revision as of 15:13, 4 September 2010

Introduction to sulfatases

Sulfatases are an old and deeply diverged family of hydratases that remove sulfate moieties from a variety of small and large molecules. Despite the apparent simplicity of this reaction, the sulfatase domain fold is perhaps the largest known for any enzyme and an unprecedented formyl glycine post-translational modification of encoded cysteine, serine or threonine is critical to activity. The fold is closely related to that of alkaline phosphatases, though primary sequence alignability has almost completely dissipated.

The 17 human paralogs reside either in lysozomes or endoplasmic reticulum. Mutations in these genes result in diseases that provide important clues as to natural substrates (which accumulate in lysosomal storage diseases). However only 8 of the 17 genes have an associated disease at OMIM as of Sept 2010. Functions of the remaining sulfatases have yet to be discovered, perhaps because the accumulating metabolite is not toxic or has an alternative catabolic pathway. Such diseases could be recessive and hence rare in the case of unassigned autosomal sulfatases.

ARSK is such a gene. First described in 2003 as SULFX, the substrate and function of ARSK remain unknown -- it has not yet been the focus of a single experimental paper. ARSK is a fairly typical sulfatase of 536 amino acids encoded by eight exons on human chr 5 with a conventional CPSRA formylglycine motif. It lacks overt membrane insertional regions and GPI terminal motif so is presumably soluble.

ARSK is however peculiar in several bioinformatic respects. Although clearly a full length duplicate of an ancestral sulfatase, its opaque evolutionary relationship to non-orthologous sulfatases makes it difficult to place in the sulfate gene tree. Its closest affinity (percent identity low 20's) is perhaps with IDS which removes the sulfate from iduronate, though the ARSK substrate may have drifted off to something else entirely during the 600+ million years since gene duplication.

A second unusual feature is that the 7 introns within the coding region of ARSK do not bear any relationship in position or phase to those of other human sulfatases. This suggests that the gene duplication event leading to ARSK and other sulfatases preceded the main era of gene intronation, ie sulfatses initially had no introns (as in bacterial genes) and were subsequently independently intronated in early eukaryotes. Once established, the introns of ARSK have been stable over billions of years of gene tree branch length.

The phylogenetic distribution of ARSK also raises many questions. Within deuterostomes, orthologs are readily located in representatives of all major subclades with the exception of echinoderms and tunicates. ARSK has evolved quite conservatively here, with the human protein still having 54% and 52% identity over 500 residues to Branchiostoma (amphioxus) and Saccoglossus (acornworm) respectively, despite divergences that preceded the Cambrian. Intron positions and phases are precisely preserved beyond two minor fission events, leaving no doubt of orthology within deuterostomes. However, ARSK is otherwise completely missing from other eumetazoans (ecdysozoa, lophotrochozoa, and cnidaria). Also missing from Trichoplax and sponge genomes, it makes its last eukaryotic appearance (two diverged paralogs) in Monosiga, a marine choanoflagellate, before fading into fungal and bacterial sequences of uncertain affinities.

A final oddity of ARSK observed early on is its close proximity to an apparently unrelated gene, TTC37 (twenty tetratricopeptide repeats 37): only 144 bp separate the two genes. These are transcribed divergently and could well share a bidirection promoter or overlap in 5' UTR. This relationship is by no means restricted to the human genes -- it is readily traced back throughout vertebrates. The putative chaperone function of TTC37 remains unspecified, though in June 2010 a disease has been assigned to it: trichohepatoenteric syndrome (THES) -- an "autosomal-recessive disorder characterized by life-threatening diarrhea in infancy, immunodeficiency, liver disease, trichorrhexis nodosa, facial dysmorphism, hypopigmentation, and cardiac defects". This does not immediately suggest why ARSK and TTC37 should be so closely linked.

ARSK phylogenetic distribution

As noted above, ARSK has evolved with above-average conservation in deuterostomes and especially vertebrates. Its apparent loss in echinoderms and tunicates -- which in itself is not unusual -- might be overturned if more genomes were sequenced from these clades. Its occurence in all earlier diverging lineages is restricted to Monosiga, which requires several independent losses as the species tree topology now stands. Monosiga thus illustrates the importance of thorough genomic sampling.

However even without the Monosiga sequences, it is certain that ARSK did not arise in deuterostomes either by horizontal gene transfer (from bacteria), nor from gene duplication and rapid divergence from a sulfatase exising in the bilateran ancestor, nor de nov from say junk dna. These options are all ruled out by its 7 immensely conserved GT-AG coding introns that do not resemble anything from these other sources. (A very high percentage of exons are precisely conserved between human and sponge implying the main era of intron creation occured much earlier.)

The Monosiga sequences below do have some problematic aspects. In part these arise from poor quality GenBank pipeline models of XM_001747506 and XM_001750805. The former even skips over small exon containing the catalytic site motif CPSRT as well as omitting a long distal exon; the latter model has a half dozen macro errors. Missing exons are however locatable using Blastx of the enveloping contig ABFJ01000822 against a small database of validated ARSK orthologs. However ambiguity remaining in earlier exons with weak homology that can only be resolved by transcript sequencing.

Despite these uncertainties, it is clear that intron positions and phases do not match those of deuterostomes very well. In particular, the latter all have a phase 1 intron starting one residue after the CCPSR motif. This motif and the following WSG pattern are easily recognized in Monosiga but there is no possibility of a GT-AG intron in the anticipated position:

agcccctgtatgctgtcccagccgaacttcgacttggtcgggccgtcacgt (The red cg in deuterostomes is splice donor GT.)
  A  P  V  C  C  P  S  R  T  S  T  W  S  G  R  H

Despite unresolved intron issues, back-Blastp of Monosiga to the 17 sulfatases of human at GenBank gives ARSK_homSap as best match by a wide margin. When the target is all 'non-redundant' deuterostome sequences at GenBank, the best match is acornworm ARSK_sacKow followed by a long list of ARSK in other species. After these, much weaker matches to IDS sulfatases appear.

CholineSulfate.png

Curiously, when Blastp of Monosiga (or any bona fide ARSK) is restricted to sulfatase-rich bacteria, top matches are typically annotated as IDS-like or choline-sulfatases. This is consistent with a deep ancestral ARSK/IDS gene whose substrate was (and still is, in bacteria ) choline sulfate. That can be infered from the phylogenetic breadth of bacterial choline sulfatases.

After gene duplication and divergence, one copy may have changed its substrate over time in eukaryotes to iduronate (ie become IDS). The other copy became ARSK; its substrate today is unknown but might well still include choline sulfate. Later, this hypothesis goes, that molecule stopped being made in various lineages (eg arthropods) or was metabolized more effectively some other way, resulting in the subsequent loss of ARSK in those lineages (under the evolutionary principle of 'use it or lose it').


Overlap of ARSK transcription start with TTC37

Recent papers establish that the transcription factor TFEB regulates the expression of lysosomal proteins (eg ARSK) via a CLEAR element in their promotor. These elements reportedly occur in the ARSK promotor-sequence at position -272 (TTCACGTGAC), -296 (CGCATGCGCC) and -348 (CCCACCTGGA).

These motifs are fairly short, so their non-accidental occurence requires verification via conservation with comparative genomics as they cannot plausibly have arisen in human. Indeed these motifs are deeply conserved in vertebrates as is quickly seen from the 46-way alignment of vertebrate genomes at UCSC. The phastCons track already identifies their conservation as a statistically significant occurence on a genomewide scale.

Note the motif CCCACCTGGA is not quite right as the last nucleotide does not belong in the motif and similarly for the other two -- it is unlikely that TFEB has changed its binding specificity only in humans or great apes. These motifs are better represented by profiles than by absolute sequence requirements.

The three motif sequences are non-palindromic so consequently are not applicable to TTC37 itself which lies on the opposite strand from ARSK -- even though one of the motifs lies within a 5'UTR exon of the TTC37 gene and another immediately precedes its transcription start. It is not known if TTC37 has upstream regulatory regions of its own and where these lie relative to the start of ARSK. Thus the two genes are not able to evolve completely independently but the constraints may not be too severe.

Metazoan genes are not organized into transcriptional unit operons as in bacteria. They may be coordinately regulated but this is rarely accomplished via physical proximity on a chromosome. Although normal function of TTC37 is not entirely clear from its disease phenotype and the natural substrate of ARSK sulfatase has not been established, it does not appear the two genes have coordinated expression or related functions. Their proximity and divergent transcription from the short shared intergenic region may simply have arisen from a chromosomal reorganization in early vertebrates. Now the genes are slightly intertwined meaning their separation from a further chromosomal rearrangement could be disadvantageous, causing the accidental adjacency to be conserved.


ARSKtranscription.png


ARSKtranscriptCons.png

Alignment of diverse ARSK sequences

ARSKalign.png

ARSK reference sequences

Only a sampler of vertebrate ARSK sequences are shown. The gene is present and well conserved in all 46 vertebrate genomes sequenced to date (reference sequences are pre-compiled at the proteinFasta link of the UCSC description page for ARSK). An ARSK ortholog is absent in all 30 sequenced non-deuterostome bilaterans as well as both cnidarian genomes, Trichoplax and sponge genomes, and numerous unicellular eukaryote genomes. Not all these genome assemblies have complete coverage but it is unlikely that all of a large 8 exon gene with conventional autosomal location (in vertebrates) would be missing so consistently.

>ARSK_homSap Homo sapiens (human) 544 aa 8 exons
0 MLLLWVSVVAALALAVLAPGAGEQRRRAAKAPNVVLVVSDSF 0
0 DGRLTFHPGSQVVKLPFINFMKTRGTSFLNAYTNSPICCPSRA 1
2 AMWSGLFTHLTESWNNFKGLDPNYTTWMDVMERHGYRTQKFGKLDYTSGHHSIS 2
1 NRVEAWTRDVAFLLRQEGRPMVNLIRNRTKVRVMERDWQNTDKAVNWLRKEAINYTEPFVIYLGLNLPHPYPSPSSGENFGSSTFHTSLYWLEK 00 VSHDAIKIPKWSPLSEMHPVDYYSSYTKNCTGRFTKKEIKNIRAFYYAMCAETDAML 1
2 GEIILALHQLDLLQKTIVIYSSDHGELAMEHRQFYKMSMYEASAHVPLLMMGPGIKAGLQVSNVVSLVDIYPTML 1
2 DIAGIPLPQNLSGYSLLPLSSETFKNEHKVKNLHPPWILSEFHGCNVNASTYMLRTNHWKYIAYSDGASILPQLF 1
2 DLSSDPDELTNVAVKFPEITYSLDQKLHSIINYPKVSASVHQYNKEQFIKWKQSIGQNYSNVIANLRWHQDWQKEPRKYENAIDQWLKTHMNPRAV* 0

>ARSK_canFam Canis familiarus (dog) NM_001048117
0 MLLLWLSVFAASALAAPDRGAGGRRRGAAGGWPGAPNVVLVVSDSF 0
0 DGRLTFYPGSQAVKLPFINLMKAHGTSFLNAYTNSPICCPSRA 1
2 AMWSGLFTHLTESWNNFKGLDPNYTTWMDIMEKHGYRTQKFGKLDYTSGHHSIS 2
1 NRVEAWTRDVAFLLRQEGRPMINLIPKKTKVRVMEGDWKNTDRAVNWLRKEASNSTQPFVLYLGLNLPHPYPSPSSGENFGSSTFHTSLYWLKK 00 VSYDAIKIPKWSPLSEMHPVDYYSSYTKNCTGKFTKKEIKNIRAFYYAMCAETDAML 1
2 GEIILALRQLDLLQNTIVIYTSDHGELAMEHRQFYKMSMYEASAHIPLLMMGPGIKANQQVSNVVSLVDIYPTML 1
2 DIAGAPLPQNLSGYSLLPLSSEMFWNEHKLKNLHPPWILSEFHGCNVNASTYMLRTNQWKYIAYSDGTSVLPQLF 1
2 DLFSDPDELTNIATKFPEVTYSLDQKLRSIINYPKVSASVHQYNKEQFIKWKQSVGQNYSNVIANLRWHQDWLKEPRKYESAINQWLKTPH* 0

>ARSK_monDom Monodelphis domestica (opossum) XM_001364779 (wrong N-terminus)
0 MPWWSLGVVLMVTTSADLALTAPALWAGGLEERGGPPNVVLVMSDSF 0
0 DGRLTFHPGNQTVALPFINFMKKRGTLFLNAYTNSPICCPSRA 1
2 AMWSGLFTHLTESWNNFKGLDQNYTTWMDLLQKYGYHTQKFGKLDYTSGHHSIS 2
1 NRVEAWTRDVDFLLRQEGRPMVNLIPNKMKTRIMEEDWQNTDKATNWLRKEAINFTQPFVLYLGLNLPHPYPSPYMGENFGASTFQTSPYWLER 00 VFYKAIKIPEWSPLSEMHPVDYYSSYTKNCTGQFTKKEIRDIRAFYYAMCAETDAML 1
2 GEIILTLHQLSLLQKTIVLFTSDHGELAMDHRQFYKMSMYEASSHIPLVMMGPGIKANLHIPDIVSLVDIYPTLL 1
2 DIAGIPLHQNLSGYSLIPLTSEAANNNSPAAMQRPPWILSEFHGCNVNASTYMLRIDKWKYIAYSDGISSPPQLF 1
2 DLSSDPDELTNIATRFPEITLSLDQKLRSIINYPRVSASVHQYNKRQFISWKDSLGQNYTEVIANLRWHQDWLKEPLKYENAINQWLKTNTNM* 0

>ARSK_galGal (chicken) NM_001031415
0 MGSGGPLLLLRGLLLVGAAYCAAPRPPRHSSRPNVLLVACDSF 0
0 DGRLTFYPGNQTVDLPFINFMKRHGSVFLNAYTNSPICCPSRA 1
2 AMWSGLFTHLTESWNNFKGLDPDYVTWMDLMQKHGYYTQKYGKLDYTSGHHSVS 2
1 NRVEAWTRDVEFLLRQEGRPKVNLTGDRRHVRVMKTDWQVTDKAVTWIKKEAVNLTQPFALYLGLNLPHPYPSPYAGENFGSSTFLTSPYWLEK 00 VKYEAIKIPTWTALSEMHPVDYYSSYTKNCTGEFTKQEVRRIRAFYYAMCAETDAML 1
2 GEIISALQDTDLLKKTIIMFTSDHGELAMEHRQFYKMSMYEGSSHVPLLVMGPGIRKQQQVSAVVSLVDIYPTML 1
2 DLARIPVLQNLSGYSLLPLLLEKAEDEVPRRGPRPSWVLSEFHGCNVNASTYMLRTDQWKYITYSDGVSVPPQLF 1
2 DLSADPDELTNVAIKFPETVQSLDKILRSIVNYPKVSSTVQNYNKKQFISWKQSLGQNYSNVIANLRWHQDWLKEPKKYEDAIDRWLSQREQRK* 0

>ARSK_xenLae Xenopus laevis (frog)
0 MIQKCIALSLFLFSALPEDNIVRALSLSPNNPKSNVVMVMSDAF 0
0 DGRLTLLPENGLVSLPYINFMKKHGALFLNAYTNSPICCPSRA 1
2 AMWSGLFPHLTESWNNYKCLDSDYPTWMDIVEKNGYVTQRLGKQDYKSGSHSLS 2
1 NRVEAWTRDVPFLLRQEGRPCANLTGNKTQTRVMALDWKNVDTATAWIQKAAQNHSQPFFLYLGLNLPHPYPSETMGENFGSSTFLTSPYWLQK 00 VPYKNVTIPKWKPLQSMHPVDYYSSYTKNCTAPFTEQEIRDIRAYYYAMCAEADGLL 1
2 GEIISALNDTGLLGRTYVVFTSDHGELAMEHRQFYKMSMYEGSSHIPLLIMGPRISPGQQISTVVSLVDLYPTML 1
2 EIAGVQIPQNISGYSLMPLLSASSNKNVSPSISVHPNWAMSEFHGSDANASTYMLWDNYWKYVAYADGDSVAPQLF 1
2 DLSSDPDELTNVAGQVPEKVQEMDKKLRSIVDYPKVSASVHVYNKQQFALWKASVGANYTNVIANLRWHADWNKRPRAYEMAIEKWIKSTRQH* 0

>ARSK_takRub Takifugu rubripes (fugu) 8 exons 504 aa single copy retained after whole genome duplication
0 MSVKLSALILLFLAFHQVLARNRTRPNFLVVMSDAF 0
0 DGRLTFDPGSKVVKLPFINYLRELGVTFINAYTNSPICCPSRA 1
2 AMWSGQFVHLTQSWNNYKCLDANATTWMDLLEVNGYLTKMMGKLDYTSGSHSvs 0
1 NRVEAWTRDVQFLLRQEGRPVTQLVGNMSTVRIMGKDWENIDKATQWIQQRAESSQQPFALYLGLNLPHPYKTESLGPTAGGSTFRTSPHWLEK 00 VSSEHVTVPKWLPGAAMHPVDFYSTFTKNCSGFFTEEEIMNIRAFYYAMCAEADAML 1
2 GQLISALRETHLLNNTVVIFTADHGELAMEHRQFYKMSMFEGSSHVPLLFMGPGLMSGVEADQLVSLVDIYPTVL 1
2 DLADVPPVGSLSGYSLLPLLSTCSSCPGRPHPDWVLSEYHGCNANASTYMLRSGRWKYIAYADGLRVPPQLF 1
2 DMILDKEELHNVVFKFSEVSAQLDKLLRSIVHYPEVSAAVHRYNKESFVAWRHTLGRNYSQVISSLRWHVDWQRNPLANERAIDEWLYGSF* 0

>ARSK_braFlo Branchiostoma floridae (amphioxus) XM_002594507
0 MRMKLDCSAGFLLFWWFTSAVGGTRDDRKNIVFVICDSM 0
0 DGRLIGRGQDSVVDLPNLNYMVQNGVNFRSTYTNSPICCPSRS 1
2 ALWSGLHTHVTQSWNNYKGLPKNYPTWQVRLEQQGYHTQVYGKTDYVSGDHSES 2
1 NRVEAWTRNVNFTLAQEGRPTPVLV 12 GSSSTDRIQLKDWASTDLASHWLLHEAPKQQKPWLLYLGLNLPHPYPTPSMGKNFGGSTFMTSPYWLKK VNSSKVTIPKWLPFSRMHPVDYYSSATKNCTSDFTRDEIMKIREYYYGMCAETDAML 1
2 GQVLDALKASGQADSTYVFFTSDHGELAMEHRQFYKMSMYEASAHIPMVLTGPEVPAGKAVDDLTSLVDVFPTFM 1
2 DIANASQPPGLNGTSLLPLLRNSSDRVDRPDWVLSQYHGCNVNMSTYMLRTGSLKYVAFGDGPNQVSSQLF 1
2 DLDKDPDELHNLAEERQDLASQLDDKLRKLVDYPTVTREVQKYNRDSFMAWKAKLGSRYKDEIANLRWWKDWQKDPQGNQEKVEEWLNNVVS* 0

>ARSK_sacKow Saccoglossus kowalevskii (acornworm) XM_002732823
0 MFSMMQSSILITVLLFTCTCIPRGNEGKPNNVLFIICDAM 0
0 DGRLVGNNLTAVNMANINNRLVSHGVTFTNAYTNSPICCPSRS 1
2 ALWSGLYTHITHAWNNHEGLPADYPTWKIKLEKAGYDSKILGKTDYVSGRHTLS 2
1 NRVEAWTRNVNFTLAQEGRPTPVLVGNKTTIRVKDVDWDNIDKAKDWLENRKSSKATKPFLLYIGINLPHPYSTPGEGEHPGGSTFMTSPYW LQYVDMSKVTIPKWTPLDKMHPVDYYESATKNCTSHFTKDEIRKIRAYYYGMCAEVDGMV 1
2 GEILDQLDSLGLTNTTQVIFTSDHGEMAMEHRQFYKMTMYEASSHVPLIITNPTVPSRQGVAVNDPVSLVDIFPTLM 1
2 DMAAIHHPVGLNGTSLMPYLEGKSHVKKPDWVLSQYHGCNVNMSAYMLRRQEWKYITYGNGKQVAPHLF 1
2 NLDEDPDELHDYANERHDIIAEMDNKLRSIIDYADITNEVSRYNKESFSSWKTSIGDKYSDTIANLRWWKDWQKDPNGNEQRIEEWLKSVE* 0

>ARSK1_monBre Monosiga brevicollis (choanoflagellate) ABFJ01000822 (XM_001747506: bad gene model)
0 MGNPIRGGSLLIVAASLLVCATLGTAKQPNILFVIDESTDAKAYFAKNPEKAPMPLPNLR 0
0 VPAHVMNSYHYHR 1  
2 PPVCCPSRTSTWSGRH 0
0 FVTGAWNNYEGLPE 0
0 NYDLKYSDVLHKGGYNVGIFGKTDFTAGGHTVDARVTAWTNKVNFPFTLQNGSAGWYDETGPLVRTVNVSK 0
0 VVHVSDWNHANQTAKFIADAATHDEPWLAYVGFDIVHP 1
2 NYVSSPYWLDQVDMDKVTVPEWIPLDQLHPEDFQATMKKNMANLTHDPAFIKSVRQHYYGMIAE 2
1 YDAILGVVLDAVEASGEADNTY 0
0 IFVTSDHGDMNMEHQQYYK 0
0 MTYYDPSARVPLIVTGPTVQANVTYENLTSHLDFFPTFLELANV 1 
2 VQLEGRSLVPILRTGVDAGRPNVALSQFHGDEIHLSWFMI 1
2 RKDDYKYVTFGSGKEVAPRLFNMREDPLEMNDLAPSNPSLVAELDAELRSYWDYPSIASTAESYNK 1
2 DSFALLRASFNDEDKFKAYLATLRWSTSWSYDPEGSYAAIEAWLKTPNSTFEWAFP* 0

>ARSK2_monBre Monosiga brevicollis (choanoflagellate) ABFJ01001665 (XM_001750805: horrible gene model)
0 MDRWTIVLVAVAIWCLAVGSHGLGSAAEPESRLRLGMTNSSRPNIVFLICESIDAKTFDEDSPVPLPNIRKLIQ 0
0 GGVSFKTHYVSAPVCAPSRTSIQQGRHVH 1
1 AAIAWNNYEGMAPDYDMKIGDVLGRTGYDVNILGKTDWT
1 IGGHALWNWWQCFTM 2
1 YTQFPYNVTNGGWNEQPETQAGE 1
2 GDVTPGNRSHDVDWMFVEQNVAYIRNHSQSQPFFVYQ 0
0 GMDIVHPP 0
0 LGMPSDQTCEKFYNMINESDVTVPDWAPLDELHPCDLQSVMLK 1
2 DNATAVTNFYSKDRRRRVRR 2
1 IYYAMIAEFDAMVGEYMQAVEDA 1
2 GLPLCKKIKLREDVKSLSGLPFQMEHQQFYKM 0
0 VPLVIAGPGIKADTETLPTQHVDLYPTFMDF 1
2 GQVPASMRPEGLDGISLVPRVVEQKPLANTSFAISQFHGADLGMSWYLIRYQ 0
0 NWKLVTYGTGQEVAPQLFDMVNDPGETHDVHAQHPDLVAQLDALLRSRIDYPSVSLDVATYNL 1
2 APAKKQFK 0
0 AIHDYTQQGDDELSFVPGDIITLVSVPPGEEIEGWLTGELNGRTGLFPDNFVEELPYVTCLAIFFSIPMFLFGCRHTSRPRTP* 0