Classification of Non-Animals and Invertebrates Based on Amino Acid Composition of Complete Mitochondrial Genomes

Amino acid compositions were predicted from data sets of 47 non-animal and 58 invertebrate animal complete mitochondrial genomes, which were chosen alphabetically based on scientific names without considering biological characteristics. Using Ward’s clustering method with amino acid composition or nucleotide content as traits, non-animals were classified into Plantae, Chromalveolata, and Fungi, and invertebrates were classified into Animalia and primitive groups, Amoebozoa, Excavata, Protista, and Choanozoa. A combined sample set of primitive eukaryotes was also examined by cluster analysis using amino acid composition and nucleotide content. Some Amoebozoa comprised a single cluster, whereas other Amoebozoa were grouped with other organisms (Excavata, Prosista, Chromaleolata, Fungi and Plantae), indicating their close relationships. Choanozoa (choanoflagellates; Monosiga brevicollis), considered the closest living relatives of animals, were found to be instead closely related to Fungi (Smittium culisetae, Pleurotus ostreatus, and Epidermophyton floccosum) and Excavata (Malawimonas jakobiformis). Our results demonstrate that amino acid composition and nucleotide content are useful indices for characterizing non-animal and invertebrate complete mitochondrial genomes.


Introduction
Methodology for analyzing nucleotide gene sequences was first developed in 1975 (Sanger & Coulson, 1975;Maxam & Gilbert, 1977).A comprehensive analysis of Haemophilus influenzae was carried out in 1995 (Fleischmann et al., 1995), with a draft of the complete human genome obtained in 2001 (Lander et al., 2001;Venter et al., 2001).Because nucleotide mutations are associated with biological evolution, nucleotide and amino acid sequence data have been used to construct an enormous number of phylogenetic trees (Dayhoff, Park, & McLaughlin, 1977;Sogin, Elwood, & Gunderson, 1986;Doolittle & Brown, 1994;Maizels & Weiner, 1994;DePouplana, Turner, Steer, & Schimmel, 1998;Woese & Fox, 1977;Weisburg, Brns, Pelletier, & Lane, 1991) that have helped us to understand biological evolution.Because nucleotide and amino acid substitution rates differ among genes, however, universal phylogenetic trees accurately modeling true phylogenies cannot be reconstructed based on current knowledge levels.For instance, different analytical methods, such as Ward's clustering (Ward, 1963) and neighbor-joining (Saitou & Nei, 1987), have yielded different phylogenetic trees using the same data set, with analysis of different traits yielding different results (Sorimachi & Okayasu, 2013).Although we cannot presently construct phylogenetic trees that are universally representative of actual phylogenies, the scientific validity of phylogenetic trees cannot be denied.
Phylogenetic analyses have primarily utilized nucleotide and amino acid sequence data, with nucleotide content and amino acid composition rarely used to investigate biological phenomena such as evolution.Studies based on nucleotide or amino acid sequences are applicable to genes or regions of relatively small length, but not to entire genomes consisting of huge numbers of nucleotides and many genes.Nevertheless, simple comparison of sequence differences between genes, both within and among species, is of course still useful.Sueoka (1961) was the first to analyze bacterial cellular amino acid composition.More recently, our laboratory has independently analyzed cellular amino acid composition of bacteria, archaea, and eukaryotes (Sorimachi, 1999).Graphical representations and diagrammatic approaches to the study of complicated biological systems can provide intuitive pictures and useful insights (Chou, 1990;Qi X. Q., Won, & Qi Z. H., 2007).With the aid of certain graphical representations, simple patterns representing complicated organisms can be easily discerned from huge genomic data sets.For example, when radar charts are used to visualize cellular amino acid compositions, their star-shaped patterns are similar among various organisms, with any differences appearing to reflect biological evolution (Sorimachi, 1999).In addition, amino acid compositions deduced from complete genomes resemble those obtained from amino acid analyses of cell lysates (Sorimachi et al., 2001).
Intra-species nucleotide content was first analyzed by Chargaff, who reported that G = C, A = T, and (G + A) = (C + T).This rule has been dubbed Chargaff's first parity rule (Chargaff, 1950), and is understandable based on the double-stranded structure of DNA (Watson & Crick, 1953).This rule is also applicable to each single strand of nuclear DNA from individual species, a case that has been termed Chargaff's second parity rule (Runder, Karkas, & Chargaff, 1968).Because these rules are based on values normalized to 1 (i.e., G + C + A + T = 1), nucleotide contents are expressed by their ratios.The second parity rule is more difficult to understand, because it is difficult to imagine how G and C or T and A pairs are formed in a single DNA strand.This puzzle has recently been solved mathematically using similarity of the forward and reverse strands and homogeneity of the DNA strand over genome structure (Sorimachi, 2009).Although Chargaff's parity rules were originally formulated as intra-species phenomena, they can be expanded to encompass inter-species relationships using data from a large number of complete genomes (Mitchell & Brigde, 2006).These results indicate that nucleotide content, similar to amino acid composition (Sorimachi & Okayasu, 2003), can be used to characterize whole genomes (Sorimachi & Okayasu, 2004a).
We have recently demonstrated the existence of natural selection in vertebrate evolution using phylogenetic trees derived from amino acid composition or nucleotide content of complete mitochondrial genomes (Sorimachi & Okayasu, 2013).Vertebrate mitochondrial DNA contains 13 genes, whereas gene number varies in plant and invertebrate mitochondrial DNA.In the present study, we investigated evolution in other eukaryotes using phylogenetic trees constructed from amino acid composition or nucleotide content data.

Materials and Methods
Mitochondrial genome data were obtained from the National Center for Biotechnology Information (NCBI, http://www.ncbi.nlm.nih.gov/sites).In an earlier study, organisms were chosen according to the alphabetical order of their scientific names without considering their characteristics (Sorimachi & Okayasu, 2008).Nucleotide contents of complete mitochondrial genomes were calculated from their complete corresponding single-strand DNA (Sorimachi & Okayasu, 2008) and normalized to 1 (G + C + T + A = 1).Predicted amino acid compositions were estimated for mitochondrial genome coding regions.Ratios of individual amino acids to total amino acids were calculated as percentages of total amino acids, and the presentation order of each amino acid on radar charts was determined based on their HPLC elution orders (Sorimachi, 1999).In our previous studies (Sorimachi & Okayasu, 2013;Sorimachi, Okayasu, Ohhira, Masawa, & Fukasawa, 2013), phylogenetic trees obtained from Ward's clustering method (Ward, 1963) using amino acid compositions or nucleotide contents predicted from complete vertebrate mitochondrial genomes slightly differed from those obtained from neibor-joining method (Saitou & Nei, 1987) using 16S rRNA sequences.However, their basic results were consistent between two different methods.Therefore, Ward's clustering method was carried out in the present study.Classifications based on Ward's clustering method (Ward, 1963) were conducted using multivariate software developed by ESMI (Tokyo, Japan).

Amino Acid Compositions Encoded in Mitochondrial DNA
In a previous study of vertebrate mitochondrial DNA, encoded amino acid compositions were similar among species.Contents (percentages) of some specific amino acids differed significantly among species, however, allowing classification of vertebrates into terrestrial and aquatic species (Sorimachi & Okayasu, 2013).To expand the scope of our investigation using amino acid compositions, we examined a wider range of eukaryote species, i.e., liverworts (Marchantia polymorpha), fungi (Monoblepharella and Epidermophyton floccosum), magnoliopsida (Arabidopsis thaliana), algae (Cyanidioschyzon merolae), and protists (Phytophthora sojae) (Figure 1).As shown in Figure 1, amino acid compositions differed among the various species examined.In every species, both Leu and Ile contents were high compared with other amino acids.Leu content was higher than that of Ile in some species, with the opposite true in other cases.In vertebrate mitochondrial DNA, Leu contents are usually higher than those of both Ile and Phe, giving rise in the case of Monoblepharella to a "pennon shape" in the radar plot (Sorimachi & Okayasu, 2013).In addition, Asp and Ser contents were high; these high Asp and Lys contents, coupled with low Glu levels, were the cause of a characteristic "V-shape" pattern observed in representations of amino acid composition.These results indicate that non-animal mitochondrial DNA can be classified based on amino acid composition.

Cluster Analysis of Non-Animal Mitochondria
As new knowledge accumulated after Linnaeus formulated his well-known historical classification of organisms, five-kingdom (Whittaker, 1969), three-domain (Woese, Kandler, & Wheelis, 1990), and six-kingdom (Cavalier-Smith, 1998) theories were subsequently proposed.In the present study, we have primarily followed the six-kingdom classification of Cavalier-Smith (1998).Cluster analysis of non-animal mitochondria was carried out based on Ward's method using contents of 20 amino acids as traits.As shown in Figure 2, two major clusters were obtained.The first major cluster in the tree consists of three sub-clusters: ( 1

Comparison of Amino Acid Compositions in Major Clusters
The sub-cluster in Figure 2 consisting of Protista (Phytophthoraa sojae) and Chromalveolata (Phytophthoraramorum, Saprolegnia ferax, and Phytophthora infestans) is characterized by higher Ile content relative to Leu content, with radar plots displaying a "mountain-shape" because of high Phe, Lys, and Asp content and lower Glu, Ser, and Gly content (Figure 3).The sub-cluster comprising Fungi (Smittium culisetae, Pleurotus ostreatus, and Epidermophyton floccosum), Plantae (Cyanidioschyzon merolae and Chondrus crispus), and Chromalveolata (Ochromonas danica and Chrysodidymus synuroideus) is characterized by higher Leu content than Ile content and by the formation of a "V-shape" reflecting Asp, Glu, and Ser content.In the third sub-cluster, Fungi (Podospora anserina, Gibberella zeae, and Moniliophthora perniciosa) have lower Phe and Lys content than do Chromalveolata (Rhodomonas salina and Cafeteria roenbergensis).The other sub-cluster, which includes Plantae (Pseudendoclonium akinetum, Porphyra purpurea, Prototheca wickerhamii, Oltmannsiellopsis viridis, Ostreococcus tauri, Marchantia polymorpha, Chlorokybus atmophyticus, Mesostigma viride, Nephroselmis olivacea, Physcomitrella patens, Chara vulgaris, and Chaetosphaeridium globosum) and a representative of Chromalveolata (Thalassiosira pseudonana), is characterized by higher Leu content than Ile content (Figure 5).In Fungi (Monoblepharella and Allomyces macrogynus), also included in this sub-cluster, Glu content is lower than that of both Asp and Ser, giving rise to a "V-shape" relationship among Asp, Glu, and Ser contents on the plot.Conversely, the "V-shape" is absent, with a weak convex shape noted in some cases, in Plantae and Chromalveolata of this sub-cluster (Dictyota dichotoma, Pylaiella littoralis, Fucus vesiculosus, Laminaria digitata, and Desmarestia viridis).

Cluster Analysis Based on Nucleotide Content
Using nucleotide content calculated from complete mitochondrial genomes, Ward's clustering method yielded results similar to those based on amino acid composition (Figure 6).Three major clusters were obtained: two clusters consisting of Chromalveolata, Fungi, and Plantae, and a clearly separated cluster comprising land plants and Fungi.This is consistent with results obtained using amino acid composition (Figure 2).
Figure 5.The third major cluster from Figure 2, a phylogenetic tree based on Ward's clustering using amino acid composition

Cluster Analysis Based on Nucleotide Content
Using nucleotide content calculated from complete mitochondrial genomes, Ward's clustering method yielded results similar to those based on amino acid composition (Figure 6).Three major clusters were obtained: two clusters consisting of Chromalveolata, Fungi, and Plantae, and a clearly separated cluster comprising land plants and Fungi.This is consistent with results obtained using amino acid composition (Figure 2).

Cluster Analysis of Invertebrate Mitochondria
In the present study, the above seven species plus the remaining 51 invertebrates were categorized based on amino acid composition using Ward's clustering method (Figure 8 and Supplemental Figure 1).Members of kingdom Animalia were completely separated from other kingdoms based on Ward's clustering (Figure 8 and Supplemental Figure 1), and amino acid compositions of Animalia differed accordingly from those of other groups consisting of Amoebozoa, Excavata, Choanozoa and Prosista.The latter group is characterized by Ile content higher than or similar to Leu content, and by Asp content higher than that of Lys and Glu, with the exception of Dictyostelium (Figure 7).In animal mitochondria, high Leu, Ile, and Phe contents are characteristic.
The highest observed amino acid percentages were those for Leu, giving rise to a "pen-point shape" for the representation of the content of these three amino acids.High Ser and Gly contents were also noted, resulting in a "knife-edge shape" in the plots (Supplemental Figure 1).Both characteristic shapes resemble a flying bird.

Cluster Analysis of Primitive Organisms
In the trees generated by cluster analysis, organisms belonging to one of the major clusters comprising Chromalveolata, Fungi, and Plantae are clearly separated from other plants (Figures 2 and 6), and the cluster consisting of Protista, Choanozoa, Excavata, and Amoebozoa is also separated from Animalia (Figure 8 and Supplemental Figure 1).Because these organisms seem to be primitive in Plantae or Animalia, Ward's clustering based on amino acid composition was carried out with a combined sample set (Figure 9).Amoebozoa (Polysphondylium pallidum, Dictyostelium discoideum, and Dictyostelium citrinum) formed a single cluster, while other Amoebozoa (Physarum polycephalum) and Excavata (Reclinomonas americana) fell into a cluster that included Fungi.Protista (Tetrahymena pyriformis), Choanozoa (Monosiga brevicollis), and Excavata (Malawimonas jakobiformis) were placed into another cluster consisting of Plantae (Cyanidioschyzon merolae) and Fungi (Smittium culisetae, Pleurotus ostreatus, and Epidermophyton floccosum).When a sample set including many other plants was analyzed, similar results were obtained (Supplemental Figure 2).

Discussion
In our previous study related to vertebrate evolution, vertebrates were clearly differentiated into two groupsterrestrial and aquatic -based on a data set chosen according to the alphabetical order of species names without considering species characteristics (Sorimachi & Okayasu, 2013).Species selection criteria are important for classification, because sampling affects classification results.We indeed found that when we varied which samples were chosen, differences were observed in the tree topologies recovered from the analyses (unpublished data).
Using amino acid composition as the examined trait in cluster analysis generated a better classification than did nucleotide content (Figures 2 and 6).The number of traits associated with amino acid composition (20) was greater than the number representing nucleotide content (4), resulting in good differentiation among organisms.Although amino acid composition provided better classification results than nucleotide content in the vertebrate evolutionary analysis, a significant separation between terrestrial and aquatic vertebrates was also obtained in the cluster analysis using nucleotide content (Sorimachi & Okayasu, 2013).In general, an increased number of traits reduce the probability of coincidental similarity in cluster analyses, resulting in better species classification.In contrast, an increase in the number of samples increases the probability of coincidental similarity, worsening species classification.
Terrestrial and aquatic vertebrates were completely separated in vertebrate phylogenetic trees based on amino acid composition or nucleotide content, with the exception of hagfish (Eptatretus burgeri), which fell into the terrestrial group (Sorimachi & Okayasu, 2013).This anomalous placement seems to be a consequence of its primitive characteristics (Janvier, 2010).Amino acid compositions predicted from complete vertebrate mitochondrial genomes were very similar to one another, although some amino acid contents differed significantly between the two groups and can thus be used to characterize them.Clear separation between terrestrial and aquatic vertebrates was thus obtained (Sorimachi & Okayasu, 2013).In contrast, species such as algae, mosses, and fungi are evolutionarily highly diverged, as can be seen by their large amino acid composition differences (Figures 1,(3)(4)(5).This means that evolutionary divergence in Plantae and Fungi may proceed in several different directions, resulting in multiple characteristic changes in amino acid composition.Consequently, the probability of coincidental similarity may increase, such that classification results obtained for Plantae, Chromalveolata, and Fungi were worse than for vertebrates.The latter have diverged in just two directionsterrestrial and aquatic -providing a good separation.
Within kingdom Plantae, members of Angiospermae were completely separated from liverworts (Marchantiophyta and Bryophyta) among land plants (Embryophyta), while land plants were separated from algae (Figure 2).Land plants and algae have thus evolved independently under natural selection in terrestrial and aquatic spheres, respectively, as observed in vertebrate evolution (Sorimachi & Okayasu, 2013).In addition, Magnoliopsida was completely separated from Liliopsida in Angiospermae, although amino acid compositions were similar between both groups (Figure 4).Although terrestrial and aquatic vertebrates were clearly separated in our previous study (Sorimachi & Okayasu, 2013), amino acid compositions were substantially similar between the two groups and among various vertebrates (Sorimachi & Okayasu, 2013).These clear separations may be due to significant changes in amino acid composition between the two groups or among many organisms.Characteristic differences in amino acid composition are indeed observed between terrestrial and aquatic vertebrates (Sorimachi & Okayasu, 2013).Similarly, bacteria are classified into two groups, "S-type" represented by Staphylococcus aureus and "E-type" represented by Escherichia coli, based on differences in amino acid compositions (Sorimachi & Okayasu, 2004b;Okayasu & Sorimachi, 2009).This phenomenon has been confirmed by other bioinformatics analyses (Qi X. Q., Won, & Qi Z. H., 2007).
Choanoflagellates (e.g., Monosiga brevicollis), which have a unique cellular morphology consisting of a single flagellum surrounded by a "collar" of microvilli, are unicellular aquatic flagellates.This cellular morphology is very similar to that of the collared cells (choanocytes) of sponges.On this basis, colonial choanoflagellates are thought to be the closest living relatives of animals.Recent molecular phylogenetic studies have investigated the relationship between Choanozoa and Metazoa (King et al., 2008), but it is difficult to assert that the evolutionary process leading from a highly differentiated unicellular organism to an ancient multi-cellular organism with organized tissues has been elucidated because of the large difference between two organisms (Lavrov, Forget, Kelly, & Lang, 2005).Based on cluster analysis using amino acid composition in our current study, Monosiga brevicollis was found to be closely related to Excavata (Malawimonas jakobiformis), Fungi (Smittium culisetae, Pleurotus ostreatus, and Epidermophyton floccosum), and Plantae (Cyanidioschyzon merolae) (Figure 9) rather than to animals including sponges (Figure 8).Differing results between our study and those of others are due to differences in algorithms and analyzed traits, as phylogenetic trees are not universally representative (unpublished data).

Conclusions
The ratios of amino acids to the total amino acids or of nucleotides to total nucleotides predicted from complete mitochondrial genomes consisting of huge number of nucleotides can characterize a whole organism.As these values are independent of species and genome size, these indexes are very useful for genome research, as well as single gene research.Indeed, Ward's clustering method using amino acid compositions or nucleotide contents predicted from complete mitochondrial genomes provided consistent phylogenetic trees.

Figure 1 .
Figure 1.Radar charts of amino acid composition predicted from complete mitochondrial genomes.Values represent percentages of total amino acids

Figure 3 .
Figure 3.The first major cluster from Figure 2, a phylogenetic tree based on Ward's clustering using amino acid composition

Figure 4 .
Figure 4.The second major cluster from Figure 2, a phylogenetic tree based on Ward's clustering using amino acid composition

Figure 7 .
Figure 7. Radar charts of amino acid composition predicted from complete mitochondrial genomes.Values correspond to percentages of total amino acids