Sunday, October 30, 2022
HomeBiologyOrthoSNAP: A tree splitting and pruning algorithm for retrieving single-copy orthologs from...

OrthoSNAP: A tree splitting and pruning algorithm for retrieving single-copy orthologs from gene household timber


Summary

Molecular evolution research, equivalent to phylogenomic research and genome-wide surveys of choice, usually depend on gene households of single-copy orthologs (SC-OGs). Giant gene households with a number of homologs in 1 or extra species—a phenomenon noticed amongst a number of vital households of genes equivalent to transporters and transcription elements—are sometimes ignored as a result of figuring out and retrieving SC-OGs nested inside them is difficult. To deal with this challenge and enhance the variety of markers utilized in molecular evolution research, we developed OrthoSNAP, a software program that makes use of a phylogenetic framework to concurrently cut up gene households into SC-OGs and prune species-specific inparalogs. We time period SC-OGs recognized by OrthoSNAP as SNAP-OGs as a result of they’re recognized utilizing a splitting and pruning process analogous to snapping branches on a tree. From 415,129 orthologous teams of genes inferred throughout 7 eukaryotic phylogenomic datasets, we recognized 9,821 SC-OGs; utilizing OrthoSNAP on the remaining 405,308 orthologous teams of genes, we recognized an extra 10,704 SNAP-OGs. Comparability of SNAP-OGs and SC-OGs revealed that their phylogenetic data content material was related, even in complicated datasets that include a whole-genome duplication, complicated patterns of duplication and loss, transcriptome information the place every gene usually has a number of transcripts, and contentious branches within the tree of life. OrthoSNAP is helpful for growing the variety of markers utilized in molecular evolution information matrices, a important step for robustly inferring and exploring the tree of life.

Introduction

Molecular evolution research, equivalent to species tree inference, genome-wide surveys of choice, evolutionary charge estimation, measures of gene–gene coevolution, and others usually depend on single-copy orthologs (SC-OGs), a bunch of homologous genes that originated through speciation and are current in single copy amongst species of curiosity [16]. In distinction, paralogs—homologous genes that originated through duplication and are sometimes members of enormous gene households—are usually absent from these research (Fig 1). Gene households of orthologs and paralogs usually encode functionally important proteins equivalent to transcription elements, transporters, and olfactory receptors [710]. The exclusion of SC-OGs from gene households has not solely hindered our understanding of their evolution and phylogenetic informativeness however can be artificially decreasing the variety of gene markers obtainable for molecular evolution research. Moreover, because the variety of species and/or their evolutionary divergence will increase in a dataset, the variety of SC-OGs decreases [11,12]; living proof, no SC-OGs had been recognized in a dataset of 42 vegetation [11]. Because the variety of obtainable genomes throughout the tree of life continues to extend, our potential to establish SC-OGs current in lots of taxa will turn into tougher.

thumbnail

Fig 1. Cartoon depiction of three courses of paralogs: outparalogs, inparalogs, and coorthologs.

(A) Paralogs discuss with associated genes which have originated through gene duplication, equivalent to genes M, N, and O. (B) Outparalogs and inparalogs discuss with paralogs which might be associated to 1 one other through a duplication occasion that happened previous to or after a speciation occasion, respectively. With respect to the speciation occasion that led to the cut up of taxa A, B, and C from D, genes M, N, and O are outparalogs as a result of they arose previous to the speciation occasion; genes O1 and O2 in taxa A, B, and C are inparalogs as a result of they arose after the speciation occasion. Species-specific inparalogs are paralogous genes noticed solely in 1 species, pressure, or organism in a dataset, equivalent to gene N1 and N2 in species A. Species-specific inparalogs N1 and N2 in species A are additionally coorthologs of gene N in taxa B, C, and D; the identical is true for inparalogs O1 and O2 from species A, that are coorthologs of gene O from species D. (C) Cartoon depiction of SNAP-OGs recognized by OrthoSNAP.


https://doi.org/10.1371/journal.pbio.3001827.g001

In mild of those points, a number of strategies have been developed to account for paralogs in particular sorts of molecular evolution research—for instance, in species tree reconstruction [13]. Strategies equivalent to SpeciesRax, STAG, ASTRAL-PRO, and DISCO can be utilized to deduce a species tree from a set of SC-OGs and gene households composed of orthologs and paralogs [11,1416]. Different strategies equivalent to PHYLDOG [17] and guenomu [18] collectively infer the species and gene timber however require considerable computational assets, which has hindered their use for big datasets. Different software program, equivalent to PhyloTreePruner, can conduct species-specific inparalog trimming [19]. Agalma, as half of a bigger automated phylogenomic workflow, can prune gene timber into maximally inclusive subtrees whereby every species, pressure, or organism is represented by 1 sequence [20]. Equally, OMA identifies subgroups of SC-OGs utilizing graph-based clustering of sequence similarity scores [21]. Though these strategies have expanded the numbers of gene markers utilized in species tree reconstruction, they weren’t designed to facilitate the retrieval of as broad a set of SC-OGs as doable for downstream molecular evolution research equivalent to surveys of choice. Moreover, the phylogenetic data content material of those gene households stays unknown, calling into query their usefulness.

To deal with this want and measure the data content material of subgroups of single-copy orthologous genes, we developed OrthoSNAP, a novel algorithm that identifies SC-OGs nested inside bigger gene households through tree decomposition and species-specific inparalog pruning. We time period SC-OGs recognized by OrthoSNAP as SNAP-OGs as a result of they had been retrieved utilizing a splitting and pruning process. The efficacy of OrthoSNAP and the data content material of SNAP-OGs was examined throughout 7 eukaryotic datasets, which embody species with complicated evolutionary histories (e.g., whole-genome duplication) or complicated gene sequence information (e.g., transcriptomes, which generally have a number of transcripts per protein-coding gene). These analyses revealed OrthoSNAP can considerably enhance the variety of orthologs for downstream analyses equivalent to phylogenomics and surveys of choice. Moreover, we discovered that the data content material of SNAP-OGs was statistically indistinguishable from that of SC-OGs suggesting the inclusion of SNAP-OGs in downstream analyses is prone to be as informative. These analyses point out that SNAP-OGs recognized by OrthoSNAP maintain promise for growing the variety of markers utilized in molecular evolution research, which may, in flip, be used for establishing and decoding the tree of life.

Outcomes

OrthoSNAP is a novel tree traversal algorithm that conducts tree splitting and species-specific inparalog pruning to establish SC-OGs nested inside bigger gene households (Fig 1C). OrthoSNAP takes as enter a gene household phylogeny and related FASTA file and might output particular person FASTA information populated with sequences from SNAP-OGs in addition to the related Newick tree information (Fig 2). Throughout tree traversal, tree uncertainty will be accounted for by OrthoSNAP by collapsing poorly supported branches. In a set of seven eukaryotic datasets that contained 9,821 SC-OGs, we used OrthoSNAP to establish an extra 10,704 SNAP-OGs. Utilizing a mix of multivariate statistics and phylogenetic measures, we exhibit that SNAP-OGs and SC-OGs have related phylogenetic data content material in all 7 datasets. This remark was constant throughout datasets the place the identification of enormous numbers of SC-OGs is difficult: flowering vegetation which have complicated patterns of gene duplication and loss (15 SC-OGs and 653 SNAP-OGs), a lineage of budding yeasts whereby half of the species have undergone an historic whole-genome duplication occasion (2,782 SC-OGs and 1,334 SNAP-OGs), and a dataset of transcriptomes the place many genes are represented by a number of transcripts (390 SC-OGs and a pair of,087 SNAP-OGs). Lastly, related patterns of assist had been noticed among the many 252 SC-OGs and the 1,428 SNAP-OGs in a contentious department within the tree of life. Taken collectively, these outcomes counsel that OrthoSNAP is useful for increasing the set of gene markers obtainable for molecular evolutionary research, even in datasets the place inference of orthology has traditionally been tough attributable to complicated evolutionary historical past or complicated information traits.

thumbnail

Fig 2. Cartoon depiction of OrthoSNAP workflow.

(A) OrthoSNAP takes as enter 2 information: a FASTA file of a gene household with a number of homologs noticed in 1 or extra species and the related gene household tree. The outputted file(s) shall be particular person FASTA information of SNAP-OGs. Relying on person arguments, particular person Newick tree information can be outputted. (B) A cartoon phylogenetic tree that depicts the evolutionary historical past of a gene household and 5 SNAP-OGs therein. Whereas figuring out SNAP-OGs, OrthoSNAP additionally identifies and prunes species-specific inparalogs (e.g., species2|gene2-copy_0 and species2|gene2-copy_1), retaining solely the inparalog with the longest sequence, a apply widespread in transcriptomics. Observe, OrthoSNAP requires that sequence naming schemes have to be the identical in each sequences and comply with the conference wherein a species, pressure, or organism identifier and gene identifier are separated by pipe (or vertical bar; “|”) character.


https://doi.org/10.1371/journal.pbio.3001827.g002

SC-OGs and SNAP-OGs have related data content material

To match SC-OGs and SNAP-OGs, we first independently inferred orthologous teams of genes in 3 eukaryotic datasets of 24 budding yeasts (none of which have undergone whole-genome duplication), 36 filamentous fungi (Aspergillus and Penicillium species), and 26 mammals together with people, canine, pigs, elephants, sloths, and others (S1 Desk). There was variation within the variety of SC-OGs and SNAP-OGs in every lineage (S1 Fig and S2 Desk). Apparently, the ratio of SNAP-OGs: SC-OGs amongst budding yeasts, filamentous fungi, and mammals was 0.83 (1,392: 1,668), 0.46 (2,035: 4,393), and 5.53 (1,775: 321), respectively, indicating SNAP-OGs can considerably enhance the variety of gene markers in sure lineages. The variety of SNAP-OGs recognized in a gene household with a number of homologs in 1 or extra species additionally diverse (S2 Fig).

Related orthogroup occupancy and best-fitting fashions of substitutions had been noticed amongst SC-OGs and SNAP-OGs (S3 Fig and S3 Desk), elevating the query of whether or not SC-OGs and SNAP-OGs have related data content material. To reply this, the data content material amongst a number of sequence alignments and phylogenetic timber from SC-OGs and SNAP-OGs (S4 Fig and S4 Desk) was in contrast throughout 9 properties—Robinson–Foulds distance [22], relative composition variability [23], and common bootstrap assist, for instance—utilizing multivariate evaluation and statistics in addition to data theory-based phylogenetic measures. Principal part evaluation enabled qualitative comparisons between SC-OGs and SNAP-OGs in lowered dimensional area and revealed a excessive diploma of similarity (Figs 3 and S5). Multivariate statistics—specifically, multifactor evaluation of variance—facilitated a quantitative comparability of SC-OGs and SNAP-OGs and revealed no distinction between SC-OGs and SNAP-OGs (p = 0.63, F = 0.23, df = 1; S5 Desk) and no interplay between the 9 properties and SC-OGs and SNAP-OGs (p = 0.16, F = 1.46, df = 8). Equally, multifactor evaluation of variance utilizing an additive mannequin, which assumes every issue is unbiased and there aren’t any interactions (as noticed right here), additionally revealed no variations between SC-OGs and SNAP-OGs (p = 0.65, F = 0.21, df = 1). Subsequent, we calculated tree certainty, an data theory-based measure of tree congruence from a set of gene timber, and located related ranges of congruence amongst phylogenetic timber inferred from SC-OGs and SNAP-OGs (S6 Desk). Taken collectively, these analyses exhibit that SC-OGs and SNAP-OGs have related phylogenetic data content material.

thumbnail

Fig 3. SC-OGs and SNAP-OGs have related phylogenetic data content material.

To guage similarities and variations between SC-OGs (orange dots) and SNAP-OGs (blue dots), we examined every gene’s phylogenetic data content material by measuring 9 properties of multiple-sequence alignments and phylogenetic timber. We carried out these analyses on 12,764 gene households from 3 datasets—24 budding yeasts (1,668 SC-OGs and 1,392 SNAP-OGs) (A), 36 filamentous fungi (4,393 SC-OGs and a pair of,035 SNAP-OGs) (B), and 26 mammals (321 SC-OGs and 1,775 SNAP-OGs) (C). Principal part evaluation revealed hanging similarities between SC-OGs and SNAP-OGs in all 3 datasets. For instance, the centroid (i.e., the imply throughout all metrics and genes) for SC-OGs and SNAP-OGs, which is depicted as an opaque and bigger dot, are very shut to 1 one other in lowered dimensional area. Supporting this remark, multifactor evaluation of variance with interplay results of the 6,630 SNAP-OGs and 6,634 SC-OGs revealed no distinction between SC-OGs and SNAP-OGs (p = 0.63, F = 0.23, df = 1) and no interplay between the 9 properties and SC-OGs and SNAP-OGs (p = 0.16, F = 1.46, df = 8). Multifactor evaluation of variance utilizing an additive mannequin yielded related outcomes whereby SC-OGs and SNAP-OGs don’t differ (p = 0.65, F = 0.21, df = 1). There are additionally only a few outliers of particular person SC-OGs and SNAP-OGs, that are represented as translucent dots, in all 3 panels. For instance, SNAP-OGs outliers on the prime of panel C are pushed by excessive treeness/RCV values, which is related to a excessive signal-to-noise ratio and/or low composition bias [23]; SNAP-OG outliers on the proper of panel C are pushed by excessive common bootstrap assist values, which is related to higher tree certainty [74]; and the one SC-OG outlier noticed within the backside proper of panel C is pushed by a SC-OG with a excessive diploma of violation of a molecular clock [78], which is related to decrease tree certainty [79]. A number of-sequence alignment and phylogenetic tree properties utilized in principal part evaluation and abbreviations thereof are as follows: common bootstrap assist (ABS), diploma of violation of the molecular clock (DVMC), relative composition variability, Robinson–Foulds distance (RF distance), alignment size (Aln. len.), the variety of parsimony informative websites (PI websites), saturation, treeness (tness), and treeness/RCV (tness/RCV). The info underlying this determine will be present in figshare (doi: 10.6084/m9.figshare.16875904).


https://doi.org/10.1371/journal.pbio.3001827.g003

We subsequent aimed to find out if SC-OGs and SNAP-OGs have higher phylogenetic data content material than a random null expectation. Teams of genes reflecting a random null expectation had been constructed by randomly choosing a single sequence from consultant species in multicopy orthologous genes (hereafter known as Random-GGs for random mixtures of orthologous and paralogous teams of genes) within the budding yeast (N = 647), filamentous fungi (N = 999), and mammalian (N = 954) datasets. Random-GGs had been aligned, trimmed, and phylogenetic timber had been inferred from the ensuing a number of sequence alignments. Random-GG phylogenetic data was additionally calculated. Throughout every dataset, important variations had been noticed amongst SC-OGs, SNAP-OGs, and Random-GGs (p < 0.001, F = 189.92, df = 4; Multifactor ANOVA). Additional examination of variations revealed Random-GGs are considerably completely different in comparison with SC-OGs and SNAP-OGs (p < 0.001 for each comparisons; Tukey trustworthy important variations (THSD) take a look at) within the budding yeast dataset. In distinction, SC-OGs and SNAP-OGs usually are not considerably completely different (p = 0.42; THSD). The identical was additionally true for the dataset of filamentous fungi and mammals—particularly, Random-GGs had been considerably completely different from SC-OGs and SNAP-OGs (p < 0.001 for every comparability in every dataset; THSD), whereas SC-OGs and SNAP-OGs weren’t considerably completely different (p = 1.00 for filamentous fungi dataset; p = 0.42 for dataset of mammals; THSD). Principal part evaluation revealed Robinson–Foulds distances (a measure of tree accuracy whereby decrease values symbolize higher tree accuracy), and relative composition variability (a measure of alignment composition bias whereby decrease values symbolize much less compositional bias), usually drove variations amongst Random-GGs, SC-OGs, and SNAP-OGs throughout the datasets. In all datasets, SC-OGs and SNAP-OGs outperformed the null expectation in tree accuracy and had been much less compositionally biased (Desk 1). These findings counsel SNAP-OGs and SC-OGs are related in phylogenetic data content material and outperform the null expectation.

SC-OGs and SNAP-OGs have related performances in complicated datasets

Advanced organic processes and datasets pose a severe problem for figuring out markers for molecular evolution research. To check the efficacy of OrthoSNAP in situations of complicated evolutionary histories and datasets, we executed the identical workflow described above—ortholog calling, sequence alignment, trimming, tree inference, and SNAP-OG detection—on 3 new datasets: (1) 30 vegetation identified to have complicated histories of gene duplication and loss [2426]; (2) 30 budding yeast species whereby half of the species originated from a hybridization occasion that gave rise to a whole-genome duplication adopted by complicated patterns of loss and duplication [2730]; and (3) 20 choanoflagellate transcriptomes, which include 1000’s extra transcripts than genes [31,32]; for orthology inference software program, a number of transcripts per gene seem much like synthetic gene duplicates.

Corroborating earlier outcomes, OrthoSNAP efficiently recognized SNAP-OGs that can be utilized downstream for molecular evolution analyses. Particularly, utilizing a species-occupancy threshold of fifty% within the plant, budding yeast, and choanoflagellate datasets, 653, 1,334, and a pair of,087 SNAP-OGs had been recognized, respectively (Desk 2). Compared, 15 SC-OGs had been recognized within the plant dataset; 2,782 within the budding yeast dataset; and 390 within the choanoflagellate dataset. (Observe that there are possible extra SC-OGs than SNAP-OGs in budding yeasts as a result of their genomes are comparatively small and due to this fact wouldn’t have as many duplicate gene copies in comparison with different lineages, equivalent to vegetation. Nonetheless, OrthoSNAP nonetheless considerably will increase the variety of markers in a phylogenomic information matrix.) To discover the affect of orthogroup occupancy, SNAP-OGs had been additionally recognized utilizing a minimal occupancy threshold of 4 taxa. This resulted within the identification of considerably extra SNAP-OGs: 15,854 in vegetation; 4,199 in budding yeasts; and 11,556 in choanoflagellates. Moreover, these had been considerably greater than the variety of SC-OGs recognized utilizing a minimal orthogroup occupancy of 4 taxa: 200 in vegetation; 3,566 in budding yeasts; and a pair of,438 in choanoflagellates. These findings assist earlier observations that incorporating OrthoSNAP into ortholog identification workflows can considerably enhance the variety of obtainable loci.

SC-OGs and SNAP-OGs have related patterns of assist in a contentious department within the tree of life

To additional consider the data content material of SNAP-OGs, we in contrast patterns of assist amongst SC-OGs and SNAP-OGs in a difficult-to-resolve department within the tree of life. Particularly, we evaluated the assist between 3 hypotheses regarding deep evolutionary relationships amongst eutherian mammals: (1) Xenarthra (eutherian mammals from the Americas) and Afrotheria (eutherian mammals from Africa) are sister to all different Eutheria [33,34]; (2) Afrotheria are sister to all different Eutheria [35,36]; and (3) Xenarthra are sister to a clade of each Afrotheria and Eutheria (Fig 4A). Decision of this battle has vital implications for understanding the historic biogeography of those organisms. To take action, we first obtained protein-coding gene sequences from 6 Afrotheria, 2 Xenarthra, 12 different Eutheria, and eight outgroup taxa from NCBI (S7 Desk), which symbolize all annotated and publicly genome assemblies on the time of this examine (S8 Desk). Utilizing the protein translations of those gene sequences as enter to OrthoFinder, we recognized 252 SC-OGs shared throughout taxa; utility of OrthoSNAP recognized an extra 1,428 SNAP-OGs, which represents a higher than 5-fold enhance within the variety of gene markers for this dataset (S8 Desk). There was variation within the variety of SNAP-OGs recognized per orthologous group of genes (S6 Fig). The very best variety of SNAP-OGs recognized in an orthologous group of genes was 10, which was a gene household of olfactory receptors; olfactory receptors are identified to have expanded within the evolutionary historical past of eutherian mammals [8]. One of the best-fitting substitution fashions had been related between SC-OGs and SNAP-OGs (S7 Fig).

thumbnail

Fig 4. SC-OGs and SNAP-OGs show related patterns of assist in a contentious department regarding deep evolutionary relationships amongst eutherian mammals.

(A) Two main hypotheses for the evolutionary relationships amongst Eutheria, which have implications for the evolution and biogeography of the clade, are that Afrotheria and Xenarthra are sister to all different Eutheria (speculation 1; blue) and that Afrotheria are sister to all different Eutheria (speculation 2; pink). The third doable, however much less well-supported topology, is that Xenarthra are sister to Eutheria and Afrotheria. (B) Comparability of gene assist frequency (GSF) values for the three hypotheses amongst 252 SC-OGs and 1,428 SNAP-OGs utilizing an α degree of 0.01 revealed no variations in assist (p = 0.26, Fisher’s precise take a look at with Benjamini–Hochberg multitest correction). Comparability after accounting for gene tree uncertainty by collapsing bipartitions with ultrafast bootstrap approximation assist decrease than 75 (SC-OGs collapsed vs. SNAP-OGs collapsed) additionally revealed no variations (p = 0.05; Fisher’s precise take a look at with Benjamini–Hochberg multitest correction). (C) Examination of the distribution of frequency of topology assist utilizing gene-wise log-likelihood scores revealed no distinction between SNAP-OGs and SC-OGs assist for the three topologies (p = 0.52; Fisher’s precise take a look at). The info underlying this determine will be present in figshare (doi: 10.6084/m9.figshare.16875904).


https://doi.org/10.1371/journal.pbio.3001827.g004

Two unbiased checks analyzing assist between different hypotheses of deep evolutionary relationships amongst eutherian mammals revealed related patterns of assist between SC-OGs and SNAP-OGs. Extra particularly, no variations had been noticed in gene assist frequencies—the variety of genes that assist 1 of three doable hypotheses at a given department in a phylogeny—with or without accounting for single-gene tree uncertainty by collapsing branches with low assist values (p = 0.26 and p = 0.05, respectively; Fisher’s precise take a look at with Benjamini–Hochberg multitest correction; Fig 4B and S9 Desk). A second take a look at of single-gene assist was carried out whereby particular person gene log likelihoods had been calculated for every of the three doable topologies. The frequency of gene-wise assist for every topology was decided. No variations had been noticed in gene assist frequency utilizing the log probability strategy (p = 0.52, respectively; Fisher’s precise take a look at). Examination of patterns of assist in a contentious department within the tree of life utilizing 2 unbiased checks revealed SC-OGs and SNAP-OGs are related and additional helps the remark that they include related phylogenetic data.

In abstract, 415,129 orthologous teams of genes throughout 7 eukaryotic datasets contained 9,821 SC-OGs; utility of OrthoSNAP recognized an extra 10,704 SNAP-OGs, thereby greater than doubling the variety of gene markers. Complete comparability of the phylogenetic data content material amongst SC-OGs and SNAP-OGs revealed no variations in phylogenetic data content material. Strikingly, this remark held true throughout datasets with complicated evolutionary histories and when conducting speculation testing in a difficult-to-resolve department within the tree of life. These findings counsel that SNAP-OGs could also be helpful for numerous research of molecular evolution starting from genome-wide surveys of choice, phylogenomic investigations, gene–gene coevolution analyses, and others.

Dialogue

Molecular evolution research usually depend on SC-OGs. Not too long ago, developed strategies can combine gene households of orthologs and paralogs into species tree inference however usually are not designed to broadly facilitate the retrieval of gene markers for molecular evolution analyses. Moreover, the phylogenetic data content material of gene households of orthologs and paralogs stays unknown. This remark underscores the necessity for algorithms that may establish SC-OGs nested inside bigger gene households, which may, in flip, be integrated into numerous molecular evolution analyses, and a complete evaluation of their phylogenetic properties.

To deal with this want, we developed OrthoSNAP, a tree splitting and pruning algorithm that identifies SNAP-OGs, which refers to SC-OGs nested inside bigger gene households whereby species-specific inparalogs have additionally been pruned. Complete examination of the phylogenetic data content material of SNAP-OGs and SC-OGs from 7 empirical datasets of numerous eukaryotic species revealed that their content material is comparable. Inclusion of SNAP-OGs elevated the scale of all 7 datasets, generally considerably. We observe that our outcomes are qualitatively much like these reported lately by Smith and colleagues [37], which retrieved SC-OGs nested inside bigger households from 26 primates and examined their efficiency in gene tree and species tree inference. Three noteworthy variations are that we additionally conduct species-specific inparalog trimming, present a user-friendly command-line software program for SNAP-OG identification, and evaluated the phylogenetic data content material of SNAP-OGs and SC-OGs throughout 7 numerous phylogenomic datasets. We additionally observe that our algorithm can account for numerous sorts of paralogy—outparalogs, inparalogs, and species-specific inparalogs—whereas different software program like PhyloTreePruner, which solely conducts species-specific inparalog trimming [19], and Agalma, which identifies single-copy outparalogs and inparalogs [20], can account for some, however not all, sorts of paralogs (S10 Desk). One other distinction between OrthoSNAP and different approaches is that Agalma and PhyloTreePruner each require rooted phylogenies. In distinction, OrthoSNAP will mechanically midpoint root phylogenies or settle for prerooted phylogenies as enter. Moreover, these algorithms usually are not designed to deal with transcriptomic information whereby a number of transcripts per gene shall be interpreted as multicopy orthologs. Thus, OrthoSNAP permits for higher person flexibility and accounts for extra numerous situations, resulting in, not less than in some cases, the identification of extra loci for downstream analyses (S8 Fig). Notably, these software program are additionally completely different from sequence similarity graph-based inferences of subgroups of single-copy orthologous genes—such because the algorithm applied in OMA [21]. In different phrases, OrthoSNAP identifies subgroups of single-copy orthologous genes by analyzing evolutionary histories, relatively than sequence similarity values. Furthermore, examination of evolutionary histories facilitates the identification of species-specific inparalogs. Lastly, our outcomes, along with different research, exhibit the utility of SC-OGs which might be nested inside bigger households [15,20,37,38].

Regardless of the flexibility of OrthoSNAP to establish extra loci for molecular evolution analyses, there have been cases whereby SNAP-OGs weren’t recognized in multicopy orthologous teams of genes. We focus on 3 causes that contribute to why SNAP-OGs couldn’t be recognized amongst some genes—particularly, gene households with sequence information from <50% of the taxa; gene households with complicated evolutionary histories (for instance, HGT and duplication/loss patterns); and gene households with evolutionary histories that differ from the species tree (for instance, attributable to analytical elements, equivalent to sampling and systematic error, or organic elements, equivalent to lineage sorting or introgression/hybridization [3941]). Notably, the primary purpose can, however doesn’t at all times, lead to lack of ability to deduce SNAP-OGs and will be, to a sure extent, addressed (e.g., by reducing the orthogroup occupancy threshold in OrthoSNAP), whereas the opposite 2 causes are tougher as a result of they usually lead to a real absence of SC-OGs. Moreover, the precise variety of SC-OGs (both these nested inside multicopy orthologs or not) for any given group of organisms is just not identified, making it tough to find out what number of SNAP-OGs and SC-OGs one ought to count on to recuperate. Notably, this challenge has lengthy challenged researchers, even when ortholog identification is carried out by additionally taking genome synteny into consideration [27].

Subsequent, we focus on some sensible issues when utilizing OrthoSNAP. Within the current examine, we inferred orthology data utilizing OrthoFinder [42], however a number of different approaches can be utilized upstream of OrthoSNAP. For instance, different graph-based algorithms equivalent to OrthoMCL and OMA [21,43] or sequence similarity-based algorithms equivalent to orthofisher [44] can be utilized to deduce gene households. Equally, sequence similarity search algorithms like BLAST+ [45], USEARCH [46], and HMMER [47] can be utilized to retrieve homologous units of sequences which might be used as enter for OrthoSNAP. Different issues also needs to be taken through the multicopy tree inference step. For instance, inferring phylogenies for all orthologous teams of genes could also be a computationally costly process. Fast tree inference software program—equivalent to FastTree or IQTREE with the “-fast” parameter [48,49]—could expedite these steps (however customers needs to be conscious that this will lead to a lack of accuracy in inference; [50]).

We recommend using “greatest practices” when inferring teams of putatively orthologous genes, together with SNAP-OGs. Particularly, orthology data will be additional scrutinized utilizing phylogenetic strategies. Orthology inference errors could happen upstream of OrthoSNAP; for instance, SNAP-OGs could also be vulnerable to inaccurate inference of orthology throughout upstream clustering of putatively orthologous genes. One technique to establish putatively spurious orthology inference is by figuring out lengthy terminal branches [51]. Terminal branches of outlier size will be recognized utilizing the “spurious_sequence” operate in PhyKIT [52]. Different instruments, equivalent to PhyloFisher, UPhO, and different orthology inference pipelines make use of related methods to refine orthology inference [5355]. Lastly, we acknowledge that future iterations of OrthoSNAP could profit from incorporating extra layers of data, equivalent to sequence similarity scores or synteny. Although OrthoSNAP did establish SNAP-OGs in some complicated datasets the place synteny has beforehand been very useful, such because the budding yeast dataset, different historic and quickly evolving lineages could profit from synteny evaluation to dissect complicated relationships of orthology [51,5658].

Taken collectively, we advise that OrthoSNAP is helpful for retrieving single-copy orthologous teams of genes from gene household information and that the recognized SNAP-OGs have related phylogenetic data content material in comparison with SC-OGs. Together with different phylogenomic toolkits, OrthoSNAP could also be useful for reconstructing the tree of life and increasing our understanding of the tempo and mode of evolution therein.

Strategies

OrthoSNAP algorithm description and utilization

We subsequent describe how OrthoSNAP identifies SNAP-OGs. OrthoSNAP requires 2 information as enter: one is a FASTA file that incorporates 2 or extra homologous sequences in 1 or extra species and the opposite the corresponding gene household phylogeny in Newick format. In each the FASTA and Newick information, customers should comply with a naming scheme—whereby species, pressure, or organism identifiers and gene sequences identifiers are separated by a vertical bar (also referred to as a pipe character or “|”)—which permits OrthoSNAP to find out which sequences had been encoded within the genome of every species, pressure, or organism. After initiating OrthoSNAP, the gene household phylogeny is first midpoint rooted (except the person specifies the inputted phylogeny is already rooted) after which SNAP-OGs are recognized utilizing a tree-traversal algorithm. To take action, OrthoSNAP will loop by way of the inner branches within the gene household phylogeny and consider the variety of distinct taxa identifiers amongst youngsters terminal branches. If the variety of distinctive taxon identifiers is bigger than or equal to the orthogroup occupancy threshold (default: 50% of whole taxa within the inputted phylogeny; customers can specify an integer threshold), then all youngsters branches and termini are examined additional; in any other case, OrthoSNAP will study the subsequent inside department. Subsequent, OrthoSNAP will collapse branches with low assist (default: 80, which is motivated through the use of ultrafast bootstrap approximations [59] to guage bipartition assist; customers can specify an integer threshold) and conduct species-specific inparalog trimming whereby the longest sequence is maintained, a apply widespread in transcriptomics. Nevertheless, customers can specify whether or not the shortest sequence or the median sequence (within the case of three or extra sequences) needs to be saved as a substitute. Customers may decide which species-specific inparalog to maintain based mostly on department lengths (the longest, shortest, or median department size within the case of getting 3 or extra sequences). Species-specific inparalogs are outlined as sequences encoded in the identical genome which might be sister to 1 one other or belong to the identical polytomy [19]. The ensuing set of sequences is examined to find out if 1 species, pressure, or organism is represented by 1 sequence and guarantee these sequences haven’t but been assigned to a SNAP-OG. If that’s the case, they’re thought-about a SNAP-OG; if not, OrthoSNAP will study the subsequent inside department. When SNAP-OGs are recognized, FASTA information of SNAP-OG sequences are outputted. Customers may output the subtree of the SNAP-OG utilizing an extra argument.

The rules of the OrthoSNAP algorithm are additionally described utilizing the next pseudocode:

FOR inside department in midpoint rooted gene household phylogeny:

  1. > IF orthogroup occupancy amongst youngsters termini is bigger than or equal to orthogroup occupancy threshold;
  2. >> Collapse poorly supported bipartitions and trim species-specific inparalogs;
  3. >> IF every species, pressure, or organism among the many trimmed set of species, strains, or organisms is represented by just one sequence and no sequences being examined have been assigned to a SNAP-OG but;
  4. >>> Sequences symbolize a SNAP-OG and are outputted to a FASTA file
  5. >> ELSE
  6. >>> study subsequent inside department
  7. > ELSE
  8. >> study subsequent inside department

ENDFOR

To boost the person expertise, arguments or default values are printed to the usual output, a progress bar informs the person of how of the evaluation has been accomplished, and the variety of SNAP-OGs recognized in addition to the names of the outputted FASTA information are printed to the usual output.

Growth practices and design rules to make sure long-term software program stability

Archival instabilities amongst software program threatens the reproducibility of bioinformatics analysis [60]. To make sure long-term stability of OrthoSNAP, we applied beforehand established rigorous growth practices and design rules [44,52,61,62]. For instance, OrthoSNAP includes a refactored codebase, which facilitates debugging, testing, and future growth. We additionally applied a steady integration pipeline to mechanically construct, bundle, and set up OrthoSNAP throughout Python variations 3.7, 3.8, and three.9. The continual integration pipeline additionally conducts 57 unit and integration checks, which span 95.90% of the codebase and guarantee trustworthy operate of OrthoSNAP.

Dataset era

To generate a dataset for figuring out SNAP-OGs and evaluating them to SC-OGs, we first recognized putative teams of orthologous genes throughout 4 empirical datasets. To take action, we first downloaded proteomes for every dataset, which had been obtained from publicly obtainable repositories on NCBI (S1 and S7 Tables) or figshare [51]. Every dataset diverse in its sampling of sequence range and within the evolutionary divergence of the sampled taxa. The dataset of 24 budding yeasts spans roughly 275 million years of evolution [51]; the dataset of 36 filamentous fungi spans roughly 94 million years of evolution [63]; the dataset of 26 mammals spans roughly 160 million years of evolution [64]; and the dataset of 28 eutherian mammals—which was used to check the contentious deep evolutionary relationships amongst eutherian mammals—considerations an historic divergence that occurred roughly 160 million years in the past [65]. Putatively orthologous teams of genes had been recognized utilizing OrthoFinder, v2.3.8 [42], with default parameters, which resulted in 46,645 orthologous teams of genes with not less than 50% orthogroup occupancy (S8 Desk).

To deduce the evolutionary historical past of every orthologous group of genes, we first individually aligned and trimmed every group of sequences utilizing MAFFT, v7.402 [66], with the “auto” parameter and ClipKIT, v1.1.3 [61], with the “smart-gap” parameter, respectively. Thereafter, we inferred the best-fitting substitution mannequin utilizing Bayesian data criterion and evolutionary historical past of every orthologous group of genes utilizing IQ-TREE2, v2.0.6 [49]. Bipartition assist was examined utilizing 1,000 ultrafast bootstrap approximations [59].

To establish SNAP-OGs, the FASTA file and related phylogenetic tree for every gene household with a number of homologs in 1 or extra species was used as enter for OrthoSNAP, v0.0.1 (this examine). Throughout 40,011 gene households with a number of homologs in 1 or extra species in all datasets, we recognized 6,630 SNAP-OGs with not less than 50% orthogroup occupancy (S1 Fig and S8 Desk). Unaligned sequences of SNAP-OGs had been then individually aligned and trimmed utilizing the identical technique as described above. To find out gene households that had been SC-OGs, we recognized orthologous teams of genes with not less than 50% orthogroup occupancy and every species, pressure, or organism was represented by only one sequence—6,634 orthologous teams of genes had been SC-OGs.

Measuring and evaluating data content material amongst SC-OGs and SNAP-OGs

To match the data content material of SC-OGs and SNAP-OGs, we calculated 9 properties of a number of sequence alignments and phylogenetic timber related to strong phylogenetic sign within the budding yeasts, filamentous fungi, and mammalian datasets (S4 Desk). Extra particularly, we calculated data content material from phylogenetic timber equivalent to measures of tree certainty (common bootstrap assist), accuracy (Robinson–Foulds distance; [67]), signal-to-noise ratios (treeness; [68]), and violation of clock-like evolution (diploma of violation of a molecular clock or DVMC; [69]). Info content material was additionally measured amongst a number of sequence alignments by analyzing alignment size and the variety of parsimony-informative websites, that are related to strong and correct inferences of evolutionary histories [70] in addition to biases in sequence composition (RCV; [68]). Lastly, data content material was additionally evaluated utilizing metrics that contemplate traits of phylogenetic timber and a number of sequence alignments such because the diploma of saturation, which refers to a number of substitutions in a number of sequence alignments that underestimate the space between 2 taxa [71], and treeness/RCV, a measure of signal-to-noise ratios in phylogenetic timber and sequence composition biases [68]. For tree accuracy, phylogenetic timber had been in comparison with species timber reported in earlier research [51,63,64]. All properties had been calculated utilizing capabilities in PhyKIT, v1.1.2 [52]. The operate used to calculate every metric and extra data are described in S4 Desk.

Principal part evaluation throughout the 9 properties that summarize phylogenetic data content material was used to qualitatively evaluate SC-OGs and SNAP-OGs in lowered dimensional area. Principal part evaluation, visualization, and willpower of property contribution to every principal part was carried out utilizing factoextra, v1.0.7 [72], and FactoMineR, v2.4 [73], within the R, v4.0.2 (https://cran.r-project.org/), programming atmosphere. Statistical evaluation utilizing a multifactor ANOVA was used to quantitatively evaluate SC-OGs and SNAP-OGs utilizing the res.aov() operate in R.

Info theory-based approaches had been used to guage incongruence amongst SC-OGs and SNAP-OGs phylogenetic timber. Extra particularly, we calculated tree certainty and tree certainty-all [7476], that are conceptually much like entropy values and are derived from analyzing assist amongst a set of gene timber and the two most supported topologies or all topologies that happen with a frequency of ≥5%, respectively. Extra merely, tree certainty values vary from 0 to 1 wherein low values are indicative of low congruence amongst gene timber and excessive values are indicative of excessive congruence amongst gene timber. Tree certainty and tree certainty-all values had been calculated utilizing RAxML, v8.2.10 [77].

To look at patterns of assist in a contentious department regarding deep evolutionary relationships amongst eutherian mammals, we calculated gene assist frequencies and ΔGLS. Gene assist frequencies had been calculated utilizing the “polytomy_test” operate in PhyKIT, v1.1.2 [52]. To account for uncertainty in gene tree topology, we additionally examined patterns of gene assist frequencies after collapsing bipartitions with ultrafast bootstrap approximation assist decrease than 75 utilizing the “collapse” operate in PhyKIT. To calculate gene-wise log probability values, partition log-likelihoods had been calculated utilizing the “wpl” parameter in IQ-TREE2 [49], which required as enter a phylogeny in Newick format that represented both speculation 1, 2, or 3 (Fig 4A) and a concatenated alignment of SC-OGs and SNAP-OGs with partition data. Thereafter, the log probability values had been used to assign genes to the topology they greatest supported. Inconclusive genes, outlined as having a gene-wise log probability distinction of lower than 0.01, had been eliminated.

The identical methodologies—orthology inference, multiple-sequence alignment, trimming, tree inference, SNAP-OG identification, and phylogenetic data content material calculations—had been additionally utilized to three extra datasets that symbolize complicated datasets. Particularly, 30 vegetation (with a historical past of intensive gene duplication and loss occasions), 30 budding yeast species (15 of which skilled whole-genome duplication), and 20 choanoflagellate transcriptomes (the place usually a number of transcripts correspond to a single protein-coding gene) [31,32].

Supporting data

S5 Fig. High quality of illustration and contributions of properties of phylogenetic data content material throughout principal part evaluation.

Principal part evaluation was used to qualitatively evaluate the similarities and variations between SNAP-OGs and SC-OGs (Fig 3). The leftmost determine in every panel of budding yeasts (A), filamentous fungi (B), and mammals (C) represents the standard of illustration for every property throughout all principal elements. The following 2 figures depict the contribution of every property (or variable) to the primary and second dimension in lowered dimensional area. The pink dashed line represents equal contributions from every variable. The info underlying this determine will be present in figshare (doi: 10.6084/m9.figshare.16875904).

https://doi.org/10.1371/journal.pbio.3001827.s005

(TIF)

S8 Fig. Cartoon comparability of various tree decomposition algorithms.

Utilizing the phylogeny offered in Fig 1B (panel A) and Fig 2B (panel B), completely different tree decomposition algorithms are in contrast. (A) OrthoSNAP will establish 4 SNAP-OGs, whereas DISCO and the maximally inclusive methods will every establish 3 subgroups of orthologous genes. PhyloTreePruner won’t establish any subgroups of single-copy orthologous genes. (B) OrthoSNAP will establish 5 subgroups of single-copy orthologous genes (mild blue) by figuring out maximally inclusive subgroups—subtrees the place every taxon is represented by a single sequence—and maximally inclusive subgroups after species-specific inparalog trimming (species-specific inparalogs are proven in orange). In distinction, DISCO and maximally inclusive methods will establish 3 SC-OGs, partly, as a result of they don’t account for species-specific inparalogs. PhyloTreePruner, which solely prunes species-specific inparalogs, won’t establish any subgroups of single-copy orthologous genes because of the presence of extra historic duplication occasions.

https://doi.org/10.1371/journal.pbio.3001827.s008

(TIF)

S1 Desk. Species and accession numbers for proteomes utilized in every dataset.

This desk particulars the species used for the budding yeasts, filamentous fungi, and mammalian datasets. All proteomes from budding yeasts had been downloaded from Shen and colleagues [51]. Proteomes from filamentous fungi and mammals had been downloaded from NCBI, and their accessions and meeting names are supplied.

https://doi.org/10.1371/journal.pbio.3001827.s009

(XLSX)

References

  1. 1.
    Rokas A, Williams BL, King N, Carroll SB. Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature. 2003;425:798–804. pmid:14574403
  2. 2.
    Jeffares DC, Tomiczek B, Sojo V, dos Reis M. A Newbies Information to Estimating the Non-synonymous to Synonymous Charge Ratio of all Protein-Coding Genes in a Genome. 2015. p. 65–90.
  3. 3.
    Steenwyk JL, Phillips MA, Yang F, Date SS, Graham TR, Berman J, et al. A gene coevolution community gives perception into eukaryotic mobile and genomic construction and performance. bioRxiv. 2021; 2021.07.09.451830.
  4. 4.
    Li Z, De La Torre AR, Sterck L, Cánovas FM, Avila C, Merino I, et al. Single-Copy Genes as Molecular Markers for Phylogenomic Research in Seed Vegetation. Genome Biol Evol. 2017;9:1130–1147. pmid:28460034
  5. 5.
    Dong Y, Chen S, Cheng S, Zhou W, Ma Q, Chen Z, et al. Pure choice and repeated patterns of molecular evolution following allopatric divergence. Elife. 2019;8. pmid:31373555
  6. 6.
    Wu J, Yonezawa T, Kishino H. Charges of Molecular Evolution Counsel Pure Historical past of Life Historical past Traits and a Publish-Okay-Pg Nocturnal Bottleneck of Placentals. Curr Biol. 2017;27:3025–3033.e5. pmid:28966093
  7. 7.
    Malnic B, Godfrey PA, Buck LB. The human olfactory receptor gene household. Proc Natl Acad Sci. 2004;101:2584–2589. pmid:14983052
  8. 8.
    Niimura Y, Matsui A, Touhara Okay. Excessive growth of the olfactory receptor gene repertoire in African elephants and evolutionary dynamics of orthologous gene teams in 13 placental mammals. Genome Res. 2014;24:1485–1496. pmid:25053675
  9. 9.
    Ozcan S, Johnston M. Perform and regulation of yeast hexose transporters. Microbiol Mol Biol Rev. 1999;63:554–569. pmid:10477308
  10. 10.
    Wingender E, Schoeps T, Dönitz J. TFClass: an expandable hierarchical classification of human transcription elements. Nucleic Acids Res. 2013;41:D165–D170. pmid:23180794
  11. 11.
    Emms DM, Kelly S. STAG: Species Tree Inference from All Genes. bioRxiv. 2018;267914.
  12. 12.
    Thomas GWC, Dohmen E, Hughes DST, Murali SC, Poelchau M, Glastad Okay, et al. Gene content material evolution within the arthropods. Genome Biol. 2020;21:15. pmid:31969194
  13. 13.
    Smith ML, Hahn MW. New Approaches for Inferring Phylogenies within the Presence of Paralogs. Developments Genet. 2021;37:174–187. pmid:32921510
  14. 14.
    Zhang C, Scornavacca C, Molloy EK, Mirarab S. ASTRAL-Professional: Quartet-Based mostly Species-Tree Inference regardless of Paralogy. Thorne J, editor. Mol Biol Evol. 2020;37:3292–3307. pmid:32886770
  15. 15.
    Willson J, Roddur MS, Liu B, Zaharias P, Warnow T. DISCO: Species Tree Inference utilizing Multicopy Gene Household Tree Decomposition. Hahn M, editor. Syst Biol. 2021. pmid:34450658
  16. 16.
    Morel B, Schade P, Lutteropp S, Williams TA, Szöllősi GJ, Stamatakis A. SpeciesRax: A software for max probability species tree inference from gene household timber below duplication, switch, and loss. bioRxiv. 2021; 2021.03.29.437460.
  17. 17.
    Boussau B, Szollosi GJ, Duret L, Gouy M, Tannier E, Daubin V. Genome-scale coestimation of species and gene timber. Genome Res. 2013;23:323–330. pmid:23132911
  18. 18.
    de Oliveira Martins L, Posada D. Species Tree Estimation from Genome-Huge Knowledge with guenomu. 2017. p. 461–478.
  19. 19.
    Kocot KM, Citarella MR, Moroz LL, Halanych KM. PhyloTreePruner: A phylogenetic tree-based strategy for number of orthologous sequences for phylogenomics. Evol Bioinform On-line. 2013;2013:429–435. pmid:24250218
  20. 20.
    Dunn CW, Howison M, Zapata F. Agalma: an automatic phylogenomics workflow. BMC Bioinformatics. 2013;14:330. pmid:24252138
  21. 21.
    Prepare C-M, Glover NM, Gonnet GH, Altenhoff AM, Dessimoz C. Orthologous Matrix (OMA) algorithm 2.0: extra strong to uneven evolutionary charges and extra scalable hierarchical orthologous group inference. Bioinformatics. 2017;33:i75–i82. pmid:28881964
  22. 22.
    Schuh RT, Polhemus JT. Evaluation of Taxonomic Congruence amongst Morphological, Ecological, and Biogeographic Knowledge Units for the Leptopodomorpha (Hemiptera). Syst Biol. 1980;29:1–26.
  23. 23.
    Phillips MJ, Penny D. The basis of the mammalian tree inferred from complete mitochondrial genomes. Mol Phylogenet Evol. 2003;28:171–185. pmid:12878457
  24. 24.
    Defoort J, Van de Peer Y, Carretero-Paulet L. The evolution of gene duplicates in angiosperms and the affect of protein-protein interactions and the mechanism of duplication. Golding B, editor. Genome Biol Evol. 2019. pmid:31364708
  25. 25.
    De Smet R, Adams KL, Vandepoele Okay, Van Montagu MCE, Maere S, Van de Peer Y. Convergent gene loss following gene and genome duplications creates single-copy households in flowering vegetation. Proc Natl Acad Sci. 2013;110:2898–2903. pmid:23382190
  26. 26.
    Panchy N, Lehti-Shiu M, Shiu S-H. Evolution of Gene Duplication in Vegetation. Plant Physiol. 2016;171:2294–2316. pmid:27288366
  27. 27.
    Scannell DR, Byrne KP, Gordon JL, Wong S, Wolfe KH. A number of rounds of speciation related to reciprocal gene loss in polyploid yeasts. Nature. 2006;440:341–345. pmid:16541074
  28. 28.
    Wolfe KH. Origin of the Yeast Entire-Genome Duplication. PLoS Biol. 2015;13:e1002221. pmid:26252643
  29. 29.
    Wolfe KH, Shields DC. Molecular proof for an historic duplication of all the yeast genome. Nature. 1997;387:708–713. pmid:9192896
  30. 30.
    Marcet-Houben M, Gabaldón T. Past the Entire-Genome Duplication: Phylogenetic Proof for an Historical Interspecies Hybridization within the Baker’s Yeast Lineage. Hurst LD, editor. PLoS Biol. 2015;13:e1002220. pmid:26252497
  31. 31.
    Richter DJ, Fozouni P, Eisen MB, King N. Gene household innovation, conservation and loss on the animal stem lineage. Elife. 2018;7. pmid:29848444
  32. 32.
    Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, et al. Full-length transcriptome meeting from RNA-Seq information and not using a reference genome. Nat Biotechnol. 2011;29:644–652. pmid:21572440
  33. 33.
    Hallström BM, Kullberg M, Nilsson MA, Janke A. Phylogenomic Knowledge Analyses Present Proof that Xenarthra and Afrotheria Are Sister Teams. Mol Biol Evol. 2007;24:2059–2068. pmid:17630282
  34. 34.
    Wildman DE, Uddin M, Opazo JC, Liu G, Lefort V, Guindon S, et al. Genomics, biogeography, and the diversification of placental mammals. Proc Natl Acad Sci. 2007;104:14395–14400. pmid:17728403
  35. 35.
    Murphy WJ. Decision of the Early Placental Mammal Radiation Utilizing Bayesian Phylogenetics. Science. 2001;294:2348–2351. pmid:11743200
  36. 36.
    Murphy WJ, Eizirik E, Johnson WE, Zhang YP, Ryder OA, O’Brien SJ. Molecular phylogenetics and the origins of placental mammals. Nature. 2001;409:614–618. pmid:11214319
  37. 37.
    Smith ML, Vanderpool D, Hahn MW. Utilizing all gene households vastly expands information obtainable for phylogenomic inference in primates. bioRxiv 2021; 2021.09.22.461252.
  38. 38.
    van der Heijden RT, Snel B, van Noort V, Huynen MA. Orthology prediction at scalable decision by phylogenetic tree evaluation. BMC Bioinformatics. 2007;8:83. pmid:17346331
  39. 39.
    Jarvis ED, Mirarab S, Aberer AJ, Li B, Houde P, Li C, et al. Entire-genome analyses resolve early branches within the tree of life of recent birds. Science. 2014;346:1320–1331. pmid:25504713
  40. 40.
    Steenwyk JL, Lind AL, Ries LNA, dos Reis TF, Silva LP, Almeida F, et al. Pathogenic Allodiploid Hybrids of Aspergillus Fungi. Curr Biol. 2020;30:2495–2507.e7. pmid:32502407
  41. 41.
    Meleshko O, Martin MD, Korneliussen TS, Schröck C, Lamkowski P, Schmutz J, et al. Intensive Genome-Huge Phylogenetic Discordance Is Attributable to Incomplete Lineage Sorting and Not Ongoing Introgression in a Quickly Radiated Bryophyte Genus. Mol Biol Evol. 2021;38:2750–2766. pmid:33681996
  42. 42.
    Emms DM, Kelly S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 2019;20:238. pmid:31727128
  43. 43.
    Li L, Stoeckert CJ, Roos DS. OrthoMCL: Identification of ortholog teams for eukaryotic genomes. Genome Res. 2003;13:2178–2189. pmid:12952885
  44. 44.
    Steenwyk JL, Rokas A. orthofisher: a broadly relevant software for automated gene identification and retrieval. Comeron JM, editor. G3 (Bethesda). 2021;11. pmid:34544141
  45. 45.
    Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer Okay, et al. BLAST+: structure and functions. BMC Bioinformatics. 2009;10:421. pmid:20003500
  46. 46.
    Edgar RC. Search and clustering orders of magnitude sooner than BLAST. Bioinformatics. 2010;26:2460–2461. pmid:20709691
  47. 47.
    Eddy SR. Accelerated Profile HMM Searches. Pearson WR, editor. PLoS Comput Biol. 2011;7:e1002195. pmid:22039361
  48. 48.
    Worth MN, Dehal PS, Arkin AP. FastTree 2—Roughly maximum-likelihood timber for big alignments. PLoS ONE. 2010;5. pmid:20224823
  49. 49.
    Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, et al. IQ-TREE 2: New Fashions and Environment friendly Strategies for Phylogenetic Inference within the Genomic Period. Teeling E, editor. Mol Biol Evol. 2020;37:1530–1534. pmid:32011700
  50. 50.
    Zhou X, Shen X-X, Hittinger CT, Rokas A. Evaluating Quick Most Probability-Based mostly Phylogenetic Packages Utilizing Empirical Phylogenomic Knowledge Units. Mol Biol Evol. 2018;35:486–503. pmid:29177474
  51. 51.
    Shen X-X, Opulente DA, Kominek J, Zhou X, Steenwyk JL, Buh KV, et al. Tempo and Mode of Genome Evolution within the Budding Yeast Subphylum. Cell. 2018;175:1533–1545.e20. pmid:30415838
  52. 52.
    Steenwyk JL, Buida TJ, Labella AL, Li Y, Shen X-X, Rokas A. PhyKIT: a broadly relevant UNIX shell toolkit for processing and analyzing phylogenomic information. Schwartz R, editor. Bioinformatics (Oxford, England). 2021. pmid:33560364
  53. 53.
    Tice AK, Žihala D, Pánek T, Jones RE, Salomaki ED, Nenarokov S, et al. PhyloFisher: A phylogenomic bundle for resolving eukaryotic relationships. Hejnol A, editor. PLoS Biol. 2021;19:e3001365. pmid:34358228
  54. 54.
    Ballesteros JA, Hormiga G. A New Orthology Evaluation Methodology for Phylogenomic Knowledge: Unrooted Phylogenetic Orthology. Mol Biol Evol. 2016;33:2117–2134. pmid:27189539
  55. 55.
    Yang Y, Smith SA. Orthology Inference in Nonmodel Organisms Utilizing Transcriptomes and Low-Protection Genomes: Bettering Accuracy and Matrix Occupancy for Phylogenomics. Mol Biol Evol. 2014;31:3081–3092. pmid:25158799
  56. 56.
    Shen X-X, Steenwyk JL, LaBella AL, Opulente DA, Zhou X, Kominek J, et al. Genome-scale phylogeny and contrasting modes of genome evolution within the fungal phylum Ascomycota. Sci Adv. 2020;6:eabd0079. pmid:33148650
  57. 57.
    Steenwyk JL, Opulente DA, Kominek J, Shen X-X, Zhou X, Labella AL, et al. Intensive lack of cell-cycle and DNA restore genes in an historic lineage of bipolar budding yeasts. Kamoun S, editor. PLoS Biol. 2019;17:e3000255. pmid:31112549
  58. 58.
    Vakirlis N, Sarilar V, Drillon G, Fleiss A, Agier N, Meyniel J-P, et al. Reconstruction of ancestral chromosome structure and gene repertoire reveals rules of genome evolution in a mannequin yeast genus. Genome Res. 2016;26:918–932. pmid:27247244
  59. 59.
    Hoang DT, Chernomor O, von Haeseler A, Minh BQ, Vinh LS. UFBoot2: Bettering the Ultrafast Bootstrap Approximation. Mol Biol Evol. 2018;35:518–522. pmid:29077904
  60. 60.
    Mangul S, Martin LS, Eskin E, Blekhman R. Bettering the usability and archival stability of bioinformatics software program. Genome Biol. 2019;20:47. pmid:30813962
  61. 61.
    Steenwyk JL, Buida TJ, Li Y, Shen X-X, Rokas A. ClipKIT: A a number of sequence alignment trimming software program for correct phylogenomic inference. Hejnol A, editor. PLoS Biol. 2020;18: e3001007. pmid:33264284
  62. 62.
    Steenwyk JL, Buida TJ, Gonçalves C, Goltz DC, Morales G, Mead ME, et al. BioKIT: a flexible toolkit for processing and analyzing numerous sorts of sequence information. Stajich J, editor. Genetics. 2022. pmid:35536198
  63. 63.
    Steenwyk JL, Shen X-X, Lind AL, Goldman GH, Rokas A. A Sturdy Phylogenomic Time Tree for Biotechnologically and Medically Vital Fungi within the Genera Aspergillus and Penicillium. Boyle JP, editor. MBio. 2019;10. pmid:31289177
  64. 64.
    Tarver JE, dos Reis M, Mirarab S, Moran RJ, Parker S, O’Reilly JE, et al. The Interrelationships of Placental Mammals and the Limits of Phylogenetic Inference. Genome Biol Evol. 2016;8:330–344. pmid:26733575
  65. 65.
    Luo Z-X, Yuan C-X, Meng Q-J, Ji Q. A Jurassic eutherian mammal and divergence of marsupials and placentals. Nature. 2011;476:442–445. pmid:21866158
  66. 66.
    Katoh Okay, Standley DM. MAFFT A number of Sequence Alignment Software program Model 7: Enhancements in Efficiency and Usability. Mol Biol Evol. 2013;30:772–780. pmid:23329690
  67. 67.
    Robinson DF, Foulds LR. Comparability of phylogenetic timber. Math Biosci. 1981;53:131–147.
  68. 68.
    Phillips MJ, Penny D. The basis of the mammalian tree inferred from complete mitochondrial genomes. Mol Phylogenet Evol. 2003;28:171–185. pmid:12878457
  69. 69.
    Liu L, Zhang J, Rheindt FE, Lei F, Qu Y, Wang Y, et al. Genomic proof reveals a radiation of placental mammals uninterrupted by the KPg boundary. Proc Natl Acad Sci. 2017;114:E7282–E7290. pmid:28808022
  70. 70.
    Shen X-X, Salichos L, Rokas A. A Genome-Scale Investigation of How Sequence, Perform, and Tree-Based mostly Gene Properties Affect Phylogenetic Inference. Genome Biol Evol. 2016;8:2565–2580. pmid:27492233
  71. 71.
    Philippe H, Brinkmann H, Lavrov DV, Littlewood DTJ, Manuel M, Wörheide G, et al. Resolving Tough Phylogenetic Questions: Why Extra Sequences Are Not Sufficient. Penny D, editor. PLoS Biol. 2011;9:e1000602. pmid:21423652
  72. 72.
    Kassambara A, Mundt F. factoextra. R bundle, v. 1.0.5. 2017.
  73. 73.
    Lê S, Josse J, Husson F. FactoMineR: An R Package deal for Multivariate Evaluation. J Stat Softw. 2008;25:1–18.
  74. 74.
    Salichos L, Rokas A. Inferring historic divergences requires genes with robust phylogenetic alerts. Nature. 2013;497:327–331. pmid:23657258
  75. 75.
    Salichos L, Stamatakis A, Rokas A. Novel Info Concept-Based mostly Measures for Quantifying Incongruence amongst Phylogenetic Bushes. Mol Biol Evol. 2014;31:1261–1271. pmid:24509691
  76. 76.
    Kobert Okay, Salichos L, Rokas A, Stamatakis A. Computing the Internode Certainty and Associated Measures from Partial Gene Bushes. Mol Biol Evol. 2016;33:1606–1617. pmid:26915959
  77. 77.
    Stamatakis A. RAxML model 8: a software for phylogenetic evaluation and post-analysis of enormous phylogenies. Bioinformatics. 2014;30:1312–1313. pmid:24451623
  78. 78.
    Tune S, Liu L, Edwards SV, Wu S. Resolving battle in eutherian mammal phylogeny utilizing phylogenomics and the multispecies coalescent mannequin. Proc Natl Acad Sci. 2012;109:14942–14947. pmid:22930817
  79. 79.
    Doyle VP, Younger RE, Naylor GJP, Brown JM. Can We Determine Genes with Elevated Phylogenetic Reliability? Syst Biol. 2015;64:824–837. pmid:26099258
RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments