Application of a novel amino-acid substitution model to determining land plant evolution

Financiador

Organização

Publicações

Data-specific substitution models improve protein-based phylogenetics

Publication . Brazão, João; Foster, Peter G.; J. Cox, Cymon

Calculating amino-acid substitution models that are specific for individual protein data sets is often difficult due to the computational burden of estimating large numbers of rate parameters. In this study, we tested the computational efficiency and accuracy of five methods used to estimate substitution models, namely Codeml, FastMG, IQ-TREE, P4 (maximum likelihood), and P4 (Bayesian inference). Data-specific substitution models were estimated from simulated alignments (with different lengths) that were generated from a known simulation model and simulation tree. Each of the resulting data-specific substitution models was used to calculate the maximum likelihood score of the simulation tree and simulated data that was used to calculate the model, and compared with the maximum likelihood scores of the known simulation model and simulation tree on the same simulated data. Additionally, the commonly-used empirical models, cpREV and WAG, were assessed similarly. Data-specific models performed better than the empirical models, which under-fitted the simulated alignments, had the highest difference to the simulation model maximum-likelihood score, clustered further from the simulation model in principal component analysis ordination, and inferred less accurate trees. Data-specific models and the simulation model shared statistically indistinguishable maximum-likelihood scores, indicating that the five methods were reasonably accurate at estimating substitution models by this measure. Nevertheless, tree statistics showed differences between optimal maximum likelihood trees. Unlike other model estimating methods, trees inferred using data-specific models generated with IQ-TREE and P4 (maximum likelihood) were not significantly different from the trees derived from the simulation model in each analysis, indicating that these two methods alone were the most accurate at estimating data-specific models. To show the benefits of using data-specific protein models several published data sets were reanalysed using IQ-TREE-estimated models. These newly estimated models were a better fit to the data than the empirical models that were used by the original authors, often inferred longer trees, and resulted in different tree topologies in more than half of the re-analysed data sets. The results of this study show that software availability and high computation burden are not limitations to generating better-fitting data-specific amino-acid substitution models for phylogenetic analyses.

2023-08Artigo científico

Acesso aberto

Ver mais

Applying data-specific substitution models and mitigating the effects of among-lineage heterogeneity to infer better protein-based phylogenies

Publication . Brazão, João Miguel Rodrigues da Silva; Cox, Cymon J.

A major challenge in phylogenetic reconstruction of evolutionary relationships lies in understanding the impact of model-fit on the accuracy of phylogenetic trees. The work conducted in this thesis aims to infer better trees by using amino-acid substitution models that are specific to the study data, and to evaluate strategies to mitigate the effects of systematic bias. Several software programmes for calculating data-specific models were evaluated, with IQ-TREE exhibiting the best features. These models consistently showed a better fit to the data than the pre-computed empirical models, indicating their greater robustness against biases caused by poorer-fitting models. Data-specific substitution models combined with more complex heterogeneous models or data partitioning strategies helped to reduce systematic bias. Among methods evaluated to identify heterogeneous data, the matched-pairs test of marginal symmetry combined with the Benjamini-Hochberg method exhibited the highest statistical power, identifying composition-heterogeneous sequences that biased the relationships among Archaeplastida and Bryophyta. By contrast, the process of evolution underlying the emergence of land plants from charophyte algae was shown to be composition-homogeneous among lineages in the analyses of nuclear and chloroplast data. Tree-homogeneous and heterogeneous analyses using these data robustly recovered the green algae Zygnematophyceae as the most-closely related to land plants. However, analyses of mitochondrial data placed Charophyceae as the sister-group to land plants; a result that was shown not to be caused by compositional heterogeneity among lineages. Nevertheless, further analyses identified a weak signal favouring Zygnematophyceae as the sister-group of land plants in buried-sites and slower-evolving sites data partitions. The cause of the incongruence between nuclear plus chloroplast data and the mitochondrial data remain unknown but maybe biological in nature rather than due to systematic bias, and perhaps a result of evolutionary processes such as horizontal gene transfer.

2024Tese de doutoramento

Acesso aberto

Ver mais

Entidade financiadora

Fundação para a Ciência e a Tecnologia

Número da atribuição

SFRH/BD/134422/2017