Recoding amino acids to a reduced alphabet may increase or decrease phylogenetic accuracy

Foster, Peter G; Schrempf, Dominik; Szöllősi, Gergely J; Williams, Tom A; Cox, Cymon; Embley, T Martin

Publication

Recoding amino acids to a reduced alphabet may increase or decrease phylogenetic accuracy

2022Journal article

dc.contributor.author	Foster, Peter G
dc.contributor.author	Schrempf, Dominik
dc.contributor.author	Szöllősi, Gergely J
dc.contributor.author	Williams, Tom A
dc.contributor.author	Cox, Cymon
dc.contributor.author	Embley, T Martin
dc.date.accessioned	2022-12-21T14:13:29Z
dc.date.available	2023-08-07T00:30:16Z
dc.date.issued	2022
dc.description.abstract	Common molecular phylogenetic characteristics such as long branches and compositional heterogeneity can be problematic for phylogenetic reconstruction when using amino acid data. Recoding alignments to reduced alphabets before phylogenetic analysis has often been used both to explore and potentially decrease the effect of such problems. We tested the effectiveness of this strategy on topological accuracy using simulated data on four-taxon trees. We simulated alignments in phylogenetically challenging ways to test the phylogenetic accuracy of analyses using various recoding strategies together with commonly used homogeneous models. We tested three recoding methods based on amino acid exchangeability, and another recoding method based on lowering the compositional heterogeneity among alignment sequences as measured by the Chi-squared statistic. Our simulation results show that on trees with long branches where sequences approach saturation, accuracy was not greatly affected by exchangeability-based recodings, but Chi-squared-based recoding decreased accuracy. We then simulated sequences with different kinds of compositional heterogeneity over the tree. Recoding often increased accuracy on such alignments. Exchangeability-based recoding was rarely worse than not recoding, and often considerably better. Recoding based on lowering the Chi-squared value improved accuracy in some cases but not in others, suggesting that low compositional heterogeneity by itself is not sufficient to increase accuracy in the analysis of these alignments. We also simulated alignments using site-specific amino acid profiles, making sequences that had compositional heterogeneity over alignment sites. Exchangeability-based recoding coupled with site-homogeneous models had poor accuracy for these data sets but Chi-squared-based recoding on these alignments increased accuracy. We then simulated data sets that were compositionally both site- and tree-heterogeneous, like many real data sets. The effect on the accuracy of recoding such doubly problematic data sets varied widely, depending on the type of compositional tree heterogeneity and on the recoding scheme. Interestingly, analysis of unrecoded compositionally heterogeneous alignments with the NDCH or CAT models was generally more accurate than homogeneous analysis, whether recoded or not. Overall, our results suggest that making trees for recoded amino acid data sets can be useful, but they need to be interpreted cautiously as part of a more comprehensive analysis. The use of better-fitting models like NDCH and CAT, which directly account for the patterns in the data, may offer a more promising long-term solution for analyzing empirical data. [Compositional heterogeneity; models of evolution; phylogenetic methods; recoding amino acid data sets.]	pt_PT
dc.description.sponsorship	ALG-01-0145-FEDER022121
dc.description.version	info:eu-repo/semantics/publishedVersion	pt_PT
dc.identifier.doi	10.1093/sysbio/syac042	pt_PT
dc.identifier.issn	1063-5157
dc.identifier.uri	http://hdl.handle.net/10400.1/18697
dc.language.iso	eng	pt_PT
dc.peerreviewed	yes	pt_PT
dc.publisher	Oxford University Press	pt_PT
dc.relation	Reconstructing a dated tree of life using phylogenetic incongruence
dc.relation	Algarve Centre for Marine Sciences
dc.title	Recoding amino acids to a reduced alphabet may increase or decrease phylogenetic accuracy	pt_PT
dc.type	journal article
dspace.entity.type	Publication
oaire.awardTitle	Reconstructing a dated tree of life using phylogenetic incongruence
oaire.awardTitle	Algarve Centre for Marine Sciences
oaire.awardURI	info:eu-repo/grantAgreement/EC/H2020/714774/EU
oaire.awardURI	info:eu-repo/grantAgreement/FCT/6817 - DCRRNI ID/UIDB%2F04326%2F2020/PT
oaire.citation.startPage	syac042	pt_PT
oaire.citation.title	Systematic Biology	pt_PT
oaire.fundingStream	H2020
oaire.fundingStream	6817 - DCRRNI ID
person.familyName	Cox
person.givenName	Cymon
person.identifier.ciencia-id	6B15-9771-1D04
person.identifier.orcid	0000-0002-4927-979X
person.identifier.rid	D-1303-2012
person.identifier.scopus-author-id	7402112716
project.funder.identifier	http://doi.org/10.13039/501100008530
project.funder.identifier	http://doi.org/10.13039/501100001871
project.funder.name	European Commission
project.funder.name	Fundação para a Ciência e a Tecnologia
rcaap.rights	openAccess	pt_PT
rcaap.type	article	pt_PT
relation.isAuthorOfPublication	82c3689c-60b6-440d-9d7b-49e6dbd6861b
relation.isAuthorOfPublication.latestForDiscovery	82c3689c-60b6-440d-9d7b-49e6dbd6861b
relation.isProjectOfPublication	e0fe093a-abb9-42e5-8691-e887456c2f10
relation.isProjectOfPublication	fafa76a6-2cd2-4a6d-a3c9-772f34d3b91f
relation.isProjectOfPublication.latestForDiscovery	e0fe093a-abb9-42e5-8691-e887456c2f10

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Foster+22_SystBiology_preprint.pdf
Size:: 490.86 KB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 3.46 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

CCM2-Artigos (em revistas ou actas indexadas)