Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models

RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models Vol. 22 no. 21 2006, pages 2688–2690 BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btl446 Phylogenetics RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models Alexandros Stamatakis Swiss Federal Institute of Technology Lausanne, School of Computer and Communication Sciences, Lab Prof. Moret, STATION 14, CH-1015 Lausanne, Switzerland Received on May 15, 2006; revised on July 21, 2006; accepted on August 16, 2006 Advance Access publication August 23, 2006 Associate Editor: Keith A Crandall ABSTRACT constraint trees and the capability to assign and estimate separate model parameters for individual genes of multi-gene alignments Summary: RAxML-VI-HPC (randomized axelerated maximum likeli- (mixed/partitioned models). hood for high performance computing) is a sequential and parallel pro- The main focus is on the computation of huge trees (1000 taxa) gram for inference of large phylogenies with maximum likelihood (ML). for real-world data and the comparative performance study with Low-level technical optimizations, a modification of the search GARLI, IQPNNI, MrBayes and PHYML. Since the efficiency of the algorithm, and the use of the GTR+CATapproximation as replacement novel optimizations in RAxML-VI-HPC increases with the number for GTR+G yield a program that is between 2.7 and 52 times faster than of taxa, less significant performance improvements will be observed the previous version of RAxML. A large-scale performance comparison on smaller datasets. Performance comparisons of RAxML with with GARLI, PHYML, IQPNNI and MrBayes on real data containing other popular ML programs on smaller datasets, including simu- 1000 up to 6722 taxa shows that RAxML requires at least 5.6 times lated alignments, can be found in Hordijk and Gascuel (2005), less main memory and yields better trees in similar times than the best Stamatakis et al. (2005) and Zwickl (2006). Finally, the experi- competing program (GARLI) on datasets up to 2500 taxa. On datasets mental study also shows that the GTR+CAT approximation [see 4000 taxa it also runs 2–3 times faster than GARLI. RAxML has been Stamatakis (2006) for a detailed description] can be efficiently parallelized with MPI to conduct parallel multiple bootstraps and infer- deployed as a replacement for the significantly more compute- ences on distinct starting trees. The program has been used to compute and memory intensive GTR+G model. ML trees on two of the largest alignments to date containing 25 057 (1463 bp) and 2182 (51 089 bp) taxa, respectively. Some of the largest published ML-based analyses to date have Availability: icwww.epfl.ch/ stamatak been conducted using RAxML (Robertson et al., 2005; Ley et al., 2005, 2006). On-going work includes the computation of a back- Contact: [email protected] bone tree for Bacteria with 9000 taxa, a phylogeny for Acer with Supplementary information: Supplementary data are available at 582 taxa, and the analysis of a mammalian multi-gene alignment Bioinformatics online. comprising 2182 sequences. 1 INTRODUCTION 2 OPTIMIZATIONS OF RAxML Phylogenetic inference with the maximum likelihood (ML) method A detailed description of the optimizations listed below is provided is NP-hard (Chor and Tuller, 2005). Despite the algorithmic com- in the on-line supplement. The main improvements cover: plexity and the high-computational cost of ML, significant progress An efficient mechanism to store and re-store topologies and has been achieved with the release of fast and accurate programs branch lengths via rearrangement descriptors. such as PHYML (Guindon and Gascuel, 2003), IQPNNI (Minh A consequent re-use of partial likelihood vectors. et al., 2005), MrBayes (Ronquist and Huelsenbeck, 2003), GARLI (Zwickl, 2006) and RAxML (Stamatakis et al., 2005). A dynamic adaptation of the rearrangement distance. Most of these programs allow for inference of 1000 taxon trees Low-level optimization of the GTR+CAT and GTR+G likeli- on a single CPU in <24 h. hood functions. This paper describes the new version of RAxML [Randomized An efficient re-implementation of Maximum Parsimony axelerated maximum likelihood for high performance computing starting tree computations. (RAxML-VI-HPC, v2.0.1)], which is significantly faster than the previous versions of RAxML due to simple, yet very efficient tech- An important and generally applicable insight from those optim- nical optimizations and a slight alteration of the search algorithm. In izations is that storing and re-storing an unrooted tree topology addition, RAxML has been parallelized with MPI to enable parallel with 2n3 branch lengths and 2n2 nodes can become a major bootstrapping and multiple inferences on distinct starting trees on PC clusters. Moreover, it implements bifurcating and multifurcating CAT and G cannot be used simultaneously in the same analysis. 2006 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. RAxML-VI-HPC performance bottleneck for trees with >1000 taxa. It is thus import- relatively low memory consumption in combination with acceptable ant to store alternative topologies as a sequence of topological likelihood values after 60 h under GTR+G, the performance is quite changes applied to the current topology rather than as complete impressive. As Bayesian inference conceptually differs from pure data object. Only the consequent avoidance of storage operations ML-based inference, a comparison based on likelihood scores is reveals the actual power of the Lazy Subtree Rearrangement (LSR) certainly not fair since it uses MrBayes as an ML heuristic. MrBayes mechanism introduced in Stamatakis et al. (2005). has mainly been included owing to its popularity. Another issue which becomes important for huge trees is to IQPNNI and PHYML both suffer from a relatively inefficient determine a ‘good’ rearrangement distance, i.e. re-insertion radius technical implementation. The high memory consumption of for the LSR moves. In RAxML-VI the algorithm initially determ- IQPNNI and PHYML is due to a different memory organization ines the best rearrangement distance by applying distances of 5, which uses two likelihood vectors per branch (3n 6 vectors) 10, ... , 25 for one iteration of LSRs, to the starting tree. The min- instead of one per inner node (n 2 vectors). imum rearrangement distance which yields the best likelihood Moreover, PHYML uses NNI moves which only exploit a very improvement on the starting tree is then selected for the inference. small fraction of the search space. A solution to this problem has Despite the extra computations which are performed, a ‘good’ been proposed by Hordijk and Gascuel (2005). However, the rearrangement distance pays off in terms of likelihood units for respective program is currently only available as proof-of-concept huge alignments with large evolutionary diameters (e.g. the 6722 implementation (W. Hordijk and O. Gascuel, personal communica- and 7769 taxa alignments, see Supplementary Table 2). tion) and cannot be used for large trees owing to numerical problems. In the final analysis, it can be stated that technical implementation aspects are becoming increasingly important and can yield signi- 3 RESULTS AND DISCUSSION ficant performance improvements. In addition, in all programs there The exact experimental set-up as well as the results are described in exist excellent algorithmic ideas which in the optimal case could detail in the on-line supplement. Table and Figure numbers also significantly advance the field, when merged into one program. refer to the on-line supplement. Results in Supplementary Table 2 show that RAxML-VI-HPC 4 CONCLUSION AND FUTURE WORK clearly outperforms RAxML-V in terms of inference times. In addi- The new version VI of RAxML has been presented, which incor- tion, due to the usage of a ‘good’ rearrangement setting it also yields porates efficient technical optimizations, parallel OpenMP- and significantly better log-likelihood values on the larger and more MPI-based implementations, and a mixed model implementation. diverse datasets 4000 taxa. Supplementary Figure 3 shows the A thorough experimental study on large real-world datasets shows significant computational advantages of the GTR+CAT over the that RAxML can find better trees with a significantly lower memory GTR+G implementation in RAxML-VI. consumption within similar or less time than the best competing Supplementary Tables 3–6 indicate that RAxML-VI-HPC out- program. performs other current sequential phylogeny programs, on huge Future work will mainly cover the development of new methods datasets with respect to inference times, memory consumption as for rapid bootstrapping. Despite the fact, that RAxML and GARLI well as final log-likelihood values. In addition, the performance allow for inference of huge trees with ML in reasonable times, advantage with respect to run-times increases with growing align- conducting a full biological analysis still requires at least 100 or ment size (Supplementary Table 5). Another important result is that 1000 bootstraps which places the computational burden much the GTR+CAT approximation (Supplementary Table 3) can be used higher than for the inference of a single ML tree. to significantly reduce memory consumption and still yield signi- ficantly better GTR+G likelihood values (Supplementary Table 4) ACKNOWLEDGEMENTS than competing programs. GARLI terminated within approximately the same time as The author would like to thank Derrick Zwickl, Wim Hordijk, RAxML-VI-HPC on the six smaller datasets and yielded the Olivier Gascuel, B.Q. Minh, L.S. Vinh and Bret Larget for useful second-best likelihood score in all cases. This is an astonishing discussions on experimental set-up and their programs. He would achievement for several reasons: GARLI implements a genetic also like to thank Usman Roshan, Charles Robertson, Josh Wilcox, search algorithm and was executed under GTR+G. Moreover, it Robin Gutell and Daniel Dalevi for providing the alignment data. maintains a whole population of trees in memory, including Funding to pay the Open Access publication charges for this article some intelligently selected (Zwickl, 2006) partial likelihood vectors was provided by Swiss Confederation Funding. as well as all tree topologies. Thus, it is expected to be slower than Conflict of Interest: none declared. the RAxML hill-climbing algorithm. This extraordinary perform- ance is due to the sophisticated implementation of the likelihood function and promising algorithmic ideas (Zwickl, 2006) such that REFERENCES the forthcoming publication about GARLI is surely something to Chor,B. and Tuller,T. (2005) Maximum likelihood of evolutionary trees: hardness and look forward to. Note that, the parallel genetic search algorithm of approximation. Bioinformatics, 21, 97–106. GARLI performs a distinct and more thorough search, that yields, Guindon,S. and Gascuel,O. (2003) A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol., 52, 696–704. e.g. better final trees on the 1000 taxon alignment (Zwickl, 2006). Hordijk,W. and Gascuel,O. (2005) Improving the efficiency of SPR moves in phylo- However, the focus of the current study is on the strictly sequential genetic tree search methods based on maximum likelihood. Bioinformatics, 21, versions of all programs. 4338–4347. The performance of the new version of MrBayes is also remark- Ley,R. et al. (2005) Obesity alters gut microbial ecology. Proc. Natl Acad. Sci. USA, able. Given that it has to maintain four distinct Markov chains, the 102, 11070–11075. 2689 A.Stamatakis Ley,R.E. et al. (2006) Unexpected diversity and complexity of the guerrero negro Stamatakis,A. (2006) Phylogenetic models of rate heterogeneity: a high per- hypersaline microbial mat. Appl. Envir. Microbiol., 72, 3685–3695. formance computing perspective. In Proceedings of the IPDPS2006, Rhodos, Minh,B.Q. et al. (2005) pIQPNNI: parallel reconstruction of large maximum likelihood Greece. phylogenies. Bioinformatics, 21, 3794–3796. Stamatakis,A. et al. (2005) Raxml-iii: a fast program for maximum likelihood-based Robertson,C. et al. (2005) Phylogenetic diversity and ecology of environmental inference of large phylogenetic trees. Bioinformatics, 21, 456–463. Archaea. Curr. Opin. Microbiol., 8, 638–642. Zwickl,D. (2006) Genetic algorithm approaches for the phylogenetic analysis of large Ronquist,F. and Huelsenbeck,J. (2003) Mrbayes 3: bayesian phylogenetic inference biologiical sequence datasets under the maximum likelihood criterion. PhD thesis, under mixed models. Bioinformatics, 19, 1572–1574. University of Texas at Austin, TX. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models

Bioinformatics , Volume 22 (21): 3 – Aug 23, 2006

Loading next page...
 
/lp/oxford-university-press/raxml-vi-hpc-maximum-likelihood-based-phylogenetic-analyses-with-lDHx2RKR0N

References (11)

Publisher
Oxford University Press
Copyright
© 2006 The Author(s)
ISSN
1367-4803
eISSN
1460-2059
DOI
10.1093/bioinformatics/btl446
pmid
16928733
Publisher site
See Article on Publisher Site

Abstract

Vol. 22 no. 21 2006, pages 2688–2690 BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btl446 Phylogenetics RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models Alexandros Stamatakis Swiss Federal Institute of Technology Lausanne, School of Computer and Communication Sciences, Lab Prof. Moret, STATION 14, CH-1015 Lausanne, Switzerland Received on May 15, 2006; revised on July 21, 2006; accepted on August 16, 2006 Advance Access publication August 23, 2006 Associate Editor: Keith A Crandall ABSTRACT constraint trees and the capability to assign and estimate separate model parameters for individual genes of multi-gene alignments Summary: RAxML-VI-HPC (randomized axelerated maximum likeli- (mixed/partitioned models). hood for high performance computing) is a sequential and parallel pro- The main focus is on the computation of huge trees (1000 taxa) gram for inference of large phylogenies with maximum likelihood (ML). for real-world data and the comparative performance study with Low-level technical optimizations, a modification of the search GARLI, IQPNNI, MrBayes and PHYML. Since the efficiency of the algorithm, and the use of the GTR+CATapproximation as replacement novel optimizations in RAxML-VI-HPC increases with the number for GTR+G yield a program that is between 2.7 and 52 times faster than of taxa, less significant performance improvements will be observed the previous version of RAxML. A large-scale performance comparison on smaller datasets. Performance comparisons of RAxML with with GARLI, PHYML, IQPNNI and MrBayes on real data containing other popular ML programs on smaller datasets, including simu- 1000 up to 6722 taxa shows that RAxML requires at least 5.6 times lated alignments, can be found in Hordijk and Gascuel (2005), less main memory and yields better trees in similar times than the best Stamatakis et al. (2005) and Zwickl (2006). Finally, the experi- competing program (GARLI) on datasets up to 2500 taxa. On datasets mental study also shows that the GTR+CAT approximation [see 4000 taxa it also runs 2–3 times faster than GARLI. RAxML has been Stamatakis (2006) for a detailed description] can be efficiently parallelized with MPI to conduct parallel multiple bootstraps and infer- deployed as a replacement for the significantly more compute- ences on distinct starting trees. The program has been used to compute and memory intensive GTR+G model. ML trees on two of the largest alignments to date containing 25 057 (1463 bp) and 2182 (51 089 bp) taxa, respectively. Some of the largest published ML-based analyses to date have Availability: icwww.epfl.ch/ stamatak been conducted using RAxML (Robertson et al., 2005; Ley et al., 2005, 2006). On-going work includes the computation of a back- Contact: [email protected] bone tree for Bacteria with 9000 taxa, a phylogeny for Acer with Supplementary information: Supplementary data are available at 582 taxa, and the analysis of a mammalian multi-gene alignment Bioinformatics online. comprising 2182 sequences. 1 INTRODUCTION 2 OPTIMIZATIONS OF RAxML Phylogenetic inference with the maximum likelihood (ML) method A detailed description of the optimizations listed below is provided is NP-hard (Chor and Tuller, 2005). Despite the algorithmic com- in the on-line supplement. The main improvements cover: plexity and the high-computational cost of ML, significant progress An efficient mechanism to store and re-store topologies and has been achieved with the release of fast and accurate programs branch lengths via rearrangement descriptors. such as PHYML (Guindon and Gascuel, 2003), IQPNNI (Minh A consequent re-use of partial likelihood vectors. et al., 2005), MrBayes (Ronquist and Huelsenbeck, 2003), GARLI (Zwickl, 2006) and RAxML (Stamatakis et al., 2005). A dynamic adaptation of the rearrangement distance. Most of these programs allow for inference of 1000 taxon trees Low-level optimization of the GTR+CAT and GTR+G likeli- on a single CPU in <24 h. hood functions. This paper describes the new version of RAxML [Randomized An efficient re-implementation of Maximum Parsimony axelerated maximum likelihood for high performance computing starting tree computations. (RAxML-VI-HPC, v2.0.1)], which is significantly faster than the previous versions of RAxML due to simple, yet very efficient tech- An important and generally applicable insight from those optim- nical optimizations and a slight alteration of the search algorithm. In izations is that storing and re-storing an unrooted tree topology addition, RAxML has been parallelized with MPI to enable parallel with 2n3 branch lengths and 2n2 nodes can become a major bootstrapping and multiple inferences on distinct starting trees on PC clusters. Moreover, it implements bifurcating and multifurcating CAT and G cannot be used simultaneously in the same analysis. 2006 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. RAxML-VI-HPC performance bottleneck for trees with >1000 taxa. It is thus import- relatively low memory consumption in combination with acceptable ant to store alternative topologies as a sequence of topological likelihood values after 60 h under GTR+G, the performance is quite changes applied to the current topology rather than as complete impressive. As Bayesian inference conceptually differs from pure data object. Only the consequent avoidance of storage operations ML-based inference, a comparison based on likelihood scores is reveals the actual power of the Lazy Subtree Rearrangement (LSR) certainly not fair since it uses MrBayes as an ML heuristic. MrBayes mechanism introduced in Stamatakis et al. (2005). has mainly been included owing to its popularity. Another issue which becomes important for huge trees is to IQPNNI and PHYML both suffer from a relatively inefficient determine a ‘good’ rearrangement distance, i.e. re-insertion radius technical implementation. The high memory consumption of for the LSR moves. In RAxML-VI the algorithm initially determ- IQPNNI and PHYML is due to a different memory organization ines the best rearrangement distance by applying distances of 5, which uses two likelihood vectors per branch (3n 6 vectors) 10, ... , 25 for one iteration of LSRs, to the starting tree. The min- instead of one per inner node (n 2 vectors). imum rearrangement distance which yields the best likelihood Moreover, PHYML uses NNI moves which only exploit a very improvement on the starting tree is then selected for the inference. small fraction of the search space. A solution to this problem has Despite the extra computations which are performed, a ‘good’ been proposed by Hordijk and Gascuel (2005). However, the rearrangement distance pays off in terms of likelihood units for respective program is currently only available as proof-of-concept huge alignments with large evolutionary diameters (e.g. the 6722 implementation (W. Hordijk and O. Gascuel, personal communica- and 7769 taxa alignments, see Supplementary Table 2). tion) and cannot be used for large trees owing to numerical problems. In the final analysis, it can be stated that technical implementation aspects are becoming increasingly important and can yield signi- 3 RESULTS AND DISCUSSION ficant performance improvements. In addition, in all programs there The exact experimental set-up as well as the results are described in exist excellent algorithmic ideas which in the optimal case could detail in the on-line supplement. Table and Figure numbers also significantly advance the field, when merged into one program. refer to the on-line supplement. Results in Supplementary Table 2 show that RAxML-VI-HPC 4 CONCLUSION AND FUTURE WORK clearly outperforms RAxML-V in terms of inference times. In addi- The new version VI of RAxML has been presented, which incor- tion, due to the usage of a ‘good’ rearrangement setting it also yields porates efficient technical optimizations, parallel OpenMP- and significantly better log-likelihood values on the larger and more MPI-based implementations, and a mixed model implementation. diverse datasets 4000 taxa. Supplementary Figure 3 shows the A thorough experimental study on large real-world datasets shows significant computational advantages of the GTR+CAT over the that RAxML can find better trees with a significantly lower memory GTR+G implementation in RAxML-VI. consumption within similar or less time than the best competing Supplementary Tables 3–6 indicate that RAxML-VI-HPC out- program. performs other current sequential phylogeny programs, on huge Future work will mainly cover the development of new methods datasets with respect to inference times, memory consumption as for rapid bootstrapping. Despite the fact, that RAxML and GARLI well as final log-likelihood values. In addition, the performance allow for inference of huge trees with ML in reasonable times, advantage with respect to run-times increases with growing align- conducting a full biological analysis still requires at least 100 or ment size (Supplementary Table 5). Another important result is that 1000 bootstraps which places the computational burden much the GTR+CAT approximation (Supplementary Table 3) can be used higher than for the inference of a single ML tree. to significantly reduce memory consumption and still yield signi- ficantly better GTR+G likelihood values (Supplementary Table 4) ACKNOWLEDGEMENTS than competing programs. GARLI terminated within approximately the same time as The author would like to thank Derrick Zwickl, Wim Hordijk, RAxML-VI-HPC on the six smaller datasets and yielded the Olivier Gascuel, B.Q. Minh, L.S. Vinh and Bret Larget for useful second-best likelihood score in all cases. This is an astonishing discussions on experimental set-up and their programs. He would achievement for several reasons: GARLI implements a genetic also like to thank Usman Roshan, Charles Robertson, Josh Wilcox, search algorithm and was executed under GTR+G. Moreover, it Robin Gutell and Daniel Dalevi for providing the alignment data. maintains a whole population of trees in memory, including Funding to pay the Open Access publication charges for this article some intelligently selected (Zwickl, 2006) partial likelihood vectors was provided by Swiss Confederation Funding. as well as all tree topologies. Thus, it is expected to be slower than Conflict of Interest: none declared. the RAxML hill-climbing algorithm. This extraordinary perform- ance is due to the sophisticated implementation of the likelihood function and promising algorithmic ideas (Zwickl, 2006) such that REFERENCES the forthcoming publication about GARLI is surely something to Chor,B. and Tuller,T. (2005) Maximum likelihood of evolutionary trees: hardness and look forward to. Note that, the parallel genetic search algorithm of approximation. Bioinformatics, 21, 97–106. GARLI performs a distinct and more thorough search, that yields, Guindon,S. and Gascuel,O. (2003) A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol., 52, 696–704. e.g. better final trees on the 1000 taxon alignment (Zwickl, 2006). Hordijk,W. and Gascuel,O. (2005) Improving the efficiency of SPR moves in phylo- However, the focus of the current study is on the strictly sequential genetic tree search methods based on maximum likelihood. Bioinformatics, 21, versions of all programs. 4338–4347. The performance of the new version of MrBayes is also remark- Ley,R. et al. (2005) Obesity alters gut microbial ecology. Proc. Natl Acad. Sci. USA, able. Given that it has to maintain four distinct Markov chains, the 102, 11070–11075. 2689 A.Stamatakis Ley,R.E. et al. (2006) Unexpected diversity and complexity of the guerrero negro Stamatakis,A. (2006) Phylogenetic models of rate heterogeneity: a high per- hypersaline microbial mat. Appl. Envir. Microbiol., 72, 3685–3695. formance computing perspective. In Proceedings of the IPDPS2006, Rhodos, Minh,B.Q. et al. (2005) pIQPNNI: parallel reconstruction of large maximum likelihood Greece. phylogenies. Bioinformatics, 21, 3794–3796. Stamatakis,A. et al. (2005) Raxml-iii: a fast program for maximum likelihood-based Robertson,C. et al. (2005) Phylogenetic diversity and ecology of environmental inference of large phylogenetic trees. Bioinformatics, 21, 456–463. Archaea. Curr. Opin. Microbiol., 8, 638–642. Zwickl,D. (2006) Genetic algorithm approaches for the phylogenetic analysis of large Ronquist,F. and Huelsenbeck,J. (2003) Mrbayes 3: bayesian phylogenetic inference biologiical sequence datasets under the maximum likelihood criterion. PhD thesis, under mixed models. Bioinformatics, 19, 1572–1574. University of Texas at Austin, TX.

Journal

BioinformaticsOxford University Press

Published: Aug 23, 2006

There are no references for this article.