Empirical codon models for comparative re-sequencing data
Empirical codon models for comparative re-sequencing data
Disciplines
Biology (30%); Computer Sciences (70%)
Keywords
-
Phylogeny,
Statistical Modelling,
Population Genetics,
Markov Process,
Evolutionary Bioinformatics,
Next Generation Sequencing
Darwinian selection is an important source of evolutionary innovation and a major force behind the divergence of species. Consequently, a wide variety of methods have been developed to detect genes that have been subject to selection, including comparative or phylogenetic methods that utilize patterns of substitutions between species. For example, standard likelihood ratio tests for positive selection have been developed that are based on codon substitution models. However, if applied to closely related species such as primate genomes, these tests lack power, and therefore very few genes show signs of positive selection. Incorporating additional information, such as patterns of intraspecific polymorphism, promises to improve the detection of positive selection. With the emergence of new sequencing technologies these data are now available. Indeed, for several species, including Human, Drosophila and Arabidopsis, 1,000 genomes will be available soon. However, it remains unclear whether the probabilistic methodologies previously used in phylogenetics and in population genetics are suitable to analyse these data sets. The proposed project includes both a theoretical and an applied component that will provide bioinformatic tools and biological knowledge on the evolution of protein coding genes. The theoretical component will aim at developing new empirical codon models. We will develop new algorithms to estimate empirical models that take into account substantial site and lineage specific rate variation in comparative polymorphism data (i.e., sequences from several species and multiple individuals). This part of the project will therefore be geared towards implementing the model and testing it using computer simulations and empirical data sets such as the mammalian phylogeny to identify its underlying properties. The software developped will be made available to the public as open source software. In the applied component we will use the empirical codon models and their extensions to understand the evolutionary processes on the Drosophila phylogeny. Taking advantage of the 12 Drosophila genomes data and the 1000 D. melanogaster project we will first analyse the melanogaster subgroup. Furthermore sequencing of D. ananassae populations will be performed to study another clade with different substitution patterns. Tests for Darwinian selection will be performed on both clades.
Darwinian selection is an important source of evolutionary innovation and a major force behind the divergence of species. The selective forces acting upon a protein are highly informative about its biological function and evolutionary history. Consequently, a wide variety of methods have been developed to detect genes that have been subject to selection. However, if applied to closely related species such as primates, these tests lack power. With the advent of next generation sequencing (NGS) technologies genomes, both from closely related species and from multiple individuals of the species, are increasingly available. Indeed, for several species, including Human, Drosophila and Arabidopsis, 1,000 genomes are available. These large amounts of data offer a great opportunity to study speciation and the evolutionary history of populations, provided we can properly model the process of evolution within and between species simultaneously. For our FWF project, we have developed Polymorphism-aware phylogenetic Models (PoMos). A substitution is hereby modeled through a mutational event followed by a gradual fixation. Polymorphisms can either be observed in the present (tips of the phylogeny) or be ancestral (present at inner nodes). With this approach, we naturally account for incomplete lineage sorting and shared ancestral polymorphisms that is a major concern for species tree estimation. Our method can accurately and time-efficiently estimate the parameters describing evolutionary patterns for phylogenetic trees of any shape (species trees, population trees, or any combination of those). We have implemented software package called IQ-TREE-PoMo. We demonstrated that PoMo is suitable to infer large scale phylogenies from population data of primate species. The new software will enable biologists to better understand speciation events.
- University of St. Andrews - 100%
- Aparup Das, National Institute of Malaria Research (ICMR) - India
- Maria Anisimova, Eidgenössische Technische Hochschule Zürich - Switzerland
- Ian Holmes, University of California Berkeley - USA
- Rasmus Nielsen, University of California Berkeley - USA
- Nick Goldman, EMBL Outstation Hinxton
Research Output
- 492 Citations
- 12 Publications
-
2016
Title Reversible polymorphism-aware phylogenetic models and their application to tree inference DOI 10.1016/j.jtbi.2016.07.042 Type Journal Article Author Schrempf D Journal Journal of Theoretical Biology Pages 362-370 Link Publication -
2015
Title PoMo: An Allele Frequency-based Approach for Species Tree Estimation DOI 10.1101/016360 Type Preprint Author De Maio N Pages 016360 Link Publication -
2015
Title PoMo: An Allele Frequency-Based Approach for Species Tree Estimation DOI 10.1093/sysbio/syv048 Type Journal Article Author De Maio N Journal Systematic Biology Pages 1018-1031 Link Publication -
2017
Title Approximate maximum likelihood estimation for population genetic inference DOI 10.1515/sagmb-2017-0016 Type Journal Article Author Bertl J Journal Statistical Applications in Genetics and Molecular Biology Pages 291-312 Link Publication -
2017
Title Inference in population genetics using forward and backward, discrete and continuous time processes DOI 10.1016/j.jtbi.2017.12.008 Type Journal Article Author Bergman J Journal Journal of Theoretical Biology Pages 166-180 Link Publication -
2019
Title Polymorphism-Aware Species Trees with Advanced Mutation Models, Bootstrap, and Rate Heterogeneity DOI 10.1093/molbev/msz043 Type Journal Article Author Schrempf D Journal Molecular Biology and Evolution Pages 1294-1301 Link Publication -
2019
Title The comparative genomics and complex population history of Papio baboons DOI 10.1126/sciadv.aau6947 Type Journal Article Author Rogers J Journal Science Advances Link Publication -
2012
Title Estimating Empirical Codon Hidden Markov Models DOI 10.1093/molbev/mss266 Type Journal Article Author De Maio N Journal Molecular Biology and Evolution Pages 725-736 Link Publication -
2014
Title The common marmoset genome provides insight into primate biology and evolution. DOI 10.1038/ng.3042 Type Journal Article Author Marmoset Genome Sequencing And Analysis Consortium Journal Nature genetics Pages 850-7 -
2016
Title An alternative derivation of the stationary distribution of the multivariate neutral Wright–Fisher model for low mutation rates with a view to mutation rate estimation from site frequency data DOI 10.1016/j.tpb.2016.12.001 Type Journal Article Author Schrempf D Journal Theoretical Population Biology Pages 88-94 Link Publication -
2013
Title Linking Great Apes Genome Evolution across Time Scales Using Polymorphism-Aware Phylogenetic Models DOI 10.1093/molbev/mst131 Type Journal Article Author De Maio N Journal Molecular Biology and Evolution Pages 2249-2262 Link Publication -
2018
Title Polymorphism-aware species trees with advanced mutation models, bootstrap and rate heterogeneity DOI 10.1101/483479 Type Preprint Author Schrempf D Pages 483479 Link Publication