Algebraic methods for reconstruction of species-phylogenies
Algebraic methods for reconstruction of species-phylogenies
Disciplines
Biology (40%); Computer Sciences (20%); Mathematics (40%)
Keywords
-
Phylogenetic trees,
Gene tree invariants,
Species trees,
Clade invariants,
Multispecies coalescent model,
Deep coalescences
Various methods to construct a phylogenetic `species tree` from a set of discordant gene trees have been developed in the past decades. To name just a few which are relevant for this project: Rooted triple consensus (Ewing et al.), pseudo-maximum likelihood (Liu et al.) and `Minimize deep coalescences` (`MDC`, Maddison). In particular, no true maximum likelihood method is known so far. The first goal of my project is to understand by algebraic and geometric means the algebraic statistical models which describe the probabilities for specific (rooted or unrooted) gene trees, resp. clades, under a multispecies coalescent on a given species tree. Such models are algebraic varieties, and thus a natural problem is to understand their defining equations - the so-called invariants of the models. Partial results into this direction have been obtained by Allman et al., who systematically obtained some of the linear invariants for these models. Their considerations led to the result that the underlying species tree can (in principal) be reconstructed if one knows the probabilities for unrooted gene trees resp. clades on gene trees. However, many of the linear as well as higher oder invariants, even for four- and five-taxon species trees, remain unexplained. A profound understanding of these invariants, and of the geometry of the associated algebraic varieties, may be not only of theoretical interest but may also help in developing algorithmic techniques for species tree reconstruction, as the results by Allman et al. indicate. Methods from computer algebra may serve as a tool for this task. A first step towards a more profound understanding of the invariants of gene tree probabilities is the understanding of invariants of probabilities of `gene triplet-trees`. I expect that these varieties have a much easier description by only linear and quadratic invariants, which should be easy to deduce from the combinatorics of the underlying species tree. Moreover, by marginalization these invariants are connected to the invariants of gene tree- resp. clade probabilities, and hence may lead to a better understanding of the latter. Studying the varieties of triple probabilities is also interesting from the point of view of Liu`s pseudo-maximum likelihood method, since I expect that understanding of the invariants of triple probabilities can help to judge in which situations the application of this method is indeed justified, and how precise the results can be expected to be. A second part of my project asks for the connection between the multispecies coalescent model and Maddison`s `minimize deep coalescences` principle. It is relatively easy to see that tropicalization of gene tree probabilities yields an expression which encodes (a slight modification of) Maddison`s deep coalescence costs. Using this I want to carry over results for maximum likelihood estimations under the multispecies coalescent to MDC, and vice versa. One such result is the fact that optimality under the MDC criterion is in general not inherited to subtrees. This is an important theoretical result which questions the reliability of the MDC-method when outgroup species are used to determine the location of the root of a phylogenetic tree. I expect that a similar negative result also holds for maximum likelihood estimations.