Evolution and Function of the Environmental Protein Sequence Universe
Evolution and Function of the Environmental Protein Sequence Universe
Disciplines
Computer Sciences (100%)
Keywords
-
Bioinformatics,
Clustering,
Computational biology,
Network analysis,
Protein sequencing analysis,
PVC superphylum
Protein sequences are generated in large quantities by DNA sequencing and represent one of the most important reservoirs of molecular biological data. Protein sequences point to the molecular functions and biological roles of their gene products through blueprints of the function and structure of their encoded proteins and their connected evolutionary relationships. During the last decade, the sequencing of metagenomes directly from environmental samples without cultivation has significantly expanded the known protein sequence universe. However, the environmental protein universe is still mainly unstructured and awaits specific utilization in computational biology; although, hundreds of metagenomes have been deeply sequenced and thereby account for the majority of protein sequences stored in databases. The central aim of this proposal is investigating the fundamental evolutionary structures behind the environmental protein sequences previously obtained. We will cluster the entire protein sequence universe, including metagenomes, into evolutionary related families. Based on established concepts, such as orthology or protein domains, this project will develop novel clustering methods for large protein networks. Based on this large-scale evolutionary reconstruction, we will investigate the function of protein families in the environmental protein sequence universe. We will comprehensively determine the relative abundances of protein families in different environments. We expect to discover many associations that will not only link known protein families to specific habitat types but will also establish connections between families of unknown function and the environment. The abundance matrix of protein families in different environments will be further studied with respect to the predictive power of environmental co-occurrence profiles for the prediction of functional interactions between protein families. We expect to develop a novel method that will significantly extend current principles for the prediction of protein interactions. In a case study, we will utilize the structured environmental protein sequences universe to investigate the phylogenetic and ecological diversity of the monophyletic PCV superphylum (Planctomycetes, Verrucomicrobia, Chlamydiae, Lentisphaerae, etc.), a bacterial clade with exceptional physiologies and major medical, ecological and biotechnological importance. Although this proposal is mainly focused on fundamental biological questions, it also comprises broader aspects such as developing novel and universal methods and resources in computational biology as well as improving our knowledge about biotechnologically and medically important bacteria.
This project investigates the architecture of the protein sequence universe. Proteins are essential biomolecules for structure and function of all cellular organisms as well as viruses. DNA sequencing generates a massive stream of molecular biological data. Protein sequences are inferred from these data, and determine structure and function of their gene products. In combination with connected evolutionary relationships these sequences point to molecular functions and biological roles of proteins. The entirety of proteins is referred to as the protein universe. Massive efforts in metagenomic projects sequencing DNA directly from environmental samples without cultivation steps expanded the known protein sequence universe markedly. While data from metagenomic studies now dominate protein databases, these data are still unstructured and not efficiently used to a large extent. The main goal of this project is investigating the fundamental evolutionary structures of the environmental protein sequence universe. Building upon established concepts, such as orthology or sequence similarity, new methods for organizing the protein sequence universe were analyzed. Particular focus was on specific general phenomena in similarity networks of high-dimensional data, which occur not only in protein sequence networks, but also in natural language processing, or automatic music recommendation systems. A large variety of concepts and methods from computational biology and machine learning are refined and applied, resulting in general insights into the structure of the protein sequence universe, as well as the development of a repertoire of algorithms, methods, and tools for the efficient utilization of high-dimensional data. Their use is not limited to biology, because the underlying concepts apply to all domains dealing with high-dimensional spaces. The developed methods are, therefore, relevant to multiple scientific domains and technical disciplines. This project also delivered results that suggest further studies considering deep learning for protein sequence vector representations. Such representations in combination with approximate neighbor search algorithms could resolve the problem of computational bottlenecks due to expensive similarity search in ever-growing sequence databases.
- Arthur Flexer, ÖFAI - Österreichisches Forschungsinstitut für Artifical Intelligence , associated research partner
- Christian Von Mering, University of Zurich - Switzerland
Research Output
- 30061 Citations
- 13 Publications
- 1 Software
- 1 Fundings
-
2020
Title scikit-hubness: Hubness Reduction and Approximate Neighbor Search DOI 10.21105/joss.01957 Type Journal Article Author Feldbauer R Journal Journal of Open Source Software Pages 1957 Link Publication -
2020
Title SciPy 1.0: fundamental algorithms for scientific computing in Python DOI 10.1038/s41592-019-0686-2 Type Journal Article Author Virtanen P Journal Nature Methods Pages 261-272 Link Publication -
2019
Title Deep learning for extremely fast protein similarity search Type Conference Proceeding Abstract Author Feldbauer R Conference Austrian High Performance Computing Meeting 2019 Link Publication -
2019
Title scikit-hubness: Hubness Reduction and Approximate Neighbor Search DOI 10.48550/arxiv.1912.00706 Type Preprint Author Feldbauer R -
2020
Title DeepNOG: fast and accurate protein orthologous group assignment DOI 10.1093/bioinformatics/btaa1051 Type Journal Article Author Feldbauer R Journal Bioinformatics Pages 5304-5312 Link Publication -
2016
Title ConsPred: a rule-based (re-)annotation framework for prokaryotic genomes DOI 10.1093/bioinformatics/btw393 Type Journal Article Author Weinmaier T Journal Bioinformatics Pages 3327-3329 Link Publication -
2016
Title An Empirical Analysis of Hubness in Unsupervised Distance-Based Outlier Detection DOI 10.1109/icdmw.2016.0106 Type Conference Proceeding Abstract Author Flexer A Pages 716-723 -
2016
Title Centering Versus Scaling for Hubness Reduction DOI 10.1007/978-3-319-44778-0_21 Type Book Chapter Author Feldbauer R Publisher Springer Nature Pages 175-183 -
2018
Title Fast Approximate Hubness Reduction for Large High-Dimensional Data DOI 10.1109/icbk.2018.00055 Type Conference Proceeding Abstract Author Feldbauer* R Pages 358-367 -
2018
Title Protein vector representations for fast similarity search Type Conference Proceeding Abstract Author Feldbauer R Conference German Conference on Bioinformatics 2018 Link Publication -
2015
Title The Unbalancing Effect of Hubs on K-Medoids Clustering in High-Dimensional Spaces DOI 10.1109/ijcnn.2015.7280303 Type Conference Proceeding Abstract Author Schnitzer D Pages 1-8 -
2018
Title A comprehensive empirical comparison of hubness reduction in high-dimensional spaces DOI 10.1007/s10115-018-1205-y Type Journal Article Author Feldbauer R Journal Knowledge and Information Systems Pages 137-166 Link Publication -
2015
Title EffectiveDB—updates and novel features for a better annotation of bacterial secreted proteins and Type III, IV, VI secretion systems DOI 10.1093/nar/gkv1269 Type Journal Article Author Eichinger V Journal Nucleic Acids Research Link Publication
-
2018
Title NVIDIA GPU Grant Program Type Capital/infrastructure (including equipment) Start of Funding 2018