Analysis of rare variants from sequencing data
Analysis of rare variants from sequencing data
Disciplines
Biology (100%)
Keywords
-
Next generation sequencing,
Rare Variants,
Association Testing,
Burden Tests,
Gene Set Enrichment Analysis
Today, many large-scale sequencing studies are on the way, addressing one of the major questions in human genetics: how and to what extent can insights into disease etiology be advanced by studying low frequency variants. The development of analytical tools, however, is barely keeping up with the deluge of human sequencing data. For example, single-SNP disease associations are commonly tested based on logistic regression. This approach is powerful for common variant and therefore broadly used in GWAS, but for studying the association of rare variants our power to detect signals will be modest. One possibility is to assess the combined effects of specific sets of rare variants: for example, all coding variants in a particular gene. These burden tests take into account overall variant-load within specified genomic regions of interest and are, therefore, better able to detect signals in the presence of multiple rare causal alleles. This is a very active area of research: within the last three years more than 20 burden tests have been proposed. However, the properties of these tests are still not fully understood and the comparisons provided in the original publications are often too simplistic or cover only a small range of genetic architectures. Furthermore, the few published method-neutral comparisons, have used simulations that do not reflect the properties of real data (e.g. excess of singletons beyond neutral expectations) or are not covering a wide range of methods. Therefore, analysts of sequence data have to make best-guess decisions when choosing a rare variant analysis method to address certain genetic hypotheses. Therefore, aim 1 of this project is to fill this gap by performing an extensive method neutral evaluation of different burden tests based on realistic sequence data. Our results will guide investigators to identify the most powerful approach to identify rare variants associated with disease. One interesting feature of burden tests is the integration of functional information at gene or locus level. A logical next step in mining genome-wide sequence data is to analyze them at gene set or even at the pathway level. For common variants gene set enrichment analysis (GSEA) is broadly used to test if pathways are enriched. Aim 2 of the proposed project is to extend GSEA to take full advantage of sequence specific properties, such as extensive ascertainment of rare variants, and compare power to the extended burden test approach outlined above. In Aim 3, we will further extend the method proposed in Aim 2 by taking into account the a-priori known relationships between genes and variants. Completion of these three aims will result in research tools of high strategic value and impact, and will enhance the value of many ongoing and future large-scale sequencing experiments.
Today, many large-scale sequencing studies are on the way, addressing one of the major questions in human genetics: how and to what extent can insights into disease etiology be advanced by studying low frequency variants. The development of analytical tools, however, is barely keeping up with the deluge of human sequencing data. For example, single-SNP disease associations are commonly tested based on logistic regression. This approach is powerful for common variant and therefore broadly used in GWAS, but for studying the association of rare variants our power to detect signals will be modest. One possibility is to assess the combined effects of specific sets of rare variants: for example, all coding variants in a particular gene. These burden tests take into account overall variant-load within specified genomic regions of interest and are, therefore, better able to detect signals in the presence of multiple rare causal alleles. This is a very active area of research: within the last three years more than 20 burden tests have been proposed. However, the properties of these tests are still not fully understood and the comparisons provided in the original publications are often too simplistic or cover only a small range of genetic architectures. Furthermore, the few published method-neutral comparisons, have used simulations that do not reflect the properties of real data (e.g. excess of singletons beyond neutral expectations) or are not covering a wide range of methods. Therefore, analysts of sequence data have to make best-guess decisions when choosing a rare variant analysis method to address certain genetic hypotheses. In this project we have filled this gap by performing an extensive method neutral evaluation of different burden tests based on realistic sequence data. Surprisingly, we discovered that in sample sizes comparable to those of contemporary sequencing studies (3K case-control individuals), we find that while gene-based association methods augment the power of single variant tests by preferentially detecting loci at which rare variants drive the causal architecture, their absolute power is very low. All gene-based methods evaluated in this study have limited power, even to detect loci explaining as much as 1% of the variance in phenotypic liability. Based on estimates of variance explained by known rare and common variant signals (the strongest single common variant association for T2D, mapping near TCF7L2, explains ~1% of phenotypic variance), it seems probable that for any given complex trait, at best, a handful of loci will have effects on this scale. The full potential of exome sequencing to provide biological insights into disease will then depend largely on the detection of loci of smaller aggregate effects, and will require far larger sample sizes than these. This result also disproves the hypothesis that complex diseases, such as T2D, are heavily driven by rare variants of large effects, and that small samples of a few hundred individuals are sufficient to identify these variants.
- University of Michigan - 100%
- Medizinische Universität Innsbruck - 100%
Research Output
- 5504 Citations
- 10 Publications
-
2016
Title Improving power for rare variant tests by integrating external controls DOI 10.1101/081711 Type Preprint Author Lee S Pages 081711 Link Publication -
2016
Title Reference-based phasing using the Haplotype Reference Consortium panel DOI 10.1038/ng.3679 Type Journal Article Author Loh P Journal Nature Genetics Pages 1443-1448 Link Publication -
2015
Title An efficient resampling method for calibrating single and gene-based rare variant association analysis in case–control studies DOI 10.1093/biostatistics/kxv033 Type Journal Article Author Lee S Journal Biostatistics Pages 1-15 Link Publication -
2017
Title Improving power for rare-variant tests by integrating external controls DOI 10.1002/gepi.22057 Type Journal Article Author Lee S Journal Genetic Epidemiology Pages 610-619 Link Publication -
2019
Title Sequencing and Imputation in GWAS: Cost-Effective Strategies to Increase Power and Genomic Coverage Across Diverse Populations DOI 10.1101/548321 Type Preprint Author Quick C Pages 548321 Link Publication -
2016
Title Next-generation genotype imputation service and methods DOI 10.1038/ng.3656 Type Journal Article Author Das S Journal Nature Genetics Pages 1284-1287 Link Publication -
2016
Title mtDNA-Server: next-generation sequencing data analysis of human mitochondrial DNA in the cloud DOI 10.1093/nar/gkw247 Type Journal Article Author Weissensteiner H Journal Nucleic Acids Research Link Publication -
2015
Title The Power of Gene-Based Rare Variant Methods to Detect Disease-Associated Variation and Test Hypotheses About Complex Disease DOI 10.1371/journal.pgen.1005165 Type Journal Article Author Moutsianas L Journal PLOS Genetics Link Publication -
2015
Title Identification and Functional Characterization of G6PC2 Coding Variants Influencing Glycemic Traits Define an Effector Transcript at the G6PC2-ABCB11 Locus DOI 10.1371/journal.pgen.1004876 Type Journal Article Author Mahajan A Journal PLOS Genetics Link Publication -
2020
Title Sequencing and imputation in GWAS: Cost-effective strategies to increase power and genomic coverage across diverse populations DOI 10.1002/gepi.22326 Type Journal Article Author Quick C Journal Genetic Epidemiology Pages 537-549 Link Publication