Key Points
-
Interactions between genetic loci might reduce the power to detect genetic effects in genetic association studies, if these interactions are not allowed for.
-
Statistical interaction corresponds to a departure from the additive effects of two or more variables in a linear model describing the relationship between an outcome and predictor variables.
-
A variety of methods can be used to test for statistical interaction between predictor variables that encode the genotype and an outcome variable corresponding to the disease phenotype.
-
Logistic regression is one method that can be used either to test for interaction, or to test for association while allowing for interaction.
-
Given genome-wide data, an exhaustive search is feasible for investigating two-way interactions (that is, all pairwise combinations of loci) but not for investigation of higher-order interactions.
-
Filtering approaches allow one to reduce the number of loci considered and thus the number of interaction tests performed.
-
Data-mining or machine-learning methods, such as random forests and Multifactor Dimensionality Reduction (MDR), can allow one to search through the space of possible interactions.
-
Bayesian model selection approaches offer an alternative approach for searching through the space of possible interactions.
-
The biological interpretation of statistical interactions is complex. The degree to which statistical interaction implies interaction or synergism in a causal sense might be extremely limited.
Abstract
Following the identification of several disease-associated polymorphisms by genome-wide association (GWA) analysis, interest is now focusing on the detection of effects that, owing to their interaction with other genetic or environmental factors, might not be identified by using standard single-locus tests. In addition to increasing the power to detect associations, it is hoped that detecting interactions between loci will allow us to elucidate the biological and biochemical pathways that underpin disease. Here I provide a critical survey of the methods and related software packages currently used to detect the interactions between genetic loci that contribute to human genetic disease. I also discuss the difficulties in determining the biological relevance of statistical interactions.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout




Similar content being viewed by others
References
WTCCC. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007). In this study of 17,000 individuals, many new complex trait loci were identified and key methodological and technical issues related to GWA studies were explored.
Easton, D. F. et al. Genome-wide association study identifies novel breast cancer susceptibility loci. Nature 447, 1087–1093 (2007).
Frayling, T. M. et al. A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity. Science 316, 889–894 (2007).
Plenge, R. M. et al. TRAF1-C5 as a risk locus for rheumatoid arthritis — a genome-wide study. N. Engl. J. Med. 357, 1199–1209 (2007).
Fellay, J. et al. A whole-genome association study of major determinants for host control of HIV-1. Science 317, 944–947 (2007).
Culverhouse, R., Suarez, B. K., Lin, J. & Reich, T. A perspective on epistasis: limits of models displaying no main effect. Am. J. Hum. Genet. 70, 461–471 (2002).
Moore, J. H. The ubiquitous nature of epistasis in determining susceptibility to common human diseases. Hum. Hered. 56, 73–82 (2003).
Ritchie, M. D. et al. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am. J. Hum. Genet. 69, 138–147 (2001). This was the original paper describing the popular MDR method.
Hahn, L. W., Ritchie, M. D. & Moore, J. H. Multifactor dimensionality reduction software for detecting gene–gene and gene–environment interactions. Bioinformatics 19, 376–382 (2003).
Moore, J. H. Computational analysis of gene–gene interactions using multifactor dimensionality reduction. Expert Rev. Mol. Diagn. 4, 795–803 (2004).
Chung, Y., Lee, S. Y., Elston, R. C. & Park, T. Odds ratio based multifactor-dimensionality reduction method for detecting gene–gene interactions. Bioinformatics 23, 71–76 (2007).
Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
Zhang, Y. & Liu, J. S. Bayesian inference of epistatic interactions in case–control studies. Nature Genet. 39, 1167–1173 (2007). This paper proposed a new Bayesian approach for the detection of loci that might interact in the context of GWA studies. The related BEAM software package provides a computationally efficient implementation of the proposed algorithm.
Ferreira, T., Donnelly, P. & Marchini, J. Powerful Bayesian gene–gene interaction analysis. Am. J. Hum. Genet. 81 (Suppl.), 32 (2007).
Gayan, J. et al. A method for detecting epistasis in genome-wide studies using case–control multi-locus association analysis. BMC Genomics 9, 360 (2008).
Kraft, P., Yen, Y. C., Stram, D. O., Morrison, J. & Gauderman, W. J. Exploiting gene–environment interaction to detect genetic associations. Hum. Hered. 63, 111–119 (2007).
Fisher, R. The correlation between relatives on the supposition of Mendelian inheritance. Trans. R. Soc. Edin. 52, 399–433 (1918).
Hayman, B. I. & Mather, K. The description of genetic interactions in continuous variation. Biometrics 11, 69–82 (1955).
Zeng, Z. B., Wang, T. & Zou, W. Modeling quantitative trait loci and interpretation of models. Genetics 169, 1711–1725 (2005). This paper includes an excellent discussion of issues in the definition and interpretation of interaction in quantitative genetic studies of derived populations (inbred lines).
Phillips, P. C. Epistasis — the essential role of gene interactions in the structure and evolution of genetic systems. Nature Rev. Genet. 9, 855–867 (2008). An excellent review describing the differing definitions and interpretations of epistasis.
Cordell, H. J. Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans. Hum. Mol. Genet. 11, 2463–2468 (2002).
Cordell, H. J., Todd, J. A., Bennett, S. T., Kawaguchi, Y. & Farrall, M. Two-locus maximum lod score analysis of a multifactorial trait: joint consideration of IDDM2 and IDDM4 with IDDM1 in type 1 diabetes. Am. J. Hum. Genet. 57, 920–934 (1995).
Cox, N. J. et al. Loci on chromosomes 2 (NIDDM1) and 15 interact to increase susceptibility to diabetes in Mexican Americans. Nature Genet. 21, 213–215 (1999).
Cordell, H. J., Wedig, G. C., Jacobs, K. B. & Elston, R. C. Multilocus linkage tests based on affected relative pairs. Am. J. Hum. Genet. 66, 1273–1286 (2000).
Strauch, K., Fimmers, R., Baur, M. & Wienker, T. F. How to model a complex trait 2. Analysis with two disease loci. Hum. Hered. 56, 200–211 (2003).
Armitage, P., Berry, G. & Matthews, J. N. S. Statistical Methods in Medical Research 4th edn (Blackwell Science, Chichester, 2002).
McCullagh, P. & Nelder, J. A. Generalized Linear Models (Chapman & Hall, London, 1989).
Neuman, R. J. & Rice, J. P. Two-locus models of disease. Genet. Epidemiol. 9, 347–365 (1992).
Li, W. & Reich, J. A complete enumeration and classification of two-locus disease models. Hum. Hered. 50, 334–349 (2000).
Hallgrimsdottir, I. B. & Yuster, D. S. A complete classification of epistatic two-locus models. BMC Genet. 9, 17 (2008).
McKinney, B. A., Reif, D. M., Ritchie, M. D. & Moore, J. H. Machine learning for detecting gene–gene interactions: a review. Appl. Bioinformatics 5, 77–88 (2006).
Piegorsch, W. W., Weinberg, C. R. & Taylor, J. A. Non-hierarchical logistic models and case-only designs for assessing susceptibility in population-based case–control studies. Stat. Med. 13, 153–162 (1994). An important paper showing the use of case-only designs for detection of gene–environment interactions in epidemiological studies.
Yang, Q., Khoury, M. J., Sun, F. & Flanders, W. D. Case-only design to measure gene–gene interaction. Epidemiology 10, 167–170 (1999).
Weinberg, C. R. & Umbach, D. M. Choosing a retrospective design to assess joint genetic and environmental contributions to risk. Am. J. Epidemiol. 152, 197–203 (2000).
Mukherjee, B. et al. Tests for gene–environment interaction from case–control data: a novel study of type I error, power and designs. Genet. Epidemiol. 32, 615–626 (2008).
Zhao, J., Jin, L. & Xiong, M. Test for interaction between two unlinked loci. Am. J. Hum. Genet. 79, 831–845 (2006).
Hoh, J. & Ott, J. Mathematical multi-locus approaches to localizing complex human trait genes. Nature Rev. Genet. 4, 701–709 (2003).
Mukherjee, B. & Chatterjee, N. Exploiting gene–environment independence for analysis of case–control studies: an empirical Bayes-type shrinkage estimator to trade-off between bias and efficiency. Biometrics 64, 685–694 (2008).
Yang, Y., Houle, A. M., Letendre, J. & Richter, A. RET Gly691Ser mutation is associated with primary vesicoureteral reflux in the French-Canadian population from Quebec. Hum. Mutat. 29, 695–702 (2008).
Moore, J. H. et al. A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. J. Theor. Biol. 241, 252–261 (2006).
Chanda, P. et al. Information-theoretic metrics for visualizing gene–environment interactions. Am. J. Hum. Genet. 81, 939–963 (2007).
Kang, G. et al. An entropy-based approach for testing genetic epistasis underlying complex diseases. J. Theor. Biol. 250, 362–374 (2008).
Dong, C. et al. Exploration of gene–gene interaction effects using entropy-based methods. Eur. J. Hum. Genet. 16, 229–235 (2008).
Zwick, M. An overview of reconstructability analysis. Kybernetes 33, 877–905 (2004). An excellent overview of some of the principles and techniques used in information-theory modelling of frequency and probability distributions.
Cordell, H. J. & Clayton, D. G. A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: application to HLA in type 1 diabetes. Am. J. Hum. Genet. 70, 124–141 (2002).
Cordell, H. J., Barratt, B. J. & Clayton, D. G. Case/pseudocontrol analysis in genetic association studies: a unified framework for detection of genotype and haplotype associations, gene–gene and gene–environment interactions and parent-of-origin effects. Genet. Epidemiol. 26, 167–185 (2004). This paper describes a regression-based framework for the analysis of family-based data that allows tests of interaction that are similar to the tests often used in case–control studies to be performed.
Martin, E. R., Ritchie, M. D., Hahn, L., Kang, S. & Moore, J. H. A novel method to identify gene–gene effects in nuclear families: the MDR-PDT. Genet. Epidemiol. 30, 111–123 (2006).
Kotti, S., Bickeboller, H. & Clerget-Darpoux, F. Strategy for detecting susceptibility genes with weak or no marginal effect. Hum. Hered. 63, 85–92 (2007).
Lou, X. Y. et al. A combinatorial approach to detecting gene–gene and gene–environment interactions in family studies. Am. J. Hum. Genet. 83, 457–467 (2008).
Gauderman, W. J. Sample size requirements for association studies of gene–gene interaction. Am. J. Epidemiol. 155, 478–484 (2002).
Hein, R., Beckmann, L. & Chang-Claude, J. Sample size requirements for indirect association studies of gene–environment interactions (G x E). Genet. Epidemiol. 32, 235–245 (2008).
Marchini, J., Donnelly, P. & Cardon, L. R. Genome-wide strategies for detecting multiple loci that influence complex diseases. Nature Genet. 37, 413–417 (2005). This paper highlights the importance and feasibility of fitting interaction models using GWA data.
Chapman, J. & Clayton, D. Detecting association using epistatic information. Genet. Epidemiol. 31, 894–909 (2007).
Motsinger, A., Lee, S., Mellick, G. & Ritchie, M. GPNN: power studies and applications of a neural network method for detecting gene–gene interactions in studies of human disease. BMC Bioinformatics 7, 39 (2006).
Motsinger-Reif, A. A., Dudek, S. M., Hahn, L. W. & Ritchie, M. D. Comparison of approaches for machine-learning optimization of neural networks for detecting gene–gene interactions in genetic epidemiology. Genet. Epidemiol. 32, 325–340 (2008).
Lunn, D. J., Whittaker, J. C. & Best, N. A Bayesian toolkit for genetic association studies. Genet. Epidemiol. 30, 231–247 (2006).
Hoh, J. et al. Selecting SNPs in two-stage analysis of disease association data: a model-free approach. Ann. Hum. Genet. 64, 413–417 (2000).
Millstein, J., Conti, D. V., Gilliland, F. D. & Gauderman, W. J. A testing framework for identifying susceptibility genes in the presence of epistasis. Am. J. Hum. Genet. 78, 15–27 (2006).
ochdanovits, Z. et al. Genome-wide prediction of functional gene–gene interactions inferred from patterns of genetic differentiation in mice and men. PLoS ONE 3, e1593 (2008).
Emily, M., Mailund, T., Schauser, L. & Schierup, M. H. Using biological networks to search for interacting loci in genomewide association studies. Eur. J. Hum. Genet. 11 Mar 2009 (doi: 10.1038/ejhg.2009.15).
Moore, J. H. & Williams, S. M. New strategies for identifying gene–gene interactions in hypertension. Ann. Med. 34, 88–95 (2002).
Golub, G., Heath, M. & Wahba, G. Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics 21, 215–224 (1979).
Velez, D. R. et al. A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction. Genet. Epidemiol. 31, 306–315 (2007).
Copas, J. B. Regression, prediction and shrinkage. J. Roy. Stat. Soc., Series B 45, 311–354 (1983).
Hastie, T., Tibshirani, R., & Friedman, J. H. The Elements of Statistical Learning: Data Mining, Inference and Prediction (Springer, New York, 2001).
Lee, A. & Silvapulle, M. Ridge estimation in logistic regression. Comm. Stat. Simul. Comput. 17, 1231–1257 (1988).
Le Cessie, S. & Van Houwelingen, J. Ridge estimators in logistic regression. Appl. Stat. 41, 191–201 (1992).
Efron, B., Hastie, T., Johnstone, I. & Tibshirani, R. Least angle regression. Ann. Statist. 32, 407–499 (2004).
Park, M. Y. & Hastie, T. Penalized logistic regression for detecting gene interactions. Biostatistics 9, 30–50 (2008).
Zhang, Z., Zhang, S., Wong, M. Y., Wareham, N. H. & Sha, Q. An ensemble learning approach jointly modelling main and interaction effects in genetic association studies. Genet. Epidemiol. 32, 285–300 (2008).
Zhang, H. & Bonney, G. Use of classification trees for association studies. Genet. Epidemiol. 19, 323–332 (2000).
Nelson, M. R., Kardia, S. L., Ferrell, R. E. & Sing, C. F. A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. Genome Res. 11, 458–470 (2001).
Culverhouse, R., Klein, T. & Shannon, W. Detecting epistatic interactions contributing to quantitative traits. Genet. Epidemiol. 27, 141–152 (2004).
McKinney, B. A., Crowe, J. E., Guo, J. & Tian, D. Capturing the spectrum of interaction effects in genetic association studies by simulated evaporative cooling network analysis. PLoS Genet. 5, e1000432 (2009).
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Lunetta, K. L., Hayward, L. B., Segal, J. & Van Eerdewegh, P. Screening large-scale association study data: exploiting interactions using random forests. BMC Genet. 5, 32 (2004).
Bureau, A. et al. Identifying SNPs predictive of phenotype using random forests. Genet. Epidemiol. 28, 171–182 (2005).
Schwartz, D. F., Ziegler, A. & König, I. R. Beyond the results of genome-wide association studies. Genet. Epidemiol. 32, 671 (2008).
Kooperberg, C., Ruczinski, I., LeBlanc, M. & Hsu, L. Sequence analysis using logic regression. Genet. Epidemiol. 21, S626–S631 (2001).
Kooperberg, C. & Ruczinski, I. Identifying interacting SNPs using Monte Carlo logic regression. Genet. Epidemiol. 28, 157–170 (2005).
Nunkesser, R., Bernholt, T., Schwender, H., Ickstadt, K. & Wegener, I. Detecting high-order interactions of single nucleotide polymorphisms using genetic programming. Bioinformatics 23, 3280–3288 (2007).
Li, Z., Zheng, T., Califano, A. & Floratos, A. Pattern-based mining strategy to detect multi-locus association and gene × environment interaction. BMC Proc. 1(Suppl. 1), S16 (2007).
Long, Q., Zhang, Q. & Ott, J. Detecting disease-associated genotype patterns. BMC Bioinform. 10(Suppl. 1), S75 (2009).
Cho, Y. M. et al. Multifactor-dimensionality reduction shows a two-locus interaction associated with type 2 diabetes mellitus. Diabetologia 47, 549–554 (2004).
Julia, A. et al. Identification of a two-loci epistatic interaction associated with susceptibility to rheumatoid arthritis through reverse engineering and multifactor dimensionality reduction. Genomics 90, 6–13 (2007).
Tsai, C. T. et al. Renin–angiotensin system gene polymorphisms and coronary artery disease in a large angiographic cohort: detection of high order gene–gene interaction. Atherosclerosis 195, 172–180 (2007).
Lee, S. Y., Chung, Y., Elston, R. C., Kim, Y. & Park, T. Log-linear model based multifactor-dimensionality reduction method to detect gene–gene interactions. Bioinformatics 23, 2589–2595 (2007).
Lou, X. Y. et al. A generalized combinatorial approach for detecting gene-by-gene and gene-by-environment interactions with application to nicotine dependence. Am. J. Hum. Genet. 80, 1125–1137 (2007).
Robnik-Sikonja, M. & Kononenko, I. Theoretical and empirical analysis of ReliefF and RReliefF. Mach. Learn. 53, 23–69 (2003).
Moore, J. H. & White, B. C. Tuning ReliefF for genome-wide genetic analysis. Lect. Notes Comp. Sci. 4447, 166–175 (2007).
McKinney, B. A., Reif, D. M., White, B. C., Crowe, J. & Moore, J. H. Evaporative cooling feature selection for genotypic data involving interactions. Bioinformatics 23, 2113–2120 (2007).
Gelman, A., Carlin, J. B., Stern, H. S. & Rubin, D. B. Bayesian Data Analysis (Chapman and Hall, London, 1995).
Gilks, W. R., Richardson, S. & Spiegelhalter, D. J. Markov Chain Monte Carlo in Practice (Chapman and Hall, London, 1996).
Hoggart, C. J., Whittaker, J. C., De Iorio, M. & Balding, D. J. Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies. PLoS Genet. 4, e1000130 (2008).
Phillips, P. C. The language of gene interaction. Genetics 149, 1167–1171 (1998). An important paper that describes the differing definitions and interpretations of epistasis used in different fields and the lack of equivalence between these definitions.
Moore, J. H. & Williams, S. M. Traversing the conceptual divide between biological and statistical epistasis: systems biology and a more modern synthesis. Bioessays 27, 637–646 (2005).
Cheverud, J. M. & Routman, E. J. Epistasis and its contribution to genetic variance components. Genetics 139, 1455–1461 (1995).
Alvarez-Castro, J. M. & Carlborg, O. A unified model for functional and statistical epistasis and its application in quantitative trait loci analysis. Genetics 176, 1151–1167 (2007).
McClay, J. L. & van den Oord, E. J. Variance component analysis of polymorphic metabolic systems. J. Theor. Biol. 240, 149–159 (2006).
Thompson, W. D. Effect modification and the limits of biological inference from epidemiologic data. J. Clin. Epidemiol. 44, 221–232 (1991).
Siemiatycki, J. & Thomas, D. C. Biological models and statistical interactions: an example from multistage carcinogenesis. Int. J. Epidemiol. 10, 383–387 (1981).
Greenland, S. Interactions in epidemiology: relevance, identification, and estimation. Epidemiology 20, 14–17 (2009). A useful commentary on the relationship between statistical and biological interaction assessed from epidemiological studies.
Gibson, G. Epistasis and pleiotropy as natural properties of transcriptional regulation. Theor. Popul. Biol. 49, 58–89 (1996).
Vanderweele, T. J. Sufficient cause interactions and statistical interactions. Epidemiology 20, 6–13 (2009).
Todd, J. et al. Robust associations of four new chromosome regions from genome-wide analyses of type 1 diabetes. Nature Genet. 39, 857–864 (2007).
Zeggini, E. et al. Replication of genome-wide association signals in UK samples reveals risk loci for type 2 diabetes. Science 316, 1336–1341 (2007).
Sepulveda, N., Paulino, C. D., Carneiro, J. & Penha-Goncalves, C. Allelic penetrance approach as a tool to model two-locus interaction in complex binary traits. Heredity 99, 173–184 (2007).
Sepulveda, N., Paulino, C. D. & Penha-Goncalves, C. Bayesian analysis of allelic penetrance models for complex binary traits. Comp. Stat. Data Anal. 53, 1271–1283 (2009).
Aylor, D. L. & Zeng, Z. B. From classical genetics to quantitative genetics to systems biology: modeling epistasis. PLoS Genet. 4, e1000029 (2008).
Curtis, D. Allelic association studies of genome wide association data can reveal errors in marker position assignments. BMC Genet. 8, 30 (2007).
Breiman, L., Freidman, J. H., Olshen, R. A. & Stone, C. J. Classification and Regression Trees (Chapman and Hall/CRC, New York, 1984).
Bastone, L., Reilly, M., Rader, D. J. & Foulkes, A. S. MDR and PRP: a comparison of methods for high-order genotype–phenotype associations. Hum. Hered. 58, 82–92 (2004).
Strobl, C., Boulesteix, A. L., Zeileis, A. & Hothorn, T. Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinformatics 8, 25 (2007). This paper gives an overview of some of the strengths and limitations of random forests analysis for measuring variable importance.
Acknowledgements
Support for this work was provided by the Wellcome Trust (Grant reference 074524). I thank J. Barrett for assistance with interpretation of the WTCCC Crohn's results, and the WTCCC for making their data freely available. I also thank J. Moore for useful discussions of data-mining methods in general and MDR in particular, and K. Keen for pointing out the origins of the term epistasis.
Author information
Authors and Affiliations
Supplementary information
Supplementary Box S1
Different models of interaction (PDF 253 kb)
Supplementary Box S2
Effects – interacting, independent or otherwise (PDF 294 kb)
Supplementary Table 1
Top pairwise interactions as detected from a--fast-epistasis analysis of the WTCCC Crohn's disease and control data using PLINK (PDF 164 kb)
Related links
Related links
DATABASES
OMIM
FURTHER INFORMATION
Nature Reviews Genetics Series on Genome-wide association studies
Glossary
- Data mining
-
The process of extracting hidden patterns and potentially useful information from large amounts of data.
- Machine learning
-
The ability of a program to learn from experience, that is, to modify its execution on the basis of newly acquired information. A major focus of machine-learning research is to automatically produce models (rules and patterns) from data.
- Bayesian model selection
-
A statistical approach for selecting models by incorporating both prior distributions for parameters of the models and the observed experimental data.
- Maximum likelihood
-
A statistical approach that is used to make inferences about the combination of parameter values that gives the greatest probability of obtaining the observed data.
- Saturated
-
A term for a statistical model that is as full as possible (saturated) with parameters. Such a model is sometimes useful as it serves as a benchmark to quantify how well a simpler model (one with fewer parameters) fits the data.
- Penetrance
-
The probability of displaying a particular phenotype (for example, succumbing to a disease) given that one has a specific genotype.
- Marginal effects
-
The average effects (for example, penetrances) of a single variable, averaged over the possible values taken by other variables. These could be calculated for one locus of a two-locus system as the average of the two-locus penetrances, averaged over the three possible genotypes at the other locus.
- Logistic regression model
-
A statistical model that is used when the outcome is binary. It relates the log odds of the probability of an event to a linear combination of the predictor variables.
- Multinomial regression
-
A statistical approach, similar to logistic regression, which is used when the outcome takes one of several possible categorical values.
- Confounding
-
A phenomenon whereby the measure of association between two variables is distorted because other variables, associated with both variables of interest, are not controlled for in the calculation.
- Empirical Bayes procedure
-
A hierarchical model in which the hyperparameter is not a random variable but is estimated by another (often classical) method.
- Information theory
-
A branch of applied mathematics involving the quantification of information.
- Entropy
-
A key measure used in information theory that quantifies the uncertainty associated with a random variable. For example, a variable indicating the outcome from a toss of a coin will have less entropy than a variable indicating the outcome from a roll of a die (two versus six equally likely outcomes).
- Permutation
-
This method is often used in hypothesis testing. An empirical distribution of a test statistic is obtained by permuting the original sample many times and recalculating the value of the test statistic in each permuted data set. Each permuted sample is considered to be a sample of the population under the null hypothesis.
- Multiple testing
-
An analysis in which multiple independent hypotheses are tested. If a large number of tests are performed, the significance level (p value) of any particular test must be interpreted in light of this fact, as the overall combined probability of making a type I error will increase.
- Bonferroni correction
-
The simplest correction of individual p values for multiple hypothesis testing can be calculated using pcorrected = 1 – (1 – puncorrected)n, in which n is the number of hypotheses tested. This formula assumes that the hypotheses are all independent, and simplifies to pcorrected = npuncorrected when npuncorrected <<1.
- Q–Q plot
-
A quantile–quantile plot is a diagnostic plot that can be used to compare the distribution of observed test statistics with the distribution expected under the null hypothesis. Those tests that lie significantly above the line of equality between observed and expected quantiles are considered significant in the context of the number of tests performed.
- High-dimensional data
-
Data that contain information on a large number of variables, albeit possibly measured in a small number of subjects or replicates.
- Cross-validation
-
This approach involves partitioning a data set into smaller subsamples, performing an analysis in one subsample and using the other subsample to measure or validate how well the analysis has performed. To reduce variability, multiple rounds of cross-validation are often performed using different partitions of the data and the validation results are averaged over the rounds.
- Overfitting
-
The phenomenon in which a complex model might provide a good fit to the current data set but is overfitted to the random quirks present in that particular data set and therefore cannot be generalized to future data sets in the way that a simpler model might be.
- Bootstrap samples
-
These are data sets obtained by taking a random sample of the original data, usually with replacement. One then applies the same analysis as was applied to the real data. This is repeated many times, allowing one to assess the variability in results incurred owing to random sampling.
- Frequentist
-
A statistical approach for testing hypotheses by assessing the strength of evidence for the hypothesis provided by the data.
- Burn-in period
-
In Markov chain Monte Carlo analysis, a period at the start of the computation in which the values taken by the parameters are ignored when constructing the posterior distribution.
- Compositional epistasis
-
The blocking of one allelic effect by an allele at another locus.
- Statistical epistasis
-
The average effect of substitution of alleles at combinations of loci, with respect to the average genetic background of the population.
- Functional epistasis
-
The molecular interactions that proteins and other genetic elements have with one another.
Rights and permissions
About this article
Cite this article
Cordell, H. Detecting gene–gene interactions that underlie human diseases. Nat Rev Genet 10, 392–404 (2009). https://doi.org/10.1038/nrg2579
Issue Date:
DOI: https://doi.org/10.1038/nrg2579