Test for Interaction between Two Unlinked Loci

Journal List > Am J Hum Genet > v.79(5); Nov 2006

Am J Hum Genet. 2006 November; 79(5): 831–845.

Published online 2006 September 21.

PMCID: PMC1698572

Test for Interaction between Two Unlinked Loci

Jinying Zhao,^* Li Jin, and Momiao Xiong

From the Human Genetics Center, University of Texas Health Science Center at Houston, Houston (J.Z.; M.X.); and School of Life Science, Fudan University (L.J.; M.X.), and Chinese Academy of Sciences and German Max Planck Society Partner Institute of Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences (L.J.), Shanghai

Address for correspondence and reprints: Dr. Momiao Xiong, Human Genetics Center, University of Texas Health Science Center at Houston, P.O. Box 20334, Houston, TX 77225. E-mail: Momiao.Xiong/at/uth.tmc.edu

^*Present affiliation: Division of Cardiology, Emory University School of Medicine, Atlanta.

Received April 7, 2006; Accepted August 14, 2006.

This article has been cited by other articles in PMC.

Abstract

Despite the growing consensus on the importance of testing gene-gene interactions in genetic studies of complex diseases, the effect of gene-gene interactions has often been defined as a deviance from genetic additive effects, which is essentially treated as a residual term in genetic analysis and leads to low power in detecting the presence of interacting effects. To what extent the definition of gene-gene interaction at population level reflects the genes' biochemical or physiological interaction remains a mystery. In this article, we introduce a novel definition and a new measure of gene-gene interaction between two unlinked loci (or genes). We developed a general theory for studying linkage disequilibrium (LD) patterns in disease population under two-locus disease models. The properties of using the LD measure in a disease population as a function of the measure of gene-gene interaction between two unlinked loci were also investigated. We examined how interaction between two loci creates LD in a disease population and showed that the mathematical formulation of the new definition for gene-gene interaction between two loci was similar to that of the LD between two loci. This finding motived us to develop an LD-based statistic to detect gene-gene interaction between two unlinked loci. The null distribution and type I error rates of the LD-based statistic for testing gene-gene interaction were validated using extensive simulation studies. We found that the new test statistic was more powerful than the traditional logistic regression under three two-locus disease models and demonstrated that the power of the test statistic depends on the measure of gene-gene interaction. We also investigated the impact of using tagging SNPs for testing interaction on the power to detect interaction between two unlinked loci. Finally, to evaluate the performance of our new method, we applied the LD-based statistic to two published data sets. Our results showed that the P values of the LD-based statistic were smaller than those obtained by other approaches, including logistic regression models.

Complex diseases are typically caused by multiple factors, including multiple genes, primarily through nonlinear gene-gene interactions and gene-environment interactions. Gene-gene interaction is an important but complex concept.¹ Despite growing recognition of the importance of gene interactions in genetic studies of complex diseases, classical genetic analysis either ignores gene interactions or defines the effect of gene interactions as a deviance from genetic additive effects, which is essentially treated as a residual term in genetic analysis.² Fisher³ mathematically defined the effect of gene interactions as a statistical deviance from the additive effects of single genes, which is often referred to as “statistical interaction” between genes. This was further developed by Cockerham⁴ and Kempthorne⁵ into the modern representation that treats statistical gene interactions as interaction terms in a regression model or a generalized linear model on allelic effects.²^,⁶^–¹¹ Modeling a trait as an additive combination of its single-locus main effects and interaction terms is likely to limit the power to detect interaction.

In the past several years, combinatorial partitioning¹² and various data-mining methods¹^,¹³^–²¹ have been explored to detect gene-gene interaction. The limitations of these methods include (1) the lack of clear biological interpretation of gene-gene interaction, (2) the requirement of intensive computation, and (3) the fact that the power to detect gene-gene interaction may depend on the data structure.

To overcome these limitations, we propose to define interaction between two unlinked loci (or genes) for a qualitative trait as the deviance of the penetrance for a haplotype at two loci from the product of the marginal penetrance of the individual alleles that span the haplotype. This definition of gene-gene interaction between two unlinked loci measures the dependence of the penetrance at one marker locus on the genotypes at another locus, which is not derived from the additive model. Interaction between two unlinked loci will result in deviation of the penetrance of the two-locus haplotype from independence of the marginal penetrance of the alleles at an individual locus, which in turn will create linkage disequilibrium (LD) even if two loci are unlinked. The level of LD created depends on the magnitude of interaction between two unlinked loci. Therefore, it is possible to develop statistics for detection of interaction between two unlinked loci by use of deviations from LD. Such statistics for interaction detection between two unlinked loci have advantages, as follows. First, since interaction between two unlinked loci can be characterized by LD between two interacting loci, the LD-based statistics for detection of interaction between two unlinked loci will have a clear biological interpretation. Second, they will not treat interaction as a residual term in the model and can implicitly consider nonlinear interaction between two unlinked loci. Hence, LD-based statistics for detection of interaction between two unlinked loci will have higher power than that of the traditional Fisher's method. Third, computation of LD-based statistics is much faster than logistic regression models; thus, they are particularly suitable for genomewide association studies.

To date, formal statistics for testing gene interactions by use of LD among loci are not yet developed, although several empirical studies to assess the role of gene interaction by use of LD have been conducted.²²^–²⁵ These studies assessed deviations from equilibrium in the affected population to indicate interaction between two unlinked loci. These empirical studies for testing interaction between two unlinked loci have limitations. Most of the LD-based empirical studies are descriptive. They separately tested deviation from equilibrium in cases and controls but did not provide a unified statistic to test gene interaction by assessing difference in LD between cases and controls. Furthermore, they did not examine the null distributions, type I error rates, and power of the test statistics. As a consequence, in the presence of complex LD patterns in populations, these LD-based empirical studies for identifying gene interactions may have high false-positive rates.

The main purpose of this article is to develop statistics with high power for detection of interaction between two unlinked loci. To accomplish this, we first develop general theory to study LD patterns under two-locus disease models. We then develop a novel definition of gene interaction and a measure of interaction between two unlinked disease loci under the framework of LD analysis. The pattern of LD between two unlinked loci created by gene-gene interaction provides a foundation for developing statistics for detection of interaction. This motives us to develop the LD-based statistics for testing interactions between two unlinked loci. We also investigate type I error rates of the LD-based statistics. Furthermore, we explore the possibility of using two unlinked tagging SNPs (tSNPs) for detecting interaction between two disease loci that are in LD with the chosen tSNPs. To investigate the impact of using tSNPs on interaction detection, we evaluate the power of directly using interacting disease loci and of using tSNPs that are in high LD with the interacting disease loci to detect interaction. To evaluate the performance of the new statistic, we also applied it to two real examples. We conclude with a discussion of the advantages and potential limitations of the proposed statistic.

Methods

LD Generated by Gene-Gene Interactions

To investigate the LD pattern generated by gene-gene interaction, we assume that two disease-susceptibility loci are in Hardy-Weinberg equilibrium (HWE) and are unlinked. Let D₁ and d₁ be the two alleles at the first disease locus, with frequencies P_D₁ and P_d₁, respectively. Let D₂ and d₂ be the two alleles at the second disease locus, with frequencies P_D₂ and P_d₂, respectively. Alleles D₁ and d₁ can be indexed by 1 and 2, respectively. At the first disease locus, let D₁D₁ be genotype 11, D₁d₁ be genotype 12, and d₁d₁ be genotype 22. The genotypes at the second disease locus are similarly defined. Two-locus genotypes are simply denoted by ijkl for individuals carrying the haplotypes ik and jl arranged from left to right. Let f_ijkl be the penetrance of the individuals with haplotypes ik and jl arranged from left to right. Let P₁₁, P₁₂, P₂₁, and P₂₂ be the frequencies of haplotypes H_D₁D₂, H_D₁d₂, H_d₁D₂, and H_d₁d₂ in the general population, respectively. Let P^A₁₁, P^A₁₂, P^A₂₁, and P^A₂₂ be their corresponding haplotype frequencies in the disease population. Let P^A_D₁, P^A_d₁, P^A_D₂, and P^A_d₂ be the frequencies of the alleles D₁, d₁, D₂, and d₂ in the disease population, respectively.

For ease of discussion, we introduce a concept of haplotype penetrance. Consider a haplotype with allele i at the first disease locus and allele k at the second disease locus. Then, the penetrance of haplotype H_ik is defined as

A mathematical equation, expression, or formula that is to be displayed as a block (callout) within the narrative flow. The name of referred object is AJHGv79p831df1.jpg

Let δ=P₁₁-P_D₁P_D₂ be the LD measure in the general population. In appendix A, we show that haplotype frequencies in disease population can be expressed as

where P_A denotes disease prevalence and is given by

Now, we calculate the LD measure in the disease population under a general two-locus disease model. The measure of LD in the disease population is defined as δ^A=P^A₁₁P^A₂₂-P^A₁₂P^A₂₁. We can show (appendix A) that it can be given by

where I=h₁₁h₂₂-h₁₂h₂₁, which is defined as a measure of interaction between two unlinked loci and quantifies the magnitude of interaction. Absence of interaction between two unlinked loci is then defined as

Under this definition, in the absence of interaction, two unlinked loci in the disease population will be in linkage equilibrium.

From equation (2), we can see that, if h₁₁h₂₂≠h₁₂h₂₁, even if two loci are in linkage equilibrium in the general population, two loci will be in LD in the disease population. LD in the disease population is created by the interaction between two unlinked loci. This provides a basis for testing interaction between two unlinked loci, as shown in the “Test Statistic” section.

Define h_D₁=P(Affected|D₁) and h_D₂=P(Affected|D₂). In appendix A, we show that equation (3) implies that

Similar to linkage equilibrium, where the frequency of a haplotype is equal to the product of the frequencies of the component alleles of the haplotype, absence of interaction between two unlinked loci implies that the proportion of individuals carrying a haplotype in the disease population is equal to the product of the proportions of individuals carrying the component alleles of the haplotype in the disease population, if we assume that the disease is caused by only two investigated disease loci. In other words, interaction between two disease-susceptibility loci occurs when contribution of one locus to the disease depends on another locus.

Suppose that the first locus postulated above is a disease-susceptibility locus and that the second is a marker locus that does not predispose carriers to a disease phenotype. Let f_ij be the penetrance of the genotype ij at the disease-susceptibility locus. Then, we have h₁₁=P_D₁f₁₁+P_d₁f₁₂, h₂₂+P_D₁f₂₁+P_d₁f₂₂, h₁₂=P_D₁f₁₁+P_d₁f₁₂, and h₂₁=P_D₁f₂₁+P_d₁f₂₂, which implies that

That is, the measure of LD between a disease locus and a marker locus in the disease population (δ^A) can be expressed in terms of the measure of LD in the general population and a multiplicative factor. If the disease locus and the marker locus are unlinked, then the disease and marker loci will be in linkage equilibrium. This demonstrates that, in the absence of interaction between the unlinked marker and the disease loci, LD in the disease population cannot be created.

To further understand the measure of interaction between two unlinked loci, we examined the interactions between two unlinked loci under six two-locus disease models. Results are listed in table 1, in which the values represent the penetrances of the given genotypes.²⁶^–²⁸ The measure of interaction between two unlinked loci depends not only on penetrance but also on the frequencies of the disease alleles.

Table 1.

Interaction between Two Unlinked Disease Loci under Six Two-Locus Disease Models

Indirect Interaction between Two Unlinked Marker Loci

In the previous section, we studied interaction between two unlinked disease loci. Now, we consider two marker loci, each of which is in LD with either of two interacting loci. Although there is no physiological interaction between the two marker loci, if each marker locus is in LD with one of the two unlinked interacting loci, we still can observe LD between two unlinked marker loci in the disease population. Assume that marker M₁ is in LD with disease locus D₁ and that marker M₂ is in LD with disease locus D₂. Furthermore, we assume that two disease loci, D₁ and D₂, are unlinked. Let δ^A_M be the LD measure between two marker loci in the disease population. Let δ_i be the LD measure between marker M_i and disease locus D_i (i=1,2) in the general population. Then, we can show (appendix B) that

where δ^A is the measure of LD between two unlinked disease loci in the disease population. It is clear that, when the marker loci are the disease loci themselves, δ^A_M is reduced to δ^A. Equation (4) can also be written in terms of the measure of interaction between two unlinked loci:

Since δ_i

P_{D_i}P_{d_i}, the absolute value of the LD measure between two unlinked marker loci in the disease population—for example, |δ^A_M|—will be less than or equal to the absolute value of the LD measure between two unlinked disease loci in the disease population.

Equation (4) shows that the LD between unlinked marker loci in the disease population is proportional to the product of LD between each marker locus and its linked disease locus, δ₁δ₂. Since the criteria for tSNP selection are based on only one pairwise LD between the marker and disease loci, the LD between tSNPs and interacting loci may not be large enough to ensure that indirect interaction between two unlinked marker loci will be detected. Thus, if the interacting disease loci are not selected as tSNPs, many loci with interactions will be missed. This will have a profound implication on tSNP selection.

Test Statistic

In the previous section, we showed that interaction between unlinked loci will create LD. Intuitively, we can test interaction by comparing the difference in the LD levels between two unlinked loci between cases and controls. Precisely, if we denote the estimators of the LD measures in cases and controls by equation M1

and

, respectively, then the test statistic can be defined as

where

and n_A and n_G denote the number of sampled individuals in cases and controls, respectively. P^A₁₁, P^A_D₁, P^A_D₂, P^N₁₁, P^N_D₁, and P^N_D₂ are defined as before. equation M3

, and

are their estimators, the variance of the LD measure was the large-sample variance,²⁹ and equation M9

and

are the estimators of the variances V_A and V_N, respectively. This statistic will be referred to as the “LD-based statistic” throughout the article. We can show that test statistic T_I is asymptotically distributed as a central χ²₍₁₎ distribution under the null hypothesis of no interaction between two unlinked loci (appendix C).

In theory, when there is no interaction between two unlinked loci, the LD between them should be zero. Thus, we can use case-only design to study interaction between two loci. In this case, equation (5) will be reduced to

However, in practice, background LD between two unlinked loci may exist in the population because of many unknown factors. Therefore, using equation (6) to test for interaction will increase type I error rates. The test statistic defined in equation (5) is more robust than that in equation (6). In appendix C, we showed that, for an admixed population, if differences in allele frequencies between two subpopulations at each of the two loci in cases and controls are the same, test statistic T_I in equation (5) is still a valid test for interaction between two unlinked loci.

Results

Patterns of Pairwise LD under Two-Locus Disease Models

Knowledge about differences in LD patterns between disease and general populations is crucial for association studies of complex diseases. To illustrate how the differences in LD patterns between disease and general populations are influenced by disease models, we examined the LD patterns between unlinked loci by assuming several two-locus disease models. We first studied the LD between two unlinked loci under three two-locus disease models: the union of dominant and dominant (Dom [union or logical sum]

Dom), the union of recessive and recessive (Rec [union or logical sum]

Rec), and threshold models (table 1). Figure 1 shows the LD between two unlinked loci, which is generated by the joint actions of two disease loci, as a function of the allele frequency at the first locus, under the assumption that the allele frequency at the second locus P_D₂=0.1 and penetrance parameter f=1. Figure 1 shows that, although two unlinked loci in the general population is in linkage equilibrium, the LD between two unlinked loci in the disease population does exist. The LD in disease population depends on the disease models and the allele frequencies at two loci.

Figure 1.

LD between two unlinked loci in a disease population under three two-locus disease models as a function of allele frequency at the first locus, under the assumption that the allele frequency at the second locus equals 0.1.

Pairwise Interaction Measure

The proposed measure of interaction between two unlinked loci quantifies the magnitude of interaction between two unlinked loci. To further explore the properties of the interaction measure between two unlinked loci, we investigated the impact of the two-locus disease models on the measure of interaction. Figure 2 plots the measure of interaction between two unlinked loci under six two-locus disease models (table 1) as a function of penetrance parameter f, under the assumption that the allele frequencies at the two loci are 0.3 and 0.8 (fig. 2A) or 0.2 and 0.4 (fig. 2B). The figures shows that the measure of interaction is a monotonic function of the penetrance parameter. The measure of interaction depends on both the disease models and the allele frequencies at the two loci. However, the relationship between the measure of interaction and disease models is complex. For example, when the allele frequencies at two loci are 0.2 and 0.4, the measure of interaction for the Dom [union or logical sum]

Dom model is much larger than that for Rec [union or logical sum]

Rec model, whereas when the allele frequencies at two loci are 0.3 and 0.8, the measure of interaction for the Dom [union or logical sum]

Dom model is smaller than that for the Rec [union or logical sum]

Rec model. This may partially explain why gene-gene interaction detected in one population cannot be replicated in another population, because allele frequencies are different between populations.

Figure 2.

Measure of interaction between two unlinked loci as a function of the penetrance parameter under six two-locus disease models, under the assumption that allele frequencies at the first and second loci equal either 0.3 and 0.8, respectively (A), or 0.2 (more ...)

Null Distribution of Test Statistics

In the previous sections, we have shown that, when sample size is large enough to apply large-sample theory, distribution of the statistic T_I for testing interaction between two unlinked loci under the null hypothesis of no interaction is asymptotically a central χ²₍₁₎ distribution. To examine the validity of this statement, we performed a series of simulation studies. The computer program SNaP³⁰ was used to generate two-locus genotype data of the sample individuals. A total of 10,000 individuals who were equally divided into cases and controls were generated in the general population. From each group of the cases and controls, 100–500 individuals were randomly sampled; 10,000 simulations were repeated.

Figure 3A and 3B plots the histograms of the test statistic T_I for testing gene-gene interaction between two unlinked loci with sample sizes n_A=n_G=150 and n_A=n_G=250, respectively. It can be seen that the distributions of the test statistic T_I are similar to the theoretical central χ²₍₁₎ distribution. Table 2 shows that the estimated type I error rates of the statistic T_I for testing interaction were not appreciably different from the nominal levels α=0.05, α=0.01, and α=0.001.

Figure 3.

Null distribution of the test statistic T_I by use of 150 individuals (A) or 250 individuals (B) from both the cases and the controls in a homogeneous population.

Table 2.

Type I Error Rates of the Test Statistic T_I in Testing Interaction between Two Unlinked Loci in a Homogeneous Population

To examine the impact of population substructure on the null distribution of the test statistic T_I, we performed a series of simulations. We assumed that allele frequencies at the first locus were 0.7 and 0.3 in population 1 and 0.3 and 0.7 in population 2. The allele frequencies at the second loci were assumed to be 0.2 and 0.8 in population 1 and 0.8 and 0.2 in population 2. From each population, 10,000 individuals were sampled, and these individuals were mixed to form an admixed population, which was then equally divided into cases and controls. Three hundred individuals were randomly sampled from each group of the cases and controls, and 10,000 simulations were repeated. Figure 4 shows the histograms of test statistic T_I. It can be seen that the distribution of T_I is similar to the theoretical central χ² distributions, which shows that population admixture has a mild impact on the null distribution of test statistic T_I.

Figure 4.

Null distribution of the test statistic T_I by use of 300 individuals from both the cases and the controls in an admixed population.

Power Evaluation

To further evaluate the performance of the proposed statistic in testing gene-gene interaction, we compared the power of the LD-based statistic with that of the logistic model. We considered three types of genotype coding (genetic covariate variables). For a recessive model, homozygous wild-type, heterozygous, and homozygous mutant genotypes were coded as 0, 0, and 1, respectively. For a dominant model, these three genotypes were coded as 0, 1, and 1. For an additive model, they were coded as 0, 1, and 2. We considered two loci, denoted as G and H, respectively. Power for the logistic regression model in testing gene-gene interaction was calculated using the software QUANTO.³¹ Figure 5A, 5B, and 5C presents the power comparisons between logistic regression model and LD-based statistic under the three genetic interaction models: recessive × recessive, dominant × dominant, and additive × additive. We can see that the power of both logistic regression and the new LD-based statistic in detecting gene-gene interaction was a monotonic function of the interaction odds ratio, a widely used measure in quantifying the strength of interaction between two loci. This implies that the proposed new interaction measure and test statistic are closely related to the traditional interaction measure. Figure 5A, 5B, and 5C also shows that the power of the test statistic T_I is much higher than that of the logistic regression model.

Figure 5.

Power of the test statistic T_I and logistic regression analysis as a function of interaction odds ratio (R_GH) under three different models. A, Recessive × recessive model, under the assumption that the risk allele frequencies at both loci G and (more ...)

Pairwise LD is widely used in tSNP selection³²—that is, the chosen tSNPs show greater LD (measured by r²) than those nearby SNPs that were not selected for a preset threshold. This approach ensures enough power in detecting disease locus. We now investigate whether the selected threshold can ensure enough power to detect interaction between two unlinked loci. Figure 6A, 6B, and 6C shows the power of the statistic T_I for detecting interaction between two unlinked disease loci (using two tSNPs) as a function of the interaction measure under three two-locus disease models: Dom [union or logical sum] Dom, Dom Rec, and Rec Rec (table 1). For the simplicity of presentation, we assume that each of the two unlinked marker loci has an equal correlation coefficient with one of the two unlinked interacting disease loci. We fix the allele frequency at the second locus and change the allele frequency at the first locus to produce the changing measure of interaction between two loci. Several remarkable features emerge from figure 6A, 6B, and 6C. First, in many cases, power increases as the measure of interaction increases. Second, using neighboring tSNPs has much lower power than does using the two interacting disease loci themselves directly. Third, the magnitude of r² has large impact on the power of interaction detection.

Figure 6.

Power of the test statistic T_I as a function of the interaction measure between two unlinked loci under a two-locus disease model. A, Dom [union or logical sum]

Dom, under the assumption that the number of individuals in both cases and controls are 500, penetrance (more ...)

In figure 6A, 6B, and 6C, we studied the power as a function of measure of interaction. However, in practice, a measure of interaction cannot be directly observed. To provide more practically useful information for tSNPs selection and association studies, we plot figure 7A, 7B, and 7C, showing the power of statistic T_I for interaction detection of two unlinked loci as a function of the allele frequency at the first locus under three two-locus disease models: Dom [union or logical sum] Dom, Dom Rec, and Rec Rec (table 1). Like figure 6A, 6B, and 6C, figure 7A, 7B, and 7C demonstrated that using tSNPs to detect interaction between two disease loci has much lower power than does using disease loci themselves. Figure 7A, 7B, and 7C also showed that allele frequencies have large impact on the power of interaction detection, although the patterns of the impact are different under different two-locus disease models.

Figure 7.

Power of the test statistic T_I as a function of allele frequency at the first locus under a two-locus disease model. A, Dom [union or logical sum]

Dom, under the assumptions that the number of individuals in both cases and controls are 500, penetrance parameter f (more ...)

Application to Real Data Examples

The proposed LD-based statistic was also applied to two real data sets. The first data set is a case-control study. It includes 398 white patients with breast cancer and 372 matched controls from the Ontario Familial Breast Cancer Registry.³³ A total of 19 SNPs from 18 key genes from the pathways of DNA repair, cell cycle, carcinogen/estrogen metabolism, and immune system were typed. All SNPs were in HWE. Under a codominant model, multivariate logistic analysis found significant gene-gene interactions between four pairs of genes: XPD and IL10, GSTP1 and COMT, COMT and CCND1, and BARD1 and XPD.³³ We used the statistic T_I to test interactions between these four pairs of genes. The results are summarized in table 3. Table 3 also includes the crude P values obtained by Onay et al.³³ When calculating the crude P values, Onay et al.³³ included all the main effects as well as the only interested interaction term in their multivariate logistic regression model. Using our LD-based statistic, we also found these four pairs of significant interactions, however, with much smaller P values. Moreover, two pairs of significant interactions, XPD (Lys751Gln) with IL10 (G−1082A) and GSTP1 (Ile241Val) with COMT (Met108/158Val), remained significant after adjustment for multiple testing by use of Bonferroni correction. But all four pairs of significant interaction identified by logistic regression became nonsignificant after adjustment for multiple comparisons by use of the same Bonferroni correction procedure. It was noticed in Onay et al.³³ that these four identified interactions can be justified by experiments and their biological relationships.³³^–³⁷

Table 3.

Comparison of P Values for Testing Gene-Gene Interactions (Example 1)

The second data set was a birth cohort study that recorded the incidence of hospital admission with malaria and severe malaria from Kilifi District Hospital on the coast of Kenya in Africa.³⁸ A total of 2,104 children from the study was genotyped for both hemoglobin (Hb) and α⁺-thalassemia genes to test their interaction. The Hb gene has two alleles, A and S. The mutant S causes sickle cell disease. The normal and mutant alleles in the gene α⁺-thalassemia are denoted by α and −. We applied the proposed statistic T_I to this data to test interaction between the Hb and α⁺-thalassemia genes. The results are summarized in table 4. For comparison, table 4 also lists P values obtained by Poisson regression analysis performed by Williams et al.³⁸ We can see that the P values of the test statistic T_I were smaller than those of the Poisson regression analysis. Each of the structural variant HbS and α⁺-thalassemia is protective against severe Plasmodium falciparum malaria. However, if they were inherited together, protection against malaria was lost. The negative epistasis between these two genes can be explained by their biochemical functions.³⁸ The malaria-protective effect of HbAs comes from allele Hbs, which might increase binding of hemichromes to the erythrocyte membrane, leading to opsonization and accelerating the removal of infected erythrocytes by phagocytosis. However, coexistence of α⁺-thalassemia with Hbs reduces the concentration of Hbs, which in turn reduces the protective effect of Hbs against malaria.

Table 4.

Comparison of P Values for Testing Gene-Gene Interaction between the Hb and α⁺-Thalassemia Genes (Example 2)

Discussion

Understanding how genomic information underlies the development of complex diseases is one of the greatest challenges in the 21st century. In the past several decades, genetic studies of human disease have focused on a “locus-by-locus” paradigm.³⁹ However, biological information is processed in complex networks. The disease emerges as the result of interactions between genes and between a gene and environments. Studying one individual gene or polymorphism at a time to explore the cause of the disease and ignoring the interaction between loci (genes) are unlikely to deeply unravel the mechanism of disease. With the imminent completion of the International HapMap Project, development of statistical methods for detecting gene-gene interaction is of great importance. The purpose of this article is to present a new statistic for identifying interaction between two unlinked loci.

Association studies rely heavily on the LD pattern between pairs of loci. Knowledge about the difference in LD between the disease and general populations is essential for understanding the interaction between two loci and their association with the disease. However, little is known about how the multiple-locus disease models influence the pattern of LD in the disease population and how the interaction between two functional SNPs generates the LD in a disease population. Therefore, before presenting the new statistic for detection of the interaction between two unlinked loci, we first developed the general theory to study LD patterns in a disease population under two-locus disease models. We introduced a new concept of haplotype penetrance and developed a measure of interaction between two unlinked loci. Surprisingly, the formula for calculating the interaction measure was very similar to that for calculating the LD measure. The proposed measure of interaction characterizes the contribution of interaction between two loci to the cause of disease. We also investigated how two-locus disease models and population parameters affect the measure of interaction between two unlinked loci. Intuitively, interaction indicates the joint action of two genes in the development of disease. This implies that some haplotypes spanned by the interacting loci occur more often in the disease population than expected. In other words, the interaction between two unlinked loci generates LD in the disease population and the LD level generated by gene-gene interaction depends on the magnitude of the interaction between two unlinked loci. We have rigorously proved that the measure of LD between two unlinked loci generated by their interaction was proportional to the measure of the interaction, which provided us the motivation to propose a statistic for testing interaction between two unlinked loci by comparing the difference in LD between the disease and general populations. Here, we should point out that, after finishing this manuscript, we noticed that a similar statistic was proposed to test association between a single gene and disease.⁴⁰ Zaykin et al.⁴¹ called it the “LD contrast test.” However, this LD contrast test was originally designed to test the association of SNPs by assuming a single disease model. It has not been extended to testing gene-gene interaction.

To use the proposed LD-based statistic to test gene-gene interaction between two unlinked loci, we first examined its distribution under the null hypothesis of no interaction. Through extensive simulation studies (under the assumption of large-sample theory), we showed that the null distribution of the proposed LD-based statistic in both homogeneous and admixed populations was close to a central χ²₍₁₎ distribution. We also calculated type I error rates of the LD-based statistic by simulation. Our results showed that type I error rates were close to the nominal significance levels. We also investigated the power of the new statistic in detecting gene-gene interaction by analytic methods. It shows that its power was a function of the interaction measure, which implies that this new statistic, indeed, can be used to test interaction between two unlinked loci. However, power of the proposed statistic is a complex function. For example, except for the measure of interaction, it also depends on allele frequencies. Moreover, when the measure of interaction is beyond some range, power is no longer an increasing function of the interaction measure (data not shown). Power comparison with logistic regression analysis demonstrated that this LD-based test statistic has much higher power in detecting interaction than does the logistic regression method.

The widely used strategies for tSNP selection are based on a single-disease-gene model. The criteria for tSNP selection is based on the LD levels between the tSNP and disease-susceptibility locus, which ensures a certain power to detect association of a single disease locus with the disease. Our theoretical analysis and power studies demonstrated that such selected tSNPs are highly unlikely to ensure that the interactions between unlinked two loci will be detected.

To further evaluate its performance for detection of interaction between two loci, the proposed LD-based statistic was applied to two published data sets. Our results showed that, in general, P values of the test statistic T_I were much smaller than those of other approaches, including logistic regression analysis.

Like all population-based methods for association studies, the proposed LD-based statistic for testing gene-gene interaction between two unlinked loci also suffers from the attribution-of-causality confound in situations of pleiotropy or overlapping clinical conditions. The detected interaction for a particular disease could actually relate to other diseases that may share common etiological effects with the disease of interest and are only indirectly associated with the disease of interest. Similar to population structure, epistatic selection will also create LD between two unlinked loci. If epistatic selection between two unlinked loci is irrelevant to the disease of interest, the level of LD created by epistatic selection in both cases and controls will be similar, and, in this case, the impact of epistatic selection on the false-positive rate is limited. However, when epistatic selection underlies the phenotypes that are indirectly associated with the disease of interest, it will cause confounding.

Similar to most models for LD, the proposed test statistic and measure of interaction between two unlinked loci require the assumption of HWE. Deviation from HWE will affect the false-positive rates. The measure of interaction in the presence of Hardy-Weinberg disequilibrium (HWD) is a complicated function of penetrance, allele frequencies, and the measure of HWD. A detailed analysis of the impact of HWD on the test for interaction is needed.

In the past years, more and more detailed and comprehensive evidence showed that genetic and molecular interactions govern cell behaviors, including cell division, differentiation, and death, and are primary factors for the development of diseases. In many cases, single-locus analysis fails to unravel the mechanism of disease. A locus-by-locus paradigm for genetic studies of complex diseases should be shifted to a new paradigm incorporating gene-gene interaction into genetic studies of complex diseases.

The results in this article are preliminary. Interaction between two linked loci or high-order interactions among multiple loci have not been studied. Gene-gene interaction is an important but complex concept. There are several ways to define gene-gene interaction. How the definition of gene-gene interaction on a population level reflects the genes' biochemical or physiological interaction is still a mystery. We hope that this work provides further motivation to conduct theoretical research in deciphering genetic and physiological meaning of gene-gene interactions and to develop more statistical methods for testing gene-gene interaction. In the coming years, the integration of gene-gene interaction into genomewide association analysis will be a major task in genetic studies of complex diseases.

Acknowledgments

We thank three anonymous reviewers for helpful comments on the manuscript, which led to much improvement of the article. M.X. is supported by National Institutes of Health (NIH)–National Institute of Arthritis and Musculoskeletal and Skin Diseases grant P01 AR052915-01A1, NIH grants HL74735 and ES09912, and Shanghai Commission of Science and Technology grant 04dz14003. J.Z. is supported by NIH grant ES09912.

Appendix

By definition, we have

Similarly, we can obtain the remaining formulas in equation (1) in the text.

By definition, the measure of LD in the disease population is given by

By definition, we have

Similarly, we obtain

Multiplying equation (A1) by equation (A2) yields

which implies that

Appendix B

Assume that marker locus M₁ has two alleles, M₁ and m₁, and the marker locus M₂ has two alleles, M₂ and m₂. Let the frequencies of the haplotypes D₁M₁, D₁m₁, d₁M₁, and d₁m₁ be P_D₁M₁, P_D₁m₁, P_d₁M₁, and P_d₁m₁, respectively. The frequencies of the haplotypes D₂M₂, D₂m₂, d₂M₂, and d₂m₂ can be similarly defined. Let the frequencies of the haplotypes M₁M₂, M₁m₂, m₁M₂, and m₁m₂ in the disease population be q^A₁₁, q^A₁₂, q^A₂₁, and q^A₂₂, respectively. Then, we have

Similarly, we have

and

Thus, after some algebra, we can obtain the LD between two marker loci in the disease population:

Recall that the LD between two unlinked disease loci in the disease population is given by

Therefore, the LD between two unlinked marker loci in the disease population can be rewritten as

Appendix C

It is well known that the estimators of the haplotype frequencies equation M11 , equation M12 , and equation M13 are asymptotically distributed as a multivariate normal distribution N[P,(1/2n_G)Σ], where P=[P₁₁,P₁₂,P₂₁]^T and Σ=diag(P₁₁,P₁₂,P₂₁)-PP^T. Let P^A=[P^A₁₁,P^A₁₂,P^A₂₁]^T. Similarly, equation M14 is asymptotically distributed as N[P^A,(1/2n_A)Σ^A], where

Since

is a function of the haplotype frequencies equation M16

, and

, the estimated measure of LD, equation M19

, is asymptotically distributed as shown by Serfling⁴²:

where

However, we can show that

First, we note that [partial differential]

P_D₁D₂=1-P_D₁-P_D₂, [partial differential]

P_D₁d₂=-P_D₂, and [partial differential]

P_d₁D₂=-P_D₁. Let V=CΣC^T. After some algebra, we have

Since (1-P_D₁-P_D₂)P_D₁D₂-P_D₂P_D₁d₂-P_D₁P_d₁D₂=δ-P_D₁P_D₂, we have

Note that

Substituting equation (C3) into equation (C2) yields

Collecting the coefficient of δ in the above equation (C4), we obtain

Substituting equation (C5) into equation (C4), we have

which proves equation (C1). Similarly, equation M20

is asymptotically distributed as N[δ_A,(1/2n_A)V_A]. Under the null hypothesis of no interaction between two unlinked loci, we have δ_A=δ=0. Therefore, the statistic T_I is asymptotically distributed as a central χ²₍₁₎ distribution under the null hypothesis.

Now, we show that, under some assumption, the statistic T_I is still a valid test in the admixed population. Consider an admixed population that is mixed from two subpopulations with proportions α and (1-α). It is known that the measure of LD in the admixed population is given by

where P^(k)_{D_i} and δ^(k) are the frequency of the allele D_i and the measure of LD between two loci in the kth subpopulation (k=1,2), respectively. If we assume that

where P^A(k)_{D_i} is the frequency of the allele D_i in the kth disease subpopulation, then we have

Therefore, under the assumption (C6), the statistic T_I is also asymptotically distributed as a central χ²₍₁₎ distribution under the null hypothesis of no interaction between two unlinked loci in the admixed population.

References

Cook NR, Zee RY, Ridker PM (2004) Tree and spline based association analysis of gene-gene interaction models for ischemic stroke. Stat Med 23:1439–1453 [PubMed] doi: 10.1002/sim.1749.

Hansen TF, Wagner GP (2001) Modeling genetic architecture: a multilinear theory of gene interaction. Theor Popul Biol 59:61–86 [PubMed] doi: 10.1006/tpbi.2000.1508.

Fisher RA (1918) The correlation between relatives on the supposition of Mendelian inheritance. Trans R Soc Edinb 3:399–433.

Cockerham CC (1954) An extension of the concept of partitioning hereditary variance for analysis of covariances among relatives when epistasis is present. Genetics 39:859–882.

Kempthorne O (1954) The correlation between relatives in a random mating population. Proc R Soc Lond B 143:103–113.

Wagner GP, Laubichler MD, Bagheri-Chaichian H (1998) Genetic measurement of theory of epistatic effects. Genetica 102–103:569–580 [PubMed].

Hosmer DW, Lemeshow S (2000) Applied logistic regression. John Wiley & Sons, New York.

Cheverud JM, Routman EJ (1995) Epistasis and its contribution to genetic variance components. Genetics 139:1455–1461 [PubMed].

Kooperberg C, Ruczinski I (2005) Identifying interacting SNPs using Monte Carlo logic regression. Genet Epidemiol 28:157–170 [PubMed] doi: 10.1002/gepi.20042.

10.

Kooperberg C, Ruczinski I, LeBlanc ML, Hsu L (2001) Sequence analysis using logic regression. Genet Epidemiol Suppl 1 21:S626–S631 [PubMed].

11.

Ruczinski I, Kooperberg C, LeBlanc M (2003) Logic regression. J Comput Graph Stat 12:475–511 doi: 10.1198/1061860032238.

12.

Nelson MR, Kardia SL, Ferrell RE, Sing CF (2001) A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. Genome Res 11:458–470 [PubMed] doi: 10.1101/gr.172901.

13.

Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH (2001) Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet 69:138–147 [PubMed].

14.

Moore JH, Hahn LW (2002) A cellular automata approach to detecting interactions among single-nucleotide polymorphisms in complex multifactorial diseases. Pac Symp Biocomput 53–64 [PubMed].

15.

Bastone L, Reilly M, Rader DJ, Foulkes AS (2004) MDR and PRP: a comparison of methods for high-order genotype-phenotype associations. Hum Hered 58:82–92 [PubMed] doi: 10.1159/000083029.

16.

Williams SM, Ritchie MD, Phillips JA 3rd, Dawson E, Prince M, Dzhura E, Willis A, Semenya A, Summar M, White BC, Addy JH, Kpodonu J, Wong LJ, Felder RA, Jose PA, Moore JH (2004) Multilocus analysis of hypertension: a hierarchical approach. Hum Hered 57:28–38 [PubMed] doi: 10.1159/000077387.

17.

Soares ML, Coelho T, Sousa A, Batalov S, Conceicao I, Sales-Luis ML, Ritchie MD, Williams SM, Nievergelt CM, Schork NJ, Saraiva MJ, Buxbaum JN (2005) Susceptibility and modifier genes in Portuguese transthyretin V30M amyloid polyneuropathy: complexity in a single-gene disease. Hum Mol Genet 14:543–553 [PubMed] doi: 10.1093/hmg/ddi051.

18.

Foulkes AS, De Gruttola V, Hertogs K (2004) Combining genotype groups and recursive partitioning: an application to human immunodeficiency virus type 1 genetics data. Appl Stat 53:311–323.

19.

Cho YM, Ritchie MD, Moore JH, Park JY, Lee KU, Shin HD, Lee HK, Park KS (2004) Multifactor-dimensionality reduction shows a two-locus interaction associated with type 2 diabetes mellitus. Diabetologia 47:549–554 [PubMed] doi: 10.1007/s00125-003-1321-3.

20.

Coffey CS, Hebert PR, Ritchie MD, Krumholz HM, Gaziano JM, Ridker PM, Brown NJ, Vaughan DE, Moore JH (2004) An application of conditional logistic regression and multifactor dimensionality reduction for detecting gene-gene interactions on risk of myocardial infarction: the importance of model validation. BMC Bioinformatics 5:49 [PubMed] doi: 10.1186/1471-2105-5-49.

21.

Tsai CT, Lai LP, Lin JL, Chiang FT, Hwang JJ, Ritchie MD, Moore JH, Hsu KL, Tseng CD, Liau CS, Tseng YZ (2004) Renin-angiotensin system gene polymorphisms and atrial fibrillation. Circulation 109:1640–1646 [PubMed] doi: 10.1161/01.CIR.0000124487.36586.26.

22.

Moore JH, Williams SM (2002) New strategies for identifying gene-gene interactions in hypertension. Ann Med 34:88–95 [PubMed] doi: 10.1080/07853890252953473.

23.

Williams SM, Addy JH, Phillips JA 3rd, Dai M, Kpodonu J, Afful J, Jackson H, Joseph K, Eason F, Murray MM, Epperson P, Aduonum A, Wong LJ, Jose PA, Felder RA (2000) Combinations of variations in multiple genes are associated with hypertension. Hypertension 36:2–6 [PubMed].

24.

Zhu X, Bouzekri N, Southam L, Cooper RS, Adeyemo A, McKenzie CA, Luke A, Chen G, Elston RC, Ward R (2001) Linkage and association analysis of angiotensin I–converting enzyme (ACE)–gene polymorphisms with ACE concentration and blood pressure. Am J Hum Genet 68:1139–1148 [PubMed].

25.

Takahashi N, Murakami H, Kodama K, Kasagi F, Yamada M, Nishishita T, Inagami T (2000) Association of a polymorphism at the 5′-region of the angiotensin II type 1 receptor with hypertension. Ann Hum Genet 64:197–205 [PubMed] doi: 10.1046/j.1469-1809.2000.6430197.x.

26.

Xiong M, Zhao J, Boerwinkle E (2002) Generalized T² test for genome association studies. Am J Hum Genet 70:1257–1268 [PubMed].

27.

Neuman RJ, Rice JP (1992) Two-locus models of disease. Genet Epidemiol 9:347–365 [PubMed] doi: 10.1002/gepi.1370090506.

28.

Schork NJ, Boehnke M, Terwilliger JD, Ott J (1993) Two-trait-locus linkage analysis: a powerful strategy for mapping complex genetic traits. Am J Hum Genet 53:1127–1136 [PubMed].

29.

Weir BS (1990) Genetic data analysis. Sinauer Associates, Sunderland, MA.

30.

Nothnagel M (2002) Simulation of LD block-structured SNP haplotype data and its use for the analysis of case-control data by supervised learning methods. Am J Hum Genet Suppl 71:A2363.

31.

Gauderman WJ (2002) Sample size requirements for association studies of gene-gene interaction. Am J Epidemiol 155:478–484 [PubMed] doi: 10.1093/aje/155.5.478.

32.

Byng MC, Whittaker JC, Cuthbert AP, Mathew CG, Lewis CM (2003) SNP subset selection for genetic association studies. Ann Hum Genet 67:543–556 [PubMed] doi: 10.1046/j.1529-8817.2003.00055.x.

33.

Onay VU, Briollais L, Knight JA, Shi E, Wang Y, Wells S, Li H, Rajendram I, Andrulis IL, Ozcelik H (2006) SNP-SNP interactions in breast cancer susceptibility. BMC Cancer 6:114 [PubMed] doi: 10.1186/1471-2407-6-114.

34.

Wang XW, Vermeulen W, Coursen JD, Gibson M, Lupold SE, Forrester K, Xu G, Elmore L, Yeh H, Hoeijmakers JH, Harris CC (1996) The XPB and XPD DNA helicases are components of the p53-mediated apoptosis pathway. Genes Dev 10:1219–1232 [PubMed].

35.

Fabbro M, Savage K, Hobson K, Deans AJ, Powell SN, McArthur GA, Khanna KK (2004) BRCA1-BARD1 complexes are required for p53^Ser-15 phosphorylation and a G₁/S arrest following ionizing radiation-induced DNA damage. J Biol Chem 279:31251–31258 [PubMed] doi: 10.1074/jbc.M405372200.

36.

Lu F, Gladden AB, Diehl JA (2003) An alternatively spliced cyclin D1 isoform, cyclin D1b, is a nuclear oncogene. Cancer Res 63:7056–7061 [PubMed].

37.

Mitrunen K, Hirvonen A (2003) Molecular epidemiology of sporadic breast cancer: the role of polymorphic genes involved in oestrogen biosynthesis and metabolism. Mutat Res 544:9–41 [PubMed] doi: 10.1016/S1383-5742(03)00016-4.

38.

Williams TN, Mwangi TW, Wambua S, Peto TE, Weatherall DJ, Gupta S, Recker M, Penman BS, Uyoga S, Macharia A, Mwacharo JK, Snow RW, Marsh K (2005) Negative epistasis between the malaria-protective effects of α⁺-thalassemia and the sickle cell trait. Nat Genet 37:1253–1257 [PubMed] doi: 10.1038/ng1660.

39.

Marchini J, Donnelly P, Cardon LR (2005) Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat Genet 37:413–417 [PubMed] doi: 10.1038/ng1537.

40.

Nielsen DM, Ehm MG, Zaykin DV, Weir BS (2004) Effect of two- and three-locus linkage disequilibrium on the power to detect marker/phenotype associations. Genetics 168:1029–1040 [PubMed] doi: 10.1534/genetics.103.022335.

41.

Zaykin DV, Meng Z, Ehm MG (2006) Contrasting linkage-disequilibrium patterns between cases and controls as a novel association-mapping method. Am J Hum Genet 78:737–746 [PubMed].

42.

Serfling RJ (1980) Approximation theorems of mathematical statistics. John Wiley & Sons, New York.

Articles from American Journal of Human Genetics are provided here courtesy of
American Society of Human Genetics