Main

Performing agnostic searches for associations between pairs of variables in large-scale data, using either common statistical techniques or machine-learning algorithms, faces the problem of multiple comparisons. This problem is particularly present in genetic association studies, in which contemporary cohorts have access to millions of genetic variants as well as a broad range of clinical factors and biomarkers for each individual. With billions of candidate associations, identifying a true association of small magnitude is extremely challenging. Standard analysis approaches currently consist of examining the data in one dimension (i.e., testing a single outcome with each of the millions of candidate genetic predictors) and applying univariate statistical tests—the commonly named genome-wide association study (GWAS) approach1,2. To increase power, GWAS relies on increasing the sample size to reach the multiple-comparisons-adjusted significance level. The largest studies to date, including hundreds of thousands of individuals across dozens of cohorts, have decreased the limit of detectable effect sizes. For example, researchers have reported genetic variants explaining less than 0.01% of the total variation in body mass index3.

In addition to the substantial financial costs of collecting and genotyping large cohorts, this brute-force approach has practical limits. More importantly, this approach does not leverage the large amount of additional phenotypic and genomic information measured in many studies. Joint analyses of multiple phenotypes with each predictor of interest (for example, multivariate analysis of variance (MANOVA) and MultiPhen)4,5,6 offer a gain in power but have three major drawbacks. First, a significant result can be interpreted only as an association with any one of the phenotypes. Although this information is useful for screening purposes, it is insufficient to identify specific genotype–phenotype associations6. Second, such analyses make the replication process difficult, because association signals in the discovery sample depend on many parameters including the phenotypic correlation and the effect of the genotype on each phenotype. Third, joint tests have lower power than do univariate tests when only a small proportion of the phenotypes are associated with the tested genetic variant. This lower power is a simple problem of dilution: a small number of true associations mixed with many null phenotypes decreases the power.

In this work, we developed covariates for multiphenotype studies (CMS), a method that improves association-test power in multiphenotype studies while providing the resolution of univariate tests. When testing for association between a genotype and a phenotype, CMS allows the other collected correlated phenotypes to serve as covariates. The core of the method is a principled approach to selecting a set of these covariates that are correlated with the phenotype but not with the genotype, thereby decreasing phenotypic variance independently of the genotype and concomitantly increasing power. Via application of CMS to simulated and real data, we found that CMS scales to thousands of phenotypes, produces gains in power equivalent to that resulting from a two- to threefold increase in sample size, and outperforms other recently proposed multiphenotype approaches with univariate resolution, including a Bayesian approach (multivariate Bayesian imputation-based association mapping (mvBIMBAM7)) and dimensionality-reduction approaches (principal component analysis8 and probabilistic estimation of expression residuals (PEER9).

Results

Covariates as a proxy for unmeasured causal factors

The objective of this work was to develop a method that keeps the resolution of univariate analysis in testing for association between outcome Y and candidate predictor X, but takes advantage of other available covariates C = (C1, C2,...Cm) to increase power. Consider the inclusion of covariates correlated with the outcome in a standard regression framework. This inclusion may increase the signal-to-noise ratio between the outcome and the candidate predictor when testing Y = X + CL, where CL C. Selection of which covariates Ci are relevant to a specific association test is usually based on causal assumptions10,11. Epidemiologists and statisticians commonly recommend inclusion of two types of covariates in testing for association between X and Y: (i) those that are potential causal factors of the outcome and independent of X and (ii) those that may confound the association signal between X and Y, i.e., variables such as principal components (PCs) of genotypes or covariates that capture undesired structures in the data that can lead to false associations12. All other variables that vary with the outcome because of shared risk factors are usually ignored. However, those variables carry information about the outcome and more precisely about the risk factors of the outcome. Because they potentially share dependencies with the outcome, they can be used as proxies for unmeasured risk factors. As such, they can be incorporated in CL to improve the detection of associations between X and Y. However, when these variables depend on the predictor X, using them as covariates can lead to both false-positive and false-negative results depending on the underlying causal structure of the data.

The presence of interdependent explanatory variables, also known as multicollinearity13, can induce bias in the estimation of the predictor's effect on the outcome. We have recently discussed this issue in the context of GWAS adjusting for heritable covariates14. To illustrate this collider bias, consider first the simple case of two independent covariates U1 and U2 that are true risk factors of Y. In testing for association between X and Y, adjusting for U1 and U2 can increase power, because the residual variance of Y after the adjustment is smaller while the effect of X is unchanged (Fig. 1a), i.e., the ratio of the outcome variance explained by X over the residual variance is larger after removal of the effects of U1 and U2. However, in practice, true risk factors of the outcome are rarely known. Consider instead the more realistic scenario in which U1 and U2 are unknown, but a covariate C, which also depends on those risk factors, has been measured. Because of their shared etiology, Y and C display a positive correlation, and when X is not associated with C, adjusting Y for C increases the power to detect (Y,X) associations (Fig. 1b). Problems arise when C is associated with X. In that case, adjusting Y for C biases the estimation of the effect of X on Y, thereby decreasing the power when the effect of X is concordant between C and Y (Fig. 1c), and inducing a false signal when X is not associated with Y (Fig. 1d). The same principles apply when multiple covariates correlated with the outcome are included.

Figure 1: Variance components of adjusted variables.
figure 1

(ad) Illustrations of the components of the variance of outcome Y before and after adjusting for other variables. The predictor of interest X is displayed in red. In a, the adjusting variables (U1 and U2) are true causal factors that have direct effects on Y; therefore, adjusting Y for U1 and U2 (thus yielding Yadj) decreases the variance of Y. In b, the true factors are not measured, but a variable C influenced by U1 and U2, is measured. Adjusting Y for C decreases the residual variance of Y but also introduces a component of the variance specific to C. In c, the covariate shares factors with Y but is also influenced by X. When the effect of X on C is concordant with the effect of X on Y, a power loss may be induced. In d, Y is not associated with the predictor, and adjusting for C can induce a false-association signal by introducing the effect of X into the residual of Y.

When none of the covariates depend on the predictor (Fig. 1a,b), their inclusion in a regression can decrease the variance of the outcome without confounding, thus the increasing statistical power while maintaining the correct null distribution. This gain in power can be easily described in terms of an equivalent sample-size increase. The noncentrality parameter (ncp) of the standard univariate chi-square test between X and Y is , where n, and are the sample size, the total variance of the outcome Y, and the squared correlation between X and Y, respectively. When reducing by a factor γ through covariate adjustment, and assuming that the effect of X on Y is small, so that , ncpXY can be approximated by . For example, when the covariates explain 30% of the variance of Y, the power of the adjusted test is equivalent to that when a sample size 1.4-fold larger (as compared with the unadjusted test) is analyzed. When covariates explain 80% of the phenotypic variance—a realistic proportion in some genetic data sets examined below—the power gain is equivalent to that resulting from a fivefold increase in sample size (Fig. 2a).

Figure 2: Examples of shared variance in real data and equivalent increases in sample size.
figure 2

(a) Equivalent increase in sample size as a function of the variance of the outcome explained by covariates, assuming initial sample sizes ranging from 100 to 10,000. (b,c) Distribution of variance explained by other variables for 79 metabolites from the PanScan study (b) and a random subsample of expression abundance estimates from 79 genes in the gEUVADIS study (c). The size of the bar corresponds to the total variance of each outcome explained by other available covariates, and the relative contributions of these covariates to each outcome are illustrated with different sets of random colors for each bar.

Selecting covariates for each outcome–predictor pair

The central problem that must be solved is how to select a subset of the available covariates to optimize power while preventing induction of false-positive associations between the outcome and the predictor. To perform this selection, all covariates associated with the outcome should be included except those also associated with the predictor. A naive solution would consist of filtering out covariates on the basis of a P-value threshold from the association test between each covariate and the predictor (for example, removing predictors with a predictor–covariate association P <0.05). However, unless the sample size were to be infinitely large, type I covariates (covariates associated with the predictor) would be included. Furthermore, such a filtering would also imply that some type II covariates (covariates not associated with the predictor) would be removed because they would incidentally pass the P-value threshold. Interestingly, removing type II covariates by using this approach not only results in a suboptimal test but also induces an inflated false-positive rate (Supplementary Fig. 1). In brief, when the outcome and the covariate are correlated, a low predictor–covariate P value implies a low predictor–outcome P value. As a result, the P-value distribution from the subset of predictor–outcome-unadjusted statistics (those for which the predictor–covariate P value is below the threshold) is enriched for low P value, while the complementary subset of predictor–outcome-adjusted statistics is expected to be uniform, thus resulting in an overall inflation of type I error for the approach (Supplementary Note and Supplementary Fig. 2).

In this work, we developed CMS, a computationally efficient heuristic to improve the selection of type II covariates while removing type I covariates. We present an overview of the approach, and complete details of the algorithm are provided in the Online Methods and the Supplementary Note.

Let and be the marginal estimated regression coefficients between X and C, and between X and Y (not adjusted for C), respectively, and let be the estimated correlation between Y and C. Naive P-value-based filtering, i.e., unconditional filtering on , assumes that under the null (δ = 0), is normally distributed with and variance 1/n, where n is the sample size. The central advance of CMS is to additionally use the expected mean and variance of conditional on under a complete null model (δ = β = 0)). We show that these can be approximated as: and (Supplementary Note and Supplementary Fig. 3).

The bias observed from naive univariate P-value filtering (Supplementary Fig. 1) is induced by the misspecification of the expected mean and variance of the estimate of the predictor–covariate effect when the predictor is associated with neither the outcome nor the covariates. The inclusion area for a P-value threshold of 5%—i.e., if is outside the inclusion area, the covariate C is filtered out—based on the unconditional distribution is illustrated in Figure 3a. Using the distribution of conditional on to select covariates is also a poor solution resulting in a deflated test statistic for , owing to an overestimation of the standard error of when adjusting for the selected covariates (Supplementary Table 1 and Supplementary Figure 4, which describe the simple case of a single covariate). The improvement from CMS is derived from defining the inclusion area as a combination of the unconditional and conditional distributions of (Fig. 3b,c). This procedure solves the inflation observed in Supplementary Figure 1 and leads to a valid test under the complete null model with a variable number of available covariates (Supplementary Fig. 3 and Supplementary Table 1).

Figure 3: Conditional and unconditional distribution.
figure 3

Example of inclusion area based on the distribution of , the estimated effect between the predictor X and the covariate C under the null hypothesis of no association between X and C (δ = 0) and no association between X and the outcome Y (β = 0). (a) Standard 95% confidence interval (green area) corresponding to P <0.05 unconditional on . (b,c) Unconditional (blue curve) and conditional (pink curve) distribution of . CMS combines the two, setting an inclusion area (blue and pink shaded) while weighting both intervals by a factor depending on the correlation between Y and C, which equals 0.5 in b and 0.8 in c. Plots were drawn on the basis of the assumption that all variables are standardized, with a sample size of 10,000, an overall explained variance of Y of 0.7, and a multivariate test of association between all covariates and Y with a P value (PMUL) of 0.3.

Finally, to decrease the risk of false positives, the algorithm scales inclusion areas on the basis of the total amount of the outcome's variance explained by ClL and . To further improve the performance of filtering covariates, we also considered the omnibus association test between Cl L and Y, which can be more effective when multiple covariates have small to moderate effects (Supplementary Note).

Simulated data analysis and method comparisons

We first assessed the performance of the proposed method through a simulation study in which we generated series of multiphenotype data sets over an extensive range of parameter settings (Online Methods and Supplementary Note). Each data set included n individuals genotyped at a SNP with the minor allele frequency (MAF) drawn uniformly from [0.05, 0.5], a normally distributed phenotype Y, and m = [10, 40, 80] correlated covariates C = (C1, C2,...Cm). Under the null, the SNP did not contribute to the phenotype, and under the alternate, the SNP contributed to the phenotype under an additive model. In some data sets, the SNP also contributed to a fraction π = [0%, 15%, 35%] of the covariates. These were the covariates that we sought to identify and filter out of the regression. We considered sample sizes (n) of 300, 2,000, and 6,000, and we varied , the variance of Y explained by C, from 25% to 75%. We varied the effect of the predictor on Y and C, when relevant, from almost undetectable (median χ2 = 3) to relatively large (median χ2 = 20). For each choice of parameters, we generated 10,000 replicates and performed four association tests: (unadjusted) linear regression (LR), LR with covariates included based on P-value filtering at an α threshold of 0.1 (FT), CMS, and an oracle method including only the covariates not associated with the SNP (OPT), which was the optimal test regarding our goal. We considered a total of 432 scenarios, and the type I error rate of CMS was well calibrated across parameter ranges (Fig. 4 and Supplementary Tables 2, 3, 4). Notably, we did not consider strategies including all Cl = 1...m variables as covariates, or 'reverse regression' (MultiPhen)5, because these approaches substantially inflate the type I error rate (Supplementary Fig. 5).

Figure 4: Power and robustness quantile–quantile plots under the null and alternate distributions of P values from a series of simulations.
figure 4

(ac) Four statistical tests are compared: a standard marginal univariate test (LR); the optimally adjusted test (OPT), which includes as covariates only the outcomes not associated with the predictor; CMS; and a univariate test that includes as covariates all outcomes with a P value for association with the predictor above 0.1 (FT). Gray boxes show the genomic inflation factor λGC for the null models (top) and the estimated power at an α threshold of 5 × 10−7 (to correct for 100,000 tests) for the alternative model (bottom). Null models also include the 95% confidence interval of the −log10(P values), displayed as a gray cone around the diagonal. Simulations were taken from 100,000 data sets including 10 (a), 40 (b), and 80 (c) outcomes (Nphe) under a null model (top), in which a predictor of interest is not associated with a primary outcome but is associated with 0%, 15%, or 35% of the other outcomes with probability 0.75, 0.2, or 0.05, respectively, and under the alternative (bottom), in which the predictor is associated with the primary outcome only. The variance of the primary outcome that could be explained by the other outcomes was randomly chosen from [25%, 50%, 75%] with equal probability.

We compared the performance of CMS with those of other recently proposed multiphenotype approaches including mvBIMBAM. The CMS approach was more than 100-fold faster than mvBIMBAM, and the two methods showed similar accuracy when they were compared with receiver-operating-characteristic curves (Supplementary Fig. 6). We also considered data-reduction techniques aimed at modeling hidden structure. For each data set, we tested the association between the primary outcome and the genotype while adding PCs or PEER factors. We observed increasing type I error rates when increasing the number of PCs or PEER factors in the model (Supplementary Fig. 7). Furthermore, at a fixed false-positive rate, when we applied CMS in addition to PEER factors, we found that CMS substantially increased the power above that gained from PEER (Supplementary Fig. 8 and Supplementary Note).

Real-data analysis

We first analyzed a set of 79 metabolites measured in 1,192 individuals genotyped at 668 candidate SNPs. We derived the correlation structure between these metabolites3 (Fig. 2b and Supplementary Fig. 9) and estimated the maximum gain in power that could be achieved by our approach in these data. The proportion of variance of each metabolite explained by the other metabolites varied between 1% and 91% (Fig. 2b). This proportion was higher than 50% for two-thirds of the metabolites and was equivalent to that resulting from a twofold increase in sample size. For 10% of the metabolites, other variables explained more than 80% of the variance and corresponded to a fivefold increase in sample size. In such cases, predictors explaining less than 1% of a metabolite's variation can change from undetectable (power <1%) to fully detectable (power >80%) when CMS is applied.

We performed a systematic screening for the association between each SNP and each metabolite, using both a standard univariate linear regression adjusting for potential confounding factors and CMS to identify additional covariates. Overall, both tests showed correct P-value distribution (λGC 1, (Supplementary Fig. 10a). We focused on associations significant after Bonferroni correction (P < 9.5 × 10−7 corresponding to the 52,772 tests performed). The standard unadjusted approach (LR) detected five significant associations. In comparison, the CMS approach identified ten associations (Table 1), including four of the five associations identified by LR. In most cases, the P value of CMS was dramatically lower (1,000-fold smaller for rs780094 (alanine)). Comparing these results with those of four independent GWAS metabolite scans of larger sample size (study total n = 8,330 for Finnish15, 7,824 and 2,820 for Kooperative Gesundheitsforschung in der Region Augsburg (KORA) plus TwinsUK16,17; and 2,076 for Framingham Heart Study (FHS)18 cohorts), we found that all metabolite–gene associations identified by only CMS replicated (Supplementary Table 5).

Table 1 Identified signals from the association test between 79 metabolites and 668 candidate SNPs

This analysis confirmed the power of CMS, highlighting its ability to identify variants with much smaller sample sizes than those required in the standard unadjusted approach. Interestingly, the only association identified by the unadjusted analysis (lactose and GC, P = 6.1 × 10−7) and not confirmed by CMS (P = 6.3 × 10−6) was also the only one that did not replicate in the larger studies. Notably, in our analysis (Table 1), we followed an approach identical to that of the previous studies and did not adjust for either PCs or PEER factors9. However, adjusting did not qualitatively change the results. For example, we considered adjusting for 5, 10, and 20 PCs and obtained 11, 15, and 17 hits for CMS and 9, 11, and 5 hits for LR with PC covariates (Supplementary Table 6). The overall replication rate was lower when PCs were included, in agreement with a potential higher false-positive rate, as observed in our simulations.

We then considered genome-wide mapping of cis-expression quantitative trait loci (cis-eQTL) in RNA-seq data from the Genetic European Variation in Health and Disease (gEUVADIS) study. Gene expression is a particularly compelling benchmark, because the gold-standard analyses already use an adjustment strategy to account for hidden factors in eQTL GWAS9,19. We used the PEER approach9 to derive hidden factors, because this method was applied in the original analysis20. After stringent quality control, the data included 375 individuals of European ancestry with expression estimated on 13,484 genes, of which 11,675 had at least one SNP with a MAF ≥5% within 50 kb of the start and end sites.

We observed that expression levels between genes were highly correlated (Fig. 2c), an ideal scenario for CMS. We first performed a standard cis-eQTL screening using LR, testing each SNP within 50 kb of each available gene for association with the overall normalized RNA level while adjusting for ten PEER factors, for a total of 3.5 million tests. Then we applied CMS to identify, for each test, which other genes' RNA levels could be used as covariates in addition to the PEER factors. Both LR and CMS showed large numbers of highly significant associations (Supplementary Fig. 10b). For comparison purposes, we plotted the most significant SNP per gene obtained with the standard approach against those obtained with CMS (Fig. 5) and found that 2,725 genes had a least one SNP significant with both methods, whereas 56 genes were identified by only the standard approach. In contrast, 657 genes were found with only CMS, corresponding to a 22% increase in detection of cis-eQTL loci. This result indicated that by being gene/SNP specific, CMS is a priori able to recover substantial additional variance, thus allowing for increased power (Table 2 and Supplementary Table 7).

Figure 5: Analysis of the gEUVADIS data.
figure 5

Plot of −log10(P values) of the most significant SNP per gene obtained by CMS (y axis) and LR (x axis) from genome-wide cis-eQTL mapping of 11,675 genes in 375 individuals from the gEUVADIS study. For illustration purposes, we truncated the plots at −log10(P value) = 30. Both CMS and LR were adjusted for ten PEER factors, and the CMS analysis also included 0–50 additional covariates per SNP–gene pair tested. We considered a stringent significance threshold of 1.4 × 10−8 to account for the approximately 3.5 million tests and derived the number of genes showing at least one cis-eQTL with LR only (blue), CMS only (red), both approaches (turquoise), or neither approach (gray).

Table 2 Replication of association from the cis-eQTL screening in GEUVADIS

To assess the validity of our results, we performed an in silico replication analysis, using two databases of known eQTLs21,22. We found that 35% of the SNP–gene associations found by both LR and CMS replicated. For the subset of association found by only CMS, the replication rate was 20%, a value similar to the 22% from the LR-only replication. The replication rate was 6% for genes without a CMS or LR association. The replications were primarily in a lymphoblastoid cell line (LCL; Table 2), and the replication rate for our study was within the same range as the replication rate in previous LCL studies (Supplementary Table 8), thus confirming that a substantial number of the additional associations identified by CMS probably corresponded to real signal (Online Methods). Additional GC correction of the P values by using inflation factors from a quasi-null experiment (λLR = 1.01, and λCMS = 1.05; Supplementary Fig. 11) did not qualitatively change the results.

Discussion

Growing collections of high-dimensional data across myriad fields, driven in part by the 'big-data revolution' and the Precision Medicine Initiative, offer the potential to gain new insights and solve open problems. However, when mining for associations between collected variables, identifying signals within the noise remains challenging. Although univariate analysis offers precision, it fails to leverage the correlation structure between variables. In contrast, joint analyses of multiple phenotypes increase power at the cost of decreased precision. Using both simulated and real data, we demonstrated that the proposed method, CMS, maintains the precision of univariate analysis but can still exploit global data structures to increase power. Indeed, in the data sets examined in this study, we observed up to a threefold increase in effective sample size in both the gene-expression and metabolite data as a result of the inclusion of relevant covariates (Supplementary Fig. 12).

CMS can be applied generally, but it is particularly well suited to the analysis of genetic data for several reasons. First, the genetic architectures of many complex phenotypes are consistent with a polygenic model with many genetic variants of small effect size that are difficult to detect with standard approaches23. Second, many correlated phenotypes share genetic and environmental variance without complete genetic overlap24. Third, the underlying structure of the genomic data is relatively well understood, and there is extensive literature describing the causal pathway from genotypes to phenotypes through direct and indirect effects on RNA, protein, and metabolites (Supplementary Fig. 13 and Supplementary Note). Finally, when the predictors of interest are genetic variants, there is less concern regarding potential confounding factors. The only well-established confounder of genetic data is population structure, and this confounding can be easily addressed through standard approaches12. For other types of data, when the underlying structure of the data is unknown, the risk of introducing bias is high.

Several other groups have considered the problem of association testing in high-dimensional data while maintaining precision. In genetics, multivariate linear mixed models (mvLMMs) have demonstrated both precision and increases in power when correlated phenotypes are tested jointly. However, mvLMMs exploit only the genetic similarity of phenotypes and are not computationally efficient enough to handle dozens of phenotypes jointly4. CMS leverages both genetic and environmental correlations and can be easily adapted to hundreds or thousands of phenotypes, as demonstrated here. Instead, we compared CMS with other more related approaches, including the Bayesian approach mvBIMBAM, and adjustment for hidden factors inferred from either principal component analysis or PEER. We found that mvBIMBAM and CMS had very similar accuracy, as measured by the area under the curve, whereas mvBIMBAM was approximately 100-fold slower and was applicable to only a small number of phenotypes (fewer than ten). As for strategies that reconstruct hidden variables, we have found that they can induce false positives25, and they are suboptimal in comparison to CMS. Indeed, the gEUVADIS analysis showed a 22% increase in the detection of eQTL when it was applied in addition to PEER-factor adjustment.

There are several caveats to our approach. First, the proposed heuristic is conservative by design to avoid false-association signals, and so all the available power gain is not achieved. Second, although all performed simulations showed strong robustness, this method remains a heuristic, like other methods9,19. Ultimately, we recommend external replication to validate results and effect size, as is standard in genetic studies. Third, CMS is more computationally intensive than methods such as principal component analysis or PEER. Fourth, CMS assumes that the variables are measured and available on all samples. The current implementation includes a naive missing-data imputation, and simple-case-scenario simulations showed that this strategy has a minimal effect on the robustness of CMS (Supplementary Fig. 14). However increasingly advanced approaches have been developed26. Fifth, although the principles that we leveraged are probably applicable to categorical and binary outcomes (logistic regression in ref. 27), our algorithm is currently applicable to only continuous outcomes. Sixth, for monogenic disorders, or phenotypes without intermediately measured endophenotypes, CMS is unlikely to result in power gains.

We focused on association screening and aimed at optimizing power and robustness. However, the selection of covariates performed by CMS might carry information about which covariates operate through specific SNPs. Future work will explore whether output from CMS can generate hypotheses on the underlying causal model. There are other additional improvements not specific to CMS that are worth exploring. In particular, when multiple phenotypes are considered as outcomes, then a multiple-testing-correction penalty must be selected to account for all tests across all phenotypes. In this study, we applied a Bonferroni correction, not accounting for the correlation between outcomes; this is a conservative correction, and more powerful approaches are possible28.

Large-scale genomic data have the potential to answer important biological questions and improve public health. However, those data come with methodological challenges. Many questions, such as improving risk prediction or inferring causal relationships rely on the ability to identify associations between variables. In this study, we provide a comprehensive overview of how leveraging shared variance between variables can be used to fulfill this goal. Building on this principle, we developed the CMS algorithm, an innovative approach that can dramatically increase statistical power to detect weak associations.

Methods

The CMS algorithm.

We developed an algorithm to select relevant covariates when testing for association between a predictor X and an outcome Y. For a set of candidate covariates C = (C1, C2,...Cm), the filtering is applied on and Pl, the estimated marginal effect of the predictor X on Cl and its associated P value, respectively. It uses four major features: (i) , the total amount of variance of Y explained by the C; (ii) , the estimated effect of each on Y from univariate and joint models, respectively; (iii) , the estimated effect of X on Y from the marginal model Y α + βX; and (iv) PMUL, the P value for the multivariate test of all Cl = 1...m and X, which is estimated with a standard multivariate approach (MANOVA).

Filtering is applied in two steps, using the aforementioned features and additional parameters described thereafter. Step 1 is an iterative procedure focusing on PMUL. It consists of removing potential covariates until PMUL reaches tMUL, a P-value threshold set to 0.05 by default. This step is effective at removing combinations of covariates with strong to moderate effects but may potentially leave weakly associated covariates.

Step 2 is also iterative and uses covariates preselected at step 1. It consists of deriving two confidence intervals, Δl.cond and Δl.un, for the expected distribution of conditional on under a complete null model (δl = 0 and β = 0), and the unconditional distribution of , respectively. The unconditional distribution of can be approximated as , and the conditional distribution is , where is the estimated correlation between Y and C (Supplementary Note). The inclusion area for each is defined as the union of Δl.cond and Δl.un, which are determined from the conditional and unconditional distributions, , , and distribution-specific weights wu and wc, which we further introduced to improve power and robustness. Specifically, and , where are the unconditional and conditional means and s.d., respectively.

The weights wu and wc are always less than two and shrink the size of the inclusion area. To obtain (wu, wc), we first set an ad hoc stringency parameter , which decreases as , and the increase, thus making the inclusion area smaller, because the covariate Cl being considered explains more of the variance of Y. The purpose of this parameter is to decrease the risk of false positives, because bias is enhanced when the residual variance of the outcome is decreased14. This phenomenon is illustrated in Figure 3, in which the unconditioned inclusion area from CMS is smaller than that for the standard approach.

As increases, the likelihood of the true β being null decreases, and we want wc, and the conditional interval Δl.cond to shrink to zero. We use a simple linear function for wc with a transition that corresponds to the point where the 95% CI of the observed and stop overlapping. When all variables are standardized, the former CI is approximately equal to whereas the latter equals Thus, the proposed transition point corresponds to Expressed as chi squared, it equals:  We set and where and vary between 0 and 2, and are defined to linearly scale with respect to this transition point (Supplementary Note).

Altering the transition point or scaling the inclusion interval can increase the risk of false positives or decrease power (Supplementary Figs. 15, 16, 17). We chose the CMS parameters conservatively to prevent false positives; however, alternative approaches such as cross-validation may identify parameters that increase the power of CMS while maintaining a calibrated null distribution. Interestingly, the omnibus association test between ClL and Y had very little effect on the overall performance (Supplementary Fig. 17) with the parameters used here.

Finally, because of multicollinearity, the estimated γl can vary substantially depending on which other covariates Ck≠l are already included in the model. As a result, γl cannot be estimated from a marginal model such as Y γlCl. To address this issue, we implemented the selection of covariates into an iterative loop in which terms are reestimated from a joint model each time a candidate covariate is excluded. The complete CMS algorithm is provided in the Supplementary Note.

Simulations.

We simulated series of genetic and phenotypic data sets under a variety of genetic models to interrogate the properties of the proposed test. Each data set included n individuals genotyped at a SNP, a normally distributed phenotype Y, and m = [10, 40, 80] correlated covariates C = (C1, C2,...Cm). Genotypes g for each of the individuals were generated by summing two samples from a binomial distribution with probability uniformly drawn in [0.05, 0.5] and then normalized to have mean 0 and variance 1. Under the null, the SNP does not contribute to the phenotype, and under the alternate, the SNP contributes to the phenotype under an additive model. In some data sets, the SNP also contributes to a fraction π = [0%, 15%, 35%] of the covariates. Those were the covariates that we sought to identify and filter out of the regression. The remaining variance for each phenotype, which represents the remaining genetic and environmental variance, was drawn from a m+1-dimensional multivariate normal distribution with mean 0 and variance σC. In instances in which this matrix was not positive definite, we used the Higham algorithm29 to find the closest positive definite matrix. The diagonal of the covariance matrix was specified as 1 minus the effect of g (if relevant) such that the total variance of each phenotype had an expected value of 1.

We considered sample sizes (n) of 300, 2,000, and 6,000, and we varied , the variance of Y explained by C, from 25% to 75%. We varied the effect of the predictor on Y and C, when relevant, from almost undetectable (median χ2 = 3) to relatively large (median χ2 = 20). For each choice of parameters, we generated 10,000 replicates and performed four association tests: (unadjusted) LR, LR with covariates included on the basis of P-value filtering at an α threshold of 0.1 (FT), CMS, and an oracle method including only the covariates not associated with the SNP (OPT), the optimal test regarding our goal. For each null model, we derived the genomic inflation factor30 λGC, whereas for the alternative model, we estimated power at an α threshold of 5 × 10−7 to account for the 100,000 tests performed. All tests were two sided. Results for each of the 432 scenarios considered are presented in Supplementary Figures 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44.

To comprehensively summarize the performance of the different tests across these scenarios, we randomly sampled subsets of the simulations to mimic real-data sets while focusing on a sample size of 2,000 individuals and a total of 100,000 SNPs tested. For null models, we assumed that two-thirds (66%) of the genotypes were under the complete null (not associated with any covariate, π = 0), whereas 27% were associated with a small proportion of the covariates (π = 0.15), and the remaining 7% were highly pleiotropic (π = 0.35).

We compared the performances of CMS against those of other recently proposed multiphenotype approaches, including mvBIMBAM, a Bayesian approach to classifying the outcome as directly associated, indirectly associated, or unassociated with the predictor. The main advantage of the mvBIMBAM approach is that it proposes a formal theoretical framework that, similarly to structural equation modeling, explores a wide range of underlying causal models. However, there is a large computational cost, and the approach is currently limited to the analysis of a relatively small number of traits (fewer than ten). We therefore performed our comparison by using small-scale simulated data (ten phenotypes).

Other potential alternatives to CMS are data-reduction techniques for modeling hidden structure. These methods have been widely used for the analysis of molecular phenotypic data, with a primary goal of removing confounding effects8,9,19. We examined principal component analysis because it has been widely used and is still one of the most commonly used approaches8, and a more complex factor-analysis-inspired method (PEER), which has outperformed similar methods9. We simulated series of large multivariate data sets under a null model, in which a genotype is associated with multiple variables but not the primary outcome of interest (i.e., in the presence of type I covariates). For each data set, we tested the association between the primary outcome and the genotype while adding PCs or PEER factors (Supplementary Fig. 7) and found an increasing type I error rates after increasing the number of PCs or PEER factors in the model.

Previous studies have also shown that including fixed effects can improve power over dimensionality-reduction approaches that incorporate these same variables31, probably as a result of the shrink that is applied when these methods jointly fit effect sizes of multiple correlated variables. To investigate the power gains available to CMS when PCs/PEER factors are used, and assuming that type I error is controlled, we simulated data under an alternative model of true association but in the absence of type II covariates to avoid the aforementioned issue. We applied CMS in addition to a variable number of PEER factors and found that CMS can substantially increase the power above that gained from PEER (Supplementary Fig. 8).

Metabolite data.

Circulating metabolites were profiled by liquid chromatography–tandem mass spectrometry (LC–MS) in prediagnostic plasma from 453 prospectively identified pancreatic cancer cases and 898 controls. The subjects were drawn from four US cohort studies: the Nurses' Health Study (NHS), Health Professionals Follow-up Study (HPFS), Physicians' Health Study (PHS) and Women's Health Initiative (WHI). Two controls were matched to each case on the basis of year of birth, cohort, smoking status, fasting status at the time of blood collection, and month/year of blood collection. Metabolites were measured in the laboratory of C. Clish at the Broad Institute by using the methods described in Wang et al.32 and Townsend et al.33. A total of 133 known metabolites were measured; 50 were excluded from analysis because of poor reproducibility in samples with delayed processing (n = 32), CV >25% (n = 13), or undetectable levels for >10% subjects (n = 5). The remaining 83 metabolites showed good reproducibility in technical replicates or after delayed processing33. Among those, 79 had no missing data and were considered further for analysis. Additional details of these data can be found in ref. 34. Genotypic data were also available for some of these participants. A subset of 645 individuals from NHS, HPFS, and PHS had genome-wide genotypes data as part of the PanScan study35. Among the remaining participants, 547 have been genotyped for 668 SNPs chosen to tag genes in inflammation, vitamin D, and immunological pathways. To maximize sample size, we focused our analysis on these 668 SNPs, which were therefore available in a total of 1,192 individuals. The in-sample MAFs of these variants ranged from 1.1% to 50%. The metabolite levels were approximately Gaussian after adjustment for the confounding factors and were therefore not transformed further (Supplementary Fig. 45). We first applied standard linear regression testing of each SNP for association with each metabolite while adjusting for five potential confounding factors: pancreatic cancer case–control status, age at blood draw, fasting status, self-reported race, and sex. We then applied the CMS while also including the five confounding factors as covariates. All tests were two sided.

gEUVADIS data.

The gEUVADIS data20 consist of RNA-seq data for 464 LCL samples from five populations in the 1000 Genomes Project. Of these, 375 are of European ancestry (CEU, FIN, GBR, and TSI), and 89 are of African ancestry (YRI). In these analyses, we considered only the European-ancestry samples. Raw RNA-sequencing reads obtained from the European Nucleotide Archive were aligned to the transcriptome by using UCSC annotations matching hg19 coordinates. RNA-seq by expectation-maximization (RSEM)36 was used to estimate the abundance of each annotated isoform, and total gene abundance was calculated as the sum of all isoform abundance values normalized to one million total counts or transcripts per million (TPM). For each population, TPM values were log2 transformed and median normalized to account for differences in sequencing depth in each sample. A total of 29,763 genes were initially available. We removed those that appeared to be duplicates or that had low expression (defined as log2(TPM) <2 in all samples). After filtering, 13,484 genes remained. The genotype data were obtained from the 1000 Genomes Project Phase 1 data set. We restricted the analysis to the SNPs with a MAF ≥5% that were within ± 50 kB from the gene tested for cis effects. A total of 11,175 genes had at least one SNP that matched those criteria. We performed standard cis-eQTL screening, first applying standard linear regression while adjusting for ten PEER factors. We then applied CMS while including the same PEER factors as covariates. All tests were two sided.

When running CMS, we performed prefiltering of the candidate covariates. More specifically, for each gene analyzed—referred to as the target gene—we restrained the number of candidate covariates (gene other than the target) to be evaluated. First, we aimed at avoiding genes whose expression was more likely to be associated with some of the SNPs tested because of a cis effect, because such genes were more likely to induce false signal. Thus, all genes in physical proximity to the target genes (≤1 Mb) were excluded. Second, we aimed at decreasing the number of candidate covariates (13,484 minus 1, a priori), because most of them were likely to be uninformative and because our simulation showed that for small sample size, CMS would have low robustness if the number of candidate covariates were too large. To do so, we performed an initial screening for association between the target and all other genes and used the top 50 showing the strongest squared correlation with the target.

We performed an in silico replication analysis by using two databases of known eQTLs. The first database included results from 15 publicly available studies (excluding the European gEUVADIS) from multiple tissues21, and a second one included eQTLs in whole-blood samples from a joint analysis of seven studies22. Summary statistics were not available for every SNP; instead, these databases listed all SNPs found at an FDR of 5% in each study. Therefore, we were not able to perform a standard replication study and instead compared the replication rate of CMS and LR in these databases. Notably, we expected a smaller replication rate for LR only and CMS only compared with those identified by both approaches, because the last group includes variants with the largest effects, whereas the first two correspond to associations of smaller magnitude. Finally, we performed a quasi-null experiment in which we tested for trans effects by using random SNPs from the genome, assuming that most of those would be under the null.

Variance explained in multiple regressions.

We plotted the variance of a set of outcomes Y = (Y1,...YK) that could be explained by covariates in the data, i.e., how much of the variance of Yi could be explained by Yj≠i (Fig. 2b,c). For illustration purposes, we also approximated the individual contribution of each Yj≠i covariate. In brief, we standardized all variables and estimated , the proportion of variance of the outcome explained by each Yj≠i from the marginal models , and , the total variance of Yi explained by all Yj≠i jointly, from the model Then, we derived vij, an approximation of the relative contribution of each Yj≠i to the variance of as follows:

Notably, this is an arbitrary rescaling of the real contribution of the Yj≠i variable. Indeed, the correlation between all Yj≠i induces multicollinearity in the regression, and it follows that .

Missing data.

The current version of the algorithm includes a naive imputation strategy for missing data that consists of replacing missing values of candidate covariates with their mean values, thereby avoiding the sharp decrease in sample size that might arise if the proportion of missing values is too large. Notably, the inference was performed per predictor–outcome pair and only for the covariates, whereas we did not infer missing values for the outcome or the predictor tested. The imputation did not strongly affect the robustness of the test (Supplementary Fig. 14), although large-scale (i.e., ≥50% of missing values) random missingness appeared to slightly deflate the test statistics from CMS.

Code availability.

An implementation of the approach is freely available at https://github.com/haschard/CMS/.

Data availability.

The gEUVADIS RNA-sequencing data, genotype data, variant annotations, splice scores, quantifications, and QTL results are freely and openly available with no restrictions at http://www.geuvadis.org/. The metabolite data that support the findings of this study are available from the corresponding author upon reasonable request. A Life Sciences Reporting Summary for this paper is available.

Additional information

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.