replying to M. Bouhaddou et al. Nature 540, 10.1038/nature20580 (2016).

In the accompanying Comment1, the authors make two main claims: (1) that viability metrics computed over the drug concentration range shared between datasets yield higher consistency than the same metrics computed over the full, often only partially overlapping, drug concentration range; and (2) that binary drug sensitivity classification (insensitive versus sensitive) as determined by manual curators increases the consistency of pharmacological profiles between the Cancer Genome Project (CGP)2 and Cancer Cell Line Encyclopedia (CCLE)3. We appreciate the innovative approach followed by the authors, and our reanalysis confirms the marginal, but statistically significant, increase in consistency achieved through the use of viability metrics computed across a reduced but common range. However, our results indicate that manual classification of drug dose–response curves does not significantly increase the agreement between drug sensitivity calls compared to computational approaches. Notably, it is unclear whether the authors’ manual approach will improve reproducibility of the biomarker discovery process, as collapsing of complex curves into discrete categories may result in a substantial information loss. Here we provide specific responses to the main results reported by Bouhaddou et al.1

Similar to Pozdeyev et al.4 and the Comment by Mpindi et al.5, the authors investigated whether sensitivity metrics computed from the drug concentration range shared between CGP and CCLE yield higher consistency1. Using the authors’ code1, we were able to implement their slope (ms) and area under the curve (AUCs) metrics (in which subscript ‘s’ denotes shared dose range) in our PharmacoGx platform6 with a minor improvement of the ms metric to prevent highly sensitive cell lines with flat drug dose–response curves to be classified as insensitive (see Supplementary Methods). We compared the AUC and m metrics computed from the full and shared drug concentration range for the pooled set of drug sensitivities. Our implementation of the drug dose–response curve fitting and sensitivity computation further improved the authors’ results: initial correlation for ms (ρ = 0.52) and AUCs (ρ = 0.61) both increased to 0.67. We then tested whether the common viability metrics constituted significant improvement over computations on the full drug concentration range. We observed a small but statistically significant improvement for both the ms and AUCs metrics (test of difference in correlations, P < 0.01; Supplementary Fig. 1). Stratifying our analysis per drug, we observed that the improvement in consistency, although significant, was marginal, with the exception of nilotinib (Supplementary Fig. 2). However, most of the drugs still yielded poor consistency (ρ < 0.5), which is in line with both our initial report7 and our more recent reanalysis8.

The authors investigated discretization of their continuous metrics to test whether binary classification would yield higher consistency, as estimated by the overall percentage agreement in drug sensitivity calls. However, such a statistic does not take into account the agreement that would be expected purely by chance owing to the large proportion of cell lines being insensitive to the tested drugs. The Matthews correlation coefficient (MCC)9 addresses this issue. It is a balanced measure that can be used when the classes are of different sizes, and its significance can be computed using the χ2 statistic for binary classes (Supplementary Methods). We illustrate the case of four drugs with different patterns of consistency in Supplementary Fig. 3. Although all four drugs yield an overall agreement of greater than 92%, they exhibit a wide range of MCC values. Nilotinib is a good example of a consistent drug phenotype across cell lines (MCC = 0.86; Supplementary Fig. 3a). PLX4720 yields moderate consistency (MCC = 0.68; Supplementary Fig. 3b). AZD0530 and erlotinib show only poor consistency (MCC = 0.42 and −0.05, respectively; Supplementary Fig. 3c, d). These examples support MCC as an appropriate statistic to discriminate between highly consistent drug sensitivity calls and those with poor concordance. We therefore used the MCC to compare different classification schemes, including those proposed by the authors.

Recognizing the difficulty of summarizing drug dose–response curves computationally, Bouhaddou et al.1 used an unconventional approach to increase the consistency of drug sensitivity calls: they gathered a team of eight curators and asked them to classify each drug dose–response curve as either sensitive or insensitive. The authors report a Cohen’s kappa (κ) calue of 0.53, which is in line with our estimated MCC value of 0.53 (Supplementary Fig. 4). The authors qualified their manual classification as a high and statistically significant consistency. We disagree with the authors’ claim1 that their results provide evidence for high consistency. We refer them to the standards for strength of agreement for k defined previously10, which would only classify observed consistency as moderate. More importantly, when classifications are stratified by drug, we do not observe a significant improvement of manual curation over the computational classifications based on AUCs and ms values (P > 0.12, Wilcoxon signed rank test; Supplementary Fig. 5). Consistent with our previous report, 10 out of 15 drugs (66.7%) yielded poor consistency (MCC < 0.5).

By pooling drug sensitivity data across drugs, the authors noticed a good quantitative agreement between the two studies, with estimated Pearson correlation coefficients (ρ) of 0.52 and 0.61 for AUCs and ms values, respectively1; our improved implementation of their method increased the correlation to 0.67. Nevertheless, we disagree that this level of correlation constitutes evidence for good agreement, and define it as only moderate consistency based on the interpretation scale of our initial study7. More importantly, the common viability metrics only marginally improved consistency at the level of individual drugs (except for nilotinib), with most of the drug yielding inconsistent drug sensitivity values (ρ < 0.5; Supplementary Fig. 2). The authors made a similar observation1 in their figure 1f, undermining their claim of drug response consistency in CGP and CCLE. Moreover, the main goal of CGP and CCLE consisted of finding new associations between molecular features and sensitivity to specific drugs2,3. Since biomarkers are to be found for each drug separately, it is vital that pharmacological profiles are highly consistent at the level of individual drugs and not merely when averaged across a larger dataset.

In conclusion, our re-analysis of the new AUCs and ms metrics described by Bouhaddou et al.1 showed that they represent a statistically significant improvement over the published drug sensitivity values, but the increase in consistency is only marginal for the vast majority of the drugs tested both in CCLE and CGP. Furthermore, manual classification of the drug dose–response curves does not appear to substantially improve the consistency of binary sensitivity calls over computational approaches and is not a scalable method. However, the authors1 showed that manually classified drug dose–response curves could be used as a benchmark to train nonlinear computational predictors that could take into account the peculiar features of each individual dataset. Although there is no evidence that the authors’ approaches1 improve reproducibility of biomarker discovery for individual drugs, their work may open a new avenue of research in pharmacogenomics. Manual curation and further exploration of new large datasets such as CTRPv2 (ref. 11) and GDSC1000 (ref. 12)—containing approximately 395,000 and 225,000 individual curves, respectively—will present major challenges, but the investment in these large pharmacogenomic warrants such efforts.

Author A. C. Jin was a student in A.H.B.’s laboratory and left shortly after publication of the initial study, and did not participate in in the writing of this Reply. Authors Z.S., P.S. and M.F. developed the PharmacoGx software package, which enabled the analyses presented here; A.G. helped with the comparison of the different drug sensitivity metrics, and participated in the interpretation of the results and writing of this Reply.

Methods

The methods are described in detail in the Supplementary Information. The code and associated files required to reproduce this analysis are publicly available on the cdrug-rebuttals GitHub repository ( https://github.com/bhklab/cdrug-rebuttals). The procedure to set up the software environment and run our analysis pipeline is provided in the Supplementary Information. This work complies with the guidelines proposed previously13 in terms of code availability and replicability of results.