To the Editor—Guo et al.1 performed an analysis of the methylation state of successive CpG sites on DNA fragments from several human tissues, including tumor samples as well as blood plasma, from multiple individuals with and without lung or colorectal cancer. They reported 147,888 loci at which the methylation state of successive CpGs was highly correlated, referring to these as methylation haplotype blocks, analogous to the haplotype blocks that are characteristic of genotype data. As a measure of how CpG methylation extends across a haplotype block, the authors introduced the methylation haplotype load (MHL) metric. They argue that MHL signatures exist for various different tissue types and cancers, and that these signatures can be used to detect the presence of cancer and the affected organ, using cell-free DNA (cfDNA) from blood plasma. Unfortunately, the report contains several serious shortcomings that lead us to question the validity of the findings.
The extent to which MHL values across methylation haplotype blocks cluster samples by tissue of origin, developmental stage (stem versus adult), and disease (cancer versus normal) is shown in Fig. 3 of Guo et al.1. The interpretation provided of the figure is that “unsupervised clustering with the 15% most variable [methylation haplotype blocks] showed that, regardless of the data sources, samples of the same tissue origin clustered together.” This is incorrect as, aside from the cancer and H1 cells, there is very limited clustering of the remaining samples by tissue of origin. This contrasts with the very clear clustering achieved when the clustering is based on selected methylation haplotype blocks that had high MHL values only in specific tissues. However, here and in multiple other places, it was not clear that the authors had maintained the required separation of training and test data. The authors appear to have performed feature selection and assessed the performance of their clustering method using at least some of the same data. For example, only one thymus sample appears in Supplementary Table 13d. Presumably, it was used for feature selection, but it is nevertheless included in Fig. 3c. Similarly, all six heart samples listed in Supplementary Table 13d are included in Fig. 3c. At least one of these samples must have been used for feature selection. A careful comparison of the figure with the table shows that this issue arises for every tissue type.
This is a preview of subscription content, access via your institution