Cardelino: computational integration of somatic clonal substructure and single-cell transcriptomes

McCarthy, Davis J.; Rostom, Raghd; Huang, Yuanhua; Kunz, Daniel J.; Danecek, Petr; Bonder, Marc Jan; Hagai, Tzachi; Lyu, Ruqian; Wang, Wenyi; Gaffney, Daniel J.; Simons, Benjamin D.; Stegle, Oliver; Teichmann, Sarah A.

doi:10.1038/s41592-020-0766-3

Article
Published: 16 March 2020

Cardelino: computational integration of somatic clonal substructure and single-cell transcriptomes

Nature Methods volume 17, pages 414–421 (2020)Cite this article

8692 Accesses
25 Citations
40 Altmetric
Metrics details

Subjects

Abstract

Bulk and single-cell DNA sequencing has enabled reconstructing clonal substructures of somatic tissues from frequency and cooccurrence patterns of somatic variants. However, approaches to characterize phenotypic variations between clones are not established. Here we present cardelino (https://github.com/single-cell-genetics/cardelino), a computational method for inferring the clonal tree configuration and the clone of origin of individual cells assayed using single-cell RNA-seq (scRNA-seq). Cardelino flexibly integrates information from imperfect clonal trees inferred based on bulk exome-seq data, and sparse variant alleles expressed in scRNA-seq data. We apply cardelino to a published cancer dataset and to newly generated matched scRNA-seq and exome-seq data from 32 human dermal fibroblast lines, identifying hundreds of differentially expressed genes between cells from different somatic clones. These genes are frequently enriched for cell cycle and proliferation pathways, indicating a role for cell division genes in somatic evolution in healthy skin.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Overview and validation of the cardelino model.**

**Fig. 2: Parallel deep-exome sequencing and scRNA-seq profiling of 32 human dermal fibroblast lines.**

**Fig. 3: Clone-specific transcriptome profiles reveal gene expression differences for joxm, one example line.**

**Fig. 4: Signatures of transcriptomic clone-to-clone variation across 31 lines.**

De novo detection of somatic mutations in high-throughput single-cell profiling data sets

Article Open access 06 July 2023

Analyzing somatic mutations by single-cell whole-genome sequencing

Article 23 November 2023

A novel single-cell RNA-sequencing approach and its applicability connecting genotype to phenotype in ageing disease

Article Open access 08 March 2022

Data availability

scRNA-seq data have been deposited in the ArrayExpress database at EMBL-EBI (www.ebi.ac.uk/arrayexpress) under accession number E-MTAB-7167. WES data is available through the HipSci portal (www.hipsci.org). The lines used in this study have the identifiers: euts, fawm, feec, fikt, garx, gesg, heja, hipn, ieki, joxm, kuco, laey, lexy, naju, nusw, oaaz, oilg, pipw, puie, qayj, qolg, qonc, rozh, sehl, ualf, vass, vils, vuna, wahn, wetu, xugn, zoxy. Metadata, processed data and large results files are available at https://doi.org/10.5281/zenodo.1403510

Code availability

The cardelino methods are implemented in an open-source, publicly available R package (github.com/single-cell-genetics/cardelino). The code used to process and analyse the data is available (github.com/davismcc/fibroblast-clonality), with a reproducible workflow implemented in Snakemake⁶⁴. Descriptions of how to reproduce the data processing and analysis workflows, with html output showing code and figures presented in this paper, are available at davismcc.github.io/fibroblast-clonality. Docker images providing the computing environment and software used for data processing (hub.docker.com/r/davismcc/fibroblast-clonality/) and data analyses in R (hub.docker.com/r/davismcc/r-singlecell-img/) are publicly available.

References

Burnet, F. M. Intrinsic mutagenesis: a genetic basis of ageing. Pathology 6, 1–11 (1974).
CAS PubMed Google Scholar
Martincorena, I. & Campbell, P. J. Somatic mutation in cancer and normal cells. Science 349, 1483–1489 (2015).
CAS PubMed Google Scholar
Stransky, N. et al. The mutational landscape of head and neck squamous cell carcinoma. Science 333, 1157–1160 (2011).
CAS PubMed PubMed Central Google Scholar
Hodis, E. et al. A landscape of driver mutations in melanoma. Cell 150, 251–263 (2012).
CAS PubMed PubMed Central Google Scholar
Huang, K.-L. et al. Pathogenic germline variants in 10,389 adult cancers. Cell 173, 355–370.e14 (2018).
CAS PubMed PubMed Central Google Scholar
Nik-Zainal, S. et al. Mutational processes molding the genomes of 21 breast cancers. Cell 149, 979–993 (2012).
CAS PubMed PubMed Central Google Scholar
Alexandrov, L. B. et al. Signatures of mutational processes in human cancer. Nature 500, 415–421 (2013).
CAS PubMed PubMed Central Google Scholar
Forbes, S. A. et al. COSMIC: somatic cancer genetics at high-resolution. Nucleic Acids Res. 45, D777–D783 (2017).
CAS PubMed Google Scholar
Bailey, M. H. et al. Comprehensive characterization of cancer driver genes and mutations. Cell 173, 371–385.e18 (2018).
CAS PubMed PubMed Central Google Scholar
Ding, L. et al. Perspective on oncogenic processes at the end of the beginning of cancer genomics. Cell 173, 305–320.e10 (2018).
CAS PubMed PubMed Central Google Scholar
Roth, A. et al. PyClone: statistical inference of clonal population structure in cancer. Nat. Methods 11, 396 (2014).
CAS PubMed PubMed Central Google Scholar
Deshwar, A. G. et al. PhyloWGS: reconstructing subclonal composition and evolution from whole-genome sequencing of tumors. Genome Biol. 16, 35 (2015).
PubMed PubMed Central Google Scholar
Jiang, Y., Qiu, Y., Minn, A. J. & Zhang, N. R. Assessing intratumor heterogeneity and tracking longitudinal and spatial clonal evolutionary history by next-generation sequencing. Proc. Natl Acad. Sci. USA 113, E5528–E5537 (2016).
CAS PubMed PubMed Central Google Scholar
Navin, N. et al. Tumour evolution inferred by single-cell sequencing. Nature 472, 90–94 (2011).
CAS PubMed PubMed Central Google Scholar
Wang, Y. et al. Clonal evolution in breast cancer revealed by single nucleus genome sequencing. Nature 512, 155–160 (2014).
CAS PubMed PubMed Central Google Scholar
Navin, N. E. The first five years of single-cell cancer genomics and beyond. Genome Res. 25, 1499–1507 (2015).
CAS PubMed PubMed Central Google Scholar
Kim, K. I. & Simon, R. Using single cell sequencing data to model the evolutionary history of a tumor. BMC Bioinf. 15, 27 (2014).
Google Scholar
Navin, N. E. & Chen, K. Genotyping tumor clones from single-cell data. Nat. Methods 13, 555–556 (2016).
CAS PubMed Google Scholar
Jahn, K., Kuipers, J. & Beerenwinkel, N. Tree inference for single-cell data. Genome Biol. 17, 86 (2016).
PubMed PubMed Central Google Scholar
Kuipers, J., Jahn, K., Raphael, B. J. & Beerenwinkel, N. Single-cell sequencing data reveal widespread recurrence and loss of mutational hits in the life histories of tumors. Genome Res. 27, 1885–1894 (2017).
CAS PubMed PubMed Central Google Scholar
Roth, A. et al. Clonal genotype and population structure inference from single-cell tumor sequencing. Nat. Methods 13, 573–576 (2016).
CAS PubMed Google Scholar
Salehi, S. et al. ddClone: joint statistical inference of clonal populations from single cell and bulk tumour sequencing data. Genome Biol. 18, 44 (2017).
PubMed PubMed Central Google Scholar
Malikic, S. et al. Integrative inference of subclonal tumour evolution from single-cell and bulk sequencing data. Nat. Commun. 10, 2750 (2019).
PubMed PubMed Central Google Scholar
Müller, S. et al. Single‐cell sequencing maps gene expression to mutational phylogenies in PDGF‐ and EGF‐driven gliomas. Mol. Syst. Biol. 12, 889 (2016).
PubMed PubMed Central Google Scholar
Tirosh, I. et al. Single-cell RNA-seq supports a developmental hierarchy in human oligodendroglioma. Nature 539, 309–313 (2016).
PubMed PubMed Central Google Scholar
Fan, J. et al. Linking transcriptional and genetic tumor heterogeneity through allele analysis of single-cell RNA-seq data. Genome Res. 28, 1217–1227 (2018).
CAS PubMed PubMed Central Google Scholar
Campbell, K. R. et al. clonealign: statistical integration of independent single-cell RNA and DNA sequencing data from human cancers. Genome Biol. 20, 54 (2019).
PubMed PubMed Central Google Scholar
Giustacchini, A. et al. Single-cell transcriptomics uncovers distinct molecular signatures of stem cells in chronic myeloid leukemia. Nat. Med. 23, 692–702 (2017).
CAS PubMed Google Scholar
Cheow, L. F. et al. Single-cell multimodal profiling reveals cellular epigenetic heterogeneity. Nat. Methods 13, 833–836 (2016).
CAS PubMed Google Scholar
Saikia, M. et al. Simultaneous multiplexed amplicon sequencing and transcriptome profiling in single cells. Nat. Methods 16, 59–62 (2019).
CAS PubMed Google Scholar
Kang, H. M. et al. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat. Biotechnol. 36, 89–94 (2018).
CAS PubMed Google Scholar
Kilpinen, H. et al. Common genetic variation drives molecular heterogeneity in human iPSCs. Nature 546, 370–375 (2017).
CAS PubMed PubMed Central Google Scholar
Williams, M. J. et al. Quantification of subclonal selection in cancer from bulk sequencing data. Nat. Genet. 50, 895–903 (2018).
CAS PubMed PubMed Central Google Scholar
Martincorena, I. et al. Universal patterns of selection in cancer and somatic tissues. Cell 173, 1823 (2018).
CAS PubMed PubMed Central Google Scholar
Simons, B. D. Deep sequencing as a probe of normal stem cell fate and preneoplasia in human epidermis. Proc. Natl Acad. Sci. USA 113, 128–133 (2016).
CAS PubMed Google Scholar
Williams, M. J., Werner, B., Barnes, C. P., Graham, T. A. & Sottoriva, A. Identification of neutral tumor evolution across cancer types. Nat. Genet. 48, 238 (2016).
CAS PubMed PubMed Central Google Scholar
Ramaker, R. C. et al. RNA sequencing-based cell proliferation analysis across 19 cancers identifies a subset of proliferation-informative cancers with a common survival signature. Oncotarget. 8, 38668–38681 (2017).
PubMed PubMed Central Google Scholar
Kowalczyk, M. S. et al. Single-cell RNA-seq reveals changes in cell cycle and differentiation programs upon aging of hematopoietic stem cells. Genome Res. 25, 1860–1872 (2015).
CAS PubMed PubMed Central Google Scholar
Tsang, J. C. H. et al. Single-cell transcriptomic reconstruction reveals cell cycle and multi-lineage differentiation defects in Bcl11a-deficient hematopoietic stem cells. Genome Biol. 16, 178 (2015).
PubMed PubMed Central Google Scholar
Kolodziejczyk, A. A. et al. Single cell RNA-sequencing of pluripotent states unlocks modular transcriptional variation. Cell Stem Cell 17, 471–485 (2015).
CAS PubMed PubMed Central Google Scholar
Tirosh, I. et al. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science 352, 189–196 (2016).
CAS PubMed PubMed Central Google Scholar
Buenrostro, J. D. et al. Single-cell chromatin accessibility reveals principles of regulatory variation. Nature 523, 486–490 (2015).
CAS PubMed PubMed Central Google Scholar
Guo, H. et al. Single-cell methylome landscapes of mouse embryonic stem cells and early embryos analyzed using reduced representation bisulfite sequencing. Genome Res. 23, 2126–2135 (2013).
CAS PubMed PubMed Central Google Scholar
Smallwood, S. A. et al. Single-cell genome-wide bisulfite sequencing for assessing epigenetic heterogeneity. Nat. Methods 11, 817–820 (2014).
CAS PubMed PubMed Central Google Scholar
Picelli, S. et al. Full-length RNA-seq from single cells using Smart-seq2. Nat. Protoc. 9, 171–181 (2014).
CAS PubMed Google Scholar
Streeter, I. et al. The human-induced pluripotent stem cell initiative—data resources for cellular genetics. Nucleic Acids Res. 45, 691–697 (2016).
Google Scholar
Church, D. M. et al. Modernizing reference genome assemblies. PLoS Biol. 9, e1001091 (2011).
CAS PubMed PubMed Central Google Scholar
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at arXiv [q-bio.GN] (2013).
Li, H. et al. The sequence alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
PubMed PubMed Central Google Scholar
Karczewski, K. J. et al. The ExAC browser: displaying reference data information from over 60 000 exomes. Nucleic Acids Res. 45, D840–D845 (2017).
CAS PubMed Google Scholar
The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Google Scholar
Fisher, R. A. On the interpretation of χ² from contingency tables, and the calculation of P. J. R. Stat. Soc. 85, 87–94 (1922).
Google Scholar
Gori, K. & Baez-Ortega, A. sigfit: flexible Bayesian inference of mutational signatures. Preprint at bioRxiv https://doi.org/10.1101/372896 (2018).
Flicek, P. et al. Ensembl 2014. Nucleic Acids Res. 42, D749–D755 (2014).
CAS PubMed Google Scholar
Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods 14, 417–419 (2017).
CAS PubMed PubMed Central Google Scholar
McCarthy, D. J., Campbell, K. R., Lun, A. T. L. & Wills, Q. F. Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics 33, 1179–1186 (2017).
CAS PubMed PubMed Central Google Scholar
Lun, A. T. L., Bach, K. & Marioni, J. C. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 17, 75 (2016).
PubMed Google Scholar
Hoffman, G. E. & Schadt, E. E. variancePartition: interpreting drivers of variation in complex gene expression studies. BMC Bioinf. 17, 483 (2016).
Google Scholar
Lund, S. P., Nettleton, D., McCarthy, D. J. & Smyth, G. K. Detecting differential expression in RNA-sequence data using quasi-likelihood with shrunken dispersion estimates. Stat. Appl. Genet. Mol. Biol. 11, https://doi.org/10.1515/1544-6115.1826 (2012).
Soneson, C. & Robinson, M. D. Bias, robustness and scalability in single-cell differential expression analysis. Nat. Methods 15, 255–261 (2018).
CAS PubMed Google Scholar
Wu, D. & Smyth, G. K. Camera: a competitive gene set test accounting for inter-gene correlation. Nucleic Acids Res. 40, e133 (2012).
CAS PubMed PubMed Central Google Scholar
Liberzon, A. et al. Molecular signatures database (MSigDB) 3.0. Bioinformatics 27, 1739–1740 (2011).
CAS PubMed PubMed Central Google Scholar
Ignatiadis, N., Klaus, B., Zaugg, J. B. & Huber, W. Data-driven hypothesis weighting increases detection power in genome-scale multiple testing. Nat. Methods 13, 577–580 (2016).
CAS PubMed PubMed Central Google Scholar
Köster, J. & Rahmann, S. Snakemake—a scalable bioinformatics workflow engine. Bioinformatics 28, 2520–2522 (2012).
PubMed Google Scholar
Smyth, G. K. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 3, 1–25 (2004).
Google Scholar
Ritchie, M. E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).
Google Scholar

Download references

Acknowledgements

We thank D. Jörg for highly constructive discussions and P. Qiao for valuable comments on the manuscript. We acknowledge the Wellcome Sanger Institute Cellular Genetics and Phenotyping teams (in particular, A. Alderton, C. Gomez, R. Boyd, S. Patel and S. Barnett) and DNA pipelines for their invaluable assistance in generating the data for this study. We thank G. Kildisiute for assisting in CNV analysis of the fibroblast lines. This project was supported by Wellcome Sanger core funding (WT206194) and the Human Induced Pluripotent Stem Cell Initiative. Research in the Stegle laboratory is supported by the BMBF, the Volkswagen Foundation, the Chan Zuckerberg Initiative and the EU (ERC project DECODE, grant agreement 732546). D.J.M. is supported by the National Health and Medical Research Council of Australia (grants APP1112681 and APP1162829), seed funding from the Baker Foundation and the Holyoake Research Fellowship at St Vincent's Institute of Medical Research and the University of Melbourne. R.R. is supported the BBSRC Doctoral Training Programme. Y.H. is supported by the University of Cambridge and EMBL-EBI through an EBPOD postdoctoral fellowship. D.J.K. is supported by the Wellcome Trust under grants 203828/Z/16/A and 203828/Z/16/Z. T.H. is supported by a Human Frontier Science Program Fellowship, an EMBO Long-term Fellowship and an EMBO Advanced Fellowship.

Author information

These authors contributed equally: Davis J. McCarthy, Raghd Rostom, Yuanhua Huang.

Authors and Affiliations

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, UK
Davis J. McCarthy, Raghd Rostom, Yuanhua Huang, Marc Jan Bonder, Tzachi Hagai, Marc Jan Bonder, Francesco Paolo Casale, Anna Cuomo, Adam Faulconbridge, Peter W. Harrison, Davis J. McCarthy, Bogdan Mirauta, Daniel Seaton, Ian Streeter, Laura Clarke, Ewan Birney, Oliver Stegle, Oliver Stegle & Sarah A. Teichmann
St Vincent’s Institute of Medical Research, Fitzroy, Victoria, Australia
Davis J. McCarthy, Ruqian Lyu & Davis J. McCarthy
Melbourne Integrative Genomics, School of Mathematics and Statistics/School of Biosciences, University of Melbourne, Parkville, Victoria, Australia
Davis J. McCarthy & Ruqian Lyu
Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
Raghd Rostom, Daniel J. Kunz, Petr Danecek, Tzachi Hagai, Daniel J. Gaffney, Oliver Stegle & Sarah A. Teichmann
Department of Clinical Neurosciences, University of Cambridge, Cambridge, UK
Yuanhua Huang
Department of Physics, Cavendish Laboratory, Cambridge, UK
Daniel J. Kunz, Benjamin D. Simons & Sarah A. Teichmann
The Wellcome Trust/Cancer Research UK Gurdon Institute, University of Cambridge, Cambridge, UK
Daniel J. Kunz & Benjamin D. Simons
School of Molecular Cell Biology and Biotechnology, George S Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv, Israel
Tzachi Hagai
Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
Wenyi Wang
The Wellcome Trust/Medical Research Council Stem Cell Institute, University of Cambridge, Cambridge, UK
Benjamin D. Simons
European Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany
Oliver Stegle & Oliver Stegle
Division of Computational Genomics and Systems Genetics, German Cancer Research Center, Heidelberg, Germany
Oliver Stegle & Oliver Stegle
Wellcome Genome Campus, Wellcome Trust Sanger Institute, Hinxton, UK
Helena Kilpinen, Angela Goncalves, Andreas Leha, Kaur Alasoo, Sendu Bala, Petr Danecek, Shane A. McCarthy, Yasin Memari, Alice Mann, Chukwuma A. Agu, Alex Alderton, Rachel Nelson, Sarah Harper, Minal Patel, Alistair White, Sharad R. Patel, Reena Halai, Christopher M. Kirton, Anja KolbKokocinski, Willem H. Ouwehand, Ludovic Vallier, Richard Durbin & Daniel J. Gaffney
UCL Great Ormond Street Institute of Child Health, University College London, London, UK
Helena Kilpinen & Philip Beales
Department of Medical Statistics, University Medical Center Göttingen, Humboldtallee, Germany
Andreas Leha
Centre for Gene Regulation & Expression, School of Life Sciences, University of Dundee, Dundee, UK
Vackar Afzal, Dalila Bensaddek & Angus I. Lamond
Department of Haematology, University of Cambridge, Cambridge, UK
Sofie Ashford & Willem H. Ouwehand
Centre for Stem Cells & Regenerative Medicine, King’s College London, Tower Wing, Guy’s Hospital, Great Maze Pond, London, UK
Oliver J. Culley, Annie Kathuria, Ruta Meleckyte, Nathalie Moens, Davide Danovi & Fiona M. Watt
Wellcome Trust and MRC Cambridge Stem Cell Institute and Biomedical, Research Centre, Anne McLaren Laboratory, University of Cambridge, Cambridge, UK
Filipa Soares & Ludovic Vallier
NHS Blood and Transplant, Cambridge Biomedical Campus, Cambridge, UK
Willem H. Ouwehand
Department of Genetics, University of Cambridge, Cambridge, UK
Richard Durbin

Authors

Davis J. McCarthy
View author publications
You can also search for this author in PubMed Google Scholar
Raghd Rostom
View author publications
You can also search for this author in PubMed Google Scholar
Yuanhua Huang
View author publications
You can also search for this author in PubMed Google Scholar
Daniel J. Kunz
View author publications
You can also search for this author in PubMed Google Scholar
Petr Danecek
View author publications
You can also search for this author in PubMed Google Scholar
Marc Jan Bonder
View author publications
You can also search for this author in PubMed Google Scholar
Tzachi Hagai
View author publications
You can also search for this author in PubMed Google Scholar
Ruqian Lyu
View author publications
You can also search for this author in PubMed Google Scholar
Wenyi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Daniel J. Gaffney
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin D. Simons
View author publications
You can also search for this author in PubMed Google Scholar
Oliver Stegle
View author publications
You can also search for this author in PubMed Google Scholar
Sarah A. Teichmann
View author publications
You can also search for this author in PubMed Google Scholar

Consortia

HipSci Consortium

Helena Kilpinen
, Angela Goncalves
, Andreas Leha
, Vackar Afzal
, Kaur Alasoo
, Sofie Ashford
, Sendu Bala
, Dalila Bensaddek
, Marc Jan Bonder
, Francesco Paolo Casale
, Oliver J. Culley
, Anna Cuomo
, Petr Danecek
, Adam Faulconbridge
, Peter W. Harrison
, Annie Kathuria
, Davis J. McCarthy
, Shane A. McCarthy
, Ruta Meleckyte
, Yasin Memari
, Bogdan Mirauta
, Nathalie Moens
, Filipa Soares
, Alice Mann
, Daniel Seaton
, Ian Streeter
, Chukwuma A. Agu
, Alex Alderton
, Rachel Nelson
, Sarah Harper
, Minal Patel
, Alistair White
, Sharad R. Patel
, Laura Clarke
, Reena Halai
, Christopher M. Kirton
, Anja KolbKokocinski
, Philip Beales
, Ewan Birney
, Davide Danovi
, Angus I. Lamond
, Willem H. Ouwehand
, Ludovic Vallier
, Fiona M. Watt
, Richard Durbin
, Oliver Stegle
& Daniel J. Gaffney

Contributions

R.R., T.H. and S.A.T. conceived and planned the experiments. R.R. and T.H. carried out the experiments. Y.H., D.J.M. and O.S. developed the computational methods. Y.H. developed the statistical model and the implementation. Y.H. and D.J.M. wrote the software. Y.H. carried out all simulation experiments and benchmarked alternative methods. The HipSci Consortium provided the cell lines and exome sequencing data. P.D. conducted somatic variant calling from exome sequencing data. D.J.G. advised on somatic variant calling approaches and the mutational signatures analysis carried out by R.R. D.J.M. and M.J.B. developed data processing workflows and D.J.M. processed the fibroblast scRNA-sequencing data. R.L. and D.J.M. processed the melanoma scRNA-sequencing data. D.J.K. conducted the selection analyses, supervised by B.D.S. D.J.M. and Y.H. carried out clonal inference and cell-assignment analyses. D.J.M. conducted differential gene and pathway expression analyses and integrated the computational analyses into a reproducible workflow. D.J.M. and R.R. took the lead in writing the manuscript. D.J.M., R.R. and Y.H. drafted the manuscript and designed the figures. W.W. suggested improvements to somatic variant calling and DE analyses. S.A.T. and O.S. conceived of the study, planned and supervised the work. All authors contributed to the interpretation of results and commented on and approved the final manuscript. The HipSci Consortium generated and provided early access to the fibroblast lines used in this work (see Supplementary Note for a full list of consortium members).

Corresponding authors

Correspondence to Oliver Stegle or Sarah A. Teichmann.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nicole Rusk and Lin Tang were the primary editors on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Graphical representation of the cardelino model.

The clonal tree configuration matrix C is a random variable and follows a Bernoulli distribution encoded by an input tree configuration Ω that is provided to the model (for example estimated from bulk or single-cell DNA-seq data using existing methods such as Canopy) as well as an error rate ξ, which follows a beta prior distribution with hyper parameters 𝜅. The indicator matrix I defines the assignment of cells to clones, which is another unknown variable, and assumed to follow a multinomial prior with fixed parameter 𝜋 for each cell. The clone configuration C and cell identity I together encode the genotype c_i,Ij of each variant i in each cell j. If ci,Ij is 1, the alternative allelic read count will follow a binomial distribution with gene specific parameter 𝜃_i, otherwise with error related parameter 𝜃₀. Both 𝜃_i and 𝜃₀ have a beta prior distribution, but with different parameters. Shaded nodes represent observed variables; unshaded nodes represent unknown variables; yellow circled nodes represent fixed hyper parameters.

Supplementary Figure 2 Distribution of key data characteristics from experimental scRNA-seq data from 32 fibroblast cell lines, used as the basis of parameter settings in simulations.

(a) Number of clones inferred from bulk exome-seq data. (b) The median number of variants per clonal branch; (c) The overall coverage of variants, namely the fraction of variants with at least one read. (d) Scatter plot between the mean number of reads per variant per cell and the overall coverage of variants in the same line. The default parameters used in simulations are highlighted with the red line.

Supplementary Figure 3 Simulation results evaluating the inferred relax (error) rate in the configuration of variants in the guide clonal tree.

(a) The estimated relax rate as a function of the simulated error rates. Errors are simulated by uniformly swapping the mutation states in the configuration matrix of the guide clonal tree, which means that a clone may contain false mutations in the guide clonal tree provided to cardelino (except in the case of the base clone which has no mutations under any simulation conditions). (b) The estimated relax rate across different fractions of variants that have wrong branch configuration. Errors are added by swapping branches for variants.

Supplementary Figure 4 Additional results from assessing cardelino and alternative methods using simulated data.

Assessment of cell assignment to clones across a variety of simulation settings, considering SingleCellGenotyper (SCG), Demuxlet, cardelino and its two versions: cardelino-free without any informative clone configuration prior and cardelino-fixed assuming that the clone configuration prior is correct (Methods; Supplementary Note). All methods were applied to simulated data with known ground truth, varying (a) the number of informative variants per clonal branch, (b) the fraction of informative variants covered (that is, nonzero scRNA-seq read coverage), (c) the total number of clones, (d) the precision (i.e., inverse variance) of allelic ratio across genes; lower precision means more genes with high allelic imbalance, (e) the rate of general errors of mutation states in the clone configuration matrix, (f) the fraction of wrongly clustered variants in the input clonal tree branch. Default parameter values are marked with an asterisk and are retained when varything other parameters. (g) The effects of the tree topology on the cell assignment accuracy. In the simulations there are 50 repeats for each parameter, where one of the tree topology candidates is randomly selected in each repeat. For the four-clone configurations, there are four different tree topologies (upper panel), and their performance (area under the precision-recall curve) for the five different methods are splitted (bottom panel).

Supplementary Figure 5 Estimated mutational signature exposures based upon the tri-nucleotide context of somatic SNVs called from whole-exome sequencing (WES) data for n=32 HipSci human fibroblast lines.

The x-axis shows 30 COSMIC mutational signatures, in order, and the y-axis shows estimated exposures (mutation fraction) using the sigfit package (Methods), with significant signatures highlighted in blue. Across lines, the only significant signatures are Signature 7 (UV mutagenic process) and Signature 11.

Supplementary Figure 6 Variant allele frequency (VAF) distributions for somatic variants called from whole exome sequencing data for the 32 fibroblast lines.

The grey lines indicate the minimum allele-frequency threshold (0.05) used for variants for this analysis (Methods). The blue lines indicate the model (neutral/selected) inferred by SubClonalSelection (shading 95% confidence interval). Donors with a selection probability below 0.3 are classified as ‘neutral’, above 0.7 as ‘selected’. Donors which are neither ‘selected’ nor ‘neutral’ remain ‘undetermined’. High confidence ‘selected’ lines (selection probability >0.7 and >100 somatic variants) are: joxm, wahn, garx, vass, ualf, euts, pipw, oilg, feec, fikt, qolg, and puie.

Supplementary Figure 7 Comparison of five methods on simulated data matching 32 fibroblast cell lines and estimated error rate and cell assignability with cardelino from experimental data for 32 fibroblast lines.

(a) Assessment of cell assignment to clones across a variety of simulation settings, considering SingleCellGenotyper (SCG), Demuxlet, cardelino, cardelino-free and cardelino-fixed (Methods; Supplementary Note). Considered are simulated data based on empirical characteristics observed in 32 fibroblast lines. For each line, the sequence coverage, clone configuration (i.e., number of clones, variants on each branch), and allelic imbalance parameters were obtained to derive simulation parameters. 200 cells are synthesised per line and a guide clonal tree with 10% errors in allocation of variants to clones. (b) Estimated error rate in the clonal tree configuration derived from bulk exome-seq data (based on cardelino) for each of 32 lines versus fraction of confidently assigned cells (>90% of cells assigned for 23 lines; at cardelino posterior probability P>0.5 for most-probable clone).

Supplementary Figure 8 Comparison of cell assignment between five methods on experimental data across 32 fibroblast lines.

(a) The fraction of assignable cells (i.e., highest P > thresholds) when varying the thresholds from 0.5 to 0.95. Shown are box plots depicting median and the first and third quantiles of the 32 lines. (b) The adjusted Rand index of cell assignment to clones between the five considered methods. The values are averaged across 32 fibroblast lines. (c) Scatter plot between the uncertainty of the inferred tree from cardelino-free (x-axis) and the mean absolute difference of the assignment probability between cardelino-free and cardelino (y-axis). The output posterior clonal configuration matrix from cardelino-free consists of the probability of each variant being present in each clone. A completely uninformative clonal tree would have all entries equal to 0.5. Thus, we measure the uncertainty of the output tree from cardelino-free by taking 0.5 minus the mean absolute difference of the posterior probability configuration matrix and the uninformative configuration probability matrix of all of entries equal to 0.5. With this measure, a value of 0.5 indicates a posterior configuration indistinguishable from the uninformative configuration and a value of 0 indicates very high-confidence from the model in the posterior configuration. (d) The comparison of cell assignment for one representative line (feec) when using different guide clonal trees sampled from Canopy’s posterior distribution as input. Each violin plot shows the adjusted Rand index of cell assignment between each of 435 tree pairs combining the 30 most probable trees from bulk exome-seq for the feec line. (e) Cell assignment similarity for each of the 32 lines when using different guide clonal trees, quantified with adjusted Rand index values between different pairs of guide clonal trees. For each line, we take the 30 most probable posterior trees from Canopy, and then each dot in the box plot denotes the average adjusted Rand index value for one line, calculated from 435 of these pairwise comparisons.

Supplementary Figure 9 ICell-clone assignment rates from cardelino.

(a) Scatter plot of the fraction of cells assigned in each cell line using cardelino (at posterior probability > 0.5) as a function of the minimum number of clone-specific variants for the corresponding line (minimum Hamming distance between clones for a given donor), for 32 fibroblast lines. Total number of cells that were considered for this analysis (QC passed) per line indicated by colour. (b) Scatter plot of recall (assignment rate) versus precision (assignment accuracy) when assigning cells using cardelino (at posterior probability > 0.5). Shown are data from for 32 simulated lines, using parameters that match the observed data characteristics in the set of 32 real fibroblast lines (Methods). The average number of variants per clonal branch (i.e., #variant/(#clone - 1)) is shown by point colour (slightly different from Supplementary Fig. 4 which uses the minimum number of variants distinguishing between pairs of clones, as shown in Fig. 3a). Lines with fewer informative variants per branch tend to have lower assignment rates, but the precision remains high.

Supplementary Figure 10 Clone prevalence estimates from WES data (x-axis; using Canopy) versus the fraction of single-cell transcriptomes assigned to the clone (y-axis; using cardelino), for each clone across lines.

Points are coloured by the overall fraction of single-cell transcriptomes assigned for a given line (i.e. cells with posterior P>0.5 for assignment).

Supplementary Figure 11 Direct effects of somatic variants on genes overlapping the variant.

Volcano plot showing negative log P values versus log2-fold change from testing differential expression for genes with a somatic mutation between cells with the mutation and cells without the mutation, faceted by VEP annotation category (Methods). Each point represents a gene, and box plots show the overall log2-fold change distribution for each annotation category. DE tests (two-sided QL F test in edgeR) are conducted within each line (donor) separately, and the results shown here are aggregated across n=32 lines. Genes are categorised by simplified functional annotations from VEP of the somatic mutation, and genes significantly DE at an FDR threshold of 20% are shown in red.

Supplementary Figure 12 Gene set enrichment results for fibroblast data from n=32 lines.

(a) Heatmap showing Spearman correlation between gene set enrichment results for the 16 most frequently enriched MSigDB Hallmark gene sets across 31 lines. Colour indicates the correlation between pairs of gene sets and is only shown if the correlation is significant (P < 0.05). (b) Heatmap showing proportion of overlap in genes between pairs of gene sets (matching those in left panel). (c) Heatmap showing the direction (first listed clone relative to second listed clone; in colour) and strength of enrichment (-log10(P) as degree of shading) for Hallmark gene sets tested with camera (Methods) for all pairwise comparisons between clones across n=31 lines. Gene sets that are significantly enriched at an FDR threshold of 5% are indicated with dots. Gene sets are shown if significant in at least one line and are ordered by number of lines in which they are significant.

Supplementary Figure 13 Results from five human melanoma samples.

(a) Number of cells assigned by cardelino to each inferred clone for five melanoma patients, stratified by cell type identified using gene expression of marker genes as in the original publication 37. (b) Gene set enrichment analysis results when comparing gene expression in clone1 cells to cells in other clones, within each patient, including cells from all cell types. Given that immune cells and cancer-associated fibroblast (CAF) cells are almost all assigned to clone1, this comparison effectively reflects expression differences between melanoma and immune cells. (c) Gene set enrichment analysis results when considering all pairwise comparisons between clones consisting of melanoma cells only. The heatmaps in (b) and (c) depict signed P-values of gene set enrichment (n=31 cell lines; two-sided test using camera) for Hallmark gene sets found to be significantly enriched (FDR<0.05) in at least one comparison. Dots denote significant enrichments. For details on the cell assignment and gene set enrichment analyses see Supplementary Note. (d) Heatmap showing correlations between gene set enrichment results when using all cells (across melanoma, immune and cancer-associated fibroblast cell types) assigned to clones across five melanoma patients and comparing expression of cells assigned to clone1 to those assigned to other clones. (e) Heatmap showing correlations between gene set enrichment results when using all melanoma cells assigned to clones across five melanoma patients and comparing expression of cells between all pairs of clones (for which the clones have sufficiently many cells assigned). For both (d) and (e), the eatmap shows Spearman correlation between gene set enrichment results for the 16 most frequently enriched MSigDB Hallmark gene sets across n=5 patients. Colour indicates the correlation between pairs of gene sets and is only shown if the correlation is significant (P < 0.05).

Supplementary information

Supplementary Information

Supplementary Figs. 1–13, Tables 1 and 2 and Note.

Reporting Summary

Source data

Source Data Fig. 1

Source Data Fig. 2

Source Data Fig. 3

Source Data Fig. 4

Source Data clone assignment

Cardelino cell-clone assignment results for 32 fibroblast lines

Rights and permissions

Reprints and permissions

About this article

Cite this article

McCarthy, D.J., Rostom, R., Huang, Y. et al. Cardelino: computational integration of somatic clonal substructure and single-cell transcriptomes. Nat Methods 17, 414–421 (2020). https://doi.org/10.1038/s41592-020-0766-3

Download citation

Received: 26 November 2018
Accepted: 31 January 2020
Published: 16 March 2020
Issue Date: April 2020
DOI: https://doi.org/10.1038/s41592-020-0766-3

This article is cited by

Accurate integration of single-cell DNA and RNA for analyzing intratumor heterogeneity using MaCroDNA
- Mohammadamin Edrisi
- Xiru Huang
- Luay Nakhleh
Nature Communications (2023)
Reconstructing clonal tree for phylo-phenotypic characterization of cancer using single-cell transcriptomics
- Seong-Hwan Jun
- Hosein Toosi
- Jens Lagergren
Nature Communications (2023)
Phylogenetic inference from single-cell RNA-seq data
- Xuan Liu
- Jason I. Griffiths
- Jeffrey T. Chang
Scientific Reports (2023)
De novo detection of somatic mutations in high-throughput single-cell profiling data sets
- Francesc Muyas
- Carolin M. Sauer
- Isidro Cortés-Ciriano
Nature Biotechnology (2023)
MQuad enables clonal substructure discovery using single cell mitochondrial variants
- Aaron Wing Cheung Kwok
- Chen Qiao
- Yuanhua Huang
Nature Communications (2022)