This Comment describes some of the common pitfalls encountered in deriving and validating predictive statistical models from high-dimensional data. It offers a fresh perspective on some key statistical issues, providing some guidelines to avoid pitfalls, and to help unfamiliar readers better assess the reliability and significance of their results.
This is a preview of subscription content, access via your institution
Relevant articles
Open Access articles citing this article.
-
A meta-analysis of immune-cell fractions at high resolution reveals novel associations with common phenotypes and health outcomes
Genome Medicine Open Access 31 July 2023
-
MAPK inhibitor sensitivity scores predict sensitivity driven by the immune infiltration in pediatric low-grade gliomas
Nature Communications Open Access 27 July 2023
-
A new blood based epigenetic age predictor for adolescents and young adults
Scientific Reports Open Access 09 February 2023
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
References
Kalinin, S. V., Sumpter, B. G. & Archibald, R. K. Nat. Mater. 14, 973–980 (2015).
Marx, V. Nature 498, 255–260 (2013).
Mattmann, C. A. Nature 493, 473–475 (2013).
Fodor, S. P. et al. Science 251, 767–773 (1991).
Schena, M., Shalon, D., Davis, R. W. & Brown, P. O. Science 270, 467–470 (1995).
Perou, C. M. et al. Proc. Natl Acad. Sci. USA 96, 9212–9217 (1999).
Wheeler, D. A. et al. Nature 452, 872–876 (2008).
Nagalakshmi, U. et al. Science 320, 1344–1349 (2008).
van ’t Veer, L. J. et al. Nature 415, 530–536 (2002).
Guo, S. et al. Nat. Genet. 49, 635–642 (2017).
Gerlinger, M. et al. N. Engl. J. Med. 366, 883–892 (2012).
Xu, R. H. et al. Nat. Mater. 16, 1155–1161 (2017).
Storey, J. D. & Tibshirani, R. Proc. Natl Acad. Sci. USA 100, 9440–9445 (2003).
Leek, J. T. et al. Nat. Rev. Genet. 11, 733–739 (2010).
Teschendorff, A. E., Zhuang, J. & Widschwendter, M. Bioinformatics 27, 1496–1505 (2011).
Simon, R., Radmacher, M. D., Dobbin, K. & McShane, L. M. J. Natl Cancer Inst. 95, 14–18 (2003).
Ioannidis, J. P. PLoS Med. 2, e124 (2005).
Jager, L. R. & Leek, J. T. Biostatistics 15, 1–12 (2014).
Sebastiani, P. et al. Science 333, 404 (2011).
Ioannidis, J. P. et al. Nat. Genet. 41, 149–155 (2009).
Seoighe, C., Tosh, N. J. & Greally, J. M. Nat. Genet. 50, 1062–1063 (2018).
Jacob, L. & Speed, T. P. Genome Biol. 19, 97 (2018).
Nieuwenhuis, S., Forstmann, B. U. & Wagenmakers, E. J. Nat. Neurosci. 14, 1105–1107 (2011).
Qin, L. X., Huang, H. C. & Begg, C. B. J. Clin. Oncol. 34, 3931–3938 (2016).
Ernst, J. & Kellis, M. Nat. Biotechnol. 33, 364–376 (2015).
Vapnik, V. N. Statistical Learning Theory (Wiley, New York, 1998).
Bishop, C. M. Pattern Recognition and Machine Learning (Springer, New York, 2006).
Friedman, J., Hastie, T. & Tibshirani, R. J. Stat. Softw. 33, 1–22 (2010).
Webb, S. Nature 554, 555–557 (2018).
Bishop, C. M. Neural Networks for Pattern Recognition (Oxford Univ. Press, Oxford, 1995).
Varma, S. & Simon, R. BMC Bioinform. 7, 91 (2006).
Teschendorff, A. E. et al. Genome Biol. 7, R101 (2006).
Ambroise, C. & McLachlan, G. J. Proc. Natl Acad. Sci. USA 99, 6562–6566 (2002).
Reunanen, J. J. Mach. Learn. Res. 3, 1371–1382 (2003).
Efron, B. & Tibshirani, R. J. J. Am. Stat. Assoc. 92, 548–560 (1997).
Simon, R. J. Natl Cancer Inst. 97, 866–867 (2005).
Biton, A. et al. Cell Rep. 9, 1235–1245 (2014).
Leek, J. T. & Storey, J. D. PLoS Genet. 3, 1724–1735 (2007).
Horvath, S. Genome Biol. 14, R115 (2013).
Leek, J. T. & Storey, J. D. Proc. Natl Acad. Sci. USA 105, 18718–18723 (2008).
Galea, M. H., Blamey, R. W., Elston, C. E. & Ellis, I. O. Breast Cancer Res. Treat. 22, 207–219 (1992).
Bartlett, T. E. et al. PLoS ONE 10, e0143178 (2015).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Teschendorff, A.E. Avoiding common pitfalls in machine learning omic data science. Nat. Mater. 18, 422–427 (2019). https://doi.org/10.1038/s41563-018-0241-z
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41563-018-0241-z
This article is cited by
-
A meta-analysis of immune-cell fractions at high resolution reveals novel associations with common phenotypes and health outcomes
Genome Medicine (2023)
-
Serum biomarker-based early detection of pancreatic ductal adenocarcinomas with ensemble learning
Communications Medicine (2023)
-
A new blood based epigenetic age predictor for adolescents and young adults
Scientific Reports (2023)
-
MAPK inhibitor sensitivity scores predict sensitivity driven by the immune infiltration in pediatric low-grade gliomas
Nature Communications (2023)
-
Towards a robust out-of-the-box neural network model for genomic data
BMC Bioinformatics (2022)