Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Detecting potential outliers in longitudinal data with time-dependent covariates

Abstract

Background

Outliers can influence regression model parameters and change the direction of the estimated effect, over-estimating or under-estimating the strength of the association between a response variable and an exposure of interest. Identifying visit-level outliers from longitudinal data with continuous time-dependent covariates is important when the distribution of such variable is highly skewed.

Objectives

The primary objective was to identify potential outliers at follow-up visits using interquartile range (IQR) statistic and assess their influence on estimated Cox regression parameters.

Methods

Study was motivated by a large TEDDY dietary longitudinal and time-to-event data with a continuous time-varying vitamin B12 intake as the exposure of interest and development of Islet Autoimmunity (IA) as the response variable. An IQR algorithm was applied to the TEDDY dataset to detect potential outliers at each visit. To assess the impact of detected outliers, data were analyzed using the extended time-dependent Cox model with robust sandwich estimator. Partial residual diagnostic plots were examined for highly influential outliers.

Results

Extreme vitamin B12 observations that were cases of IA had a stronger influence on the Cox regression model than non-cases. Identified outliers changed the direction of hazard ratios, standard errors, or the strength of association with the risk of developing IA.

Conclusion

At the exploratory data analysis stage, the IQR algorithm can be used as a data quality control tool to identify potential outliers at the visit level, which can be further investigated.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Mean intake by country and visit.
Fig. 2: Residuals diagnostics plots from fitted models for full and IQR-5 TEDDY data where residuals are plotted against average vitamin B12 (μg/day) intake/day.
Fig. 3: Partial DFBETA residuals from full TEDDY dietary data showing vitamin B12 (μg/day) values that have larger or unusual partial DFBETA residuals compared to other data points.
Fig. 4: Partial DFBETA residuals from IQR-5 TEDDY data with labeled vitamin B12 (μg/day) values.

Similar content being viewed by others

Data availability

Data from The Environmental Determinants of Diabetes in the Young (https://doi.org/10.58020/y3jk-x087) reported here will be made available for request at the NIDDK Central Repository (NIDDK-CR) website, Resources for Research (R4R), https://repository.niddk.nih.gov/.

Code availability

Statistical analysis code can be provided by the corresponding author upon a reasonable request.

References

  1. Agresti A, Franklin CA, Klingenberg B. Statistics: the art and science of learning from data. 5th ed. Pearson; Essex, England; 2021.

  2. McClave JT, Sincich TT. Statistics. 13th ed. Pearson Higher Ed; New Jersey, USA; 2017.

  3. Aguinis H, Gottfredson RK, Joo H. Best-practice recommendations for defining, identifying, and handling outliers. Organ Res Methods. 2013;16:270–301.

    Article  Google Scholar 

  4. Jones PR. A note on detecting statistical outliers in psychophysical data. Attention, perception, and psychophysics. Vol. 81. Springer New York LLC; New York, USA, 2019. p. 1189–96.

  5. Leys C, Delacre M, Mora YL, Lakens D, Ley C. How to classify, detect, and manage univariate and multivariate outliers, with emphasis on pre-registration. Int Rev Soc Psychol. 2019;32:5.

  6. Van den Broeck J, Cunningham SA, Eeckels R, Herbst K. Data cleaning: detecting, diagnosing, and editing data abnormalities. PLoS Med. 2005;2:966–70.

    Google Scholar 

  7. Stasinopoulos MD, Rigby RA, Heller GZ, Voudouris V, Bastiani F De. Flexible regression and smoothing using GAMLSS in R. CRC Press; Boca Raton, FL, USA. 2017.

  8. Rigby RA, Stasinopoulos MD, Heller GZ, Bastiani F De. Distributions for modeling location, scale, and shape: using GAMLSS in R. CRC Press; Boca Raton, FL, USA. 2020.

  9. Yang J, Rahardja S, Fränti P. Outlier detection: how to threshold outlier scores? In: ACM International Conference Proceeding Series. Association for Computing Machinery; New York, USA, 2019.

  10. Van der Meer T, Te Grotenhuis M, Pelzer B. Influential cases in multilevel modeling: a methodological comment. Am Socio Rev. 2010;75:173–8.

    Article  Google Scholar 

  11. Yang S, Hutcheon JA. Identifying outliers and implausible values in growth trajectory data. Ann Epidemiol. 2016;26:77–80.e2.

    Article  PubMed  Google Scholar 

  12. Leys C, Ley C, Klein O, Bernard P, Licata L. Detecting outliers: do not use standard deviation around the mean, use absolute deviation around the median. J Exp Soc Psychol. 2013;49:764–6.

    Article  Google Scholar 

  13. Phan HTT, Borca F, Cable D, Batchelor J, Davies JH, Ennis S. Automated data cleaning of paediatric anthropometric data from longitudinal electronic health records: protocol and application to a large patient cohort. Sci Rep. 2020;10:10164.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Shi J, Korsiak J, Roth DE. New approach for the identification of implausible values and outliers in longitudinal childhood anthropometric data. Ann Epidemiol. 2018;28:204–11.e3.

    Article  PubMed  PubMed Central  Google Scholar 

  15. Dugravot A, Sabia S, Shipley MJ, Welch C, Kivimaki M, Singh-Manoux A. Detection of outliers due to participants’ non-adherence to protocol in a longitudinal study of cognitive decline. PLoS One. 2015;10:e0132110.

    Article  PubMed  PubMed Central  Google Scholar 

  16. Boone-Heinonen J, Tillotson CJ, O’Malley JP, Marino M, Andrea SB, Brickman A, et al. Not so implausible: impact of longitudinal assessment of implausible anthropometric measures on obesity prevalence and weight change in children and adolescents. Ann Epidemiol. 2019;31:69–74.e5.

    Article  PubMed  PubMed Central  Google Scholar 

  17. Hazrati S, Hourigan SK, Waller A, Yui Y, Gilchrist N, Huddleston K, et al. Investigating the accuracy of parentally reported weights and lengths at 12 months of age as compared to measured weights and lengths in a longitudinal childhood genome study. BMJ Open. 2016;6:11653. https://doi.org/10.1136/bmjopen-2016-011653.

    Article  Google Scholar 

  18. Farooqui T, Mustafa I, Christie T. Outliers in educational achievement data: their potential for the improvement of performance. Pak J Stat. 2014;30:71–82.

    Google Scholar 

  19. Voloh B, Watson MR, König S, Womelsdorf T. MAD saccade: statistically robust saccade threshold estimation via the median absolute deviation. J Eye Mov Res. 2019;12:1–12.

    Google Scholar 

  20. Chen Z, Song S, Wei Z, Fang J, Long J. Approximating median absolute deviation with bounded error. Proc VLDB Endow. 2021;14:2114–26. https://doi.org/10.14778/3476249.3476266.

    Article  Google Scholar 

  21. Casella G, Berger RL. Statistical inference. 2nd ed. Duxbury; USA. 2002.

  22. Rousseeuw PJ, Croux C. Explicit scale estimators with high breakdown point. In: Dodge Y, editor. L1-Statistical analysis and related methods. Y. Dodge, Amsterdam; North-Holland; 1992. p. 77–92.

  23. TEDDY Study Group. The Environmental Determinants of Diabetes in the Young (TEDDY) Study. Ann N Y Acad Sci. 2008;1150:1–13. https://doi.org/10.1196/annals.1447.062.

  24. Uusitalo U, Kronberg-Kippila C, Aronsson CA, Schakel S, Schoen S, Mattisson I, et al. Food composition database harmonization for between-country comparisons of nutrient data in the TEDDY Study. J Food Compos Anal. 2011;24:494–505.

    Article  CAS  Google Scholar 

  25. Cox DR. Regression models and life tables (with discussion). J R Stat Soc B 1972;74:187–220.

    Google Scholar 

  26. Klein JP, Moeschberger ML. Survival analysis: techniques for censored and truncated data. 2nd ed. Springer; New York, USA. 2003.

  27. Hosmer DW, Lemeshow S, May S. Applied survival analysis: regression modeling of time-to-event data. 2nd ed. John Wiley & Sons, Inc.; New Jersey, USA; 2008.

  28. Lin DY, Wei LJ. The robust inference for the cox proportional hazards model. J Am Stat Assoc. 1989;84:1074–8.

    Article  Google Scholar 

  29. Zeger SL, Liang K-Y. Longitudinal data analysis for discrete and continuous outcomes. Biometrics. 1986;42:121–30.

    Article  CAS  PubMed  Google Scholar 

  30. Willett WC, Howe GR, Kushi LH. Adjustment for total energy intake in epidemiologic studies. Am J Clin Nutr. 1997;65:1220S–1228S. discussion 1229S–1231S.

    Article  CAS  PubMed  Google Scholar 

  31. SAS Institute Inc. SAS Software 9.4 (SAS/STAT 15.2). Cary, NC, USA; 2016. http://www.sas.com/.

  32. R Core Team. R: A language and environment for statistical computing. Vienna, Austria; 2023. https://www.r-project.org/.

  33. StataCorp LLC Stata Statistical Software. College Station, TX: StataCorp LLC; 2023.

Download references

Acknowledgements

The authors would like to thank Sarah Austin-Gonzalez of the University of South Florida (USF)-Health Informatics Institute for editing and providing study information and support. We would also like to thank the reviewers for helping us to improve on the manuscript.

Funding

The TEDDY Study is funded by U01 DK63829, U01 DK63861, U01 DK63821, U01 DK63865, U01 DK63863, U01 DK63836, U01 DK63790, UC4 DK63829, UC4 DK63861, UC4 DK63821, UC4 DK63865, UC4 DK63863, UC4 DK63836, UC4 DK95300, UC4 DK100238, UC4 DK106955, UC4 DK112243, UC4 DK117483, U01 DK124166, U01 DK128847, and Contract No. HHSN267200700014C from the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), National Institute of Allergy and Infectious Diseases (NIAID), Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD), National Institute of Environmental Health Sciences (NIEHS), Centers for Disease Control and Prevention (CDC), and JDRF. This work is supported in part by the NIH/NCATS Clinical and Translational Science Awards to the University of Florida (UL1 TR000064) and the University of Colorado (UL1 TR002535). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization: LKM and JY. Methodology: LKM, KFL, and XL. Software: LKM. Formal analysis: LKM. Resources: JPK. Data curation: LKM, JY, and UMU. Writing—original draft preparation: LKM. Writing—review and editing: LKM, XL, KFL, JY, CAA, SH, JMN, SMV, LH, UMU, and JPK. Supervision: KFL, XL, and JPK. Project administration: UMU and JMN. Funding acquisition: JMN, UMU, and JPK. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Lazarus K. Mramba.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethical approval

The TEDDY study was conducted in accordance with the Declaration of Helsinki, and approved by local US Institutional Review Boards and European Ethics Committee Boards, including the Colorado Multiple Institutional Review Board, Medical College of Georgia Human Assurance Committee (2004–2010), Georgia Health Sciences University Human Assurance Committee (2011–2012), Georgia Regents University Institutional Review Board (2013–2015), Augusta University Institutional Review Board (2015–present), University of Florida Health Center Institutional Review Board, Washington State Institutional Review Board (2004–2012), Western Institutional Review Board (2013–present), Ethics Committee of the Hospital District of Southwest Finland, Bayerischen Landesärztekammer (Bavarian Medical Association) Ethics Committee, Regional Ethics Board in Lund, Section 2 (2004–2012), and Lund University Committee for Continuing Ethical Review (2013–present).

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mramba, L.K., Liu, X., Lynch, K.F. et al. Detecting potential outliers in longitudinal data with time-dependent covariates. Eur J Clin Nutr 78, 344–350 (2024). https://doi.org/10.1038/s41430-023-01393-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41430-023-01393-6

This article is cited by

Search

Quick links