Benchmarking machine learning missing data imputation methods in large-scale mental health survey databases

Preethi Prakash1 Kelly Street2 Shrikanth Narayanan3 Bridget A. Fernandez4,5 Yufeng Shen6 Chang Shu7*
1 Department of Computer Science, Fu Foundation School of Engineering and Applied Science, Columbia University, New York, NY, United States of America
2 Department of Population and Public Health Sciences, Division of Biostatistics, Keck School of Medicine, University of Southern California, Los Angeles, CA, United States of America
3 Viterbi School of Engineering, University of Southern California, Los Angeles, CA, United States of America
4 Department of Pediatrics, Division of Medical Genetics, Children’s Hospital Los Angeles and The Saban Research Institute, Los Angeles, CA, United States of America
5 Department of Pediatrics, Keck School of Medicine of USC, University of Southern California, Los Angeles, CA, United States of America
6 Departments of Systems Biology and Biomedical Informatics, and JP Sulzberger Columbia Genome Center, Columbia University Irving Medical Center, New York, NY, United States of America
7 Department of Population and Public Health Sciences, Center for Genetic Epidemiology, Division of Epidemiology and Genetics, Keck School of Medicine, University of Southern California, Los Angeles, CA, United States of America
AIH 2025, 2(1), 81–92;
Submitted: 1 August 2024 | Revised: 17 September 2024 | Accepted: 14 October 2024 | Published: 7 November 2024
Databases tied to mental and behavioral health surveys suffer from the issue of missing data when participants skip the entire survey, which affects the data quality and sample size. These missing data patterns were investigated and the imputation performance was evaluated in Simons Foundations Powering Autism Research for Knowledge, a large-scale autism cohort consists of over 117,000 participants. Four common methods were assessed – Multiple imputation by chained equations (MICE), K-nearest neighbors (KNN), MissForest, and multiple imputation with denoising autoencoders (MIDAS). In a complete subset of 15,196 autism participants, three types of missingness patterns were simulated. We observed that MIDAS and KNN performed the best as the random missingness rate increased and when blockwise missingness was simulated. The average computational times were each 10 min for MIDAS and KNN, 35 min for MissForest, and 290 min for MICE. MIDAS and KNN both provide promising imputation performance in mental and behavioral health survey data that exhibit blockwise missingness patterns.

Missing data
Mental health survey
Imputation methods
Machine learning
This work is supported by Southern California Environmental Health Sciences Center pilot grant from NIH/NIEHS, grant number P30ES007048 (Rob McConnell), and The Tobacco-Related Disease Research Program, grant number T32IR5216 (Xuejuan Jiang) and NIH/NIA, grant number 1RF1AG076124-01A1 (Hussein Yassine).
Conflict of interest
The authors declare that they have no conflicts of interest.
