Machine Learning in Health Research
Ensemble methods, mixture models, and semiparametric estimation for clinical and neuroimaging data
2026-06-07 18:16 PDT
Overview
Machine learning methods offer predictive capacity that classical regression models cannot achieve when covariate spaces are high-dimensional, outcome distributions are complex, or the functional form of the association is unknown. In clinical research, however, the standards for model transparency, stability across sites, and interpretability are demanding: a model that predicts well on training data but cannot be explained to a clinician, replicated in a new cohort, or used to inform a treatment decision has limited scientific value.
This program applies and evaluates machine learning methods in clinical settings where those standards can be met – principally, ensemble methods for binary clinical outcomes with large multi-cohort validation, mixture models for MRI-derived biomarkers, and semiparametric density estimation for neuroimaging data analysis.
Current work
Medications and MCI-to-AD progression: ensemble ML. Gradient boosting, random forests, and LASSO applied to time-varying medication features in ADCS, ADNI, and NACC cohorts, examining whether concomitant medication exposure predicts MCI-to-AD conversion. Four manuscripts (pooled plus three cohort-specific) near submission.
ADNI machine learning. Application of machine learning methods to ADNI neuroimaging and clinical data. Early-stage project establishing the analytic pipeline.
Mixture-normal models for MRI data. Normal mixture models applied to MRI-derived distributions, accommodating the multimodal structure that arises in MRI-based biomarker analysis.
Empirical characteristic function and Gaussian mixture estimation. Fourier-based semiparametric density estimation via the empirical characteristic function (ECF) and Gaussian mixture models (GMM), with application to neuroimaging data from ADNI.
MissForest for clinical data imputation. Evaluation of the missForest random-forest imputation algorithm for the high-dimensional, mixed-type missingness patterns common in multi-site AD cohort studies.
Methods
Gradient boosting (xgboost, gbm); random forests (randomForest, ranger); LASSO via glmnet; normal mixture models via mclust; empirical characteristic function estimation; Fourier-based nonparametric density estimation; multiple imputation via mice and missForest; cross-validated performance evaluation (AUROC, Brier score) with site-stratified folds.
Software
- zzlongplot – Visualization of longitudinal prediction trajectories and model performance across cohorts.
- zztable1 – Descriptive tables for multi-site machine learning study populations.
Publications
The full publications list filtered by machine-learning or predictive-modeling provides the relevant publication record, which includes early neural-network and classification work from the 1990s and the current ensemble ML program.