Machine Learning in Health Research

Ensemble methods, mixture models, and semiparametric estimation for clinical and neuroimaging data

2026-06-07 18:16 PDT

Overview

Machine learning methods offer predictive capacity that classical regression models cannot achieve when covariate spaces are high-dimensional, outcome distributions are complex, or the functional form of the association is unknown. In clinical research, however, the standards for model transparency, stability across sites, and interpretability are demanding: a model that predicts well on training data but cannot be explained to a clinician, replicated in a new cohort, or used to inform a treatment decision has limited scientific value.

This program applies and evaluates machine learning methods in clinical settings where those standards can be met – principally, ensemble methods for binary clinical outcomes with large multi-cohort validation, mixture models for MRI-derived biomarkers, and semiparametric density estimation for neuroimaging data analysis.

Current work

  • Medications and MCI-to-AD progression: ensemble ML. Gradient boosting, random forests, and LASSO applied to time-varying medication features in ADCS, ADNI, and NACC cohorts, examining whether concomitant medication exposure predicts MCI-to-AD conversion. Four manuscripts (pooled plus three cohort-specific) near submission.

  • ADNI machine learning. Application of machine learning methods to ADNI neuroimaging and clinical data. Early-stage project establishing the analytic pipeline.

  • Mixture-normal models for MRI data. Normal mixture models applied to MRI-derived distributions, accommodating the multimodal structure that arises in MRI-based biomarker analysis.

  • Empirical characteristic function and Gaussian mixture estimation. Fourier-based semiparametric density estimation via the empirical characteristic function (ECF) and Gaussian mixture models (GMM), with application to neuroimaging data from ADNI.

  • MissForest for clinical data imputation. Evaluation of the missForest random-forest imputation algorithm for the high-dimensional, mixed-type missingness patterns common in multi-site AD cohort studies.

Methods

Gradient boosting (xgboost, gbm); random forests (randomForest, ranger); LASSO via glmnet; normal mixture models via mclust; empirical characteristic function estimation; Fourier-based nonparametric density estimation; multiple imputation via mice and missForest; cross-validated performance evaluation (AUROC, Brier score) with site-stratified folds.

Software

  • zzlongplot – Visualization of longitudinal prediction trajectories and model performance across cohorts.
  • zztable1 – Descriptive tables for multi-site machine learning study populations.

Publications

The full publications list filtered by machine-learning or predictive-modeling provides the relevant publication record, which includes early neural-network and classification work from the 1990s and the current ensemble ML program.