Comparative Performance of Machine Learning Models for Diabetes Prediction Among US Adults Using NHANES Data


  •  Keshab Raj Dahal    
  •  Nirajan Budhathoki    
  •  Anuja Dahal    
  •  Shreya Dhital    

Abstract

This study aimed to compare the performance of six popular machine learning (ML) models for predicting diabetes mellitus among US adults using data from the 2017-March 2020 cycle of the National Health and Nutrition Examination Survey (NHANES). Data from 6,170 NHANES participants aged 20 years and older were analyzed. The target variable was self-reported diabetes status. A comprehensive set of predictors spanning demographics, health behaviors, and body measurements was considered. Descriptive statistics, accounting for NHANES’s complex sampling design, were presented to characterize the study population. To train the ML models, data were preprocessed using min-max normalization and the Synthetic Minority Oversampling Technique (SMOTE) to address class imbalance. Six ML models —logistic regression (LR), k-Nearest neighbors (kNN), support vector machine (SVM), random forest (RF), extreme gradient boosting (XGBoost), and neural networks (NN) —were trained and optimized using grid search and 10-fold cross-validation. Model performance was evaluated using sensitivity, specificity, accuracy, and AUC. Statistical differences in performance matrices were tested using Cochran’s Q and pairwise McNemar’s tests. Regarding models’ predictive ability, the RF model achieved the highest AUC (0.7559) and sensitivity (0.5370), while XGBoost showed the highest specificity (0.8654) and accuracy (0.7909). The NN model had consistent performance (AUC: 0.7482 ± 0.0054; sensitivity: 0.4828 ± 0.0493). A Cochran’s Q test indicated significant differences in all metrics (sensitivity, specificity, accuracy and AUC) across models (p-value < 0.001). Pairwise McNemar’s tests further confirmed multiple statistically significant differences in metrics within some models.



This work is licensed under a Creative Commons Attribution 4.0 License.