Optimal Algorithm for Metabolomics Classification and Feature Selection varies by Dataset

  •  Charles Jr    


Metabolomics, the systematic identification and quantification of all metabolites in a biological system, is increasingly applied towards identification of biomarkers for disease diagnosis, prognosis and risk prediction. Applications of metabolomics extend across the health spectrum including Alzheimer's, cancer, diabetes, and trauma. Despite the continued interest in metabolomics there are numerous techniques for analyzing metabolomics datasets with the intent to classify group membership (e.g. Control or Treated). These include Partial Least Squares Discriminant Analysis, Support Vector Machines, Random Forest, Regularized Generalized Linear Models, and Prediction Analysis for Microarrays. Each classification algorithm is dependent upon different assumptions and can potentially lead to alternate conclusions. This project seeks to conduct an in depth comparison of algorithm performance on both simulated and real datasets to determine which algorithms perform best given alternate dataset structures. Three simulated datasets were generated to validate algorithm performance and mimic 'real' metabolomics data: (Han et al., 2011) independent null dataset (no correlation, no discriminatory variables), (Davis, Schiller, Eurich, & Sawyer, 2012) correlated null (no discriminating variables), (Guan et al., 2009) correlated discriminatory. This comparison is also applied to 3 open-access datasets including two Nuclear Magnetic Resonance (NMR) and one Mass Spectrometry (MS) dataset. Performance was evaluated based on the Robustness-Performance-Trade-off (RPT) incorporating a balance between model classification accuracy and feature selection stability. We also provide a free, open-source R Bioconductor package (OmicsMarkeR) that conducts the analyses herein. The proposed work provides an important advancement in metabolomics analysis and helps alleviate the confusion of potentially paradoxical analyses thereby leading to improved exploration of disease states and identification of clinically important biomarkers.

This work is licensed under a Creative Commons Attribution 4.0 License.