An Empirical Analysis of Imbalanced Data Classification

Full Text: PDF &nbsp;
DOI: 10.5539/cis.v8n1p151

Shu Zhang; Samira Sadaoui; Malek Mouhoub

doi:10.5539/cis.v8n1p151

An Empirical Analysis of Imbalanced Data Classification

Shu Zhang
Samira Sadaoui
Malek Mouhoub

Abstract

SVM has been given top consideration for addressing the challenging problem of data imbalance learning. Here,we conduct an empirical classification analysis of new UCI datasets that have dierent imbalance ratios, sizes andcomplexities. The experimentation consists of comparing the classification results of SVM with two other popularclassifiers, Naive Bayes and decision tree C4.5, to explore their pros and cons. To make the comparative exper-iments more comprehensive and have a better idea about the learning performance of each classifier, we employin total four performance metrics: Sensitive, Specificity, G-means and time-based eciency. For each benchmarkdataset, we perform an empirical search of the learning model through numerous training of the three classifiersunder dierent parameter settings and performance measurements. This paper exposes the most significant resultsi.e. the highest performance achieved by each classifier for each dataset. In summary, SVM outperforms the othertwo classifiers in terms of Sensitive (or Specificity) for all the datasets, and is more accurate in terms of G-meanswhen classifying large datasets.