Ranking Normalization Methods for Improving the Accuracy of SVM Algorithm by DEA Method

Data mining techniques, extracting patterns from large databases have become widespread in all life’s aspect. One of the most important data mining tasks is classification. Classification is an important and widely studied topic in many disciplines, including statistics, artificial intelligent, operations research, computer science and data mining and knowledge discovery. One of the important things that should be done before using classification algorithms is preprocessing operations which cause to improve the accuracy of classification algorithms. Preprocessing operations include various methods that one of them is normalization. In this paper, we selected five applicable normalization methods and then we normalized selected data sets afterward we calculated the accuracy of classification algorithm before and after normalization. In this study the SVM algorithm was used in classification because this algorithm works based on n-dimension space and if the data sets become normalized the improvement of results will be expected. Eventually Data Envelopment Analysis (DEA) is used for ranking normalization methods. We have used four data sets in order to rank the normalization methods due to increase the accuracy then using DEA and AP-model outrank these methods.


Introduction
Data mining and knowledge discovery (DMKD) has made predominant progress during the past two decades (Peng et al., 2008).It utilizes methods, algorithms, and techniques from many disciplines, including statistics, datasets, machine learning, pattern recognition, artificial intelligence, data visualization, and optimization (Fayyad, 1996).
In recent years, the field of data mining has seen an explosion of interest from both academic and industry (e.g., Olafson, Li, & Wu, 2008).Increasing volume of data, increasing awareness of inadequacy of human brain to process data and increasing affordability of machine learning are reasons of growing popularity of data mining (e.g., Marakas, 2004).Data mining (DM) is the process for automatic discovery of high level knowledge by obtaining information from real data.
One of the major tasks in DMKD is classification.Researchers in a variety of fields have created a large number of classification algorithms, such as decision tree, neural networks, Bayesian network, linear logistic regression, Naive Bayes, and K-nearest-neighbor.
Learning algorithms are now used in many domains, and different performance metrics are suited for each domain.For example Precision/Recall measures are used in information retrieval; medicine prefers ROC area; Lift is appropriate for some marketing tasks, etc.
The different performance metrics measure different trade of in the predictions made by a classification and it is possible for learning methods to perform well on one metric, but be suboptimal on other metrics.Because of this it is to calculate algorithms on a broad set of performance metrics (Caruana & Niculescu-Mizil, 2006).
Classification is an important and widely studied topic in many disciplines, including statistics, artificial intelligent, operations research, computer science and data mining and knowledge discovery (Chen, Xu, & Chi, 1999).
Based on the number of predefined groups, classification can be divided into binary and multiclass classification.Binary classification assigns data objects into one of the two groups and multiclass classification involves three or more groups.Compare with binary classification, multiclass classification problem is more complex.
With increased use in real-world applications, such as disease diagnosis, text categorization, credit analysis, software risk management and network intrusion detection, a variety of methods and algorithms have been developed for multiclass classification in recent years.
Chen et al. scrutinized the reason why the mixture of experts (ME) performance poorly in multiclass classification and proposed an approximation for the Newton-Raphson algorithm to improve the performance of the ME architecture in multiclass classification.Platt et al. (1999) presented the decision directed acyclic graph architecture and a learning algorithm for multiclass classification.Allwein et al. (2000) proposed a unifying framework for multiclass classification using a margin-based binary learning algorithm.Crammer and Singer (2001) described the algorithmic implementation of multiclass kernel-based vector machines and conducted experiments to compare the presented approach to previously studied kernel-based methods.Loucopoulos proposed a mixed-integer programming model for the minimization of misclassification costs in the three-group problem.
Thorsten Joachims (1997) published results on a set of binary text classification experiments using the Support Vector Machine.The SVM concluded lower error than many other classification techniques.Yang and Liu (1995) followed two years later with experiments of their own on the same data set.They used improved versions of Naive Bayes (NB) and k-nearest neighbors (KNN) but still found that the SVM performed at least as well as all other classifiers they tried.
Nowadays normalization methods are very applicable and appropriate for solving the Multiple Criteria Decision Making (MCDM) problems.There are different methods for normalization and these methods used to obtain concise answers.The normalization methods extracted from (Hwang & Yoon, 1981;Milani, Shanian, & Madoliat, 2005;Yoon & Hwang, 1995) and are used in the experimental study.
Ranking of normalization methods normally need to examine several criteria, such as accuracy, computational time, and misclassification rate.Therefore algorithm selection can be modeled as multiple criteria decision making (MCDM) problems (Peng et al., 2009;Rokach & Ensemble, 2009).
As mentioned heretofore, algorithm ranking is a useful strategy for selecting the appropriate classifier and the preferences of users are important in algorithm ranking (Berrer, Paterson, & Keller, 2000).Some existing MCDM methods are able to rank classifiers based on multiple performance measures and take the preferences of users into the ranking process.
DEA is a non-parametric linear programming based technique for measuring the relative efficiency of a set of similar units, usually referred to as decision making units (DMUs).Because of its successful application, DEA has gained too much attention and vast use by business and academic researchers.
The rest of this paper is organized as follows.In Section II normalization, SVM and DEA are described.The design of the experimental study is provided in Section III.Section IV presents and analyzes the experimental results.The Section V summarizes the findings and discusses future research directions.

Normalization
The normalization methods and distance measures are also taken into consideration as well.Some of them are very useable that we selected 5 popular methods with study literature that such as: 2) Linear normalization (I):   * , .; , .; max , , ; min{ } x is the most favorable value and j  is the standard deviation of alternative ratings with respective to the jth attribute (Hwang & Yoon, 1981;Milani, Shanian, & Madoliat, 2005;Yoon & Hwang, 1995).

Support Vector Machine (SVM)
The Support Vector Machine is a classifier, originally proposed by Vapnik that finds a maximal margin separating hyper plane between two classes of data (e.g., Christopher, 1998).An SVM is trained via the following optimization problem: with constraints: For more information, see Burges' tutorial and Cristianini and Shawe-Taylor's book (e.g., Nello & John, 2000;Yang & Liu, 1999).There are non-linear extensions to the SVM, but Yang and Liu found the linear kernel to outperform non-linear kernels in text classification (e.g., Ryan, 2000).
In informal experiments, we also found that linear performs at least as well as non-linear kernels.Hence, we only present linear SVM results.We use the SMART ltc 2 transform; the SvmFu package is used for running experiments (e.g., Charnes, Cooper, & Rhodes, 1978).
The SVM must read in the training set and then perform a quadratic optimization.This can be done quickly when the number of training examples is small (e.g., <10000 documents), but can be a bottleneck on larger training sets.We realize speed improvements with chunking and by caching kernel values between the training of binary classifiers.

Data Envelopment Analysis (DEA)
Charnes, Cooper, and Rhodes developed data envelopment analysis (DEA) to evaluate the efficiency of decision making units (DMUs) through identifying the efficiency frontier and comparing each DMU with the frontier (Koksalan & Tuncer, 2009).Since DEA is able to estimate efficiency with minimal prior assumptions (Li & Ma, 2008;Cherchye & Post, 2003), it has a comparative advantage to approaches that require a priori assumptions, such as standard forms of statistical regression analysis (Cooper, 2004).
During the past thirty years, various DEA extensions and models have been developed and established themselves as powerful analytical tools (Banker, Charnes, & Cooper, 1984).
The original DEA model presented by Charnes, Cooper, and Rhodes (Koksalan & Tuncer, 2009) is called ''CCR ratio model", which uses the ratio of outputs to inputs to measure the efficiency of DMUs.Assume that there are n DMUs with m inputs to produce s outputs.X ij and y rj represent the amount of input i and output r for DMU j ( , ,.... j 1 2 n  ), respectively.Then the ratio-form of DEA can be represented as: , r i u v 0 for all i and r  where the u r 's and the vi's are the variables and the y ro 's and x io 's are the observed output and input values of the DMU to be evaluated (i.e., DMU o ), respectively.The equivalent linear programming problem using the Charnes-Cooper transformation is (Banker, Charnes, & Cooper, 1984): Banker, Charnes, and Cooper introduced the BCC model by adding a constraint ∑ j=1 to the CCR model (Nakhaeizadeh & Schnabl, 1997).These models can be solved using the simplex method for each DMUs.DMUs with value of 1 are efficient and others are inefficient.Nakhaeizadeh and Schnabl proposed to use DEA approach in data mining algorithms selection (Fayyad, Piatetsky-Shapiro, & Smyth, 1996).
They argued that in order to make an objective evaluation of data mining algorithms that all the available positive and negative properties of algorithms are important and DEA models are able to take both aspects into consideration.Positive and negative properties of data mining algorithms can be considered as output and input components in DEA, respectively.
For example, the overall accuracy rate of a classification algorithm is an output component and the computation time of an algorithm is an input component.Using existing DEA models, it is possible to give a comprehensive evaluation of data mining algorithms.
In the empirical study, input components are 1.Output components include the accuracy SVM algorithm with use each normalization method and the DMU (Decision Making Unit) are the normalization method.
After solving this LP and determining the weights, the algorithms with h r =1 (100%) are efficient algorithms and form the efficiency frontier or envelope.
The other algorithms do not belong to the efficiency frontier and remain outside of it.As already mentioned, the definition of efficiency is more general than interestingness as suggested by Fayyad et al. (1993).
One can connect also both concepts in this form that more efficient algorithms are more interesting.For ranking the algorithms, one can use the approach suggested by Andersen and Petersen (2008).(AP-model).They use a criterion that we call it the AP-value.
In input-oriented models the AP-value measures how much an efficient algorithm can radically enlarge its input-levels while remaining still efficient (output-oriented is analogous).For example, for an input-oriented method an AP-value equal to 1.5 means that the algorithm remains still efficient when its input values are all enlarged by 50%.If the algorithm is inefficient then the AP-value is equal to the efficiency value.In this paper, the CCR and BCC models are utilized to rank a normalization method.

Experimental Study
The experiment is designed to rank normalization methods with the SVM classification algorithm using the DEA model described in the previous section.The following subsections describe the performance measures, data sources and experimental design.

Performance Measures
There are an extensive number of performance measures for classification.Commonly used performance measures in software defect classification are accuracy, precision, recall, F-measure, the area under receiver operating characteristic (AUC), and mean absolute error (Elish & Elish, 2008;Lessmann et al., 2008;Mair et al., 2000;Han & Kamber, 2006).
Besides these popular measures, this work includes seven other classification measures.The following paragraphs briefly describe these measures.
Overall accuracy: Accuracy is the percentage of correctly classified modules (Baeza-Yates & Ribeiro-Neto, 1999).It is one the most widely used classification performance metrics.

TN TP Overal accuracy TP FP FN TN
True positive (TP): TP is the number of correctly classified fault-prone modules.TP rate measures how well a classifier can recognize fault-prone modules.It is also called sensitivity measure.
F-measure: It is the harmonic mean of precision and recall.F-measure has been widely used in information retrieval (Ferri, Hernandezorallo, & Modroiu, 2009).

Precision Recall F measure
Pr ecision Recall AUC: ROC stands for Receiver Operating Characteristic, which shows the tradeoff between TP rate and FP rate (Ferri, Hernandezorallo, & Modroiu, 2009).AUC represents the accuracy of a classifier.The larger is the area, the better is the classifier.
Kappa statistic (KapS): This is a classifier performance measure that estimates the similarity between the members of an ensemble in multi-classifiers systems (Witten & Frank, 2005).
P(A) is the accuracy of the classifier and P(E) is the probability that agreement among classifiers is due to chance.
m is the number of modules and c is the number of classes.f(i, j) is the actual probability of i module to be of is the number of modules of class j.Given threshold , C (i, j) is 1 if j is the predicted class for i obtained from P(i, j); otherwise it is [0, 1] (Triantaphyllou & Baig, 2005).
Mean absolute error (MAE): This measures how much the predictions deviate from the true probability.P(i, j) is the estimated probability of i module to be of class j taking values in [0,1] (Triantaphyllou & Baig, 2005).
As the name suggests, the mean absolute error is an average of the absolute errors i i i e f y   , where i f is the prediction and i y the true value.Note that alternative formulations may include relative frequencies as weight factors.The mean absolute error is one of a number of ways of comparing forecasts with their eventual outcomes.
Well-established alternatives are the mean absolute scaled error (MASE) and the mean squared error.These all summarize performance in ways that disregard the direction of over-or under-prediction; a measure that does place emphasis on this is the mean signed difference.
Where a prediction model is to be fitted using a selected performance measure, in the sense that the least squares approach is related to the mean squared error, the equivalent for mean absolute error is least absolute deviations.


Training time: the time needed to train a classification algorithm or ensemble method.


Test time: the time needed to test a classification algorithm or ensemble method.

Data Sources
The data used in this study are 6 public-domain data sets from four application domains including Iris, Glass and Breast Cancer and Breast Cancer Wisconsin are provided by the UCI machine learning repository (http://archive.ics.uci.edu/ml/).
The Iris is a three-class (Iris Setosa, Iris Versicolour, Iris Virginica) data set that has 150 instances with 4 continuous predictor variables and 1 class variable.The predictor variables describe the Sepal length, Sepal width, Petal length and Petal width of Iris plants and the class variable indicates the type of iris plant.
The Breast Cancer is a two-class (no-recurrence-events, recurrence-events) dataset that has 699 instances with 9 continuous predictor variables and 1 class variable.The predictor variables describe the Clump Thickness, Uniformity of Cell Size, Uniformity of Cell Shape, Marginal Adhesion, Single Epithelial Cell Size, Bare Nuclei, Bland Chromatin, Normal Nucleoli, mitoses.
The Breast Cancer Wisconsin is a two-class(R = recur, N = nonrecur) dataset that has 198 instances with 10 continuous predictor variables and 1 class variable.The predictor variables describe the radius (mean of distances from center to points on the perimeter), texture (standard deviation of gray-scale values), perimeter, area, smoothness (local variation in radius lengths), compactness (perimeter^2/area -1.0), concavity (severity of concave portions of the contour), concave points (number of concave portions of the contour), symmetry, fractal dimension ("coastline approximation" -1).
The Cardiotocography is a three-class (N=normal; S=suspect; P=pathologic) dataset that has 15888 instances The Wall-Following Robot Navigation is a four-class (Move-Forward, Slight-Right-Turn, Sharp-Right-Turn, Slight-Left-Turn) data set that has 5456 instances with 24 continuous predictor variables and 1 class variable.

Experimental Design
The experiment was carried out according to the following process: Input: the four selected datasets.
Output: Ranking of normalization techniques.
Step 1: Prepare target datasets: select and transform relevant features; data cleaning; data integration.
Step 3: Normalized the data set by five methods.
Step 5: Ranking the Normalization methods with attention accuracies get from SVM classification algorithm by DEA method.
Step 6: If gained efficiency from DEA model of two or more algorithms equal 1 then ranking Normalization methods with using A-P model of DEA.

END
Normalization methods selection problem involves benefit criteria.A criterion is called the benefit because the higher a normalization methods scores in terms of the corresponding criterion, the better the algorithm is (Abbas, Morteza, & Karimi, 2011).

Discussion of Results
Table 1 shows the classification results of 4 data sets using SVM classification algorithm before normalization methods.

Conclusions and Future Work
In this paper we implemented different normalization methods as a preprocessing to improve SVM algorithm.
After testing the accuracy of SVM method on various Data Sets we concluded that these simple methods have low amount of processing and also increase the accuracy of SVM algorithm.For ranking these methods we used DEA which is the most practical decision making techniques.
We suggest to the future researchers to apply normalization methods on more data sets and also use other decision making techniques like TOPSIS and VIKOR (Peng et al., 1999) for ranking them.In addition testing these methods on other classification algorithms especially those based on n-dimension space like SMO to see how their accuracy improves is also suggested.And also they considerate time complexity in their researches.
with 22 continuous predictor variables and 1 class variable.The predictor variables describe the LB -FHR baseline (beats per minute), AC-#of accelerations per second, FM -# of fetal movements per second, UC -# of uterine contractions per second, DL -# of light decelerations per second, DS -# of severe decelerations per second, DP -# of prolonged decelerations per second ASTV -percentage of time with abnormal short term variability, MSTV -mean value of short term variability ALTV -percentage of time with abnormal long term variability, MLTV -mean value of long term variability, Width-width of FHR histogram, Min-minimum of FHR histogram, Max -Maximum of FHR histogram, Nmax -# of histogram peaks, Nzeros -# of histogram zeros, Mode -histogram mode, Mean -histogram mean, Median -histogram median, Variance -histogram variance, Tendency -histogram tendency, CLASS -FHR pattern class code (1 to 10).

Table 1 .
Classification results before normalization methods

Table 2 .
Classification results after normalization methodsTable 3 represents the ranking results generated by DEA.That the efficiency of all normalization methods is 1 it means all normalization methods are efficient and for ranking the normalization methods we should rank them by A.P model.Table 4 represents the ranking results generated by A.P model and ultimate evaluation of them.

Table 4 .
Represents the ranking results generated by A.P