A Wrapper-Based Combined Recursive Orthogonal Array and Support Vector Machine for Classification and Feature Selection

In data mining, classification problems are among the most frequently discussed issues. Feature selection is a very important pre-processing function in the vast majority of classification cases. Its aim is to delete irrelevant or redundant features in order to reduce the feature dimension and computing complexity and increase the accuracy of classification. Current feature selection methods can be roughly divided into the filter method and the wrapper method. The former chooses the feature subset before classifying, whereas the latter chooses the feature subset during the classification procedure. In general, wrapper methods result in better performance than filter methods, but they are time-consuming. This paper therefore proposes a wrapper method called OA-SVM that uses an orthogonal array (OA) to make systemic rules of feature selection and uses support vector machine (SVM) as the classifier. The proposed OA-SVM is employed to test eight UCI databases for the classification problem. The results of these experiments verify that the proposed OA-SVM for feature selection can effectively delete irrelevant or redundant features, thereby increasing classification accuracy.


Introduction
With the rapid progress of technology development, access to huge databases and their management is an issue that many enterprises are likely to face.Data mining techniques have consequently become some of the most important applications in recent years for solving this issue.The main purpose of data mining is to discover and analyze the useful information from large databases, to provide a reference for managers or decision makers.In general, data mining's more commonly used capabilities are classification, clustering, affinity grouping, and prediction.Among those, classification problems are widely encountered in many fields.Classification, which is a type of supervised learning, uses a known training set to establish a prediction model for the categorization of data of an unknown class.
In practical applications, data is usually pre-processed before establishing a prediction model, and this process is often referred to as feature selection.Data usually contains a large amount of features, but not every feature is a useful classification target.The removal of irrelevant or redundant features while ensuring that classification does not affect the accuracy of the target concept and the desired information may significantly improve a complex operation and increase efficiency (John, Kohavi, & Pfleger, 1994).Thus, feature selection technique is our focus in this paper.
In order to increase accuracy and reduce the computing time, feature selection methods and data classification technology constitute the two major steps for classification problems.Many scholars have proposed different algorithms to improve the accuracy of classification in the feature selection methods, but the use of different methods on the same problem might produce different degrees of accuracy and efficiency.Thus, the choice of method is an important issue when determining how to address a particular problem.This study proposes a wrapper method that uses an orthogonal array (OA, statistical methods) as a feature selection technique and support vector machine (SVM) for classification.The proposed method establishes a systematic rule for the selection of the feature subset to significantly reduce the computing time and increase classification accuracy.This paper is organized as follows.Section 2 introduces the concept of feature selection and briefly reviews some feature selection methods.In Section 3, the basic concepts of SVM and the OA are presented.The ROA-SVA is proposed to solve the feature selection problem for classification in Section 4. In Section 5, the wine (recognition) dataset adapted from UCI is used to show how to implement the proposed ROA-SVM.Comparisons based on benchmark data listed in the UCI demonstrate the effectiveness of the proposed ROA-SVM in Section 6. Finally the conclusion and suggestions for future research are presented in Section 7.

Feature Selection Methods
The main purpose of feature selection is to delete irrelevant or redundant variables and reduce space dimensions.Although an exhaustive search method is able to find the best feature subset, it is usually unrealistic and costly.Many heuristic or random methods, called feature selection methods, have been proposed by scholars to solve the above issues.Dash and Liu (1997) summarized a typical feature selection method in four steps, as shown in Figure 1.


Generation procedure: A procedure generates the feature subset which is evaluated in the next step.


Evaluation function: Evaluate the feature subset and generate a goodness (such as accuracy) to determine the candidate feature.


Stopping criterion: A criterion is used to decide when to stop the process to prevent an exhaustive search from taking place.


Validation process: The stopping criterion is usually the last step of a feature selection process; however, a validation procedure is necessary to compare the result of other feature selection methods to prove that the proposed method is valid.
Figure 1.Feature selection process with validation (Dash & Liu, 1997) In generally, there are two kinds of feature selection methods: filter and wrapper methods (Blum & Langley, 1997).Filter methods select the features subsets by analyzing the distance, information and other measures of the intrinsic data.Because filter methods do not rely on classification technology, the advantage of these methods is that calculation is simple and fast.The main disadvantage is that the mutual relations of the selecting subsets of features and classifier are ignored.Rokach et al. (2007) divided the filter method into ranker method and non-ranker method.The ranker method evaluates the features by a given measure and sorts the ranks; however, the non-ranker method only generates the feature subset and no ranks.The filter method is illustrated in Figure 2.
Figure 2. Filter method flow chart (Mladenić, 2006) The wrapper method uses the classifier directly to select features.This method therefore combines the feature selection method and classification technology.The pros and cons of the wrapper methods are opposite to those of the filter methods.Wrapper methods are usually computationally expensive and costly, but they demonstrate better performance than the filter methods (Zhu, Ong, & Dash, 2007).The wrapper method is illustrated in Figure 3.

Introduction of SVM and OA
The proposed ROA-SVM is based on the OA and SVM.Section 3.1 introduces the SVM for the classification method by illustrating the basic idea behind SVMs based on the linear model.The concept of OA will be introduced in Section 3.2.

SVM
SVMs (Vapnik, 1995(Vapnik, , 1998) ) have been proven to give excellent performance in binary classification cases.Let X i =(x i1 , x i2 ,…, x id )  R d be the ith training data, and y i {1, 1} denote its class label for i=1,2,…,n.A hyper-plane can be written in the following form: such that (as shown in Figure 4) where W is normal to the hyper-plane, |b|/||W|| is the perpendicular distance from the hyper-plane to the origin, and ||W|| is the Euclidean norm of W.
The above two equations can be combined and rewritten as The purpose of SVM is to find W and b in Equation ( 1) to maximize the margin  between two support hyper-planes H 2 : W T X i +b=−1 (6) to separate two classes of data.Notice that =2d, where d is the distance between the hyper-plane and any one of the support hyper-planes and defined as d= ( 7 ) By above Equations ( 4) and ( 7), the SVMs problem can be summarized as a quadratic programming problem: (8) such that Equation ( 4) is held.The above quadratic programming problem is also a convex optimization problem which can be solved using the Lagrange multiplier method after translating the quadratic programming problem using the Lagrange multipliers i ≥ 0, we have To find the extreme point to minimize Equation ( 9), the partial differentiations are taken to Equation ( 9) with respect to W and b and set to zero: (10) (11) The above two equations can be rewritten as follow: Substitute Equations ( 12) and ( 13) into Equation ( 9), we have Maximize ( 14) For a convex problem, KKT conditions are necessary and sufficient to solve W, b and α i .Therefore, solving the SVMs problem is equal to solving the KKT conditions.The related KKT conditions are included Equations ( 4), ( 12), and ( 13), and the rest are listed below: Notice that α i can be obtained by solving the quadratic programming problem listed in Equations ( 14) and ( 15).Next, Equation ( 12) is used to obtain W. Finally, b can be solved using Equation (17).

OA
An OA is an array of positive integers (called levels) arranged in rows (denoted experiments) and columns (denoted factors).The ith column denotes the ith feature, and the 0 in any combination is set to select the feature, 1 as a waiver of the features.For example, only feature A is selected in Experiment 2 since A=0 and B=C=1 in Table 1.All columns exhibit the following properties of statistically independence in any OA:  Self-balanced: The number of each level is the same in each column.For example, Table 1 is a 2-level 3-factor OA and level 0 appears the same number of times as level 1, i.e., twice in each column (factor).
Table 1.Two levels and three factors OA

Number of Experiment
Column (factor) The number of any level is the same in each column.For example, level 1 appears the same number of times, i.e., twice in any column of Table 1.The above two properties are called the orthogonality.Algorithms for constructing OAs with various levels are found in (Rokach, Chizi, & Maimon, 2007).The details of OA are as follows.Let L n (s m ) be an OA for n experiments, m factors and s levels per factor, where L denotes a Latin square.Eighteen standard basic OAs are listed as in Table 2.Note that an additional experiment will be tested by the factor weighted analysis (FWA) based on the self-balanced property.The FWA can evaluate the effects of respective factors (hereafter called 'features' in this research) and determine whether a feature is needed after the result (which is defined as and called the 'accuracy of classification' in this research hereafter) of each experiment is given.Let w i denote the accuracy of experiment i, x ij {0,1} denote the level of experiment i of feature j, and the effect of feature j be defined as then the feature j is selected in the additional experiment.For convenience in interpreting the FWA, some assumed values are added to Table 3.For example, feature A is obtained in experiments 3 and 4, but it is neglected in experiments 1 and 2. The effect of feature A (feature 1) can be computed as followed: Therefore, feature A is selected in the additional experiment.We determine whether features B and C are likewise selected.We set A=0, B=0, and C=1 in the fifth experiment which means that the features A and B are obtained in the fifth experiment.Finally SVM is used to compute the accuracy of the classification.The best feature subset for feature selection is selected by ranking the accuracy of each experiment and choosing the highest one.The OA is a special statistical design of experiments that studies the effects of several factors simultaneously to use the least number of experiments to explore the maximum number of factors and estimate the interaction between factors efficiently, rather than exploring all the possible combinations of assignments.Therefore, OA has the advantage of significantly reducing the number of experiments and simplifying the data analysis.

The Proposed ROA-SVM
This section discusses the details of how the proposed ROA-SVM combines recursive OA and SVM to conduct feature selection for classification problems.The proposed ROA-SVM is mainly based on the standard OA for two levels, such as L 4 (2 3 ), L 8 (2 7 ), L 12 (2 11 ), L 16 (2 15 ), L 32 (2 31 ), and L 64 (2 63 ).Let Z k =0, 3, 7, 11, 15, 31, and 63, where k=0,1,2,…,6.When the number of features are m and Z k <mZ k+1 , the OA denoted by 2 is proposed.The procedure is recursive until no better accuracy can be found in each experiment.
The proposed ROA-SVM is essentially the same for any number of features and experiments, but we will describe it in detail only for L 4 (2 3 ) shown in Table 1.SVM is used as the evaluation tools and classification method.We set 10-fold cross-validation in SVMs.In 10-fold cross-validation, the input data is randomly partitioned into 10 equal parts and a single part of the 10 parts is retained as the testing data for the model.The other 9 parts are used as training data.The cross-validation process is repeated 10 times, with each of the 10 parts being used exactly once as the testing data.
Finally, the 10 results can be averaged to produce a single accuracy.In experiment 1, the data with all features A, B and C are selected for SVM with 10-fold cross-validation to compute the accuracy of classification.In experiment 2, only feature B is obtained to compute the classification accuracy.Experiments 3 and 4 are proven likewise.In this way, we obtain the respective accuracy of each experiment.Note that, as mentioned in Section 3.2, an additional experiment will be conducted with the FWA, in addition to the original experiments.Those features in the experiment that have the best accuracy will be selected and the remainder will be discarded in the next run.This procedure is repeated until there is no further improvement in accuracy.Figure 5 illustrates the flow chart of ROA-SVM.

A Numerical Example: Wine Recognition Dataset
In this section, the wine (recognition) dataset adapted from UCI is used to show the procedure of ROA-SVM.
Wine dataset has 178 data patterns and 13 features.For a complete test and comparisons, 10-fold cross-validation is used; therefore, there are always 90% data in the training set and 10% data as the testing data.Because 121316, the L 16 (2 15 ) OA is used.The result of the first run in feature selection using the proposed ROA-SVM is shown in Table 4.
Table 5.The second round results of wine database experiment 1 6 10 12 null null null accuracy The results of the second run are presented in Table 5.The best accuracy is still 93.82% as found in the first run.Therefore, the features 1, 6, 10 and 12 are the best subset features in our proposed ROA-SVM with the best accuracy being 93.82%.

Computational Experiments
To evaluate its quality and performance for data mining, the proposed ROA-SVM is applied to and compared with the original SVM in eight widely referenced real-world datasets (including the wine dataset discussed in Section 4) which are adopted from the UCI Machine Learning Repository (Asuncion & Newman, 2007).
These eight benchmark datasets are Balance scale weight & distance dataset (Balance), Iris plants dataset (Iris), General description of thyroid disease dataset (Thyroid), Pima Indians diabetes dataset (Diabetes), Breast cancer dataset, Glass identification dataset (Glass), Wine recognition dataset (Wine), Australian credit approval dataset (Credit).The number of instances, classes, and features of these datasets are shown in Table 6.To fully exploit the benefit and demonstrate the performance of the proposed ROA-SVM, two tests are used (Test1 and Test2).In Test1, the computational result provides a comparison between the proposed ROA-SVM and the conventional SVM.The datasets in Test1 for which the proposed ROA-SVM has failed to reduce the number of features are tested further in Test2.In Test2, the exhaustive method is implemented to remove all possible combinations of features to prove that all features are significant and none are removable in those datasets which were impossible to reduce in Test1.

Test1
Two SVM-based classifiers, ROA-SVM and traditional SVM, are implemented.The accuracy and the number of feature subsets on the eight UCI datasets based on SVM and ROA-SVM are summarized in Table 8.The results of the above experiments (with the exception of the first two) show that the proposed ROA-SVM is superior to the conventional SVM in terms of both prediction accuracy and number of features.

Test2 Based on the Exhaustive Method
Excluding the balance and iris datasets, the accuracy of the other six datasets is increased with fewer selected features in the classification.To further test whether there are irrelevant or redundant features in the balance and iris datasets, the exhaustive method is used to test them.Since both datasets include only four features, a 154 OA is used that only has two possible values, that is, 0 and 1; 0 in any combination is set to select the feature, and 1 is set as a waiver of the features as shown in Section 3.All the possible combinations of feature subsets and accuracies estimated by SVM for the balance dataset and iris dataset are listed in Tables 9 and 10, respectively.

Conclusions and Future Research
Classification is an important task in data mining.Feature selection is always an important issue in classification.This work describes a new classifier design method called ROA-SVM to provide a systematic method for the effective deletion of irrelevant or redundant features.According to the testing result from Table 8, the classification result in 5 th column using the proposed ROA-SVM method is better than the 4 th column using SVM to classify the eight UCI dataset which includes: Balance, Iris, Thyroid, Diabetes, Breast Cancer, Glass, Wine, Credit.
The comparisons based on eight common UCI benchmark datasets demonstrate the effectiveness of the proposed ROA-SVM method in deleting the irrelevant or redundant features and reducing the number of experiments, thereby increasing the accuracy of classification and computation time significantly.
Our experimental results had shown a good achievement with the default SVM parameter settings.However, the parameter settings have a deep impact on classification performance, so how to adjust the parameters to achieve better performance is still worth researching.

Figure 5 .
Figure 5.The flow chart of recursive orthogonal array

Table 2 .
The standard OA

Table 3 .
The additional experiment in Table1

Table 6 .
Summary of eight adapted UCI dataset

Table 8 .
The result of feature selection on UCI data