Using J48 Tree Partitioning for Scalable Svm in Spam Detection

Support Vector Machines (SVM) is a state-of-the-art, powerful algorithm in machine learning which has strong regularization attributes. Regularization points to the model generalization to the new data. Therefore, SVM can be very efficient for spam detection. Although the experimental results represent that the performance of SVM is usually more than other algorithms, but its efficiency is decreased when the number of feature of spam is increased. In this paper, a scalable SVM is proposed by using J48 tree for spam detection. In the proposed method, dataset is firstly partitioned by using J48 tree, then, features selection are applied in each partition in parallel. Consistently, selected features are used in the training phase of SVM. The propose method is evaluated conducted some benchmark datasets and the results are compared with other algorithms such as SVM and GA-SVM. The experimental results show that the proposed method is scalable when the number of features are increased and has higher accuracy compared to SVM and GA-SVM.


Introduction
Spammers are continuously pioneering new methods to bypass email anti-spam filter solutions forcing companies to invest in spam filtering that can keep up with the evolving methods.Up to now, various methods have been presented in order to fight and spam detection (Jindal & Liu, 2007).According to this reason there is need to use hybrid multi-layer architectures method for Spam detection problems.Most of the hybrid multi-layer architectures filters include a combination of methods, such as applying key words, rule based filters, black and white list and other data mining for detecting the spams that are more important (Cook et al., 2006, Nakulas et al., 2009).dataset, it is so useful to select feature on each part and finally run SVM algorithm instead of decreasing the number of support vectors in each of the final result of SVM algorithm.We evaluate our model with other algorithms like SVM, GA-SVM.This paper is organized as follows: section 2 is about related work.In Section 3 which is the main partition of this paper, we present our proposed method.Section 4 is about test results and finally, we conclude in section 5 and is about future work.

Related Work
The weakness of SVM algorithm is that the computational complexity that is not fitted for extensive dataset.Weight ratio is not constant; also it needs to decide to choose a good kernel functions and select a good value for parameter C. SVM is so suitable for the issues have limited training data features (Auria & Moro, 2008).In researches of (Chapelle et al., 1999;Kim et al., 2002), SVM is much more better and efficient than other non-parametric classifications, for example K-NN (Note 1), NN (Note 2) with the best classification's accuracy, have less computational time and set the parameters suitable.The result in the research of (Ye et al., 2008) represented that SVM occupied a high range of time and memory for big size of the data.One way for decrease and manage the size of dataset and database is feature selection.Another way to control with this problem is partitioning (Kerdprasop & Kerdprasop, 2003).Partitioning the data set to a proper subset is so beneficial the incremental mining to get its high accuracy.An improved approach for C4.5 had offered (Polat & Güneş, 2009).Experiment results run on three data set namely Lymphography, Image-segmentation, Dermatology from UCI.In the experiments a good accuracy in compare other algorithms was found but they didn't mention about performance and time against other type of datasets.In the research of (Chang, Guo et al., 2010) a decision tree to decompose dataset and train SVM is proposed.The method shows that the decision tree has several abilities for large-scale in training of SVM.First, it can classify the dataset with data points and reducing the cost of SVM training for data points.Second, it is so good to raise the accuracy.Third, the tree decomposition approach can decrease error rate.For data sets whose size can be control by current non-linear, or kernel-based, SVM training techniques, the proposed approach in (Chang, Guo et al., 2010) can speed up the training time and give good test accuracy.A number of researches have tried to hybrid SVM and decision trees.Some of them improved the accuracy of classification (Bennett & Blue 1998;Bennett, Cristianini et al., 2000;Tibshirani & Hastie, 2007).Some of the other researches speed up the SVM accuracy (Platt et al., 1999;Sahbi & Geman 2006;Gong et al., 2007).
In our previous work (Zahra S. Torabi, Mohammad et al., 2015), we have considered the evaluation criterions of SVM for spam detection and filtering.
In this article we use J48 tree to partition the original large data set into a data subsets that are manageable and learning effective, then on each partition we apply feature selection and use the selected features in training phase of SVM.

Methodology
Partitioning the given dataset to a suitable subset size is completely beneficial the incremental mining to get the high accuracy.The last research version of C4.5, implemented in Weka as J4.8 with Java.Indeed, J48 produces a decision tree for input dataset with recursive partitioning of dataset by Depth-first strategy.If classes C is denoted to {C 1 , C 2 ,…, C k }.T has samples which depend on a mixture of classes.In this case, the idea is to purify T into proper subsets of samples which are heading to a single-class set of samples.A proper test is selected, based on single feature, which has one or more mutually exclusive results {O 1 ,O 2 , …,O n }.T is divided into subsets T 1 , T 2 , …, T n in which T i has all the samples in T that have result O i of the selected test.Entropy is a criterion of average uncertainty or ambiguity of collection of data.This represents the average much information we want to receive from result of a data source (Mazid, Ali et al. 2010).If S is a collection of samples, then freq (C i , S) points to the set of samples in S that is in class C i and also ⏐S⏐ stand for the set of samples in the collection S. Equation1 shows the entropy of the set S: (1) When collection T has been divided in accordance with n results of one feature test X: Gain information creates with the split is really the difference between the amount of information require to classify a situation after and before making the split.Equation3 shows the new gain rate: (3) J48 adds a multiplier F in front of the gain calculation.F equals number of samples in dataset with known value for a given features divide total number of samples to control missing values.The gain uses alone is not sufficient to make a tree.The gain measure proper splits with many results.Equation 4 defines gain ratio as follow to solve this problem: The gain ratio divides the gain with the evaluated split information.This penalizes splits with many results in equation 5.
Split information is the weighted average of the information utilizing the ratio of states that are sent to each child.The above pseudocode in figure1 shows the steps of partitioning where S represents the set of training states.In order to get the optimal split while the tree is growing (see the section of the pseudo-code above) the gain ratio must be calculated.We find the best split with Consistency measure.Consistency measures is type of evaluation measures are characteristically different from other measures because of their heavy reliance on the training dataset and use of Min-Features bias in choosing a subset of features.Consistency measure is employed in (Almuallim and Dietterich 1994).When partition dataset is finished, we apply GA algorithm on each partition for feature selection simultaneously and then we use selected features in training phase of SVM.

Results and Discussion
To validate our method, we conducted a thorough experimental evaluation over email dataset.We used the Accuracy and 10 folds cross validation.The results are shown on 4 datasets Spambase(Arthur Asuncion 2007) and (SpamCorpus, SpamData, mail_corpus) (Katakis 2008).We compare our method on SVM, GA-SVM.The algorithms are implemented with Java source code in java Net Beans.We add some dlls and jar files from WEKA to our program.All experiments have been run on a machine with Core i5 CPU and 4G MB of RAM.We used the accuracy rate and error rate with 10 folds cross validation to evaluate the method.Accuracy and error rate formulas calculated with equation 6 and 7: Error rate = (n L→S + n S→L )/( n L→L + n L→S + n S→L + n S→S ) In these formulas nL→ S and nS → L denote legitimate emails and spam emails that have not been classified.nL→ L and nS → S represent the number of legal emails and spams are correctly classified.The sign of "-" in tables represents that the algorithm couldn't run on that point.As it can be seen in table 2, the execution time increases for both SVM and GA-SVM algorithms as the size of the dataset increases.The performance of GA-SVM base on execution time is better than SVM but when size of the dataset increases both algorithms couldn't run and face to the lack of memory.On the other hand, by increasing the size of dataset and features, our proposed method has a less execution time than others and doesn't face to the lack of memory because of using partitioning the datasets with J48 and eliminate redundant and irrelevant features, and capable of handling some noise.As it can be seen in table 3 by increasing the number of attributes, the error rate of proposed approach reduced and this represents an increase in the precision of this method.Also, the error rate of SVM is lower than SVM-GA.Compared with other methods, the error rate of proposed method is less than the other methods because in the partitioning phase the noise and missing values are ignored with J48 tree.
Table 4 shows the accuracy on datasets and compares the algorithms with proposed method:  4 represents the accuracy of proposed method, SVM and GA-SVM.As the results show by increasing the number of feature, the accuracy of GA-SVM is better than SVM due to feature selection with GA and the accuracy of our proposed method is higher than both algorithms.

Conclusion and Feuture Work
Today, Spam has become a problem for users of Internet, IT companies and organizations.In some studies, SVM's performance is more than other categories, but in the computational complexity of high-dimensional data collection, its performance decreases (Tsang, Kwok et al. 2005, FENG andZHOU 2013) because of the complex mathematical calculations in the kernel function, SVM faces to the lack of memory and run time.So, SVM is suitabale and has the best performance from the other classifications when the number of features of the data set is small.In this paper, a scalable SVM is proposed by using J48 tree for spam detection.In the proposed method, dataset is firstly partitioned by using J48 tree, then, features selection are applied in each partition in parallel.Consistently, selected features are used in the training phase of SVM.We show that by increasing features, the accuracy of propose method has been increased because, when size of dataset become large, we face to lack of memory and couldn't have calculated time and accuracy that's the things occur in other algorithms like SVM and GA-SVM.Therefore the performance of proposed algorithim is better than SVM and GA-SVM.For future work the other trees or other effective features selection can use instead of J48 algorithm.
(Zhao and Zhang 2008)unction can measure the excellence of a subset that produces from generation function and compared with the previous subset.A certain evaluation function is requred to detect the best subset.If U is an inconsistency measure rate for an input dataset and a pattern is a part of a sample without definition of label for class.S is a feature subset with n f 1 , n f2 ,…,n f|S| number of values for features f 1 ,f 2 ,…f |s| patterns.Consistency measure is related to the inconsistency rate concept(Lin, Lee et al. 2008).For each discrete feature, one test with results is considered as many as the number of distinct values of the feature.On the other hand, for each continuous feature, binary tests are required for every distinct values of the feature(Zhao and Zhang 2008).Figure1shows the propose partition algorithm: Detect the normalized information gain ratio from splitting on a• Let a_best be the feature with the maximum normalized information gain • Create a decision node that splits on a_best • Select a best feature base on Consistency measure • Make sublists get by splitting on a_best, and also add those nodes as child of node ->Ssub

Table 1
Shows the datasets that use in experiment:

Table 2
shows the comparison time between the proposed method and the other algorithms:

Table 2 .
Compare the time of proposed method with SVM and GA-SVM

Table 3 .
Compare the Error Rate Between Proposed Method, SVM and GA-SVM

Table 4 .
The comparison of accuracy with GA-SVM, SVM and Proposed method