Applications of Support Vector Machine Based on Boolean Kernel to Spam Filtering

Spam is so widely speared that has a bad effect on daily use of E-mail. Nowadays, among the primary technologies of spam filtering, support vector machine (SVM) is applied widely, because it is efficient and has high separating accuracy. The main problem of support vector machine arithmetic is how to choose the kernel function. To solve this problem people propose spam filtering arithmetic of support vector machine based on Boolean kernel. The arithmetic uses filtering methods based on attributes, such as IP address, subject words, keywords in content, enclosure information, etc. These attributes compose the feature vectors, and the vectors are classified by SVM-MDNF based on Boolean kernel. The experiment results show that this arithmetic has high separating accuracy, high recall ratio and precision ratio. The arithmetic has its value in theory and application.


Introduction
E-mail is one of the main means for people to communicate information on Internet.As the Internet is so widely used, sending and receiving E-mail has almost become a part of considerable amount of people's daily life.However, with the convenience the Internet brings, it also brings the existence and wide spread of spams, which cause a lot of troubles to people.It is evident that people's work efficiency and their emotion will be influenced, if they have to spend time and efforts on identification E-mail every day.So to auto-distinguish spam has important meaning and applying value(Shawe-Taylor J, Cristianini N. KerneI. 2005).Spam means that publicizing E-mails, containing all kinds of publicities, such as ads, electronic publications, are not requested or accepted by receivers in advance.
To classify the technologies of spam filtering, they can be classified into two kinds: server spam filtering and client spam filtering, according to different places the filter is executed.But if we classify the technologies based on different filtering methods, there are three ways: spam filtering based on blacklist/ whitelist, spam filtering based on principles and spam filtering based content.

1) Spam Filtering Based On Blacklist/Whitelist
Any E-mails, sent by senders in the whitelist, are considered legal E-mails, while any E-mails sent by the senders in the blacklist are treated as spams.The following method is widely used in spam filtering recently.Usually it collects a blacklist and a whitelist.In these lists, the content can be E-mail addresses, the DNS of E-mail servers or IP addresses.They help receivers to check senders in real time.

2) Spam Filtering Based On Principles
This method needs people to set some principles.And the spam is the E-mail that meets one of several principles.These principles always include analysis on header, filtering on multiple send, accurate matching on keywords and other features of the E-mail.

3) Spam Filtering Based On Content
Actually, the producers who send spam vary continuously.So the blacklist/whitelist has great limitations.And spam filtering based on principles also has some disadvantages: principles are made by people, and those users who are lack of experience will affect the validity and accuracy of principles.Therefore, many experts come up with an idea that analyze the content of E-mail first, and then distinguish whether it is a spam.This method combines spam filtering with other technologies, such as text classification and information filtering.It requires the arithmetic of text classification and information filtering to be introduced into the spam filtering.
To solve this problem, a great amount of measures have been adopted, such as extension of E-mail protocols, certification of E-mail server, spam filtering and legislation.Among these measures, the spam filtering is more realistic.Nowadays, many arithmetic of text classification have been introduced into applications of spam filtering based on content, like Bayes, Decision Tree, K-Most Neighboring Arithmetic, Support Vector Machines, etc (Wang bin, Pan wenfeng. 2005).And applications of SVM are more successful in spam filtering.

Evaluate Standard of Spam Filtering System
The performance evaluation on spam filtering often makes use of some related indexes in text classification.The standard, which can decide whether text classification is mature or not, is the mapping accuracy and mapping speed.And the mapping speed is decided by the complexity of mapping arithmetic; the mapping accuracy is evaluated by information retrieval evaluation.The followings are the definitions about two common indexes: Recall Ratio and Precision Ratio of information retrieval in spam filtering field(C.J. van Rijsbergen.1979).
Def 1: Recall Ratio is the ratio of the amount of spam that has been filtered to the amount of E-mails that should be filtered.The computing formula of Recall Ratio is: amount of E mails that should be filtered − (1) Def 2: Precision Ratio is the ratio of the amount of spam that has been filtered to the amount of E-mails that have been filtered.The computing formula of Precision Ratio is: SVM is developed from Optimal Separating Plane on linear classifying.The basic idea of it is maximum-separation (margin).The so called optimal means that separating plane is required not only to separate two kinds of text correctly, but also to find a max margin.
Actually, the maximum-margin is the control of promotion ability.Linear support vector machine separates the "yes" and "no" examples, through constructing optimal hyperplane 0 , = + b X W in input space.Here the "<,>" represents the inner product; It can be proved that the optimal separating plane is what leads to minimum 1 2 2 W in input space.To solve this problem we need to transform it to dual form with Lagrange Optimization.The dual form can also be called constraints: The solving is as follows: i α is the corresponding Lagrange multiplier of constraint (5) in primary problem.This is a problem of seeking optimization for quadratic function on the constraint of inequality and it has unique answer.It can be proved easily that only part (often a little part) of i α answers are not equal to zero, and the corresponding examples are the support vector.Through working out the above-mentioned problem, we get the optimal separating function.That is: In the function: in fact, the summation only works in support vector.The b is separating threshold.It can be worked out with any support vector (satisfying formula 5th) or through the median of any pair of support vectors in two classes.
Here, the sgn() is a symbol function.
With Non-linear-Mappingφ , vectors of input space can be transformed to vectors of higher-dimension space, which is named as feature space.The feature space has a higher dimension than the input space.
Non-linear SVM makes use of Non-linear-Mapping φ to transform vectors of input space to vectors of high-dimension space.Therefore, i X r , X in the above equation are respectively replaced by ( ) . So we can get that: In the function: We name the function like ( , ) ( ( ), ( )) K x y x y φ φ = as kernel function.Some common kernel functions include: 1) Gaussian Radial basis functions: 2) Polynomial: ( , ) (( , ) 1) 3) Hyperbolic tangent: ( , ) ( ) Choosing different kernel functions, you can get different Non-linear support vector machine.
If x and y in the kernel functions above are Boolean, then we can suppose that {0,1} , {0,1} We call K MDNF as Monotone Disjunctive Normal Form (MDNF) kernel function.MDNF kernel function is the kernel function we use in this paper as SVM arithmetic.

The Strategy of SVM Spam Filtering Based On Boolean Kernel
This experiment adopts Enron-spam E-mail dataset.And the dataset includes two parts: "pre-processed" is the set of E-mails that have been pretreated, and the part "raw" are pretreated based on needs to get "preprocessed".Our experiment cramps out some "preprocessed" as training set, and some as testing set.We select 2000 E-mails.Among these E-mails, 1100 are spam and 900 are normal E-mails.
The specific procedures of the strategy of SVM spam filtering based on Boolean kernel are as followings: 1) Firstly, we process the dataset with standard.Wipe off the noise words (such as spelling mistakes, etc), and filter words whose text frequency are between 2 and 8000; set different weighing to the subject and text content of every E-mail, and the subject is set higher weighing to concern the words appearing in the E-mail subject.Taking subject, text content and many other features of the E-mail into consideration, we will get the feature vector of every E-mail.
2) Make binaryzation towards the features in the feature vector.That is to give every feature the value "0" or "1".Since we use Boolean kernel MDNF here, there is a need to transform the feature vector to Boolean feature vector.
3) Filter spam with SVM based on MDNF Boolean kernel.In order to verify whether the arithmetic is valid or not, we use k cross for our experiment.K cross is to separate E-mails into k parts.We make use of the k-1 parts for training, and the remaining for testing.The procedure loops k times, so every part has been tested.Finally, the average of tests' is used as the result of test for evaluation.Here we make k equal 10.

Experiment Result and Analysis
In this experiment, we compare the separating accuracy of the spam filtering arithmetic based on Boolean kernel SVM with that of some arithmetic-Naïve Bayes, linear SVM and Non-linear SVM based on radial basis functions.The result is shown is the table 1: From the comparison result of separating accuracy, it is evident that the highest is SVM based on MDNF Boolean kernel.Second top is the Non-linear SVM based on radial basis functions.The lowest is Naïve Bayes.
During the evaluation of the efficiency of E-mail separating arithmetic, it cannot evaluate the arithmetic completely only to compare the separating accuracy.So we evaluate the arithmetic further using precision ratio, recall ratio and F 1 given in the Section 2.
In table 2, it compares the recall ratio, precision ratio and F i .And from these targets, we can evaluate the validity of the arithmetic in a more comprehensive way.From the experiment result, we can find that SVM based on MDNF Boolean kernel has the best spam filtering effect, comparing with the other three.

Conclusion
After the analysis of all the characteristics of spam, we propose the SVM based on MDNF Boolean kernel spam filtering arithmetic when we make the feature vector using E-mail subject, text content, etc.The experiment shows that this arithmetic has higher separating accuracy, and has better spam filtering effect in recall ratio and precision ratio, comparing with Naïve Bayes, Linear SVM and SVM based on radial basis functions.And in the experiments thereafter, we will apply SVM with more Boolean kernels to spam filtering, and look forward a better effect.

Table 2 .
Comparison of recall, precision and F i