Combination of Naïve Bayes Classifier and K-Nearest Neighbor (cNK) in the Classification Based Predictive Models

In this study, we present a new classifier that combines the distance-based algorithm K-Nearest Neighbor and statistical based Naïve Bayes Classifier. That is equipped with the power of both but avoid their weakness. The performance of the proposed algorithm in terms of accuracy is experimented on some standard datasets from the machine-learning repository of University of California and compared with some of the art algorithms. The experiments show that in most of the cases the proposed algorithm outperforms the other to some extent. Finally we apply the algorithm for predicting profitability positions of some financial institutions of Bangladesh using data provided by the central bank


Introduction
Classification is one of the most important multivariate techniques used in statistics.It is closely related to prediction and interestingly the classification problem is sometimes called the prediction problem particularly in data mining.In statistics, classification is a procedure in which individual objects are placed into groups based on quantitative information on one or more characteristics inherent in the objects (referred to as traits, variables, characters, etc) and based on a training set of previously labeled objects.The problem can be stated as follows: given training data ( ) { } There are several approaches those deal with classification problem.The statistical based algorithm Naïve Bayes Classifier and distance-based algorithm K-Nearest Neighbor are frequently used in prediction problem.In case of Naïve Bayes algorithm one of the main factors is to deal with numerical attributes.It is obvious because in the algorithm one must determine the conditional probability for each possible value of all attributes.To resolve this issue, we have to discretize numerical attributes into several classes by adopting a discretization technique from a wide range of options available.So the technique used for discretization plays an important role over the accuracy of the method.Several attempts have been made to increase the accuracy of the Naïve Bayes algorithm by adopting new discretization scheme (Yang & Webb, 2002).
In case of K Nearest Neighbor algorithm, the situation is quite opposite.It has issue regarding categorical attributes.As the algorithm selects a segment from the training data based on the distance, a distance measurement scheme for the categorical data must be obtained.Generally, it is done by different similarity measurement techniques.Several investigations also have been done to find a proper distance measurement scheme.
In this paper, we propose a new algorithm that combines these two classifiers mentioned above in such a way to resolve both the issues involved.That is, in the new algorithm we do not have to discretize the continuous variables and, at the same time, do not have to measure the distances between categorical attributes.The combination is done in a logical way, which is supposed to increase the performance over the said algorithm, and will be consistent in nature.The implementation of the proposed algorithm, over some standard datasets, reflects this fact.
We also implement new algorithm to determine the profitability position of some financial institutions of Bangladesh.Typically this profitability position is evaluated by the central bank.We predict the profitability position with the accuracy of about 90% by our proposed model.
In the next section we will have a review of Naïve Bayes and K-Nearest Neighbor.In section 3 we shall mention the proposed algorithm and in section 4 we discuss the experimentation in detail.In section 5 we include the analysis of a dataset which is taken from Bangladesh.Finally, concluding remark and some future research are mentioned in section 6.

Naïve Bayes Classifier and K-Nearest Neighbor
In abstract, the probability model for a classifier is a conditional model.

(
) Over a dependent class variable C with a small number of outcomes or classes, conditional on several variables 1 x through n x .Using Bayes' theorem, we may write: ) The numerator is equivalent to the joint probability model ( ) which can be rewritten as follows, using repeated applications of the definition of conditional probability and so forth.

(
) Now the naive conditional independence assumptions come into play: assume that each attribute i x is conditionally independent of every other attribute j x for i j ≠ .This means that ( ) ( ) and so the joint model can be expressed as This means that under the above independence assumptions, the conditional distribution over the class variable C can be expressed as Here, Z is a scaling factor dependent only on n x x x ,.., , 2 1 , i.e., a constant if the values of the feature variables are known.
The corresponding classifier is the function classify defined as follows:  Output:

The Proposed Algorithm
Combining classifiers to improve the accuracy is a common phenomenon now-a-days.Being simpler yet powerful algorithms both Naïve Bayes and KNN are ideal candidate for combination to achieve higher accuracy.Hsiao et al. (2008) proposed a combination for predicting sub cellular location of Eukaryotic proteins.Jiang et al. (2009) proposed another way of combining NB and KNN and showed the performance on some UCI dataset.More recently the performance of some sort of combination of these two methods is investigated in the field of Image classification (Timofte et al., 2012;McCann & Lowe, 2012).Previously, Xie et al. (2002) proposed an improved algorithm called Selective Neighborhood Naïve Bayes (SNNB).Lazkano and Sierra (2003) proposed the combination in a different way that combines the nearest neighbor with Bayesian network.Etzold (2003) introduced the approach of combination of these two classifiers for spam filtering.Jiang et al. (2006) proposed the new algorithm called Dynamic K-Nearest Neighbor Naive Bayes where they have weighted the attributes based on the mutual information between each attribute and class attributes.
The idea of the algorithm which we are going to propose is very simple.To classify a new object we first use the KNN algorithm to find the K Nearest Neighbor from the training dataset.While implementing the KNN, we shall not include the categorical attributes.The distance will be measured only by the numerical attributes.After selecting the K nearest object, we shall then build a model using the Naïve Bayes algorithm, but this time we are using only the categorical attributes.From the model, we shall classify the new object.So, this is a two step process.In the first step, only the numerical attributes are used to select the closest data of the new object.This makes sense, as numerically close objects are supposed to have the same characteristics.Now, after we get K nearest object of the new object, instead of taking simple voting scheme as in KNN, we are looking deeper in the characteristic of the categorical data and the relation to the class.So we use the Naïve Bayes for the purpose.In this way, both the numerical and categorical attributes are used for the classification of a new object without any alteration in the data.Hence, the proposed method keeps the data intact.No discretization or complex similarity measurement is required.Thus, the proposed method is nothing but a 'combination of Naïve Bayes and K Nearest Neighbor', in short, we write it as cNK.
Formally, we can describe cNK algorithm (see also Figure 1) as follows: Step 1. Obtain the K-Nearest Neighbor of a new observation based on the numerical attributes.
Step 2. Use the set of K observations, found in step 1 as training data and use it to build a model exploiting the Naïve Bayes algorithm based only on the categorical attributes.
Step 3. Use the model built in step 2 to classify the new observation.

Datasets
In order to have a better understanding of how the newly proposed cNK algorithm works.We used numerical data illustration of six datasets obtained from Machine Learning Repository of University of California (http://archive.ics.uci.edu/ml/datasets.html)(Table 1).In doing so, we used accuracy as a measure of performance of the algorithm.Diaconis and Efron (1983).Gail-Gong reported 80 percent classification accuracy.It was used by Cestnik et al. (1987) and they showed 83 percent classification accuracy.
Crx Dataset: This file concerns credit card applications.All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data.There is no missing value in this dataset.
Australian Dataset: This dataset is originated from Quinlan (1997).The title of the dataset is "Australian Credit Approval".This dataset is used for "simplifying decision trees" in 1987 and "C4.5:Programs for Machine Learning" in 1992.All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data.This dataset is interesting because there is a good mix of attributes continuous, nominal with small numbers of values, and nominal with larger numbers of values.There are also a few missing values.
Post-Operative Dataset: This dataset is created by Sharon Summers of University of Kansas.The title of the dataset is "Postoperative Patient Data."It is donated by Jerzy W. Grzymala-Busse in 1993.It is used for program LERS_LB as a tool for knowledge acquisition in nursing by Budihardjo et al. (1991).It is also used for machine learning program LERS_LB 2.5 in knowledge acquisition for expert system development in nursing.The classification task of this database is to determine where patients in a postoperative recovery area should be sent to next.The accuracy of LERS is 48 percent.
Heart Dataset: This database contains no missing values.Here presence or absence of heart disease is predicted on the basis of age, sex, chest pain type, resting blood pressure, serum cholesterol, fasting blood sugar, resting electrocardiographic results, maximum heart rate achieved, exercise induced angina and oldpeak etc.

German Dataset:
The title of this dataset is "German Credit Data".This dataset classifies people described by a set of attributes as good or bad credit risks.This dataset is provided by Prof. Dr. Hans Hofman.This dataset is used by Eggermont et al. (2004) and Wang et al. (2003).

Experimental Design
It is necessary to have datasets that contains both numerical and categorical attributes as the newly proposed cNK algorithm treats the numerical and categorical attributes differently.Naïve Bayes and cNK algorithms are implemented by using "class" and "e1071" libraries of R programming language.The 10-fold cross validation method is used to find the accuracy of the algorithm for different values of K. See Figure 2 for stepwise description of the whole experimental design.
For wider comparison, we have chosen four algorithms viz discretize simple Bayes (Dougherty et al., 1995), selective simple Bayes with forward selection (Langley & Sage, 1994), tree augmented Naïve Bayes (Friedman et al., 1997) and lazy Bayesian rule-learning algorithm (Zheng & Webb, 2000).Moreover, sophisticated machine learning technique algorithms (used to estimate the values of the weights of a neural network) the Back Propagation (BP) algorithm (Mitchell, 1997), the SMO algorithm, and the 3-Nearest Neighbor (3NN) algorithm are also used as the representative of the Artificial Neural Network (ANN), the Support Vector Machine (Platt, 1999) and KNN (Aha, 1997), respectively.Subsequently, we compare the performance of BoostFSNB with Bagging decision trees and boosting decision trees that have been proved to be successful in many machine-learning problems (Quinlan, 1997).
• For different values of K run the last algorithm.
Figure 2. Steps of the experimentation procedure

Experimental Results
We run cNK algorithm over each dataset for different values of K (number of neighbor to build up the Naïve Bayes Model).For experimentation purpose we gradually increase the value of K and run the same test.For distance metric we always use the normalized Euclidian distance.If we plot accuracy against the values of K, get a pick point.Thus from 10 fold cross validation over the training data we find the suitable value of K that gives the highest accuracy.We then use it over the test data.
For Hepatitis, Post-operative and CRX the cNK algorithm (86%, 71%, and 87%) performs better than the Naïve Bayes classifier (83%, 64% and 85%) for K equal to 70, 50 and 400; respectively (See Table 2).For Australian data, the cNK Algorithm (87%) also has some improvement over the accuracy of the Naïve Bayes classifier (86%) for K = 350.It is also clear (Table 2) that for all datasets the cNK algorithm has lower error rate than the Naïve Bayes classifier.These results confirm the initial hypothesis of the study.It is apparent that the cNK Algorithm outperforms the original Naïve Bayes classifier in every domain giving the accuracy up to 3 percent level.The cNK algorithm also outperforms DiscNB, WrapperNB, TAN, LBR, BP, SMO and 3-NN in every domain giving the accuracy up to 2, 7, 3, 2, 3, 2 4 percent, respectively (see Table 3).It is observed that the cNK algorithm outperforms the algorithms those tried to improve the classification accuracy of the simple Bayes algorithm.The overall performance of the cNK is quite impressive.The accuracy of the cNK algorithm compared with the simple Bayes algorithm while applying the bagging and boosting procedures of Bauer and Kohavi (1999).The cNK algorithm gives greater accuracy than BoostFSNB, single boosting simple Bayes (AdaBoostNB), AdaBoost C4.5, Bagging C4.5 and Bagging NB for all datasets (Table 4).

Predicting Profitability Position of Financial Institutions
The cNK algorithm is implemented to find the profitability position of some financial institutions of a country.The central bank of the country evaluates such profitability position and determines whether it is satisfactory or not.The position is determined by analyzing quite a many attributes.It is performed by some regulatory organizations that involve complex calculations and procedures.However, using the available attributes and the class assigned against them in previous regulation, we may build a model to predict the possible position of any institute provided the value of the attributes (Table 5).Institute type refers to the type of the financial institution namely nationalized commercial banks, government-owned development financial institutions, private commercial banks and foreign commercial banks.Total 'interest income' refers to the income from the interest that it is the biggest source of income for a financial institution.Total 'interest expense' is the amount of money spends in for borrowing money.It usually refers to the amount of money given to the depositors as interest by the financial institutions.Without earning from interest income, now-a-days, most of the financial institutions shift their income source through other diversifying into fee-earning activities, such as corporate cash management, check collection, consumer annual fees on credit cards, and monthly service charges on deposit accounts.These are the noninterest income which may also includes many new activities, such as fees from participation in mutual fund commissions, investment advisor fees in merger and acquisition activities, and securities underwriting fees.This dataset is not used ever before for classification purpose and is used for the first time to measure the accuracy of the Naïve Bayes and cNK algorithm.There are 20 attributes and 48 observations, 15 of them are numerical, 4 of them are categorical and 1 class attribute (Table 5).We apply some of the algorithms over the dataset and the proposed cNK algorithm on the data to predict the profitability position.The best result (89.58%) for cNK comes for the value of K = 21 (see Table 6 and Figure 3).The details result is shown in Table 7.It shows that the proposed algorithm gives better result.

Conclusion
This paper leads to a new classifier cNK which combines Naïve Bayes and K-Nearest Neighbor.We implement the Naïve Bayes classifier and the cNK algorithm on some standard datasets using R code.The experimental results clearly show that the performance of the proposed algorithm is better than that of the Naïve Bayes in many cases.Some of the datasets are also compared with the other attempts that have tried to improve the accuracy of the simple Bayes algorithm and some other algorithms such as Boost FSNB, DiscNB, WrapperNB, TAN, LBR, BP, SMO, Adaboost C4.5, Bagging C4.5, BaggingNB and AdaboostNB.In most of the cases, the cNK algorithm gives the better result.It also gives better result in case of central bank dataset providing profitability analysis.Though the experiments are done in limited context over a small number of dastaset and based on that it cannot be declared the best approach, it shows that the proposed algorithm can outperform other algorithms in several cases.Further investigation can be done to evaluate the performance of the algorithm using extensive experimentation on real and simulated dataset.Based on the evidences it may be concluded that a simple approach has been proposed that may improve the performance of the simple Bayes classifier through combining two classifiers where the data contains continuous and categorical attributes.
K-Nearest Neighbor, given a training set D and a test object nearest-neighbor list, D z (x is the data of a training object, while y is its class.Likewise, 'x is the data of the test object and ' the set of k closest training objects to z.

Figure 1 .
Figure 1.Graphical representation of the proposed algorithm

Table 1 .
Description of dataset used The hepatitis domain was donated by G. Gong in Carnegie-Mellon University on November 1988.It was used by

•
Producing disjoint training and test sets as follows.

Table 2 .
Accuracy of the algorithm using 10-fold cross validation

Table 4 .
Comparing the cNK with other attempts to improve the Naïve Bayes

Table 5 .
Central bank dataset attributes the income reported after deducting the expenses without deducting the taxes and interests.Interests earning assets and interest bearing liabilities are the total value of earning assets and liabilities that cost interest of the financial institutions.Total loan is the amount given as loan by the institution.Classified loans are those that given according to the rules and regulations of the institution but later become suspect by the bank examiners.

Table 6 .
Accuracy of the proposed algorithm on central bank dataset (2 class)Inter bank deposit is the deposit held by the institutions for other banks, usually a correspondent.Usually a due to account is held.Deposit excluding this is shown in another column.Urban and rural represents the number of urban and rural branch of the institution.

Table 7 .
Comparing the performance of the algorithms over the central bank dataset Number of ATM, number of employees, number of deposit accounts, number of loan account fields are self explanatory.Obligations that are contingent liabilities of a bank, and thus do not appear on its balance sheet.In general, off-balance sheet items include the following: direct credit substitutes in which a bank substitutes its own credit for a third party, including standby letters of credit; irrevocable letters of credit that guarantee repayment of commercial paper or tax-exempt securities; risk participations in bankers' acceptances; sale and repurchase agreements; and asset sales with recourse against the seller; interest rate swaps; interest rate options and currency options, and so on.
AccuracyFigure 3. Accuracy of the proposed algorithm without PCA of central bank dataset (2 classes)