3 NQ : Natural Nearest Neighbor with Quality

In this paper, a novel algorithm for enhancing the performance of classification is proposed. This new method provides rich information for clustering and outlier detection. We call it Natural Nearest Neighbor with Quality (3N-Q). Comparing to K-nearest neighbor and E-nearest neighbor, 3N-Q employs a completely different concept to find the nearest neighbors passively, which can adaptively and automatically get the K value. This value as well as distribution of neighbors and frequency of being neighbors of others offer precious foundation not only in classification but also in clustering and outlier detection. Subsequently, we propose a fitness function that reflects the quality of each training sample, retaining the good ones while eliminating the bad ones according to the quality threshold. From the experiment results we report in this paper, it is observed that 3N-Q is efficient and accurate for solving data mining problems.


Introduction
K-Nearest Neighbor (KNN) (Han & Kamber, 2006) is one of the most popular algorithms in knowledge discovery.It is a supervised learning method where a new instance is classified based on majority of K-nearest neighbor category.The purpose of the algorithm is to classify the new object based on attributes and training samples.However, due to its limitations in large data sets, such as low recognition rate, high computation complexity and no valuation of training samples, an increasing number of modifications have been proposed in recent years (Shi, Li, Liu, He, Zhang, & Song, 2011) to improve the efficiency of KNN, including the best well-known weighted-KNN (Liu & Chaw, 2011).However, several major problems are still unsolved, including the randomness of K value, outlier elimination, excessive computing time and data size reduction.Thus the need for a new method to solve these deficiencies is obvious and our 3N-Q serves this purpose very well.
Our proposed method is divided into two parts.The first one is the proposed concept of 3N and its implementation, which is inspired by the idea of obtaining K dynamically (Gong & Liu, 2011).Our algorithm can get the value of K automatically for different problems when each point in space becomes at least one neighbor of another point.The K value reflects the distribution of data set.The smaller, the sparser, the bigger, the denser.We conducted several experiments to evaluate the stability and accuracy of this proposed algorithm.The first experiment shows the stability of K. Two other variables record the first to Kth neighbors of each training sample and the frequency of being other neighbors.It provides the density and outlier information for us, which is also the evidence of clustering and outlier detection.The second experiment uses the fitness function to show the quality of training samples.It is deemed that a point owns a same class with its most neighbors possessing a high energy and visa versa.We report that the result of outliers and boundaries will have low quality value here, which would be eliminated.
This paper is organized as follows.Section 2 introduces the background for the regular KNN algorithm.Section 3 discuses the related work to this study.Section 4 describes the proposed method in detail.Three experiments conducted to show the stability of the new algorithm, clustering ability and classification accuracy are reported in Section 5. Finally conclusion and future work are listed in Section 6.

Background
In this section, we introduce the background as well as principle of the regular KNN.KNN has gained much popularity due to its non-parametric and easy-to-implement properties.Basically, it classifies an unknown instance by its closest neighbors in the training sets.K represents the number of the closet neighbors we select.Figure 1 from (Commons, 2007) illustrates the core idea of this algorithm.The green circle is called unclassified sample and our task is to determine whether it belongs to the blue squares class or the red triangles one.The first step is to calculate the distance between the unclassified instance and each training set.Obviously, if we take its three neighbors into account, the unclassified instance should belong to the class of red triangles.Indeed, among these three neighbors, two are red triangles, which leads to a majority votes.On the contrary, it would be classified into the blue squares class if we take more neighbors into consideration, such as 5.In this latter situation, the three squares hold a majority vote among the total 5. Hence, we can see that the classification accuracy is highly and directly dependent on the number K of the neighbors we consider.

Related Work
Data mining aims at discovering valuable information behind large data sets in a wide variety of areas including artificial intelligence, machine learning, statistics, and database systems (Fayyad, Piatetsky-shapiro, & Smyth, 1996;Hastie, Tibshirani, & Friedman, 2009).Recently, data mining gains much popularity, resulting from the enormous market and research demand in science and engineering areas, such as bioinformatics, genetics, medicine, education and electrical power engineering (Witten, Frank, & Hall, 2011).Classification, clustering and outlier detection are three major tasks in data mining because of their various applications, including detecting credit card fraud, analyzing the structures of protein, categorizing news as finance, politics, education or sports.Apparently, K-Nearest Neighbor algorithm (Mitchell, 1997) is amongst the simplest of all machine learning algorithms: an object is voted by its K neighbors and classified by a majority votes of these K neighbors.The combination of artificial intelligence algorithms and basic KNN has shown remarkable success in wide areas.In 1991, Kelly andDavis (1991) proposed a hybrid method employing genetic algorithm and KNN to evaluate the weight by individual attributes in the data set.In 2001, Li and Weinberg (2001) proposed a GA/KNN method, which utilizes a subset of predictive genes jointly for classification with a consideration of the expression data.In 2005, Peterson and Doom (2001) provided a GA-facilitated KNN classifier to assign the weight for each attribute and apply different measurements, Euclidian distance, cosine and person similarity which takes k-neighbors straightaway instead of considering all the training samples.In 2010, Suguna and Thanushkodi (2010) also put forward an improved KNN classification using genetic algorithms.In 2012, Yang (2012) presented an integrated genetic algorithm and conditional probability to enhance the performance of the standard KNN.A drawback of the basic 'majority voting' classification occurs when the class distribution is irregular and complex (Mitchell, 1997).One way to overcome this issue is to weigh the samples, taking the distance between the test points and each of its k nearest neighbors into account, we call it weighted-KNN.In 2011, Liu and Chawla (2011) came up with a class confidence weighted-KNN to tackle imbalance data set, which in one of the most complex distribution.However, the inherent bias of majority class can be correct in existing KNN algorithms with any distance measurements.In 2012, Lican and Liu (2012) proposed a parallel GPU-based KNN algorithm to solve the computational complexity in practical applications.In 2013, Amato and Falchi (2013) addressed the image recognition issue by combining the KNN and local feature search method with similarity functions.

Proposed Method
This section introduces the proposed method in detail.The first purpose of this method is to solve some puzzling problems in classification, such as the uncertainty of neighbors' number, interference of noise points, irregular distribution of data sets, etc. Besides, providing precious information for clustering and outlier detection is another purpose.In the following we will first introduce our proposed algorithm and provide the details of the quality function used to pre process the training data.We will then present the threshold setting to eliminate low quality data set and provide the description of the classification we have used.

3N Algorithm
The main idea of the algorithm is described as follows.Our purpose is to determine the number of neighbors that should be considered in the classification step.We find the first to rth neighbors of each training sample until the terminal condition is reached, that is when every training sample has been other neighbors at least one time.The algorithm can be noted with the following mathematical expression.
Here, K = r and Nr(y) is the set of r-nearest neighbors of y. r is the considered number of neighbors, x and y are one of the training sample from the set X. Nr(y) records the first to rth neighbors of y.The value of r begins from 1 and ends when A(r, x, y) is true.It is apparent that K can be obtained adaptively instead of subjectivity or randomly.An example is as follows.We randomly choose 9 data from the famous iris data set on UCI (Blake & Merz, 1998), three for each class.Each sample has four attributes and there are three categories (classes) of plants.From Table 3, we can see that after finding each sample's two neighbors, sample 6 is still not a neighbor of others.Hence, we continue to find everyone's third neighbor until sample 6 is one of other sample's neighbor.
Using formula (1) we propose our 3N algorithm as shown in Algorithm 1.
Algorithm 1: NatrualNearestNeighborAlgorithm(3N) nb: %records the frequency of i being the neighbor of others.nn:%records the 1st neighbor to rth neighbor of i DataS ize = n × m; r = 0, f lag = 0; % r is the number of neighbors ∀i ∈ X, nb(i) = 0, nn = zeros(n, n); % X: data set of training sample % termination condition is when every sample i being others' neighbor at least once while f lag == 0 do r = r + 1; for ∀i ∈ X do computing the r nearest neighbor of sample i, recorded in nn(i, r); end if all (nb(i) 0) then f lag == 1 end end K = r; % Because of other possible uses of r, here we assign the value of r to K Output: r,nn (n, n),nb(1, n) In this algorithm we start looking for each one's first neighbor.If the condition in the i f statement is not true, we increment r by one and continue to find each one's second neighbor.The process is repeated until the condition in the i f statement is true.The worst situation is when a point deviates from all the other data points and is not the (n − 2)th nearest neighbor.In this case, r reaches its peak point n − 1 (since every point has n − 1 neighbors).In this case the time complexity is O(n × r).Additionally, if r is very large, we may consider to deal with the outliers or noise points in the training sample firstly.

Quality Evaluation
Inspired by the neighbor concept (Parvin, Alizadeh, & Minaei-Bidoli, 2008), we suppose that the quality of a sample is determined by its neighbors.Hence, we provide the following formula to compute the quality.In 3N-Q algorithm, every training sample must first be qualified according to its r neighbors.
C(x): Class value of x.
N i (x): Class value of x' ith nearest neighbor.The function S is defined as follows.
This means that if samples a and b belong to the same class, S = 1, otherwise S = 0.

Threshold Setting and Classification
Here we take "the major vote" to set the threshold.If more than half of the neighbors belong to the same class with this point, then we consider it a positive energy in classification.In contrast, points with low energy will most likely have a negative impact on the the final result.This is similar to "The fittest survives the natural selection" (Oliveira, 2012), the one with low energy is considered isolated from peers and cannot be well adapted to the environment.Consequently, it will be eliminated.In general, we will keep the samples whose quality is greater than 0.5 and eliminate the samples whose quality is equal or lower than 0.5.After the data reduction, the output of the new training sample consists of high quality samples.
The class of new point is determined by its r retaining neighbors.There is no weight valuation here as we assume that the points retained are with high quality, each of them has the ability to vote correctly.If various classes have same votes, then the class is determined by its first neighbor.Just like in our real life, the one who knows us best is our best friend.
Compared to the weight method, Q is likely to predict without being influenced by noise and outliers, since they have a lower Q value according the formula above.As a result, they would be removed by the threshold setting.For example, if there is a red triangle amongst red circles in Figure 2, this noise will be removed by the threshold since the vast majority of its neighbors are from the other class.While the red circle at the bottom of the figure is called outlier, deviating from other instances will also be removed with the low quality value.Also, Q can speed up the time to get the classification result as it removes the instances from training sets, thus decreasing the scale of data.

Stability of r
Datasets are produced by the rand(n, 2) matlab function generating two dimensional data (where n is ranging from 500 to 4500).The purpose is to prove the stability of 3N algorithm for different data size.5 tests are taken in for each group, and the average r is obtained for different sample size.Table 4 reporting the results, indicates that r is very stable in different groups (500, 1000, . . ., 4500), mostly ranging from 5 to 7. The average r grows slightly with the increase of data points.Algorithm 1 shows it is a natural process to get the 3N as there is a termination condition corresponding to the convergence of the algorithm.We have used matlab 2012a to randomly produce 500 and 1000 points respectively.When connecting r neighbors of every point, we can obtain an adaptive neighbor graph as shown in Figure 3.We assume that r is equal to 6.The effectiveness of the proposed algorithm is experimented with both artificial (Shi, 2013) and real datasets.First, the artificial dataset demonstrates that there is a potential clustering ability of 3N.The original dataset (left picture in Figure 3) shows tthere is one cluster outside and one inside in addition to two clusters on the right side which totals to four clusters.The corresponding 3N graph (right picture in Figure 4) shows exactly the same distribution with the original dataset.In other words, we can get the data set information, including clusters numbers, distribution, etc.This is important for us in the pre-processing step as it allows us to choose the appropriate type of method in the next step.
where d is the distance between the training and the test samples.
This proposed 3N-Q algorithm is evaluated on three data sets (Blake & Merz, 1998), namely, Iris, Seeds and Breast.All these sets are obtained from UCI repository.The results are respectively reported in Tables 5, 6 and improve the accuracy since both 3NW and 3NQ have a higher accuracy than 3N.Moreover, Q outperforms W since 3NQ has a higher accuracy than 3NW.

Conclusion and Future Work
In this paper, a novel algorithm, 3N-Q, is proposed for enhancing the performance of classification.Experiments show its stability, feasibility, potential and efficiency in knowledge discovery areas, classification, clustering and anomaly detection.In the near future, using the fitness function in genetic algorithm (Suguna & Thanushkodi, 2010), it is possible to add a constraint condition before finding r, thus finding a more accurate r which corresponds to making data the most saturated instead of unsaturation or over-saturation.3N graph can reflect the distribution of data set more realistically and provide a more accurate evidence for clustering.Also, according to the 3N graph we can determine which similarity measure should be better used for classification.

Figure 2 .
Figure 2. Example of removing noisy points and outliers

Figure 3 .
Figure 3. 3N graph respectively for 500 and 1000 random points Table 1 lists these 9 data.We simply use N dimensions Euclidean distance (M.Deza & E. Deza, 2013) to calculate the distance between the training samples.The results are shown in Table 2.According to Table 2, we can find each one's r neighbors.Neighbors are determined here by the distance.For instance, samples 2 and 3 are sample first and second neighbors respectively.

Table 4 .
Value of r in different sample size