A Novel Center Point Initialization Technique for K-Means Clustering Algorithm

Clustering is a major data analysis tool utilized in numerous domains. The basic K-means method has been widely discussed and applied in many applications. But unfortunately failed to offer good clustering result due to the initial center points are chosen randomly. In this article, we present a new method of centre points initialization and we prove that the distance of the new method follows a Chi-square distribution. The new method overcomes the drawbacks of the basic K-means. Experimental analysis shows that the new method performs well on infectious diseases dataset when compare with the basic K-means clustering method and a histogram measures the quality of the new method.


Introduction
The massive quantity of information gathered and input into databases brings up the necessity of efficient exploration technique which can utilize the information contained unconditionally there.Among the initial data exploration work is clustering, which enables a person to comprehend pattern and natural groupings within the datasets.Hence, enhancing clustering techniques continues to be getting a lot of interest.The aim would be to cluster the items in the databases to some group of significant subclasses (Ankerst et al., 1999).
Information are generally preprocessed by means of data selection, data integration, data transformation as well as data cleaning and ready for the exploration process.The exploration may be carried out in different databases as well as data repositories, though the styles available were laid out in different exploration benefits such as concept/class description, classification, association, prediction, correlation analysis, cluster analysis and so on.
Cluster analysis is a method of grouping certain sets of designs in to different groups.This is accomplished so that designs within the same groups are similar, while designs in different groups are dissimilar.Cluster analysis has become commonly studied problems in various usage areas such as knowledge discovery as well as data mining (Fayyad et al., 1996).
In clustering methods which are determined by reducing a proper objective function, the most commonly utilized and practiced might be k-means method.For some n data items on real d-dimensional space, d R , with an integer k, the issue is to ascertain some sets of k items on d R , known as centres, in order to reduce a mean squared distance from every data point to their closest centre.
A basic stage for k-means cluster analysis is straightforward.At first we decide how many groups 'k' so we presume the centres of such groups.It will consider any random items to be the first centre or an initial k items within the series also can function as a first centroid.After that k-means technique performs three of the stages here till converge: Iterate till stable (Means zero item transfer groups): i.
Decide a centre coordinate.
ii. Decide a distance for each item to the centres. iii.
Cluster the items according to minimal distance.
Although the basic k-means approach features some benefits more than alternate data clustering approaches, also, it seems to have shortcomings; it converges usually to the local optimum (Anderberg, 1973), the end effect is determined by the initial centroids.
The two relatively easy methods for cluster centre initialization are, to randomly decide for the initial values or to select any initial k samples in the data items.Rather than, other sets of initial values are selected (from the data items) also, the set that will be nearest to optimal, can be selected.Then again, examining different initial sets is considered impracticable criteria, especially for a large number of clusters (Rajashree et al., 2010).Again the computational complexity of the basic k-means technique may be very excessive, specifically with regard to huge data sets.Additionally the amount of distance computations rises tremendously having increases on the dimensionality of the dataset.Once dimensionality increase normally, just one particular dimension is significant for specific clusters, however data within insignificant dimension might possibly yield a lot of noise also it will conceal original groups to be found.Furthermore if dimensionality increase, information usually become continuously minimal, which means that data items positioned from various dimensions can be viewed as all equally distanced as well as distance estimate, basically for clustering technique, turns into the incomprehensible.For this reason, feature reduction or perhaps dimension reduction might be a very important data-preprocessing procedure regarding clustering technique with dataset aquiring huge amount of attributes/features (Rajashree et al., 2010).
Principal Component Analysis by Valarmathie et al. (2009) is an unsupervised feature reduction technique concerning predicting higher dimension dataset to a different reduces dimension dataset that represents most of the variance within a dataset with minimal reconstruction error.Dimensionality reduction by (Rajashree et al., 2010) is the transformations of higher dimension dataset to low dimension correspond to the intrinsic dimensionality of the dataset.It is categorized in to two classes, that is feature reduction and feature selecsion.
Feature Selection criteria aims at obtaining the subsets of the extremely representative features based on a few goal functions within discrete space.The methods of these are normally greedy.Therefore, they generally can not actually discover the optimum solutions within a discrete space.Feature Extraction methods aims at extracting features through projecting an initial higher dimension dataset to a lower dimension space by means of algebraic transformation.This reaches an optimum solution of the problems in a continuous space, however the computations intricacy will be more compared to feature selection criteria.Numerous feature reduction techniques were proposed.Principal component analysis is the frequently employed feature reduction technique concerning reducing reconstruction errors.
A number of efforts have been made by research workers to enhance the performance and effectiveness on a basic k-means technique.Yuan (2004) presented an organized way of selecting an initial center point, but his approach fails to propose an enhancement for the time intricacy on the k-means technique.Belal and Daoud (2005) presented a new technique to cluster centers by considering a group of medians obtained from the dimensions having optimum variance.Zoubi (2008) presented a technique to improve k-means cluster analysis when avoiding unnecessarily distance computations using the partial distance logic.Fahim (2009) presented an approach for selecting an excellent initial solution by means of dividing datasets into blocks than also employing k-means for each block, however the intricacy for the time is a bit more.Although the technique above can acquire effective initial centres for some level, they tend to be very complicated while many utilize k-means technique in their techniques, and is also have to utilize a method of randomly selecting center point.Deelers and Auwatanamongkol (2007) presented a method to enhance k-means clustering technique in accordance with data partitioning technique utilized for color quantization.This technique carries out data partitioning on the data axis considering the maximum variances.Nazeer and Sebastian (2009) presented a better k-means technique, includes an organized way of getting initial center points with a new effective approach of assigning data items into their clusters.This approach guarantees the whole procedure for grouping within O(n2) time while not compromising correctness for the clusters.Furthermore (Xu et al., 2009) stipulate a new initialization structure to choose initial cluster centres using reverse nearest neighbor lookup.Yet the whole techniques above fail to function effectively with huge dimension datasets.Yeung and Ruzzo (2000) presented an empirical exploration with principal component analysis for grouping gene expression datasets, still the initial center points were also selected here at random.Chao and Chen (2005) as well presented an approach regarding dimensions reduction for microarray data exploration employing Locally Linear Embedding.Karthikeyani and Thangavel (2009) enhanced k-means clustering technique through the use of global normalization prior to carrying out the cluster analysis in distributed dataset, while not always getting each of the information to a one site.The efficiency for the proposed normalization centered distributed k-means clustering technique was evaluated alongside of distributed k-means clustering technique and normalization centered directed k-means clustering technique.The clustering level has also been evaluated with three normalization methods, the z-score, decimal scaling as well as min-max with the suggested distributed clustering technique.A comparison test revealed that a distributed cluster effecs rely upon the kind of normalization method.Alshalabi et al. (2006) designed an experiment to evaluate the impact for various normalization procedures for consistency as well as preciseness.The experiment results suggested choosing the z-score normalization as the method that will give a much better accuracy.

Materials and Methods
Standardization of the original dataset: The initial dataset are scaled with mean 0 and variance 1.The position as well as scale information with the initial variables has been missed Jain and Dubes (1988).An essential limitation with the z-score standardization z is that, it is used for global standardization rather than within-cluster standardization (Milligan & Cooper, 1988).The second method applied is the principal component analysis for outliers detection and removal.
Computing principal components of the standardized dataset: The number of principal components obtained will be identical with the initial variables also to clear away the weaker components from the set of principal component, we obtained the corresponding variance, percentage of variance and cumulative variances in percentage shown in Table 2. Then considered principal components with variances below the mean variance and disregarding the others.The reduced principal components are shown in Table 3.
Acquiring the reduced dataset utilizing reduced principal components: The transformation matrix with reduced principal components is formed which can be used for further data analysis.The reduced dataset Y is used for further analysis shown in Table 4. 2) Work out the distance among each data points from the set Y.

Initialization of the
3) Select any two data point y i and y j in a way that distance (y i , y j ) is at the maximum.7) Find the nearest cluster center for each data point in Y from the list of Cen, which is closest, than assign that data point to the corresponding cluster.
8) Update the cluster centers in each cluster using the mean of the data points, which are assigned to that cluster.
Re-iterate the steps 7 and 8 up to the point there is little or no further variations in the centroids.

Results and Discussions
In this section, we show that the new method is normal and follows a Chi-square distribution.We analysed and compare the results of the basic and new methods.We also evaluate the accuracy of the two approaches, whereby accuracy is measured by the error sum of squares for the intra-cluster range, that is a distance between data vectors in a group and the centroid of the cluster, the smaller the sum of the differences is, the better the accuracy of clustering.

The New Distance Follows a Chi-square Distribution
As original k-means distance follows a Chi-square distribution the new method also follows a Chi-square distribution.Consider Figure 1 below with two groups, cluster 1 and cluster 2 having 1 2 , y y , as the random (2)

y y y y y y y y y y yy yy yy y y y yy y y y
Collecting the like terms we have: Than we add 4 into Equation 5, we have: For independence assumption the covariance between 2 equals to , that is: Adding Equation 4and 7 we have Hence Equation 8 also follows a Chi-square distribution.

Experimental Analysis
In order to test our algorithm we used an infectious diseases dataset.We compare the analysed results of the k-means algorithm with the two different initialization techniques, which are the random initialization technique and the new technique, respectively.The experimental result of the cluster analysis shows that the new initialization approach outperforms the basic clustering approach.3 presents the reduced principal components that have variances greater than mean variance.But the number of principal components found is the same with the number of the original dataset, here we present only the eighty percent (applying pareto law) to be considered for further analysis.Table 5 presents the error sum squares and respective time taken obtained for both basic k-means clustering algorithm and the proposed technique.The result also shows that the new technique provides better error sum squares and the time taken for the execution also reduced.

Figure 2. Basic k-means algorithm
Figure 2 presents the result of the basic k-means algorithm using the original dataset having 20 data objects and 8 attributes as shown in Table 1.Indicating three points attached to both cluster 1 and 2 are out of the cluster formation, indicating the presence of outliers.The intra-cluster distance is very high while the inter cluster distance is also very small with the error sum of squares equal 175.00.  4. The intra-cluster distance is very small and the inter cluster distance is very high with the error sum of squares equal 74.01.

Conclusion
Many applications rely on the clustering techniques.One of the most widely used clustering approaches is k-means clustering.In this article a new method of center point initialization is proposed to produce optimum quality clusters and we prove that the new distance method follows a Chi-square distribution.Comprehensive experiments on infectious diseases datasets have been conducted in a manner that the sum of the total clustering errors was reduced as much as possible whereas inter distances between clusters are preserved to be as large as possible for better performance.The experimental result of the cluster analysis shows that the new initialization approach outperforms the basic clustering approach.
Center points: The stairways for the k-means clustering center point initialization are highlighted y i , y j from the set Y. 6) If (m <= k).For i = 1 to m -1, obtain a distance of each object in Y to [ ] Cen i .Obtain an average of the distances to the centroid for each object in Y. Pick the data object y o acquiring highest average distance from earlier centroids.Eliminate the object y o out of Y. Stage 2: K-means clustering considering the initial centroids succumbed Cen[ ]. Figure

Table 1
Table2presents the variances, the percentage of the variances and cumulative percentage which corresponds to the principal components of the original dataset.

Table 4
Table4presents the transformed dataset having 20 data objects and 5 attributes which are generated using the reduced principal component analysis and the original dataset shown in Table3 and 1 respectively.

Table 5 .
Summary of error sum squares and time taken