A New Singular Value Decomposition Based Robust Graphical Clustering Technique and Its Application in Climatic Data

An attempt is made to study mathematical properties of singular value decomposition (SVD) and its data exploring capacity and to apply them to make exploratory type clustering for 10 climatic variables and thirty weather stations in Bangladesh using a newly developed graphical technique. Findings in SVD and Robust singular value decomposition (RSVD) based graphs are compared with that of classical K-means cluster analysis, its robust version, partition by medoids (PAM) and classical factor analysis, and the comparison clearly demonstrates the advantage of SVD over its competitors. Lastly the method is tested on well known Hawkins-Bradu-Kass (1984) data.


Introduction
Singular value decomposition, specially its low rank approximation property is an elegant part of modern matrix theory.After its inception (1936), its two ways fascinating data reduction capacity remained unnoticed till the last quarter of last century.Since then statisticians have been showing increasing interest to SVD for principal component analysis (PCA), canonical correlation analysis (CCA), cluster analysis and multivariate outliers detection.Principal component analysis (PCA), often performed by singular value decomposition (SVD), is a popular analysis method that has recently been explored as a method for analyzing large-scale expression data (Raychaudhuri et al., 2000;Holter et al., 2000;Alter et al., 2000).To analyze the effect of the oceans and atmosphere on land climate, Earth Scientists have developed climate indices, which are time series that summarize the behavior of selected regions of the Earth's oceans and atmosphere.In the past, Earth scientists used observations directly where as , more recently, they are using eigenvalue analysis techniques, such as principal components analysis (PCA) and singular value decomposition (SVD), to discover climate indices (Steinbach, M. , Pang-ning Tan, 2003, Maeng-Ki Kim, Yeon-Hee Kim, 2009).For statistical climate forecasting SVD is also used (Campbell, E. 2006).The Singular Value Decomposition can be applied to the analysis of climate data to identify patterns and maximize covariation (P.S. Lucio, F. C. Conde and A. M. Ramos 2007).A quasidecadal mode is isolated using singular value decomposition (SVD) (Mark R. Jury, 2009 ) applied to monthly smoothed and detrended rainfall, sea surface temperature (SST) (Wallace, J. M., Smith, C. and Bretherton, C. S. (1992)), sea level pressure (SLP).SVD can reduce data in both ways-columns and rows, and is more numerically stable.As PCA can be undertaken as a by product of SVD, in modern research it is being used more frequently in place of classical PCA for data compression ( Diamantaras and Kung, 1996;Wall et al., 2003 ), clustering (Sai Jayram andMurty, 2002;Murtagh, 2002;Chipman et al., 2003) and multivariate outliers detection (Penny and Jolliffe, 2001).
The conventional approach to the SVD requires that the matrix X be complete.If it has any missing elements, the calculation as sketched cannot be performed.Also there may be outliers.To solve this problem, Douglas M. Hawkins, Li Liu, S. Stanley Young (2001) have been developed Robust Singular Value Decomposition (RSVD) using Alternating L 1 regression approach.In this article we mainly develop a graphical technique on the basis of SVD and RSVD, and apply to a data set of Bangladesh that contain 10 climatic variables of 30 principal weather stations with a view to clustering them, and compare its results with that of classical K-means cluster analysis, PAM method and classical factor analysis.We see that our exploratory technique serves the purpose of both the techniques-K-means cluster analysis, and classical factor analysis have shown that this same graph with a slight change could be effectively used for outliers detection both for supervised and unsupervised learning.
The article consists of ten sections.Section 1 describes importance of SVD, and objectives and layout of the article.Section 2 delineates nature and source of data, section 3 describes the partitioning around medoid (PAM) method, section 4 illustrate the data reduction capacity of SVD and section 5, robust singular value decomposition.Sections 6, 7, 8 and 9 uphold our exploratory method for clustering, results for clustering of weather stations and climatic variables, and evaluated and advantages of our proposed method respectively.The paper concludes with some general comments about our method in section 10.

Nature and source of data
Source.The source of this data is the report entitled "Land Resources Appraisal of Bangladesh for Agricultural Development", BGD/81/035, Technical Report 3, volume I, @ FAO 1988.This report was prepared for the Government of the People's Republic of Bangladesh, based on the work of H. Brammer (Agricultural Development Adviser), J. Antoine (Data Base Management Expert The principal stations are coded by the serial number and given in Table 1.

Partitioning around medoids method
Partitioning Around Medoids (PAM) method is the one of the robust method of clustering.This method is developed by Kaufman and Rousseeuw (1987).K-means algorithm, attempts to minimize the average squared Euclidean distance which is very sensitive to outliers and not applicable for scales other than interval scales.We have decided in favor of the k-medoid method because it is more robust with respect to outliers and it does not only deal with interval-scaled measurements but also with general dissimilarity coefficients.Dissimilarity coefficients between objects may be obtained from the computation of distances, such as the Euclidean distance, Manhattan distance etc.The objective of this method is to find clusters, the objects of which show a high degree of similarity, while objects belonging to the different clusters are as dissimilar as possible.The first step of this method is to find the number of cluster K.After finding K, the K clusters are constructed by assigning each object of the data set to the nearest representative object.The best K is selected on the basis of average silhouette width.The silhouette width of an object is obtained by using the formula, (A-B)/max(A,B), Where B is the average distance of the object to all other objects within its cluster and A is the average distance of the object to all objects in its nearest neighboring cluster.The silhouette width lies between -1 to +1.Choose the value of K that maximizes the average silhouette width (Kaufman and Rousseeuw, 1990).

Data reduction capacity of SVD
C. Eckart and G. Young proved low rank approximation of SVD (1936).Singular value decomposition (SVD) has a wonderful data reduction capacity (both "R" and "Q" modes) with minimum recovery error.We can use SVD to perform PCA.By using SVD we can reduce both variables and observations of a data matrix.
Suppose X is m×n matrix of rank k ≤ min (m,n).Then by singular value decomposition we can write, Where U is the column orthonormal matrix whose columns are the eigen vectors of T XX , Λ is the diagonal matrix contain the singular values of X and V is the orthogonal matrix whose columns are the eigen vectors of X X T .Suppose we approximate X by X ~whose rank is l< k ≤ min (m,n).
By the singular value decomposition we can write we have Here the matrix l l U  contains the principal component scores, its first column represents the first PC, and second column represents the second PC and so on.Hence we see that X is a m×n matrix but l V X ~ is a m×l.If n represents no. of variables, it then reduces data by minimizing no. of variables.On the other hand reduces data by minimizing no. of observations.Both ways data reduction capacity is fully utilized in the biplot (Gabriel, 1971).

Robust singular value decomposition
The conventional approach to the SVD requires that the matrix X be complete.The calculation of SVD cannot be performed, if X has any missing elements.Gabriel and Zamir (1979) addressed this problem.To solve this problem and handle missing element, Hawkins D. M. , Liu L. and Young S. S ( 2001) proposed robust singular value decomposition (RSVD) on the basis of alternating L1 regression approach.The flowchart of the algorithm of alternating L 1 regression approach for calculating robust singular value decomposition are given in Figure 1.

Proposed method for clustering data
Cluster analysis identifies and classifies objects individuals or variables on the basis of the similarity of the characteristics they possess.It seeks to minimize within-group variance and maximize between-group variance.The result of cluster analysis is a number of heterogeneous groups with homogeneous contents.There are substantial differences between the groups, but the individuals within a single group are similar.Data may be thought of as points in a space where the axes correspond to the variables.Cluster analysis divides the space into regions characteristic of groups that it finds in the data.The objectives of cluster analysis are discovering types and reducing the number of cases by enabling consideration of several types instead of numerous records.Clusters may present in the data.After getting any data set we need to see whether any clusters exists or not in the data.For clustering data we have proposed an exploratory method by the help of SVD.SVD is also used for outliers detection.Since we know that the first few principal components (PC's) account most of the variation of data.So for graphical purpose we have used the first three PC's.The proposed methods for clustering data using first three PC's are given below First we construct the scatter plots of first two PC's, and first PC and third PC.We also make six lines in X-axis also in Y-axis by the following way median (1 st PC)  k × mad (1 st PC) in the X-axis and median (2 nd PC/3 rd PC)  k × mad (2 nd PC/3 rd PC) in the Y-axis.Where mad = median absolute deviation.The value of k = 1, 2, 3.
The position of a point with respect to the other points in the graph and the boxes made by six vertical (horizontal) lines offer us to a rough view of possible clusters.Note that median and mad are the robust location and scatter measure respectively.Our method will be more appropriate and meaningful if variables or observations are correlated.

Weather stations
In our standardized climate data if we apply our method by using SVD then it shows Figure 2. From Figure 2 we see that Cox's Bazar, Chittagong, Maijdi Court and Hatiya make a cluster, so we can say that the climatic behavior of these four stations is similar.The weather of Sylhet is totally different from other stations.The climatic nature of Patuakhali, Barisal, Bhola, Ishurdi and Kaptai is same because they make a cluster.Srimangal, Dinajpur, Rangpur, Sirajganj and Pabna make another cluster and the rest of the stations make a single cluster so their climatic nature is similar.
In our standardized climate data if we apply our method by using RSVD then it shows Figure 3. From the Figure 3 we see that Cox's Bazar, Chittagong, Maijdi Court and Hatiya make a cluster, so we can say that the climatic behavior of these four stations are similar.The weather of sylhet is totally different from other stations.The climatic nature of Patuakhali, Barisal, Bhola, Satkhira and Kaptai is same because they make a cluster.Dinajpur, Rangpur, Sirajganj, Ishurdi and Pabna make another cluster also the climatic behavior of Srimangal is similar to this cluster.The rest of the stations make a single cluster so their climatic nature is similar K-MEANS procedure.By the K-MEANS procedure for K = 5 we get, Cluster-1: Rajshahi, Narayangang, Brahmanbaria,Comilla, Chandpur, Jessore, Khulna, Feni and Rangamati.

PAM method.
In partitioning around medoids (PAM) method the first step is the selection of K.Where K is the no. of cluster.The average silhouette width is used for K selection.Choosing that K for which the average silhouette width is maximum.In our standardized climate data, The maximum average silhouette width is 0.30 for K = 5.The result of PAM method for K = 5 is, Cluster-1: Rajshahi, Narayanganj, Jessore, Khulna, Satkhira.

Cluster-5:Maijdi Court, Hatiya, Chittagong, Coxs Bazar
By the help of these methods we can say that the weather of south east area of Bangladesh is similar.The climatic behavior of north east area i.e., Sylhet is totally different from other stations.The climatic nature of the hem of south Bengal is similar.The stations of north Bengal shows the similar weather.The weather of rest of the stations is similar.

Climatic variables
In our Standardized climate data if we apply our method by using SVD then it shows Figure 4. Our climatic variables rf,vp,tmean,tmax,tmin,tday,tnight,ws,mps and sr are indicated by 1,2,3,4,5,6,7,8,9 and 10.From Figure 4 we see that the variable rainfall (rf) differ from all other variables and the rest of the variables form three clusters minimum temperature (tmin) form a cluster; Hours of bright sunshine as percentage of maximum possible sunshine hours (mps) and Solar radiation (sr) make another cluster and the rest of the variables construct another group.K-MEANS procedure is not applicable for variables.
In our Standardized climate data if we apply our method by using RSVD then the Figure 5. From Figure 5 we see that the variable rainfall (rf) differ from all other variables and the rest of the variables form three clusters, minimum temperature (tmin) form a cluster; Hours of bright sunshine as percentage of maximum possible sunshine hours (mps) and Solar radiation (sr) make another cluster and the rest of the variables construct another group.

Factor analysis based.
If the variables are uncorrelated in the population then the Factor Analysis (FA) does not useful.So we first measure the sample adequacy for our climate data using Bartlett test of sphericity and Kaiser-Meyer-Olkin (KMO) statistic.Table 2 represents the SPSS results of Bartlett test of sphericity and KMO statistic.
From Table 2 we see that the value of Bartlett test statistic is 483.963 and p-value is 0.00.Hence the null hypothesis that the variables are uncorrelated in the population is rejected i.e., the variables are not uncorrelated in the population.Again the value of KMO statistics is 0.635 which is greater than 0.50, which indicate FA is appropriate.Therefore both the Bartlett test of sphericity and KMO statistic supports to conduct FA in climate data.

Normality test.
Here we test the univariate normality by the Kolmogorov-Smirnov (KS) test statistic.Then the KS values and p-values are given in the Table 3. From Table 3 we see that the normality assumption of mps, sr and rf are rejected even at 1% level of significance i.e., mps, sr and rf are not univariate normal.Also the normality assumption of mps, tday, ws, sr and rf are rejected at 5% level of significance.Since all the variables are not univariate normal so our climate data is not multivariate normal.we know that if any data set is multivariate normal then all the variables must be univariate normal but converse is not true.Hence we can say that our climate data is nonnormal.Thus in this data set principal factor solution method of estimation is more appropriate than Maximum likelihood (MLE) method.Principal factor solution method is also denoted by PFA (principal factor analysis) method.PFA method is stable and less affected by the outliers than MLE method.
PFA method.Scree plot helps us to select the number of factors to be extracted.Figure 6 shows the scree plot of the climate data.From Figure 6 we can select four factors for our analysis of climate data.Here first four components explain about 94% of the total variation in the data.By using PFA method we get the Table 4 which describes the rotated component matrix.Also Figure 7 shows the component plot in the rotated space.From Table 4 and Figure 7 we can see that the variables vp, tmean, tmin, tday and tnight with high positive loadings forms a group for the first factor.The variables tmax and tday with high positive loadings but rf with high negative loadings for the second component.Also the variables mps and sr with high negative loadings for the third factor and ws with high positive loadings for the fourth factor.
By the help of these methods we can say that the characteristics of climatic variable rf is totally different from other climatic variables.The climatic variables tmin make a group.Also mps and sr make another group and rest of the climatic variables make a different cluster.

Evaluate our proposed method by Hawkins-Bradu-Kass (1984) data
Hawkins, Bradu and kass (Hawkins et al., 1984) constructed an artificial three-predictor data set containing 75 observations with 14 influential observations.Among them there are ten high leverage outliers (cases 1-10) and four high leverage points (cases 11-14) (Imon 2005).The data set is mainly well known for checking efficacy of regression diagnostic as well as robust measures.If we apply our method in this data then it shows the Figure 8. From Figure 8 we see that first two PC's show the observations 1-14 are outside our box, it also shows that there are three clusters present in the data.Observations 1-10 make 1st cluster, observations 11-14 make second cluster and the rest observations make third cluster.

Advantages our method
The advantages of our method from other existing methods are given below  It is easy to understand without hard mathematics.
 It can be applied to data for both supervised and unsupervised learning.
 It is directly applied to seperate different clusters.
 It can be applied in extremely complicated data sets without any extraneous assumption.
 It can be applied to locate outliers if they present in data.

Conclusion
Bangladesh is located in subtropical monsoon region.In context of Bangladesh, BANGLAPEDIA (National Encyclopedia of Bangladesh) states that Bangladesh has been divided into the following seven distinct climatic zones on the basis of mainly three climatic variables rainfall, temperature and winter dew.Actually it's an adhoc method that does not consider all the climatic variables and their correlations.On the other hand we have proposed a more scientific method and compared it with some existing clustering technique.The result of our proposed method is very much similar with the all of these methods.From our above discussion we can say that SVD is an important exploratory tool for clustering for both variables and cases.If one wishes, one may go forward for more rigorous SVD based / K-means/other methods of clustering after this exploratory graph By help of our proposed method we can clustering any data if clusters present.In climate data we see that there are five clusters present in the stations and also the four clusters present in the variables.In this article we have used S-PLUS, R, SPSS and LaTex software and its packages.For the second and subsequent of the SVD, we replaced X by a deflated matrix obtained by subtracting the most recently found them in the SVD ) and A. H. Kassam & H. T. van Velthuizen (Land resources and Agricultural Consultants) of UNDP (United Nations Development Programme).Nature.This data set contains the values of 10 climatic variables for 30 principal stations.For 30 principal stations, data sets of average values of annually means of the following parameters were collated from the unpublished and published records of the Bangladesh Meteorological Department (BMD).The climatic variables are-time temperature (T-DAY)0C 6. Night-time temperature (T-NIGHT)0C 7. Daily mean water vapor pressure (VP) MBAR 8. Daily mean wind speed (WS) m/sec 9. Hours of bright sunshine as percentage of maximum possible sunshine hours (MPS)% 10.Solar radiation (SR) cal/cm2/day (i) South -eastern zone (Chittagong sub-region) (ii) North-eastern zone (east and south Sylhet) (iii) Northern part of the northern region (Panchagarh, Lalmonirhat etc.) (iv) North-western zone (Rangpur , Dinajpur etc.) (v) Western zone (greater Rajshahi ) (vi) South-western zone (Jessore , Khulna etc.) (vii) South-central zone (Dhaka, Cumilla, Mymensingh etc. )

Figure 1 .
Figure 1.Flowchart of the algorithm of alternating L 1 regression approach for Robust singular value decomposition

Table 2 .
Results of Bartlett Test of Sphericity and KMO statistic

Table 3 .
Table for KS-value and p-value

Table 4 .
Table for rotated component matrix Calculate the resulting estimate of the left eigenvector u i =d/ ║d║ Iterate this process untill it converge.