Improved SOM-Based High-Dimensional Data Visualization Algorithm

In this paper, a new high-dimensional data visualization algorithm based on the Self-Organizing Map (SOM) is proposed. It is named TDSOM (three-dimensional self-organizing map) to describe its special characteristics. TDSOM trains the high-dimensional data with SOM network and projects it into particular point sets in the three-dimensional coordinate system. In the three-dimensional coordinate system, the x axis represents attributes of the original data set; the y axis represents the weight of each attribute; the z axis represents different categories of the mapping result. The most important is that researchers can watch the three-dimensional model from different viewpoints by rotating it and gain some interesting patterns. Through the experiment, TDSOM is proved to be much more accurate and more analytical than the traditional methods in displaying the high-dimensional data. The main innovation of the new TDSOM algorithm is the presentation of large data in three-dimensional coordinate system which provides a much wider view than the two-dimensional one. What’s more, users are able to discover some interesting patterns according to their own research areas through the model. The algorithm can be widely applied in areas such as data mining, pattern recognition and so on.


Introduction
Data visualization, witch is used to display multi-dimensional data graphically, has been widely applied in pattern recognition, image processing and so on.There are many algorithms being used in data visualization, such as scatter plot matrices (Yu, Zhou, & Zhang, 2007) which displays the different combination of data items with many child graphics, parallel coordinates (Wegman, 1990) which rearrange the order of data items in two-dimensional space, stacked techniques (Feiner & Beshers, 1900) which embed one coordinate system in another one, cone trees which divide the n-dimensional data set to subspaces, Chernoff face (Soete & Corte, 1985) and star chart (Willis, 1992) which convert the multi-dimensional data set into specific graphics and so on.These methods have made great contribution to visualizing the multi-dimensional data set, but it is hard to find the characters through them when analyzing the data set with more dimensions.To solve the problem, researchers put forward dimension reduction methods which visualizing data sets after projecting the high-dimensional data to the low-dimensional space.These methods effectively extract the distribution characteristics of high-dimensional data set and reduce the complexity of final visualization result.However, this requires a higher performance of the mapping algorithm.Representative ones include the principal component analysis (Kambhatla & Leen, 1997), multidimensional scaling (Friedman & Tukey, 1974), self-organizing map (Kohonen, 1990;Vesanto, 1999) and so on.And researchers are keeping improve these methods continuously.
In 1982, Professor Kohonen from Finland proposed the neural network model named SOM.SOM is an unsupervised neural network projecting high-dimensional data to low dimensional space with competition learning.No supervision means not pointing the output when the network was being trained.And competition learning is the selection rule of "survival of the fittest".SOM produce a two-dimensional graph by means of projecting the input space into output space.In the graph, input space records are projected to the neural nodes with which they have the maximum similarity.And this algorithm has high execution efficiency.However, the traditional SOM algorithms usually visualize data with consistent grid format, so the size of the network had to be designed in advance.It makes the presentation capability of the output graph largely limited.What's more, SOM does not reserve the distance information of network nodes.Although this fault can be made up by designing color map based on U-matrix, the structure and distribution of data were usually presented in a distortive format.Dynamic self-organizing map (Alahakoon, Halgamuga, & Srinivasan, 2000), an improved algorithm of the SOM, generates a two-dimensional graphic with a grid which grew dynamically.But the irregular network shape is not able to visualize the final result appropriately.The visualization-induced self-organizing map (Yin, 2002) preserves the distance information of data as well as the topological structure.But it cannot present the characteristics of different attributes.The polar self-organizing map (PolSOM) (L.Xu, Y. Xu, & Chow, 2010) and its improved probabilistic polar self-organizing map (PPoSOM) (L.Xu, Y. Xu, & Chow, 2011) which are based on SOM are recently put forward.They visualize data with the two-dimensional polar coordinates and project the radius and angle of the coordinates to the distance and characteristics of the data.The PolSOM and PPoSOM present the distance diversity between the network nodes and preserve the topology.What's more, they present the value weight and feature characteristic with polar coordinates' feature effectively.However, with the increase of record amounts and the growth of data dimensions, these methods based on polar coordinates may lead to the confusion of the visualization result.
A new algorithm based on SOM and used to visualize high-dimensional data is named as three-dimensional SOM, and we call it TDSOM for short.TDSOM discards the traditional model that projects data to the two-dimensional coordinates, but projects data to the three-dimensional coordinates to enforce the effect of the visualization.TDSOM projects the three coordinate variables of the three-dimensional coordinates to the attributes, values and category of the dataset.

SOM Algorithm
Early SOM algorithm mainly projects the multi-dimensional to the two-dimensional space.The two-dimensional plane consists of many neural nodes.In the graph, each node has a weight vector with the same dimension and distribute on the grid made up of rectangles or hexagons.In SOM, an input datum x i is represented as a d-dimensional feature vector x i =(x i1 , x i2 ,…, x id ) T .And each neural node is represented as the corresponding vector w i =(w j1 , w j2 ,…, w jd ) T .At the beginning of the training, the algorithm select one datum x randomly and compute the similarity of the datum x and the node j.The similarity is measured by Euclidian distances, and the node which has the minimum distance is the winning one: where N is the number of neurons.
After getting the best matched node, the algorithm updates the weights of its neighbor nodes.Usually, the neighborhood function should be defined before updating them.And Gaussian function is usually chosen: where p j and p c are the coordinates of neuron j and c, respectively, N c is the neighboring set of winning neuron c, ||p j -p c || is the distance between j and c, δ(t) is the neighboring radius that monotonically decreases with time.
The weight updating formula is: Where ζ(t) is the learning rate that monotonically decreases with time.
After training, input data with similarity are projected onto adjacent neural nodes in the grid of the output space.Thus, SOM is able to reserve the topological relationship of the input space.But due to more than one datum are projected onto the same neural node, the inter-point distance is not preserved.What's more, the format of the output space usually is consistent, SOM is not able to display the intern-neuron distance.

TDSOM Algorithm Principle
TDSOM is the newly proposed algorithm in the article to exhibit the characteristics of high-dimensional input space.Unlike the traditional algorithms, its visualization model is constructed in the three-dimensional coordinate system which is much wider than the two-dimensional space.Records data are projected to the certain coordinates of the three-dimensional space after being processing by the algorithm.In the direction of X axis, the three-dimensional model is divided into d bar areas (where d is the dimensions of the input data) horizontally, and different x value represents different attribute of the data set.In the direction of Y axis, the vertical value of the point represents the weight of the corresponding attribute.In the direction of Z axis, different value means different class of the points.Each node of the neural network is represented with a vector of the same dimension w=(w 1 , w 2 , …, w d ) T (where d is the dimension of the input data).After selecting the input datum x=(x 1 , x 2 , …, x d ) T from the data set, the algorithm gets the winning neural node according to following formula: After this, TDSOM update the winning node c and its neighbor nodes with Formula (2) following the below: where η 1 and η 2 are the learning rates of the t-th step, and η 1 >η 2 .η 1 is in the region [0, 0.3], and represents the learning rate of the winning node towards its current training sample.While η 1 is in the region [0, 0.1], and η 2 represents the learning rate of the neighborhood nodes learning to the current training sample.The learning rates have effects on the convergence.While η 1 andη 2 increase, the training process is speeded up.
Set the training period, and start to train the network according to it.During the training, TDSOM save the times of each node chosen to be the winning one w j (j=1, …, N).And during the last training period, the winning nodes of all input data J c are recorded.After the training, the U-matrix that stores the distances information of the neural nodes in the network is computed.Then the neural network is visualized according to gray model generated from the U-matrix.In the result, the distance between similar neural nodes is closer, so their colors are similar.Presume that the distance between nodes is closer, their color is deeper.Finally, some dark areas come to being in the plane and each of them represents one clustering.
According to the information of input data projecting to clustering areas, the input data subset corresponding to certain clustering can be recorded.Follow the formula: where J c is the winning node of datum x, m is the clustering number divided according to U-matrix and C k is the k-th clustering.
After the above processing, TDSOM projects one datum onto l points of the coordinate system, and they are represented with Pl(l=1,2,…,d) (where d is the dimension of the datum x ):   where Pl x , Pl y and Pl z are the x, y and z coordinates of the points.
The holistic executing steps of TDSOM algorithm are as follows: Step(1): Normalize the input data.Initialize the weight of all neural nodes randomly, and initialize their coordinates in the output space.Set the training period T.
Step(2): Randomly select an input datum and find its corresponding winning neural node by Equation (4).
Step(3): Update the weights of winning node and its neighborhood set by Equation ( 5) Record the winning times W j .
Step(4): If the training period reaches T, stop and go to Step(5).Otherwise, reduce the learning rate η 1 ,η 2 and neighborhood radius δ(t) and go back to Step(2).
Step( 5): Draw the contour map according to W j which is the numbers of records being projected to the corresponding neural node.And draw the gray graph according to the U-matrix.Get clustering according to Equation ( 6).
Step( 6): Update the coordinates by Equation ( 7), Equation ( 8) and Equation ( 9) according to clustering result and winning neuron x w , and draw the three-dimensional model.
Through the training above, every input datum have a points set in the three-dimensional coordinate system.Data with the similar attributes gather on the same group, and researchers can compare the distribution of all clusters' attributes and detect their weight from different viewpoints.The final model is made up of points set but not neural nodes, so the distance of points is reserved.

Experimental Results
Wine recognition data set which was released in UC Irvine learning repository is one of the most widely applied data sets.These data are the results of a chemical analysis of wines grown in the region in Italy but derived from three different cultivars.The analysis determined the quantities of 13 constituents found in each type of wines.It is hard to represent characteristics of the data set with so many attributes clearly through traditional data visualization methods.So the effectiveness of TDSOM visualizing high-dimensional data is easy to be defected by this wine recognition data set.After being trained by the TDSOM, the results are listed as follows: In Figure 1, most data are mainly projected to three areas where their colors are darker than edges and gathered as three clusters.Figure 2 is the SOM U-matrix grayscale which is the traditional method of visualization.Although clusters' Edges are not clear enough, the result also contains three clusters.
Figure 3 and Figure 4 are screenshots of the three-dimensional model which are taken horizontally and vertically.
In the two figures, the numbers of 1 to 13 in x axis represent thirteen attributes of the wine recognition data set; the value of y coordinate represent the weight of attributes; 1, 2 and 3 on the z axis represent three different areas.
In Figure 3, researchers can distinguish three different clusters projected from original data set, and different layers contain the corresponding points set of its clustering.Rotating the three-dimensional model and observing the Figure 4, researchers can find the distribution of front attributes easily.For example, observe the '3' attribute of three clusters, researchers can easily find out that although the distribution of Cluster 1 is disperse, the values of the set mainly focus on the area which values are smaller; that Cluster 2 middle; and that Cluster 3 bigger.
According the above illustration, researchers may get knowledge of the distribution of all the attributes of different clusters by rotating the model and detecting it from different view point.
Based on the experiment results, it is noted that obvious improvement has been made in visualizing high-dimensional data set by TDSOM, and that the distribution of each attribute of different clusters is effectively represented.

Conclusion
In this paper, a new improved mapping method, TDSOM, is proposed for visualization and projection of high-dimensional data.The categories, attributes and weight of the data set are represented clearly, and the distribution differences of clusters are also obvious.What's more, researchers can find the characteristics of each cluster.However, TDOM may be modified for further improvement.During the experiment, one accurate model usually contains thousands of millions of points.And this requires high-performance computer system.With the development of computer technology, the problem may be solved in the near further.At last, what's worth mentioning, due to the upstanding visualization result, TDSOM is worth widely applying in relevant areas.