The Design of Pre-Processing Multidimensional Data Based on Component Analysis

,


Introduction
Applications related to multidimensional data continue to grow.Techniques to support more efficient query becomes an important research issue at this time.This technique is needed to open the multimedia content, data exploration in areas of health, population issues, decision-making in education, as well as to analyse the time-series.Processes such as data pre-processing, data cleaning, data integration and transformation, and reduction of dimension can be applied to improve the quality of the results.
Real data are often incomplete (Magnani, et.al., 2004), lack of attributes, noisy, contains outlier, and also inconsistent, thus requiring the data pre-processing.Pre-processing of data is to improve algorithm (Orfanidis, et.al., 2008), accuracy, completeness, consistency, timeliness, value added, interpretation, and better accessibility.
Pre-processing is the process of transforming data into simpler, more effective, and in accordance with user needs.More accurate results and shorter computation time can be used as indicators.The data also becomes smaller without changing the information in it.Some pre-processing method is done by selecting a subset of a large population sample of data, referred to as denoising.This will be followed by normalization and feature extraction.
Dimensional reduction becomes a fundamental problem in most of the data mining process.It benefits not only for computational efficiency, but also can improve the accuracy of the analysis (Cunningham, et.al.,2007).Dimension reduction techniques are often used to overcome "the curse of dimensionality", as part of pre-processing in addition to simplify the data model.
Dimension reduction techniques can be grouped into feature selection and feature extraction.Feature selection is the process of finding a subset of the original variables, with the aim to reduce and eliminate the noise dimension.It can improve the performance of data mining, including improving the speed and accuracy.In some cases, regression or classification analysis can be done to reduce the dimension, which produces more accurate dimensions.Several algorithms have been proposed such as ReliefF (Sikonja, et.al., 2003), Focus, Support Vector Machine Recursive Feature Elimination (SVM RFE) and Feature Subset Selection using Expectation Maximization (FSSEM).
Feature extraction is a technique to transform high-dimensional data into lower dimensions.Several supervised learning algorithms have been proposed, namely Linear Discriminant Analysis (LDA), Canonical Correlation Analysis (CCA), Partial Least Square (PLS), Latent Semantic Indexing (LSI), Singular Value Decomposition (SVD).While for unsupervised learning, algorithms such as Principal Component Analysis (PCA), Independent Component Analysis (ICA), FastICA (extension of ICA) can be used as a basic component analysis.
This paper organized into a few sections.Section 2 will present related work.Section 3 presents material and method, followed by result and discussion in Section 4, and followed by concluding remarks in Section 5.

Pre-processing Data
Research in pre-processing has been done, and produced commercial products.However, given the number of attributes and multidimensional data continues to grow, this research has the potential to grow.DB-H algorithm is one of the researches relating to the pre-processing of data, namely with discretizes technique to eliminate the numerical attributes and generalizes by eliminating the symbolic attributes (Hu, et.al, 2003).Pre-processing and data transformation is often required before applying the data mining of clinical data, namely the compilation of data using information from the data itself (Lin, et.al, 2009).This process is used to ensure reliability of data used in data mining (Wahab, et.al., 2008).

Dimension Reduction
Dimension reduction methods associated with regression, additive models, neural network models, and methods of Hessian (Fodor, et.al. 2003).Local Dimension Reduction (LDR) looks for relationships in the dataset and reduces the dimensions of each individual using a multidimensional index structure (Chakrabarti, et.al., 2000).Nonlinear algorithm gives better performance than PCA for sound and image data (Kambhatla, et.al. 1994).Principal Component Analysis (PCA) which is based on dimension reduction and texture classification scheme can be applied to manifold statistical framework (Sang, et.al., 2007).The semantics of linear algebra is significantly simpler than Probabilistic Latent Semantic Analysis (PLSA) and LDA, while PLSA is much simpler than the LDA (Chua, et.al., 2009).
In most applications, dimension reduction performed as pre-processing step (Ding, et.al. 2007), performed with traditional statistical methods that will parse an increasing number of observations (Fodor, et.al., 2003).Dimension reduction creates a more effective domain characterization (Bi).Sufficient Dimension Reduction (SDR) is a generalization of nonlinear regression problems, where the extraction of features is as important as the matrix factorization (Globerson, et.al. 2003), while SSDR (Semi-Supervised Dimension Reduction) is used to maintain the original structure of high dimensional data (Zhang, et.al., 2008).

Normalization
Important aspect of pre-processing of data is the normalization (Shalabi, et.al, 2007).There are many methods of doing data normalization, i.e. normalization min-max, z-score, and normalization with the decimal scale (Han, et.al., 1998).Normalization min-max performed with a linear transformation on the original data.Suppose min A and max A is the minimum and maximum values of an attribute A. Normalization Min-max will map the value v, A into v' in the range [new_min A , new_max A ] by using the formula ′ _ _ _ In the normalization z-score, value for attribute A, normalized by the mean and standard deviation values A. Value , where A and σA are mean and standard deviation of attribute A. Normalization method is useful if the minimum and maximum value of attribute A is unknown, or when there are outliers that dominate normalization min-max.Normalization by decimal scale by moving the decimal point normalized attribute values A. The number of decimal points moved depends on the absolute value of A. The value v from A normalized to v' count by ′ ′ , where j is minimum value of ′ 1.

Outlier
Outlier which often also be interpreted as an anomaly, is a set of data that is considered to have different properties compared with other data.Outlier analysis is also known as anomaly analysis or anomaly detection, or deviation detection (object attribute values are significantly different from others).Outlier based on density-based approach, where the outlier is a point which is located in an area with low density.To find outliers, we can use the formula: where N(x, k) is the set containing the k nearest neighbours x, y is the nearest neighbour of x and |N(x,k)| is the number of members of the set N(x, k).Meanwhile, to calculate the LOF (local outlier factor) can be done with the approach of:

Material and Methods
The proposed design consist of five stages, i.e. data cleansing, data denoising, data extraction, data clustering & cluster modelling and data visualization (Figure 1).

Data Cleansing
Most of real data cannot be used directly because of the lack of the value attribute, or containing only aggregate data.Data can also be noisy because it contains errors, have outliers, or is not consistent due to differences in coding or naming conventions.This can be solved by cleaning the data.The cleaning of data starts with the process of centering, to reduce the data by finding the average of each attribute, using the formula: , where is the result after centring, X is column vector and is the average of the corresponding column.The process of centering done for all in order, if null value is found, the value will be replaced by an average value to that column, the result of the centering process can be used to find the spread by using the formula Scatter = The results of the scatter can be used to find the value of covariance using the formula, ′ .After the process of centering followed by a Gaussian function, hereinafter referred to as normalization, by the formula

Data Denoising
After cleaning, the data will be denoise using Binning method.This study will use the outlier detection method with (Ramaswamy et.al., 2000) where D=distance, k=nearest neighbour, and n=top n point.Denoising outlier data can be done through a search by an equal size distance (Knorr, et.al, 1997).This method states that every object with the greatest distance from the k-nearest neighbor can be called outliers from data set.

Data Extraction
Independent Component Analysis (ICA) was introduced by Jeanny Hérault and Christian Jutten in 1986, later developed by Pierre Comon in 1994.FastICA is one of the extensions of ICA, which is based on point iteration scheme to find nongaussianity (Hyvaerinen, et.al., 2000).It can also be derived as approximate Newton iteration, using the following formula , where , and 1/ ′ , matrices W need to orthogonalized after each phase have been processed.

Data Clustering
This research implements the Kernel K-Mean clustering, with k = 2 and 100 sets maximum optimization.Radial kernel used by the kernel gamma = 1.0.For the cluster model used Expectation Maximization (EM), with maximum runs = 5, maximum optimization steps = 100, with quality 1.0E-10, and with k-mean initial distribution runs.

Data Visualization
Most of the data mining research is a predictive modelling (Kohavi, et.al., 2000), with the primary tasks are defining goals and to measure the predictive test-set independency.Data visualization is another important factor so that users understand the result of applying the model.The visualization is represented in 3-dimensional image.

Result and Discussion
This paper proposed a pre-processing model using component analysis, as shown in Figure 1.The model is then tested against three medical datasets (cancer datasets): a.
To view the comparison of model results, we have conducted two tests of classification, namely the implementation of pre-processing, FastICA and clustering, as shown in Figure 2, and compared the results with no pre-processing, as shown in Figure 3.The overall results of experiments on the two models above can be seen in Table 3. Overall, these experiments have been carried out except for prostate cancer datasets with the test datasets without pre-processing has yet to find results despite testing more than seven hours.
Analysis of the performance vector for the number of clusters generated showed better value in all three datasets, i.e. from 0.990 to 0.993 for wisconsin breast cancer dataset, 0.867 to 0.882 for lung cancer datasets and 0.991 for prostate cancer datasets.Meanwhile, if viewed from the time of processing, the implementation of the model with the application of pre-processing also showed positive results, for the three datasets, i.e. 63 to 61 seconds for testing wisconsin breast cancer datasets, 7 to 5 seconds lung cancer datasets and 46 seconds for prostate cancer datasets.
Cluster modelling results with the implementation of EM have also been carried out, for the application of pre-processing of data as shown in Figure 4 for wisconsin breast cancer dataset, Figure 6 for the lung cancer dataset and Figure 8 for prostate cancer datasets.Implementation of the model without pre-processing for wisconsin breast cancer datasets results as shown in Figure 5, to lung cancer datasets in Figure 7, while for prostate cancer datasets found no results and shows no significant difference.

Conclusion
This paper aims to improve the quality of pre-processed data.We proposed a model for the design of pre-processing multidimensional data based on component analysis.RapidMiner is used for data pre-processing using FastICA algorithm.Kernel K-mean is used to cluster the pre-processed data and Expectation Maximization (EM) is used to model the cluster.The model was tested using wisconsin breast cancer datasets, lung cancer datasets and prostate cancer datasets.The result shows that the performance of the cluster vector value is higher and the processing time is shorter.Machine Learning, 53, p.23-69, [doi> 10.1111/j.1467-8640.1990.tb00298.x]Wahab, Mohd Helmy Abd, Mohd Norzali Haji Mohd, Hafizul Fahri Hanafi, Mohamad Farhan. (2008).Data Pre-processing on Web Server Logs for Generalized Association Rules Mining Algorithm [doi>10.1.1.140.5102

Table 3 .
Experiment Result