Prediction of Stock Market Index Movement by Ten Data Mining Techniques

This work is supported by Shanghai Leading Academic Discipline Project, Project Number: S30504. Abstract Ability to predict direction of stock/index price accurately is crucial for market dealers or investors to maximize their profits. Data mining techniques have been successfully shown to generate high forecasting accuracy of stock price movement. Nowadays, in stead of a single method, traders need to use various forecasting techniques to gain multiple signals and more information about the future of the markets. In this paper, ten different techniques of data mining are discussed and applied to predict price movement of Hang Seng index of Hong Kong stock market. The approaches include Linear discriminant analysis (LDA), Quadratic discriminant analysis (QDA), K-nearest neighbor classification, Naïve Bayes based on kernel estimation, Logit model, Tree based classification, neural network, Bayesian classification with Gaussian process, Support vector machine (SVM) and Least squares support vector machine (LS-SVM). Experimental results show that the SVM and LS-SVM generate superior predictive performances among the other models. Specifically, SVM is better than LS-SVM for in-sample prediction but LS-SVM is, in turn, better than the SVM for the out-of-sample forecasts in term of hit rate and error rate criteria.


Introduction
Financial market is a complex, nonstationary, noisy, chaotic, nonlinear and dynamic system but it does not follow random walk process, (Lo & Mackinlay, 1988;Deng, 2006).There are many factors that may cause the fluctuation of financial market movement.The main factors include economic condition, political situation, traders' expectations, catastrophes and other unexpected events.Therefore, predictions of stock market price and its direction are quite difficult.In response to such difficulty, data mining (or machine learning) techniques have been introduced and applied for this financial prediction.Most of the studies have focused on the accurate forecasting of the value of stock price.However, different investors adopt different trading strategies; therefore, the forecasting model based on minimizing the error between the actual values and the forecasts may not be suitable for them.In stead, accurate prediction of movement direction of stock index is crucial for them to make effective market trading strategies.Some recent studies have suggested that trading strategies illustrated by the forecasts based on the direction of stock price change may be more effective and generate higher profit.Specifically, investors could effectively hedge against potential market risk and speculators as well as arbitrageurs could have opportunity of making profit by trading stock index whenever they could obtain the accurate prediction of stock price direction.That is why there have been a number of studies looking at direction or trend of movement of various kinds of financial instruments (such as Wu & Zhang, 1997;O'connor et al, 1997).But, these studies do not use data mining based classification techniques.Data mining techniques have been introduced for prediction of movement sign of stock market index since the results of Leung et al (2000) and Chen et al (2001), where LDA, Logit and Probit and Neural network were proposed and compared with parametric models, GMM-Kalman filter.Kim (2003) applied newly and powerful techniques of data mining, SVM and Neural network, to forecast the direction of stock index price based on economic indicators.To obtain more profits from the stock market, more and more "best" forecasting techniques are used by different traders.In stead of a single method, the traders need to use various forecasting techniques to gain multiple signals and more information about the future of the markets.Kumar & Thenmozhi (2006) collected five different approaches including SVM, Random forecast, Neural network, Logit and LDA to predict Indian stock index movement based on economic variable indicators.From the comparison, the SVM outperformed the others in forecasting S&P CNX NIFTY index direction as the model does not require any priori assumptions on data property and its algorithm results global optimal solution which is unique.Huang et al (2005) also forecasted the movement direction of Japanese stock market (NIKKEI 225 index) by various techniques such as SVM, LDA, QDA, NN and the all-in-one combined approach.The SVM approach also gives better predictive capability than other models: LDA, QDA and NN, following the out-performance of the combined model.In the study, they defined the movement of the NIKKEI 225 index based on two main factors including American stock market, S&P 500 index, which is the most influence on the world stock markets including Japanese market, and the currency exchange rate between Japanese Yen and US dollar.
In our study, ten different techniques of data mining are discussed and applied to predict price movement of Hang Seng index of Hong Kong stock market.The approaches include Linear discriminant analysis (LDA), Quadratic discriminant analysis (QDA), K-nearest neighbor classification, Naïve Bayes based on kernel estimation, Logit model, Tree based classification, neural network, Bayesian classification with Gaussian process, Support vector machine (SVM) and Least squares support vector machine (LS-SVM).The main goal is to explore the predictive ability of the ten data-mining techniques in forecasting movement direction of Hang Seng Index based on five factors, including its open price, high price, low price, S&P 500 index, and currency exchange rate between HK dollar and US dollar.The general model of stock price movement is defined as , where t D is the direction of HSI movement at time t and is defined as a categorical value "1" if the closing price at time t is greater than the closing price at time 1 t and as "0", otherwise.The function (.) f can be linear or nonlinear and it is estimated by the ten data-mining algorithms. is the closing price of S&P 500 index at time 1 t .
1 t FX is the currency exchange rate between HK dollar and US dollar.
All the inputs are transformed into log return to remove any trend pattern.
The remaining of the paper is organized as follow.Next sections describe the data and prediction evaluation.Section 3 briefly discusses the ten different algorithms.The final section is for conclusion.where i A is the actual output for ith trading day and i P is the predicted value for ith trading day, obtained from each model.Here m is the number of the out-of-sample.R software with related packages is used to conduct the whole experiment; for example Karatzoglou (2004Karatzoglou ( , 2006) ) illustrates the R commands for SVM, LSSVM, and Bayesian classification with Gaussian Processes., the decision boundary between the two classes is a hyperplane in the feature space.A hyperplane in the p-dimensional input space is the set:

Data-mining methods
. (1) The two regions separated by the hyperplane are } 0 : { .The algorithms include linear discriminant analysis (LDA), Quadratic discriminant analysis (QDA), Naïve Bayes based on kernel estimation, and Logit model.Other types of data mining techniques are K-nearest neighbor classification, Tree based classification, neural network, Bayesian classification with Gaussian process, Support vector machine (SVM) and Least squares support vector machine (LS-SVM).The last three models take more advantage via kernel based methods.

Linear Discriminant Analysis
The goal here is to obtain class posteriors ) / Pr( X Y for optimal classification.Suppose and let k be the prior probability of class K with 1 . By Bayes theorem, and the classes are assumed to have a common covariance matrix k k .Considering the log ratio of the two classes k and l posteriors which is a linear equation in x in p dimensional hyperplane defined in (1), the linear discriminant functions are obtained . From the training data, we can estimate the Gaussian distribution parameters as where k N is the number of class k observations; and class 1 otherwise. (2) From the experimental result, we have ) , ( 5 with dimension of p =5 so that five coefficients of the linear discriminant in (2) are obtained as the following: V1 = -0.79761825,V2 = 0.91216594, V3 = 0.74655315, V4 = -0.02241239,V5 = -1.85401191.Prior probabilities of groups: 0 and 1 are 0.5037137 and 0.4962863 respectively.For the in-sample data, the hit rate is 0.8393 and the error rate is 0.1607, while in the out-of-sample data, the hit rate and error rate are 0.8440 and 0.1560 respectively.

Quadratic discriminant analysis
For the QDA, the above . The decision boundary between two classes k and l is the . The estimates for QDA are similar to those LDA except that separate covariance matrix must be estimated for each class.Here QDA needs parameters.See McLachlan (1992) and Duda et al (2000) for comprehensive discussion on discriminant analysis.From the prediction results, the hit rate and error rate for training data by QDA are 0.8305 and 0.1695 respectively.For the test data, the hit rate and error rate are 0.8480 and 0.1520 respectively.

K-nearest neighborhood method
K-nearest neighbor method is one of the simplest machine learning algorithms used for classifying objects based on closest training examples in the feature space.An object is classified by a majority being assigned to the class most common amongst its k nearest neighbors.Formally, the k-nearest neighbor approach uses the training data set } closest in input space to x to form Y ˆ.Specifically, the k-nearest neighbor fit for Y ˆis defined as is the neighborhood of x defined by k closest points i x in the training example.That is, we find the k observations with i x closest to x in input space and average their responses.k is estimated by cross-validation technique.The algorithm starts with the determination of the optimal k based on RMSE done by cross validation technique, then calculate the distance between the query distance and all the training samples.After sorting the distance and determination of the nearest neighbors based on the k th minimum distance, gather the Y of the nearest neighbors.Finally, use simple majority of the category Y of nearest neighbors as the prediction value of the query distance.Noticeably, the k-nearest neighbor approach does not rely on prior probabilities like LDA and QDA.
Results: Table 1 displays the process of choosing k by cross-validation technique in the experiment.We consider k = 30 as an initial range and then select optimal k in the range.The best k = 10 is obtained corresponding to smallest error (0.1708).[Insert Table 1 around here].The performance results are given as the following.For the training data, the hit rate is 0.8312 and the error rate is 0.1688 and for the test data, the hit rate is 0.7960 and the error rate is 0.2040.

Naïve Bayes classification method
This is a well established Bayesian method primarily formulated for performing classification tasks.Given its simplicity, i.e., the assumption that the independent variables are statistically independent, Naive Bayes models are effective classification tools that are easy to use and interpret.Naive Bayes is particularly appropriate when the dimensionality of the feature space (i.e., number of input variables) is high (a problem known as the curse of dimensionality).Mathemcally, Naïve Bayes model requires an assumption that given a class j Y , the features k X are independent so . The estimates of (.) jk f is from the training data via kernel smoothing.Naïve Bayes classification is and j is estimated by the sample proportions.See Mitchell (2005) for precise explanation.From the result obtained in the experiment, the hit rate and error rate for in-sample data are 0.8386 and 0.1614 respectively.For the out-of-sample, the hit rate is 0.8280 and the error rate is 0.1720.

Logit model
Logistic regression refers to methods for describing the relationship between a categorical response variable and a set of predictor variables.It can be used to predict a dependent variable on the basis of independents and to determine the percent of variance in the dependent variable explained by the independents.The logistic regression applies maximum likelihood estimation after transforming the dependent into a logit variable.In this way, logistic regression estimates the probability of a certain event occurring.Note that logistic regression calculates changes in the log odds of the dependent, not changes in the dependent itself.From the Friedman et al (2008), the model for logistic regression is given as: for two classes of output Y.We obtain 's using the maximum likelihood approach.
Logit is given by: . The curve of ) (x are called sigmoid because they are S-shape and therefore nonlinear.Statisticians have chosen the logistic distribution to model binary data because of its flexibility and interpretability. .The performance results show that the hit rate and error rate for in-sample data is 0.8474 and 0.1526 respectively while the hit rate and error rate for out-of-sample data are 0.8560 and 0.1440 respectively.

Tree based classification
Classification tree is one of the main techniques used in Data Mining.Classification trees are used to predict membership of objects in the classes of a categorical dependent variable from their measurements on one or more predictor variables.The goal of classification trees is to predict or explain responses on a categorical dependent variable, and as such, the available techniques have much in common with the techniques used in the more traditional methods of Discriminant Analysis, Cluster Analysis, Nonparametric Statistics, and Nonlinear Estimation.The flexibility of classification trees makes them a very attractive analysis option as it does not require any assumption on the distribution like traditional statistical methods.Technically, the Tree-based methods partition the feature space into a set of rectangles, and then fit a simple model in each one.Starting from the root of a tree, the feature space containing all examples is split recursively into subsets usually two at a time.Each split depends on the value of only a unique variable of input x .If x is categorical, the split is of the form where A is subset of .The goodness of split is measured by an impurity function defined for each node.The basic idea is to choose a split such that the child nodes are purer than their parent node.The split continues till the end subsets (leaf nodes) are 'pure'; that is till one class dominates.For an impurity function , define the impurity measure by , is the chosen threshold.It is not trivial to choose as it leads to overfitting or underfitting problems for a new data prediction.To solve this problem, we go for pruning approach.The idea is to obtain the subtree from the initial large tree.One of the most popular techniques is the Cost-complexity pruning.Let max T be initial large tree and let the pruned subtree by max T T . Then the cost complexity measure ) (T C is defined as T ~denotes the number of leaf nodes.
is the error measure of represents the tradeoff between the cost of a tree and its complexity.The goal of cost-complexity pruning is for each choose a tree max is minimized.The estimation of is achieved by crossvalidation.We choose ˆthat minimizes the cross-validation sum of squares.Thus the final tree is ) ( T .Tree based method is considered one of top ten algorithm of data mining technique (Wu et al, 2008).We refer to Quinlan (1986) and Friedman et al (2008) for detailed discussion on the tree based classification.For illustrative application of Tree method in predicting stock price behavior is referred to Pearson (2004).

Experimental results:
Denote variables actually used in tree construction by V1 = Open price, V2 = Low price, V3=High price, V4= S&P500, V5 = FX, V6 = Close price.The class "1" is when the next price is larger than the previous price, while the class "0" is refers to when the next price is smaller than the previous price.Table 2 show the process of pruning tree.
From the Table 2, we choose CP = 0.0036281 corresponding to 17 splits based on the smallest value of the x_error: 0.40272.The initial large tree is not displayed to reduce the space but pruned tree is illustrated in the appendix section.The performance result shows that the hit rate is 0.8717 and error rate is 0.1283 in the training data; while the test data, the hit rate is 0.8 and the error rate is 0.2.Haykin (1994) defines neural network as a massively parallel distributed processor that has a natural propensity for storing experiential knowledge and making it available for use.It resembles the brain in two respects: (1) Knowledge is acquired by the network through a learning process, and (2) Interneuron connection strengths known as synaptic weights are used to store the knowledge.The literature on neural network is enormous, and its application spreads over many scientific areas see Bishop (1995) and Ripley (1996) for detailed.Recently, the neural network has been well known for good capability in forecasting stock market movement.Let's go directly to its formulation.Let X be input vector and Y be output taking value as categorical.Following notations in Hastie (1996), the neural network model can be represented as To learn the neural network, back propagation is used.

Neural network for classification
Suppose we use least squares on a sample of training data to learn the weights: The ith component is denoted by superscript i.

Gradient update at the (r+1)st iteration,
Here r is the learning rate.
In the experimental analysis, From the prediction performances, the hit rate is 0.8481 and error rate is 0.1519 in the in-sample data.For the out-of-sample, 0.8520 and 0.1480 are hit rate and error rate respectively.

Bayesian classification for Gaussian process
Gaussian processes are based on the prior assumption that adjacent observations should convey information about each other.Particularly, it is assumed that the observed variables follow normal distribution and that the coupling between them takes place by covariance matrix of a normal distribution.Using the kernel matrix as the covariance matrix is a convenient way of extending Bayesian modeling of linear estimators to nonlinear situations.
In regression problem, the goal is to predict a real valued output based on a set of input variables.It is possible to carry out nonparametric regression using Gaussian process.With Gaussian prior and Gaussian noise model, the solution of the regression problem can be obtained via Kernel function placed on each training data; the coefficients are determined by solving a linear system.If the parameter indexed in the Gaussian process are unknown, Bayesian inference can be carried out for them.Gaussian process can be extended to classification problems by defining a Gaussian process over y , the input to the sigmoid function. .The idea is to place the Gaussian prior on ) (x y and combine it with the training data to obtain predictions for new x points.Bayesian treatment is imposed by integrating over uncertainty in y and in the parameters that control the Gaussian prior.Then distribution for * y is found from the marginalization of the product of the prior and the noise model. The integral terms are estimated by Laplace's approximation.William et al (1999) and Rasmussen et al (2006) provided a comprehensive and detailed discussion on the Bayesian classification with Gaussian processes.
The In-sample data: Hit rate is 0.8595 and error rate is 0.1405.
Out-of-sample data: Hit rate is 0.8520 and error rate is 0.1480.

Support vector machine for classification
A popular machine learning algorithm with neural network type is SVM, support vector machine, developed by Vapnik (1995).The SVM is a kernel based learning approach like the above Gaussian processes for classification.However, the SVM does not require any assumptions on the data property like in Gaussian processes.The SVM has been successfully applied for various areas of predictions; for instance, in financial time series forecasting (Mukherjee et al, 1997;Tay and Cao, 2001), marketing (Bend-David and Lindenbaum, 1997), estimating manufacturing yields (Stoneking, 1999), text categorization (Joachims, 2002), face detection using image (Osuna et al, 1997), handwritten digit recognition (Burges and Schokopf, 1997); Cortes and Vapnik, 1995), medical diagnosis (Tarassenko et al, 1995).
The SVM formulation can be started as the following.

Given a training set
with input data and corresponding binary class label , SVM algorithm seeks the separating hyperplane with largest margin.The problem can be formulated as follow: Standard method to solve the problem (3)-( 4) is convex programming, where Lagrange method is applied to transfer primal to dual problems of optimization.Specifically, we first construct Lagrangian (5) i are nonnegative Lagrange multipliers corresponding to (4).The solution is achieved by a saddle point of the Lagrangian which has to be minimized with respect to w and b and maximized with respect to i .Differentiating (5) and set the results equal to zero, The optimal solution is obtained from (6) as where * denotes optimal values.Now substituting ( 8) and ( 7) into (5),

Modern Applied Science
December, 2009 The dual problem is posed as quadratic programming: implies that 0 i only when constraint (4) active.The vectors for which 0 i are called support vectors.
By (10), we obtain for any support vector i x .By linearity of the inner product and ( 8), the decision function for the linear separable case is For linearly non-separable case, we introduce a new set of variables that measure the amount of violation of the constraints.Thus (3) and (4) are modified as where C and k are predetermined parameters defines the cost of constraints.
The Lagrangian is constructed as where i and i are Lagrange multipliers which are associated with constraints ( 12) and ( 13) respectively.The solution to this problem is determined by minimizing P L with respect to k w , and b and maximizing with respect to i and i .Differentiating ( 14) and setting equal to zero, From ( 15), Substituting ( 18),( 17),( 16) into ( 14), we obtain This leads to the dual problem as Hence the classifier is To avoid complex calculation of scalar product , we introduce kernel function:  3 illustrates the cross-validation error corresponding to the tuning parameters (C , ).From the Table 3, we obtain the best hyperparameters C = 4 2 and 4 2 (the Gaussian kernel function parameter) with the smallest error of ten fold cross-validation = 0.000006.Considering the in-sample data, the hit rate is 1.000 and error rate is 0.000 but for the out-of-sample, the hit rate is 0.860 and the error rate is 0.140.

Least square support vector machine for classification
LSSVM is a new version of SVM modified by Suykens et al (1999).LSSVM uses least square loss function to obtain a set of linear equations in dual space so that learning rate is faster and the complexity of calculation in convex programming (in SVM) is relaxed.In addition, the LSSVM avoids the drawback faced by SVM such as trade-off parameters ( , , 2 C ) selection; instead the LSSVM requires only two hyper-parameters ) , ( 2 to train the model. According to Suykens et al (2001), the equality constraints of LSSVM can act as recurrent neural network and nonlinear optimal control.Due to these nice properties, LSSVM has been successfully applied for classification and regression problems, including time series forecasting.Further application can be found in Van Gestel et al (2004)  .This formulation consists of equality instead of inequality constraints and takes into account a squared error with regularization term similar to ridge regression.
The solution is obtained after constructing the Lagrangian: , where i are Lagrange multipliers that can be positive or negative in the LSSVM formulation.From the conditions for optimality, one obtains the Karush-Kuhn-Tucker (KKT) The experimental results show that the obtained hyper-parameters: gamma is 0.50691 chosen from the range [0.04978707, 148.4132] and sigma square is 8.7544 selected from the range [0.082085, 12.1825].The cost of ten fold cross-validation is 0.033291.from the forecasting performance, the in-sample data hit rate is 0.8528 and error rate is 0.1472.For the out-of-sample data, the hit rate is 0.8640 and error rate is 0.1360.
To sum it up, Table 4 below gives a summary result of prediction performances by the ten different approaches.From the Table 4, we can see that almost all the algorithms generate high hit rate (more than 80%) and low error rate (less than 20%).By comparison, LSSVM ranks first one as it outperforms the other model though it is not better than SVM for in-sample prediction.The superior performances of the SVM models in this study also support the results in the literature.Bayesian classification for Gaussian process produces good prediction as neural network, following the SVM and LS-SVM.K-nearest neighbor approach gives the worst predictive ability and then Tree classification is the next one.

Conclusion
In this paper, we apply ten different techniques of data mining to forecast movement direction of Hang Seng index from Hong Kong stock market.All algorithms produce good prediction with hit rate more than 80%.The LS-SVM and SVM outperform the other models since theoretically they don't require any priori assumption on data property and their algorithms guarantee to efficiently obtain global optimal solution which is unique.The other models may be reliable for other markets, especially when the data fall into each of their properties.As can be seen in the figure 1-3 in appendix section, different stock prices behave differently.Therefore, all the approaches are recommended for forecasters of stock index movement and the better models, SVM and LS-SVM, are more preferred.

Appendices
examine the daily change of closing prices of Hang Seng index based on five predictors, Open price, High price, Low price, S&P 500 index price, and Exchange rate USD against HKD.The stock prices are downloaded from the Yahoo finance and the foreign exchange rate is taken from website of Federal Reserve Bank of ST.Louis.The sample period is from Jan 03 2000 to Dec. 29 2006 so that the whole sample is of 1732 trading days.The data is divided into two sub-samples where the in-sample or training data spans from Jan 03 2000 to Dec 30 2005 with 1481 trading days.The whole year 2006 from Jan 1 2006 to Dec 29 2006 of size 250 trading days are reserved for out-of-sample or test data.Figure1 displays the actual movement of HSI closing prices for the whole sample.Figure 2 plots S&P 500 price and its log return and Figure 3 shows the plots of price and log return of exchange rate of HKD against USD.To measure the predictive performances by different models, Hit rate and Error rate are employed and defined as Hit rate sigmoid.The parameters jl and kj are known as weights and following shortly describes the experimental results obtained from training and forecasting the movement of HSI by Bayesian classification with Gaussian processes.Problem type: classification Gaussian Radial Basis kernel function hyperparameter: sigma = 0.414967184016261 Number of training instances learned: 1481 Train error: 0.238758065 Cross validation error: 0.1647560

.
In this work, Gaussian kernel or RBF(radial basis function) is used as it tends to give good performance under general smoothing assumptions.The kernel is defined as avoid overfiting problem.By applying the above algorithm of SVM to train the data under study, training results are obtained.Table is the estimated probability of class j within node t .The goodness of a split s for node t is are the proportions of the samples in node t that go to the right node and left node respectively.Possible impurity functions: N is the total number of samples in node t .The criteria is to stop splitting a node t when p

Table 1 .
: Selection of k by cross-validation techniqueThe table displays the mean square error and corresponding value of k.The optimal k is 10 with smallest MSE 0.1708.

Table 3 .
Hyperparamters selection based on 10 fold cross-validation for SVM

Table 4 .
Summary result of prediction performances by ten different approaches