Cascade-Correlation Algorithm with Trainable Activation Functions

According to the characteristic that higher order derivatives of some base functions can be expressed by primitive functions and lower order derivatives, cascade-correlation algorithm with tunable activation functions is proposed in this paper. The base functions and its higher order derivatives are used to construct the tunable activation functions in cascade-correlation algorithm. The parallel and series constructing schemes of the activation functions are introduced. The model can simply the neural network architecture, speed up the convergence rate and improve its generalization. The efficiency is demonstrated with the two-spiral classification and Mackay-Glass time series prediction problem


Introduction
Generalization is a critical capacity for feedforward neural network.It is influenced by many factors (Simon Haykin, 1999;WEI Hai-Kun, 2001;Leonardo Franco, 2005).In the last number of years, many researchers have been making efforts to promote the generalization ability of neural networks.Two popular techniques for avoiding the over fitting are the regularization (Burden F, & Winkler D, 2008;R. Felix, 2011) and early stopping methods (Xing-xing Wu, 2009).To get the smallest system that will fit the data, pruning algorithms (Reed R. 1993;Md.Shahjahan, 2003) are widespread used.Pravin Chandra, YogeshSingh have experimentally verified the conjectures that imply the dependence of learning rate of FFANNs on activation functions used at the hidden units (P.Chandra, 2004).Gao Daqi, Yang Genxinga found that the learning abilities of multilayer feed forward neural networks depend on the types of activation functions (Gao Daqi, & Yang Genxing, 2003).ShuXing, Xu, Ming Zhang revealed that feedforward neural networks with the proposed neuron-adaptive activation function present several advantages over traditional neuron-fixed feedforward networks (ShuXing Xu, 2000).So professors gave a lot of new activation functions in the last years (M.Solazzi, 2004;Chine-Cheng Yu, 2002).Wu and Zhao addressed a new kind of neural model, which has trainable activation function (TAF) (WU You-Shou, & ZHAO Ming-Sheng, 2001).They incorporated certain degrees of freedom in the activation function.And in that paper, a special feasible domain of TAF was given, whereas the TAF model needs an effective and fast learning algorithm.So Shen and Wang presented a new multi-output neural model with tunable activation function (YANJUN SHEN, & BINGWEN WANG, 2004).The new model simplifies the neural network architecture, improves its accuracy and speeds up the convergence rate.Simulation results show that it has better capability and performance than the traditional multilayer feedforward neural network and the feed forward neural network with tunable activation functions.Slow convergence is another factor which induces poor generalization of ANN, Back-Propagation (BP) for example.MD.ASADUZZAMAN and MD.SHAHJAHAN proposed "Fusion of Activation Functions" (FAF) in which different conventional activation functions (AFs) are combined to compute final activation to make the learning faster (MD.ASADUZZAMAN, & MD. SHAHJAHAN, 2009).In 1989, Scott E. Fahlman and Christian Lebiere interpreted the reasons why the Back-Propagation learns so slowly, and put forward a new architecture and supervised learning algorithm for neural networks called Cascade-Correlation architecture (CC) (S.E. Fahlman, & C. Lebiere, 1989).In the architecture, hidden units are added to the network one at a time and do not change after they have been added.It learns very quickly and determines its own size and topology.But the algorithm has poor generalization for functional approximation.So Qun XU combined Cascade-Correlation algorithm with regularization theory (Qun XU, & Kenji NAKAYAMA, 1997), and got better generalization, especially for functional approximation.These ideas bring us new inspiration.Basted on the characteristic about sigmoid activation function that higher order derivative can be deduced quickly and accurately by primary function without any analysis and operation, we constructed a new TAF model.Combining both traditional CC algorithm and new TAF model, we got a fresh multilayer feed forward neural network, called Cascade-Correlation algorithm with trainable activation function (CCTAF).In fact, the model simplifies the neural network architecture, speeds up the convergence rate and improves its generalization, particularly for functional approximation.The parallel and series constructing schemes of the activation functions are introduced in this paper.This paper is organized as follows.In Section 2 we discuss the CC algorithm briefly.In Section 3 we give readers a brief introduction about TAF model.We elaborate the theory of CCTAF and realization algorithm in Section 4. Section 5 present two simulations of CCTAF.Section 6 concludes this paper.

Theory of CC algorithm
The Cascade Correlation algorithm was proposed by Fahlman & Lebiere.The architecture begins with all inputs and one or more output units without hidden ones.Every input is connected to every output unit by a connection with an adjustable weight.Then connections are trained by the Widrow-Hoff or "delta" rule etc for smaller training error (2.1 for example).E is defined as Where op y is the network output, while op t is the expected output for pattern p .When the value of the error stops to change and does not meet the above requirements, a new unit called a candidate node is added to the network which receives a connection from each of the network's original inputs and also from every pre-existing hidden unit.The aim of updating the weights, connected to the adding hidden unit by gradient ascent, is to maximize the correlation magnitude (2.2) between the output of the candidate node and network output error.S is defined as Where o is the network output at which the error is measured and p is the training pattern.The quantities V and o E are the values of V and o E averaged over all patterns.The hidden unit's input weights are frozen at the time the increment of the correlation is smaller than expected value.Then the output of the added node together with earlier inputs is connected to the output units, only the output connections are trained repeatedly.
The cascade architecture with sigmoid functions is illustrated in Figure 1.Since only one layer is trained at each training stage, CC requires no back-propagation of error signals, therefore learns very quickly, meanwhile a network is generated automatically without defining beforehand.In fact, CC has a tendency to construct a complexity network with some invalid nodes and has poor generalization for functional approximation.

TAF model
The ) with tunable parameters transform the inner stimulation to the output of the TAF nonlinearly.It is the parameters that enable the soma function to be adaptive to various questions.In that paper, the function of ) , ( W X g has three forms.And three rules were given for selecting feasible (.) f .Meanwhile, an approach of selecting TAF came up with.It is shown as follows: are all constants and tunable, ) , ( W X m  is the mth basis function.Seven forms of (.)


were formulated.Compared to the traditional multilayer feed forward neural network, the TAF model can deal with many difficult problems easily.And it can simplify network's architecture.But lacking effective and fast learning algorithms influences its generalization.

Cascade-Correlation algorithm with trainable activation functions
A characteristic of some functions was found that their higher derivatives can be expressed by themselves, such as sigmoid function, Gauss function, and exponential function.We called it hereditability.The following is illustrated by the case of sigmoid.The definition of the ith derivative of (.) They can be normalized to ] , [ 1 0 for easily being observed, as shown in Figure 3 (b).Two models will be formulated: parallel connection and series connection.The sigmoid function and its normalized derivatives are parallel operation.We install the candidate whose correlation score is the best.The function of hidden unit is defined as: Where i S is the correlation of E and i  .Parallel connection can reduce the chance that useless units are installed permanently, and accelerate the training.Series connection need work out the weighted sum of sigmoid function and its normalized derivative, and the weight coefficients are tunable.Presented as follows: Where i a is the tunable coefficient.For maximizing the correlation, we need calculate Where o  is the plus or minus of the magnitude in the S , ip I is the ith input of the candidate unit when pattern p is imported.W and a are adjusted by Quick-Propagation.The processes of CCTAF are listed below: Step 1. Train the initial net without hidden units until the mean square error reaches a minimum or invariability.
If the performance is dissatisfied, turn to step 2; Step 2. Install a hidden candidate node with TAF.Connect it to initial inputs and outputs of all the hidden units.
Step 3. Updating all weights connected to the candidate, and adjusting the tunable coefficient.When the correlation is maximized, we add the hidden candidate unit to the network and freeze its weights.
Step 4. Output of the added node together with earlier inputs is connected to the network output unit.
Step 5. Train the weights until E is smaller than expected value or invariability.Stop the whole process if the performance is satisfied, otherwise turn to step 2; So the hidden units are added one by one until the performance reaches the stopping criterion.

Simulations results
Two Simulations were conducted to evaluate the generalization of the CCTAF in the section.The Two-Spiral problem verifies the pattern classification of the CCTAF.Mackay-Glass time series prediction problem was used to illustrated the property of function approximation ability.

The Two-Spiral Problem
In this experiment, we adopt two-spiral problem because it is extremely challenging.It has become a common bench mark for neural network after proposed by Alexis Wieland firstly.The problem is to separate two classes of patterns in two intertwined spirals which cannot be linearly separated.Wu reported obtaining a solution with a 2-10-1 TAF model after 5794 epoch (WU You-Shou, & ZHAO Ming-Sheng, 2001).Fahlman solved the problem with the Cascade-Correlation algorithm using a sigmoidal activation function for both the output and hidden units and a pool of 8 candidate units (S.E. Fahlman, & C. Lebiere, 1989).The number of hidden units varied from 12 to 19, with an average of 15.2.
The training set consists of 194 X-Y values, half of which are to produce a +1 output and half a 0 output.And compose the test samples.We use CCTAF with sigmoidal activation function and normalized derivative (series connection) and a pool of 8 candidate units.We ran the problem 100 times successfully.The number of hidden units varied from 11 to 15, with an average of 12.9.And we got better generalization as Figure 4 (b).It follow from that CCTAF has the ideal ability of learning and generalization for pattern classification.

Mackey-Glass Chaotic Time Series Prediction
Mackey-Glass Chaotic Time Series is a complex nonlinear dynamical time series, firstly investigated by Mackey and Glass, and difficult to approximate for many algorithms.It is recognized as a benchmark problem that has been used and reported by a number of researchers for comparing the learning and generalization ability of different models (Daijin Kim, & Chulhyun Kim, 1997;B. Samanta, 2011;Yusuf Oysal, 2005).The Mackey-Glass chaotic time series generated by the following chaotic differential delay equation: Where a , b , , n are real numbers, and

Conclusion
In this paper, CCTAF was addressed based on traditional CC and TAF model including parallel connection and series connection.Sigmoid function has a characteristic that it can express its higher order derivative.According to that, we constructed TAF by sigmoid and its higher order derivatives for faster learning.Then the TAF was applied within the Cascade-Correlation architecture as activation function.It was evidently shown that CCTAF model can solve problem with fewer hidden units and faster learning.The generalization of CCTAF when applied to the two-spiral and Mackey-Glass Chaotic Time Series Prediction is better compared with other some algorithms.

Figure 1 .
Figure 1.The Cascade architecture with two outputs and two added hidden units

Figure 2 .
Figure 2. The architecture of general TAF neuron model is the output of TAF neuron, X is the inputs signal vector, W is a weight vector, and a is a tunable parameter vector used to control the functions of the TAF model.In fact, the (.) TAF model was put forward in 2001(WU You-Shou, & ZHAO Ming-Sheng, 2001).It is different from other neural networks for the tunable activation functions.Its general form is given by (3.1) Because the Testing sample is farther away from training, the prediction became more difficult.Neural network with CCTAF, including parallel connection and series connection, was used in the simulations.In parallel connection model, we used a pool of candidate consist of sigmoid function, normalized first and second derivative.We used sigmoid function, first and third normalized derivative in series connection model.The generalization of CCTAF is better than some other models as shown in table 1.According to the table, we found that CCTAF model has better generalization for function approximation.