Assessment of Risk of Rainfall Events with a Hybrid of ARFIMA-GARCH

Hazardous situations related to rainfall events can be due to very intense rainfall, or to the persistence of rainfall over a long period of time. Such events may result to an exceedence of the capacity of drainage systems resulting in the heap of basements which may lead to landslides and flooding. This study assesses the persistence dependence of rainfall time series of Chui Chak, a station in Peninsular Malaysia that observed the highest rainfall event for the period 01/01/1975-31/12/2008. The persistence dependence of the rainfall time series was modelled via fractional ARIMA model augmented with the GARCH model. The Ljung-Box test for testing autocorrelation proves that the combined ARFIMA-GARCH model captures the temporal persistence behaviour in the Chui Chak rainfall time series data with persistence measure 0.839. This measure represents a relatively lasting persistence, that is, the process variability should return to the historical average after a relatively long period of time which may have a risk of extreme event.


Introduction
Conventionally, rainfall has been considered as a random process with several peculiarities, mostly associated to intermittency and non-Gaussian behaviour (Papalexiou et al., 2011).Hazardous situations related to rainfall events can be due to very intense rainfall, or to the persistence of rainfall over a long period of time.Such events may give rise to an exceedence of the capacity of sewer systems resulting in the inundation of roads and basements which may lead to landslides or flooding.To prevent such hazards, an analysis of hydrological and climatological records may aid in modelling by giving enhanced understanding of the processes that might derived the extremes.
Persistence dependence is generally recognized in hydrological, climatological and financial time series over a varied range of time scales.Rainfall analysis have a substantial role in the successful planning, development and implementation of water resource management to evaluate engineering works and environmental problems such as hydropower generation, reservoir operation, flood control and water quality control.Henceforth, an efficient study of rainfall temporal behaviour is the critical mission in hydrology (Modarres & Ouarda, 2013).Most of the traditional methods for measuring the risk related with behaviour of the data set are done through study of the variance or volatility.Nevertheless, the study of the variance or volatility does not give explanation that there might be a predictable component in the data.
Stochastic models are designed to replicate the imperative behaviour or patterns evident in the data set grounded on the current knowledge of the process.The application of stochastic models to hydrology and water resources management is mainly for two important explanations.Firstly, stochastic models provide a basis for decisions making.Secondly, stochastic models also increase the understanding of physical phenomena.This is the crucial goal of the physical sciences.A worthy modelling framework will extract previously unknown insight from the data so as to plan for the future.
The dependence structure of rainfall data is typically very complex both in space and time both in time.Daily rainfall data sets from dense networks of rain-gauge stations in peninsular Malaysia are commonly analysed in this paper using ordinary Kriging technique to identify the location that realized highest rainfall for the period under study.Very high rainfall events have a substantial effect on society and may lead to loss of life and property.To conduct an in-depth study on the temporal dependence the daily rainfall series of the identified location with highest rainfall event would be very vital.
Simple linear or nonlinear stochastic models may not always be appropriate in modelling the temporal dependence in rainfall time series.Certainly, rainfall time series data usually have both linear and nonlinear patterns.To adequately fit these chaotic time series, other more complicated models that have the ability to capture the dynamics of the series more precisely have to be taken into account (Yusof & Kane, 2013).There are a number of studies on time series modelling and forecasting that argue that performance improves in hybrid models (Zhang, 2003).Numerous authors published a lot of papers combining different models e.g., ARIMA with GARCH, ARFIMA with GARCH, HMM with GARCH, ANN with SVM etc. (Chau & Wu, 2010) investigates the joint effect of the hybrid of ANN-SVR model and the singular spectrum analysis on improving the accuracy of daily rainfall forecasting.Their results indicated that the hybrid of support vector regression model performed the best.Sallehuddin et al. (2007) proposed a model that integrates nonlinear Grey Relational Artificial Neural Network (GRANN) and linear ARIMA model.The forecasting ability of the model was compared with several other models, which include: ARIMA, Multiple Regression, GRANN, etc.The result have shown that the proposed hybrid of GRANN_ARIMA model has outperformed other models with 99.5% forecasting accuracy for small-scale data and 99.84% on large-scale data.Yin (2008) proposed a variant of an ARFIMA process by adding an explanatory variable into the ARFIMA long memory model.Comparing the forecast ability of the proposed model with other models such as; ARFIMA, the Random Walk and the ARMA process of the order one, the result show that the proposed model outperform other models in terms of out sample forecasting.Suhartono and Muhammad (2011) proposed a hybrid model to study trends and seasonal fuzzy time series.The proposed hybrid model is based on a Winter's model and weighted fuzzy time series.The result revealed that more accurate forecast is obtained from the proposed hybrid method.It can be observed that most hybrid models performed better than the individual models in terms of forecast ability and accuracy.
Rainfall time series is often characterized by long memory behavior (Yusof et al., 2013) and are highly skewed with volatility clustering (Villarini et al., 2010).Long memory is an asymptotic property of a stochastic process, characterizing the decay of the correlation between observations separated by increasingly large time lags.The fractionally integrated process is used in explicitly modeling of the long term memory structure.The possible way of representing the correlation structures is by the use of fractional differencing order d.It has been demonstrated that a stationary process will possess long memory behavior if 0 < d <0.5.For -0.5 < d < 0, the process is considered by a slow decay of autocorrelation but does not retain the long memory property.Therefore, in this case, the process is said to be anti-persistent.This behavior is often related to an over differentiation problem by the first difference filter.For 0.5 < d < l, the case is called non-stationary mean-reverting (Claude & Vivien, 2002).The testing of hydrological data with Long Memory models was done by number of authors (Koutsoyiannis, 2006;Koscielny-Bunde et al., 2006;Mudelsee, 2007;Wang et al., 2007;Gil-Alana, 2012;Yusof et al., 2013) and found it to be a common characteristic in hydrological data.
The volatility clustering indicates that the residual of the series exhibits time-varying hetereoskedastic effect, that is, unconditional standard deviations are not constant through time.These characteristics can well be captured using the GARCH family models.The motivation for using both models comes from the notion that rainfall time series is characterised by both linear and non-linear pattern, therefore neither one of the models can identify the true data generating process DGP (Terui & Dijk, 2002).Modarres and Ouarda (2013) investigated the advantages of non-linear GARCH model over linear ARIMA model using stream flow data of the Matapedia River, Quebec, Canada.The result shows that the linear ARIMA model became inadequate for modelling stream flow time series due to the existence of Heteroskedasticity in the residuals of the ARIMA model as shown by Engle test.Therefore, combining an ARIMA (13, 1, 4) and GARCH (3, 1) models fitted the data very well.Therefore, recommended that "the application of a GARCH model is strongly suggested for hydrological time series modelling as the conditional variance of the residuals of the linear models can be removed and the efficiency of the model will be improved".Wang et al. (2005) verified the presence of conditional heteroskedasticity in the residuals from linear models fitted to the daily and monthly stream flow processes of the upper Yellow River using the McLeod-Li and Engle's Lagrange Multiplier tests.In a related study, HongRui et al. ( 2012) established a GARCH model to the daily runoff data for the period 1949-2001 of Yichang hydrological station.Szolgayova (2011) investigates whether hetereoskedastic effect can be detected in selected time series of daily discharges in Slovakia.The results show that heteroskedasticity was present in all of the tested time series.Other application of hetereoskedastic models to hydrological data can be seen in (Chen et al., 2008;Laux et al., 2011;Yusof & Kane, 2013).
In this study, we first analyse the spatial variability of network of rainfall stations to identify the location with the highest rainfall event in peninsular Malaysia.The temporal pattern of the rainfall data of the identified location is then modelled via long memory model augmented with the Generalised Autoregressive Conditional Hetereoskedastic (GARCH) model taking in mind the observed features of the hydrological time series.The combined model would be good in capturing the persistence dependence in the rainfall data.The remaining of the paper is organized as follows: The methodology employed in the spatial analysis and the temporal modelling that captures both the linear and nonlinear behaviour of the daily rainfall series is presented in section 2. Section 3 presents and discusses the results of the findings.Section 4 gives the conclusions and recommendations.

Rainfall Spatial Variability
Kriging uses semivariance to measure the spatially correlated component, a component that is also called spatial dependence or spatial autocorrelation which expresses the spatial dependence between neighboring observations.The semivariogram quantifies the relationship between the semivariance and the distance between sampling pairs.A semivariogram plots the average semivariance against the average distance, this function may be used alone as a measure of spatial autocorrelation (Kang, 2012).
The semivariance is computed by:  is the average semivariance between known points s i separated by distance h, the number of pairs of sample points arranged by direction in the bin is denoted by n and z is the attribute value.Semivariogram is modeled by fitting a theoretical function such as: Spherical, Exponential, Linear and Gaussian, to ensure that the solution is unbiased and has minimum variance (Kang, 2012).After having a suitable semivariogram model, the kriging technique can then be applied in estimating the value of z(s 0 ) in each grid point s 0 where no value is available.The Kriging estimate is of the linear form: Where ( ) i z s is the measured value in the ith location, i  is an unknown weight for the measured value at the ith location, 0 s is the prediction location and N is the number of measured values (Journel & Huijbregts, 1978).The issue is to find a set of weighted coefficients in such a way that the kriging estimator satisfied the conditions of unbiasedness and the minimum estimation variance respectively, that is: These two constraints make the kriging technique a good interpolator against other interpolation methods.

Testing and Estimation of Fractional Degree of Integration (d)
In testing and estimation of the fractional order of integration, Geweke and Porter-Hudak (1983) proposed the log-periodogram regression estimate and has been the most widely used (Shimotsu, 2002).Consider a fractional integrated process {Y t }, its spectral density is given by: 2 ( ) [2sin( / 2)] ( ) where  is the Fourier frequency, ( ) u f  is the spectral density corresponding to t u and t u is a stationary short memory noise with 0 mean.Consider the set of harmonic frequencies 2 / , where n is the sample size.Taking the logarithm of Equation (2) we have: Equation ( 3) can be re-written in an alternative form following (Wang et al., 2007) as: 2 ( ) ln ( ) ln ( 0) ln[4sin ( / 2)] ln (0) The fractional order of differencing d can be estimated using the regression equation in Equation ( 4).Using the periodogram estimate of ( ) j f  , if the number of frequencies m used in Equation ( 4) is a function g(n) (a positive integer) of the sample size n, where m = ( ) g n n   with 0 <  < 1, it can be demonstrated that the least squares estimate d  using the above regression is asymptotically normally distributed in large samples 2 ( ) 2 1 ~,6 ( ) and U is the sample mean of j U , j = 1 … g(n).Under the null hypothesis of no long memory (d = 0), the t-statistic The value of the power factor  is the main determinant of the ordinates included in the regression, for more details see (Yusof et al., 2013).

ARFIMA (p, d, q) Model
Fractionally integrated processes are an extension of the class of Box and Jenkins methodology known as ARIMA processes.Autoregressive fractionally integrated moving average models (ARFIMA) grant more freedom and flexibility in modelling processes with long memory behaviour.Modelling data with fractionally integrated processes allows capturing and modelling slower rates of decay of autocorrelation than with other common time series methods such as AR, MA or ARIMA models (Baillie, 1996).The general ARFIMA (p, d, q) model was first introduced by Granger and Joyeux (1980) and Hosking (1981) and is formulated as: where t  is a white noise process with mean zero and variance 2  , B is the lag operator, ( ) and ( ) B  are polynomials of orders p and q denoted 1 (1 ... ) respectively, with roots outside of the unit circle.The standard differencing operator (1 − B) of an ARIMA processes is replaced with a fractional operator (1 where d is the fractional differencing parameter.The aim is to find the order of the AR and MA processes.The order of differencing can be decided using Equation (4).There exist several techniques for selecting the order ARMA model.One of them is based on the information criteria.The idea is to minimize the risk of under fitting that is, selecting an order smaller than the true order, and over fitting that is, selecting an order larger than the true order.One of the mathematical formulation of the parsimony criterion of model building is proposed by Akaike (1974) named Akaike Information Criterion (AIC).This method was formulated for the purpose of selecting an optimal model fits to a given time series data.Mathematically the AIC is formulated as: where p and q are the order of autoregressive and moving average parameters to estimate.The model with minimum AIC can be selected as the parsimonious model (Yurekli & Kurunc, 2007).

Method of Estimating ARFIMA (p, d, q) Model
For a tentative formulation of a model, identification of the model parameters, then it is needed to obtain the co-efficient estimates of the parameters.After the parameters have been estimated, the fitted model is then subjected to diagnostic checks.Since the order of differencing has been estimated already, now we consider an ARMA model of the form: To estimate the value of 2 1 1 ( , , , , , , , ) on the basis of observations on Y, The principle will be based on maximum likelihood estimation.This method makes it possible to estimate the parameters of the process.Let 2 / 1 1 ( , , , , , , , ) denote the vector of population parameters.Suppose we have observed a sample of size T 1 2 ( , ,..., ) T y y y , the approach is to calculate the joint probability density: 1 1 , ,..., 1 1 ( , ,..., ; ) which implies the probability of having observed this particular sample.The maximum likelihood estimate (MLE) of µ is the value for which this sample is most likely to have been observed; that is, it is the value of µ that maximizes (7).

Test for Heteroskedasticity
The identification of conditional heteroskedasticity is often based on testing whether squared residuals from the ARFIMA model are autocorrelated (Bollerslev & Mikkelsen, 1996;Andersen & Bollerslev, 1998) among many others.Testing for the lack of correlation of particular residuals, can be carried out using different methods, among the methods the Mcleod-Li Test would be implemented in this paper.
This test for heteroskedasticity was proposed by McLeod and Li (1983).It looks at the autocorrelation function of the squares of the residuals from the fitted model and tests whether the correlation between ( 2 t x , 2 t j x  ) is not 0 for some j.The autocorrelation at the j lag for the squared residuals { 2 t x } is estimated using: For the null hypothesis that t x is an independently identically distributed (i.i.d) process, McLeod and Li (1983)  demonstrated that, for fixed L, is asymptotically a multivariate unit normal.Consequently, if L is sufficiently large, the usual Ljung-Box (LB) statistic: If the probability of Q is higher than  = 0.05, then is a strong evidence that the residuals are uncorrelated and the model is suitable for the data set.If the probability is less than  = 0.05, then the residuals are correlated and the model is inadequate, therefore need to repeat the model building process to achieve an adequate model (Modarres & Ouarda, 2012).

GARCH Modeling
The ARFIMA model is adequate for modelling linear dependence of rainfall data; however, the model cannot capture the hetereoskedasticy in the time series typically observed in the form of existence of serial dependence in squares of the model residuals.Therefore, the residual t  from the fitted model is modelled using GARCH family models to reduce or remove the heteroskedasticity.
Introduced by Engle (1982), the Autoregressive Conditional Hetereoskedastic (ARCH) model, and generalized by Bollerslev (1986), named Generalized Autoregressive Conditional Hetereoskedastic model (GARCH).The term "conditional" implies the level of association on the past sequence of observations and the "autoregressive" describes the feedback mechanism that incorporates past observations into the present (Laux et al., 2011).
The variance equation of the GARCH (p, q) model can be expressed as: where t  is the residuals from the fitted ARFIMA model and   (0, 1) denote the pdf (probability density function) of the residuals with 0 mean and variance 1.  is the distributional parameter that define the the shape of the distribution.The GARCH (p, q) process is called a stationary process if the following conditions hold: The stationarity assumption in statistical analysis of a time series data is necessary.In GARCH modelling this considers the constraints on the estimated parameters in the maximum likelihood-estimation.It follows the theorem that states the restrictions on the estimated parameters in the GARCH (p, q) model for stationarity in the GARCH process.The required and sufficient condition for the existence of a second order stationary solution is that: 1 The simplest GARCH model is GARCH (1, 1) model (Hu et al., 2010).

Choice and Validation of Time Series Model
When a good model for the data generating process of a time series has been built, it is common practice to test for the model acceptability.Test for residuals autocorrelation is the prominent tool for this task.To judge whether ARCH effects and autocorrelation have been completely removed or not, a well-known example is a Portmanteau test for residual autocorrelation and ARCH-LM test.The former is implemented in this paper.

Portmanteau Test for Residual Autocorrelation
A portmanteau test is a test used for investigating the presence of autocorrelation in time series.The portmanteau test checks the pair of hypotheses: 0 , , : ... 0 e. all lags correlations are zero.
1 , : 0 u i H   for at least one i = 1, …, h, is tested, at least one lag with non-zero correlations.
where , ( , )   are the standardized estimation residuals.The test statistic has an approximate 2 ( ) h p q    -distribution if the null hypothesis holds.An advance version with potentially better small sample properties was proposed by Ljung and Box (1978) given as: The null hypothesis is rejected if probability value is less than the significance level of  = 0.05.

Data Used
The daily average rainfall data of 75 rain-gauge stations across peninsular Malaysia for the period January 1975 to November 2008 obtained from the Malaysian meteorological department were used in this study.

Results and Discussion
The rainfall data sets from 75 gauged stations were first analyzed using OK technique.The reason behind this is to identify the location that realized highest rainfall for the period under study.Figure 1 gives the result of the OK displaying the spatial variability of rainfall amount in peninsular Malaysia with the location with the highest average rainfall amount for the period under study.It can be observed in (Figure 1) that some other stations falls within the range of 7.6 and above signifying areas high average rainfall amount, but the interest in this study is particularly on single station with the highest amount (the station was selected using a defined threshold in the arcGIS software) so as to modelled the temporal behaviour of the rainfall in the identified station.In modelling the temporal behaviour of the rainfall time series of the identified station, the daily rainfall time series for the period January 1975 to November 2008 of the identified station (Chui Chak) is used.The autocorrelation function (ACF) that provides a measure of temporal correlation between rainfall data points with different time lags is given in Figure 3.For a purely random event, all autocorrelation coefficients are zero, apart from lag 0, which is equal to 1 (Yusof et al., 2013).The autocorrelation function (Figure 3) decays slowly which indicate that the time series are strongly correlated which is a main feature of long memory appearance.Fractionally integrated processes possess genuine long memory in the sense that the present state of a system is temporally dependent on all past states (Rea et al., 2007).The approach here follows fitting a suitable ARFIMA model to the daily rainfall series of the station identified by first transforming the data via fractional integration.The Geweke-Porter-Hudak method (GPH) was employed to estimate the fractional order of differencing d using bandwidth of 0.8.The value of d was found to be 0.2070681 which falls within 0 < d < 0.5 indicating a long memory process.A stationary and invertible process can be represented either in a moving average form or in an autoregressive form.The problem of either of these representations, though, is that it may contain too many parameters even for a finite-order moving average and a finite order autoregressive model.In general, a large number of parameters reduce efficiency of the estimates.Hence, in building a good model, it may perhaps be essential to include both AR and MA terms in the model; this will leads to the useful mixture of ARMA process (Wei, 2006).The AR and MA orders p and q respectively of the ARFIMA model were identified and estimated using Box-Jenkins methodology.The results are given in Table 1.The autocorrelation function (Figure 4) of the residual from the fitted ARFIMA model shows that the residuals are uncorrelated; therefore, the ARFIMA model seems to be adequate.However, the autocorrelation function (Figure 5) of squares of residual of the fitted ARFIMA model shows autocorrelation, this indicates the presence of hetereoskedastic effect and confirmed by Mcleod-Li test given in (Figure 6) which can be modelled using GARCH model.As mentioned earlier, the linear models e.g.(ARMA, ARFIMA etc) can only be sufficient in modelling the information confined in the autocorrelation of the differentiated time series, which is a low requirement and is not capable of modelling Heteroskedasticity.Due to this problem, the maximum likelihood method to estimate the ARFIMA parameters used to be biased when the error variance is not constant (De Montera et al., 2008).The remedy to this consists in modelling Heteroskedasticity as a nonlinear relationship.Therefore the residuals from ARFIMA model is augmented by a GARCH model, the results are given in (Table 1).Table 2 gives the Ljung-Box test results for the ARFIMA-GARCH error model, it can be seen that both residual and squares of residual are uncorrelated meaning that no other hetereoskedastic effect left and this is the necessary condition to conclude that the model is appropriate.Therefore the addition of the GARCH (1,1) specification for the ARFIMA error helps to capture the serial correlations in the squared residuals.In the ARFIMA-GARCH (1, 1) model, the sum of  and  measures the extent to which the variance of current volatility remains significance for long periods into the future (i.e.volatility persistence).As this sum approaches unity, the persistence of variability to the volatility becomes greater.Moreover, when the sum of  and  equals to 1, then any variance to volatility is permanent and the unconditional variance is infinite.In this case, the process is denoted an IGARCH and implies that volatility persistence is permanent; hence, past volatility is significant in predicting future volatility over all finite horizons (McMillan & Thupayagale, 2011).
The volatility is said to be explosive if the sum of  and  is greater than 1, then, the variance to the volatility in one period will result in even greater volatility in the next period (Chou, 1988), as such the higher the volatility the riskier the security.
It could be noticed that the sum parameters value for the ARFIMA-GARCH model is relatively high i.e.
   = 0.839, therefore represent a relatively lasting persistence, that is the process variability should converge to the historical average after a relatively long period of time.

Conclusions
The spatial variability of rainfall time series is modelled using Ordinary Kriging technique in identifying the location with highest rainfall event in peninsular Malaysia.Chui Chak station realised to be the identified station.Moreover, the temporal dependence of the rainfall time series of the identified station is modelled via fractional ARIMA model augmented with the GARCH model.The Ljung-Box test for testing autocorrelation proves that the combined ARFIMA-GARCH model captures the temporal behaviours in the Chui Chak rainfall time series data.Thus, modelling the dependence in level and variability enhances our understanding for the properties of the rainfall series.Therefore the model could be a very good model for modelling rainfall for water resource management.

Figure 1 .
Figure 1.Spatial distribution of Peninsular Malaysia daily precipitations in millimetre averaged over the time period January 1975 to November 2008

Figure 2 .
Figure 2. Time series plot for the daily rainfall of Chui Chak station

Figure 3 .
Figure 3. ACF plot for the daily rainfall of Chui Chak station

Figure 4 .Figure 6 .
Figure 4. Autocorrelation function of residuals of the fitted ARFIMA model

Table 2 .
Ljung-Box test results for the ARFIMA-GARCH model