Consistency of an Estimator for Change Point in Volatility of Financial Returns

A non parametric Auto-Regressive Conditional Heteroscedastic model for financial returns series is considered in which the conditional mean and volatility functions are estimated non-parametrically using Nadaraya Watson kernel. A test statistic for unknown abrupt change point in volatility which takes into consideration conditional heteroskedasticity, dependence, heterogeneity and the fourth moment of financial returns, since kurtosis is a function of the fourth moment is considered. The test is based on L2 norm of the conditional variance functions of the squared residuals. A non-parametric change point estimator in volatility of financial returns is further obtained. The consistency of the estimator is shown theoretically and through simulation. An application of the estimator in change point estimation in volatility of United States Dollar/Kenya Shilling exchange rate returns data set is made. Through binary segmentation procedure, three change points in volatility of the exchange rate returns are estimated and further accounted for.


Introduction
A change point is a situation where an entire data set is no longer characterized by the same underlying process. Instead, the observations have two or more distinct segments and each segment has a unique underlying process. The segments could have different variances or different mean. Change points are as a result of an observed event or an unobserved combination of factors. These include among others financial liberalization of emerging markets and integration of world equity markets, changes in exchange rate regimes from a fixed exchange rate regime to a floating exchange rate regime and introduction of a single currency like the Euro in Europe. Synonyms for change point include probabilistic diagnostics and disorder problems.
In financial instruments, volatility is rarely constant but instead behaves like a jump processes where it varies over time creating temporal clusters due to volatility clustering. Volatility clustering means large or small price movements tend to be followed by similarly large or small price movements respectively on consecutive hours, days, weeks or other time durations. The clustering of volatility brings about the presence of structural breaks or change points which the standard GARCH model fails to accommodate leading to overestimation of the degree of volatility persistence.
Unlike previous research which considers change points in identical or independent variables or change point in the mean of variables Csorgo and Horváth (1997), Shao and Zhang (2010), the concern here is on dependent heterogeneous processes with finite fourth moment since kurtosis is a function of the fourth moment. The change is assumed to be abrupt (as though it occurs completely between two observations) thus creating piecewise stationary segments and not smooth or gradual which could create locally stationary segments. The change point location is not known a priori as is common with most previous studies. Due to the fact that inferences based on non-parametric approaches are robust against model misspecification, non-parametric models are able to avoid misspecification problem commonly encountered in parametric approaches. Hence, kernel estimator of conditional mean function is employed to estimate the conditional mean function and the conditional variance function in financial returns. Change point estimation is further done on the conditional variance function of the returns series.
The first published article in change point analysis was by Page (1954). He considered testing for a potential single change point in mean for independent and identically distributed normal random variables motivated by a quality control setting in manufacturing. Since then, change point analysis has developed rapidly with considerations on either multiple change point detection and estimation, different types of data, parametric test statistics which are based on the likelihood ratio statistic and estimation done using maximum likelihood method Csorgo and Horváth (1997) and nonparametric test statistics and other assumptions being put into consideration. An extensive literature on change point detection and estimation in independent and identically distributed random variables is by Csorgo and Horváth (1997) where mostly a change in mean is considered, or in martingale difference sequences as in Bai (1994) or even change in variance under the assumption that financial returns are independent and identically distributed random variables Inclan and Tiao (1994) which is not a fact when one is working with real financial data. It has over time been done in time series like Chen et al. (2005), Ross (2013), Aminikhanghahi and Cook (2017) under different assumptions on dependence, heterogeneity among others Gerstenberger (2018).
The change point detection and estimation methods can be classified into offline (fixed sample or retrospective) and online (sequential) where the classification depends on the sample acquisition approach Brodsky and Darkhovsky (2013), Csorgo and Horváth (1997) and Chen and Gupta (2011). In offline approach, the data set is first observed, then the detection and further estimation of the change-point is done by looking back in time to recognize where the change occurred. These include among others Lee et al. (2003) where a change point in variance in a non-parametric time series regression model with a strong mixing error process using cumulative sum of squares approach introduced by Inclan and Tiao (1994) is done, Sansó et al. (2020) in change point detection in unconditional variance of financial time series. In sequential approaches, new data is continually arriving and is analyzed adaptively. The goal of on-line change point detection is to detect changes as quickly as possible, while keeping the number of false alarms as low as possible. On-line approaches are mostly used in areas such as quality control, financial risk management, allocation of asset or portfolio selection while off-line approaches are used in areas such as genome analysis, linguistics, audiology among others. They include research by Berkes et al. (2004) for change point detection in GARCH(p,q) models, Koubková (2006) under change point in mean. Change point estimation in volatility is applicable in areas such as options pricing and calculation of value-at-risk. The volatility change point estimator is applied to estimate change point in volatility of United States Dollar/Kenya Shilling exchange rate returns and an account for the change points made. A significant improvement in describing time series is expected, if points in time for volatility changes are identified.

Non Parametric Auto-Regressive Conditional Heteroscedastic (NP-ARCH) Model
Let S t denote the price of some financial instrument observed at equally spaced discrete time points t = 1, 2, ... and that X t = logS t − logS t−1 is the continuously compounded single-period return on the asset from time t − 1 to t. The volatility of the instrument is the standard deviation of these returns. One assumes the existence of a non-parametric and non-linear relationship between X t and X t−i , for i = {1, 2, ..., d} which is modeled by a non-linear Auto-Regressive process of the form X t = m(X t−1 , X t−2 , ..., X t−d ) + u t for t = 1, 2, ..., n where {u t } are innovations which are independent of the past {X t }, m(.) is the conditional mean function of the returns in period t given the previous periods X t−1 , X t−2 , .... In non-parametric approaches, m(.) is allowed to be from some flexible class of functions and it is approximated in such a way that the precision of approximation increases with the size of the sample. Since the interest is on the future volatility, by representing the innovations as u t = σ(X t−1 , X t−2 , ..., X t−d )z t , Equation (1) is extended to a Non Parametric Auto-Regressive Conditional Heteroscedastic (NP-ARCH) model of the form where is the conditional variance function of the returns in period t given the past periods. {z t } are independent and identically distributed random variables (errors) which are time invariant with unspecified continuous and positive probability density function f z , E(z t |F t−1 = X t−1 , ..., X t−d ) = 0, Var(z t |F t−1 = X t−1 , ..., X t−d ) = 1 and independent of X t−1 , ..., X t−d while Ez t = 0, Ez 2 t = 1, E (z t ) 4 < ∞ . Model (2) is a flexible nonparametric time series model because it does not impose any (parametric) particular form on the conditional mean and conditional variance functions. However, in higher dimensions, there is poor performance called curse of dimensionality, which for d > 2 the estimation of the functions in Equation (2) is complicated unless one has a very large sample (For a given bandwidth (window size), the higher the dimension, the less the data there is in a neighborhood of a point x ∈ R d with bandwidth b n ). For simplicity,let d = 1 so that Equation (2) becomes Equation (3) generates heavy tailed distributions. To demonstrate this, suppose a simple model X t = σ(X t−1 )z t with z t assumed to be i.i.d and from a standard normal distribution. This means that E(z 4 t ) = 3 and by Jensen's inequality, This heavy tailed-ness feature implied by Equation (3) makes it a suitable model for modeling data with heavy tails like financial returns data. X t is assumed to be strictly stationary and strong mixing which is a common assumption satisfied by most financial time series Fan and Yao (2008) hence the theorem below is applied.
Theorem 1 Strong mixing condition: Suppose the existence of a probability space (Ω, F , P). Let the dependence measure between any two σ fields A and B ⊂ F as introduced by Rosenblatt (1956) be defined by Now, suppose {X t , t ∈ Z} is a two-sided sequence of variables on a given probability space (Ω, A, P). For −∞ ≤ j < l ≤ ∞, let F l j = σ(X t , j ≤ t ≤ l) denote the σ − field of events which has been generated by the family {X t , j ≤ t ≤ l (t ∈ Z)} (F l j assembles all the information collected between time j and l). For each n ∈ N, define the "coefficient of dependence (mixing) " α(n) by, where F j −∞ is the σ − field of events contained in the past of the sequence {X t } up to time j and F ∞ j+n is the σ − field of events contained in the future of the sequence {X t } from time j + n onwards. The sequence of numbers (α(n), n ∈ N) is non-increasing (decreasing)in n and are non-negative. The sequence {X t , t ∈ Z} is therefore said to be "strong mixing " or "α mixing " if lim n−→∞ α(n) = 0 and hence the sequence is asymptotically independent between the past and the future.
For a one-sided sequence {X t , t ≥ 1}, then one defines α (n) by hence the sequence is said to be ergodic.
Let {(X t , X t−1 ) ∈ R × R : t = 1, 2, ..., n} be a sample of size n ∈ N, K(·) : R −→ IR be a kernel function which is a bounded continuous function on R satisfying the assumption of normalization K(u)du = 1, which ensures that the method of kernel density estimation results in a probability density function, symmetry about zero K(u) = K(−u) ∀u implying that all the odd moments are zero, K(u) ≥ 0 ∀ u implying that K(u) is a probability density function, u K(u)du = 0, and u 2 = u 2 K(u)du < ∞ and b n a positive real-valued number , called a bandwidth or smoothing parameter. The kernel density estimator is defined by the kernel density estimator is defined bŷ m(x) and σ(x) are estimated by non-parametric technique. The assumption is that they are smooth functions and that X t−1 has a density function f (x), x ∈ [−1, 1] which is actually the support of the second order Epanechnikov kernel function which is employed when estimating the regression function since it is the most efficient in minimizing the Mean Integrated Squared Error (MISE) and is therefore optimal putting in mind that the choice of the kernel is not as important as the choice of the bandwidth. The non-parametric estimator of the regression function m( where K(·) is a kernel function and b n is the bandwidth.m(x) is called a kernel estimator or the Nadaraya Watson kernel estimator developed independently by Nadaraya (1964) and Watson (1964).
The estimator of the conditional variance function σ 2 (x) = Var(X t |X t−1 = x) which is obtained by using the estimated residuals which is still a local constant smoother (kernel estimator) where G(.) is a kernel function and g n is the bandwidth. A residual based estimator ofσ 2 (x) is able to overcome the bias problem of the method of Härdle and Tsybakov (1997) who used a direct estimator , and which is also likely to produce negative estimate of the volatility function especially when different bandwidth parameters are used to estimate m 2 (x) and m(x). Also, a residual based estimator is able to reduce the variance of the difference based estimators. The second order Gaussian Kernel function is employed when estimating the conditional variance function so as to cater for asymmetric behavior of volatility. g n is a bandwidth that is different from b n . Fan and Yao (1998) showed thatm(x) is a consistent estimator of m(x).
The bandwidth b n is fixed and does not vary with x. It determines the size of the neighborhood and is chosen dependent on n ∈ N in such a way that for a larger sample, it is chosen smaller Härdle (1990). Hence, asymptotically, the bandwidth is a sequence of positive real-valued numbers (b n ) n∈N with lim n−→∞ b n = 0. Due to dependency in the returns data set, the bandwidth is estimated with the aim of minimizing the leave-one-out cross-validation function Härdle (1990).

Non-Parametric Auto-Regressive Conditional Heteroscedastic Model Under Structural Change
Under the null hypothesis H 0 , Equation (3) with no change in volatility, is written as This in turn implies that where σ 2 t (X t−1 ) shall be denoted by σ 2 (1) (X t−1 ) for t = 1, 2, ..., n. The data structure having changed at a certain point in time means that using one regression model to study the data will obviously leave the data unfitted or poorly explained by the regression model. This implies that the model valid near t = 1 is not valid near t = n due to the presence of an unknown change point. Hence, in the presence of an unknown change point in volatility, τ ∈ [2, n − 1], Equation (3) becomes where the alternative hypothesis H A becomes The model having instability in the conditional variance function is alternatively referred to as the model with nonstationary variances.

Volatility Change Point Test Statistic and Estimator
Define the residuals obtained from non-parametric estimation of conditional mean function and standardized using the conditional variances obtained from the conditional variance function asε t = X t −m(X t−1 ) σ(X t−1 ) wherem(.) is the Nadaraya Watson estimator of the unknown regression function m(.).
The sums of squared residuals among the sample segments (partial sums) are defined as Defineε 1,τ as the mean of the first τ squared residuals andε τ+1,n as the mean of the last n − τ squared residuals as The estimate of the variance of the first τ observations is obtained by ε τ τ while the estimate of the variance of the last n − τ observations is given by ε n−τ n−τ once te jump point τ has been estimated. The change point estimatorτ of the unknown change point position τ is obtained through minimization of the residual sum of squares among all the possible segments of the sample asτ . Further, supposeε 1,n = 1 n n t=1ε 2 t is overall mean of the squared residuals. Bai (1994), shows that for each τ ∈ {2, 3, ..., n − 1}, the following relation holds This implies that the minimum of V 2 t will occur when D n t is maximum. Hence, where considering a weighted l 2 norm, which is a non-parametric statistic for change point detection in volatility. The non-parametric statistic so obtained does not utilize any a priori information about the data. This test statistic suitable for processes which are not independent identically distributed as opposed to Inclan and Tiao (1994) test.
A good choice of the estimator for the change point τ is where the test statistic has a global maximum since the maximum usually occur in the area of the "true " change point. This means that all possible time points are taken to account and hence the use of maximum statistics.Thus the estimator for unknown change point in volatility of financial returns is the functionτ

Consistency of the Volatility Change Point Estimator
When proving consistency of the volatility change point estimator, one needs to show that the volatility change point fractionˆτ n =k is consistent for the "true " change point fraction τ n = k * under H A but not showing that the indexes are consistent Truong et al. (2018). Distance between a true break point index and its estimated counterpart |τ − τ| never converge to zero and thus the need to consider the break point fraction instead of the indexes themselves. Hence, one needs to show that the relative error |τ−τ| n decreases with the sample size n meaning thatk p −→ k * as the sample size diverges, The results on consistency of the volatility change point estimator are summarized in Theorem 2 and Theorem 3 below.
Theorem 2 Let τ be the "true " location (position) of the change point under the alternative hypothesis Letτ be the estimate of τ given byτ = arg max τ D n t Thenτ is consistent for the "true " break point τ Chen and Gupta (2001).
In proving the consistency of an estimator that has been obtained by maximizing an objective function, one needs to show that the objective function D n t with integer-valued variable τ, τ = 2, ..., n − 1 as its argument, is uniformly close to its mean function (the expected value) and that this mean function has a maximum which is unique.
Theorem 3 below for the convergence in probability for the change point fractionk is the main result in proving the consistency of the estimated change point fraction.
Theorem 3 Consider a sample of squared residualsε 2 1 ,ε 2 2 , ...,ε 2 n satisfying the alternative change point hypothesis and the change point estimatorτ given in (18). If the sequences ε 2 1,t t ∈ Z and ε 2 2,t t ∈ Z satisfy where ∆ < ∞ denotes the finite magnitude of the jump in the conditional variance function , then fork =ˆτ n P |k − k * | > B 2 ∆ 2 n − 1 2 where 0 < B < ∞ (is a positive constant) and k * = τ n . To proof Theorem 3 above, suppose that ε 2 1,t t ∈ Z and ε 2 2,t t ∈ Z are two conditional heteroscedastic processes of regression residuals which can be described as pre-break and post-break sub-samples respectively. Further, suppose that one obtains a sampleε 2 1 ,ε 2 2 , ...,ε 2 n such thatε Also, assume that the two sequences of residuals ε 2 1,t , t = 1, 2, ..., τ and ε 2 2,t , t = τ + 1, ...n have different conditional variance functions such that (2) (X t−1 ) for τ < t ≤ n when i = 2 where τ = k * n and 0 < k * < 1. The conditional variance function is assumed to change at time t = τ. One assumes that ε 2 1,t is a process which is independent fromε 2 2,t thus resulting in a discontinuity in the conditional variance function of the residuals at τ. Since the returns are dependent due to the fact that they are obtained from a single instrument, the change in the conditional variance function will result in the observations having starting values not out of stationary models. Hence, one assumes that the observations after the change are based on a time series which is stationary.
The disordered sequence ε 2 t which has a change point is no longer a stationary sequence and τ is referred to as the change point whileε is the change point estimator ofτ. |E D n t | will achieve its maximum at t = τ and E(D n t ) has a unique maximum at τ and also that D n t − E D n t is uniformly small in τ for large n. The first sum shall thus be weighted by τ n converging to k * which is the percentage of chronologically ordered observations before the change point while the second sum is weighted by 1 − τ n converging to 1 − k * which is the percentage of the chronologically ordered observations after the change. This results to where 1 < τ < n, is parametrized as τ = nk * with k * ∈ (0, 1).
Consideration of both cases when t ≤ τ and when t > τ is made even though there is symmetry.
From Equations (23) and (24) one obtains If k ≤ k * , then by the mean value theorem If k > k * , then 1 − k < 1 − k * and Equation (26) above yields Combining Equations (25), (26) and (25), one obtains which from the triangle inequality and the relation that |a| = |a − b + b| ≤ |a − b| + |b| yields Taking E D n τ − E D n t to the left hand side of the inequality will imply that Therefore, from Equations (28) and (31) the following is obtained Replacing k byk in Equation (30) above and noting that D n τ ≤ D n τ one obtains Considering D n t as given in Equation (19), max τ D n t − E D n t needs to be estimated. Writing Opening the brackets Which implies Distributing 1 n and introducing the magnitude in Equation (36) yields on adding the two parts where one concludes that A general Hajek Renyi type inequality for dependent processes for the maximum of weighted sums in Theorem 4 below is applied Kokoszka et al. (2000).
Theorem 4 Let Y 1 , Y 2 , ..., Y n be any random variables with finite second moments and suppose b 1 , b 2 , ..., b n are any nonnegative constants (a decreasing positive sequence of constants). Then for any > 0, This in turn implies that which further implies that substituting this result into Equation (33) yields wherek = 1 2 (k * ) − 1 2 (1 − k * ) − 1 2 min {k * , 1 − k * }, and B is some positive constant which further proves Theorem 3 which was initially stated as From Equation (43), it can be seen that as n −→ ∞, then B 2 ∆ 2 √ n −→ 0 which helps complete the proof for consistency of the change point estimator. One can observe that consistency of the change point estimator depends on the magnitude of change due to the fact that ∆ is kept on the right hand side.

A Simulation Study on Consistency of the Change Point Estimator
The performance of the change point estimator with respect to sample size, change point location and the magnitude of change is investigated by simulating an ARMA(1,1) ARCH(1) model is simulated with a singe change point in volatility function as Where ∆ = {0.3, 0.5, 0.7} and the sample sizes n = {500, 1000, 2000, 4000} and the true change point instants being τ = 1 3 n, 1 2 n, 2 3 n . A table after 1000 bootstrap samples for each sample size is created. In each simulation, the estimates of the change point highly depended on the locations of the change points, the magnitude of change and the sample size just as per Equation (40) above. As the sample size grew unbounded, the error of estimation was decreasing to zero. From the sequence of the estimatesτ, it was observed that the estimator seemed to work better in larger samples than in smaller samples and in samples with larger magnitude of shifts than those with smaller magnitude of shift as evident from the table 1. Also, the bias of estimation, |τ−τ| n is decreasing as the sample size increases and as the magnitude of the shift ∆ grows, which further verifies that the estimatorτ is consistent. The accuracy of the change point detection is seen to increase with the sample size.

Sampling Distribution of the Change Point Estimator
Besides being a function, the change point estimatorτ is a random variable. As a random variable, an estimator will itself follow some probability distribution, which is called the sampling distribution of the estimator. A distribution of all change points thus becomes the sampling distribution. The plots of the histograms of estimated change points when the "true "change point is located at 1 3 n, 1 2 n, 2 3 n for a sample size of n = 4000 after 500 bootstrap samples of where z t is a sequence of i.i.d random variables symmetric around zero from a normal distribution and ∆ = 0.3 are as shown below in figures 1, 2 and 3 respectively.  2 and 3 show that the sampling distribution ofτ is skewed to the left and the degree of the skewness depends on the position of the change point in that it increases as the change point moves to the right hand side of the sample. The distribution is seen to have a long slowly decaying tail (leptokurtic).

Application to Foreign Exchange Rate Data Set
The existence of a change point in the conditional variance (volatility) of USD/KSH daily returns from 1 st January 2010 to 27 th November 2020 is investigated where n = 2839 historical exchange rates. This means obtaining 2838 continuously compounded returns. Lagging the returns by one implies having 2837 continuously compounded returns at lag one.  The exchange rates are plotted as in Figure 4a above. Since the conditional variances are obtained via the conditional expectation of the squared residuals, the squared residuals plot is as shown in Figure 4b above. The

Conclusion and Recommendations
In this research, a non-parametric procedure for estimating a change point in volatility of financial returns is considered. The procedure allows for change point estimation in sequences with conditional heteroscedastic variances. The consistency of the change point estimator is proven theoretically and shown through simulations. It is observed that the estimator best estimates change point when the change point is fixed around the middle of the sample, when the magnitude of change is large and as the sample size grows. The sampling distribution of the change point estimator is shown through histograms and is found to be negatively skewed and the degree of skewness increases as the change point moves to the right.
The change point estimation approach can be applied to multidimensional non-parametric models of the form X t = m(X t−1 , ..., X t−d ) + σ(X t−1 , ..., X t−d ) where either the volatility function is changing with time. One can also consider a case where both the conditional mean function and the conditional variance function are changing. In such cases, the functions m(.) and σ(.) should be estimated using multivariate kernel methods. This should be done carefully due to the curse of dimensionality problem. Also, situations where there are change points both in conditional mean function and conditional volatility function would be much interesting. The authors aim to obtain the limit distribution of the change point test statistic in a subsequent paper.
g(x) = g(a) + g (a)(x − a) + g (a) 2! (x − a) 2 + . . . + g (n) (a) n! (x − a) n + R n (x) where R n (x) = g (n+1) (c) (n+1)! (x − a) n+1 for some number c between a and x. R n (x) which is the remainder term gives the error in the approximation. If R n (x) −→ 0 as n −→ ∞, then one obtains a sequence of better and better approximations of g. When n = 0, then the Taylor's theorem reduces to the Mean Value Theorem which itself is a consequence of Rolle's theorem Sahoo and Riedel (1998).
Theorem 6 [Rolle's theorem ] If a function g is continuous on the closed interval [a, b] and differentiable on an open interval (a, b) such that g(a) = g(b), then g (c) = 0 for some c with a ≤ c ≤ b Sahoo and Riedel (1998).
Inequality 1 Jensen's inequality: This can be demonstrated by remembering that the variance of every random variable X is a positive (or non-negative) value i.e var(X) = E(X 2 ) − (EX) 2 ≥ 0 Thus E(X 2 ) ≥ (EX) 2 .
Suppose g(.) : R −→ R is a convex function e.g g(x) = x 2 . Also, assume that the expectations of X and g(X) exist (is finite). Thus the Jensen's inequality states that for any convex function g, there is For the proof, Pishro-Nik (2016).
Inequality 2 Regular triangle inequality: This inequality states that for any triangle,the sum of the lengths of any two sides should be greater than or equal to the length of the side remaining.i.e Suppose without loss of generality that we have vectors x and y in L p and that x is no smaller than y.
x + y ≥ x + y For the proof, Pedoe (2013) and by considering the Cauchy Schwartz inequality.
Inequality 3 Cauchy Schwartz Inequality: Let x and y be vectors in L p space. Then we shall have |x.y| ≤ x y If one of the two vectors is zero, then both sides are zero. So one may assume that both x and y are non-zero so as to proof the inequality.