Detection of Multiple Change Points by the Filtered Derivative and False Discovery Rate

Let X = (X1, X2, . . . , Xn) be a time series, that is a sequence of random variable indexed by the time t = 1, 2, . . . , n. We assume the existence of a segmentation τ = (τ1, τ2, . . . , τn) such that Xi is a family of independent identically distributed (i.i.d) random variable for i ∈ (τk, τk + 1], and k = 0, . . . ,K where by convention τo and τK+1 = N. In the literature, it exist two main kinds of change points detections: The change points on-line and the change points off-line. In this work, we consider only the change point analysis (off-line), when number of change points is unknown. The result obtained is based on Filtered Derivative method where we use a second step based on False Discovery Rate. We compare numerically this new method with the Filtered Derivative with p-Value.


Introduction
Change-point detection is an important problem in many applications, and it has been well-studied for a long time, see e.g. the textbooks (Basseville & Nikirov, 1993;Brodsky & Darkhovsky, 1993;Csorgo & Horváth, 1997), or (Huskovà & Meintanis, 2006b;Gombay & Serban, 2009) for an updated overview.Depending on the method of data acquisition, there exist two different kinds of change detection: A posteriori or off-line change-point detection arises when the series of observations is complete at the time we process the data, whereas in sequential analysis, the detection is performed on line.In this work, we only consider the a posteriori problem.In this century, the state to the art method was the Penalized Least Square Criterion (PLS): When the number of change point is known, PLS minimizes a contrast function (Bai & Perron, 1998;Lavielle & Moulines, 2000).When the number of change point is unknown, many authors use the penalized version of the contrast function (Lavielle & Teyssière, 2006;Lebarbier, 2005).From a computational point of view, PLS method use the dynamic programming algorithms and it needs to compute a matrix.Therefore, the time and memory complexity of PLS algorithm is of order O(n 2 ), where n denote the size of the dataset.Due to the data deluge, the size of datasets are larger and larger, then the computational complexity of statistical method has became a challenge.Cumulative sum can be iteratively computed and therefore leads to algorithms with both time and memory complexity of order O(n).Among these methods, the Filtered Derivative has been introduced by Benveniste and Basseville (1984) and Basseville and Nikirov (1993).The advantage of Filtered Derivative method is the time and memory complexity, both of order O(n).On the other hand, Filtered Derivative method leads to many false discoveries of change points.Recently, Bertrand, Fhima, and Guillin (2011) have introduced a method called Filtered Derivative with p-value (FDpV) (see below for more details).Change detection by FDpV method has been successfully applied to real life large datasets (n = 120, 000 or n = 40, 000) of heartbeat series (Bertrand, Fhima, & Guillin, 2011;Ayache & Bertrand, 2011;Khalfa, Bertrand, Boudet, Chamoux, & Billat, 2012).However, Step 2 of FDpV algorithm use single hypothesis tests, and therefore it does not allow to control the rate of false discoveries.In this work, we propose to replace the family of single hypothesis tests of Step 2 in FDpV method by the use of the False Discovery Rate.The False Discovery Rate (FDR) has been introduced for multiple tests (Benjamini & Hochberg, 1995).Moreover, we investigate the effect of adding a Step 3, for taking advantage of the enlargement of windows when the number of potential change point decreases.
The rest of this paper is structured as follows: Section 1 describes the problem and the comparison criterions.
Section 2 recall the methods (FDpV and PLS) used for off-line change detection.Section 3 described the new method proposed in this work (FDqV), then Section 4 contains the numerical comparison.Eventually, the choice of the extra-parameters for FDpV or FDqV method is discussed in Section 5.

Description of the Problem
In this section we describe the problem of change point analysis and we give some comparison's criterions.For sake of simplicity, we restrict ourselves to a toy model, since we still have checked on real life datasets the efficiency of FDpV method, see Bertrand, Fhima, and Guillin (2011), Ayache and Bertrand (2011) and Khalfa, Bertrand, Boudet, Chamoux, and Billat (2012).

Change Point Analysis: A Toy Model
Let X = (X 1 , X 2 , . . ., X N ) be a series indexed by the time t = 1, 2, . . ., N. We assume that there exists a segmentation τ = (τ 1 , . . ., τ K ) such that X t is a family of independent identically distributed (iid) random variables for t ∈ (τ k , τ k+1 ], and k = 0, . . ., K, where by convention τ 0 = 0 and τ K+1 = N.The most simple model is X t a sequence of independent Gaussian variable with X t ∈ N(μ(t), σ), where N(μ, σ) denote the Gaussian law with mean μ and standard deviation σ = 1, and t → μ(t) is a piecewise constant map, that is μ(t) = μ k for all time t ∈ (τ k , τ k+1 ].We will use this model in all the sequel of this work.

Criterion
1) The quality of estimation for one sample can be measured by two criterions: ii) The integrated square error (ISE).Actually, we can reformulate the problem as an estimation of a noisy signal.The signal is where we have set by convention τ 0 = 0 and τ K+1 = N.The estimated signal is then and the integral square error (ISE) by 2) However, a result on just one simulation is hazardous.So, we have to do M simulations, with e.g.M = 1, 000 and calculate the mean integrated square error (MISE).
3) The second family of criterions is the time complexity and the memory complexity that is the mean CPU time for estimating s and which quantity of memory is used.
Remark 1 Other criterium are possible, like the BIC (Schwarz) criterium based on the likelihood, see e.g.Ciuperca (2011a).

Some Methods for Change Point Analysis
In this section, we recall some methods for change point analysis: The Penalized Least Square Error (PLS) and the Filtered Derivative with p-value (FDpV).

Penalized Least Square Method
Set S K = {τ such that length(τ) = K, that is τ = (τ 1 , . . ., τ K )} the set of all possible configuration of change of length K .Firstly, when the number of change points K is known, for each configuration of change τ ∈ S K , we can define where mean(X, Box) denotes the mean of the family X t for the indices t ∈ Box.Next, we search the configuration of change τ K ∈ S K which minimizes the square error Q(τ) defined by and we denote it by τ K .Secondly, we consider that the number of change points K is unknown.We remark that the map K −→ Q( τ K ) is decreasing.So minimizing the function Q(τ) with an unknown number of changes will lead to consider as optimal the trivial configuration of changes τ = (1, 2, . . ., N).To avoid this drawback, we add a penalty term proportional to the length of the change point configuration.Eventually we want to minimize Different choices of the penalty coefficient β are possible.In Lavielle and Moulines (2000), the following choice is proposed: In Lebarbier (2005) and Birgé and Massart (2007), the proposed choice is where σ 2 is the variance assumed to be constant and known and n the size of the series.In Figure 1 below, we have plotted the contrast function and the penalized contrast function (Lavielle & Moulines, 2000).We clearly see that the penalized contrast is almost horizontal, thus the minimum value is very fluctuating with respect to the choice of the parameter β.Let us stress that both time and memory complexity of PLS method is O(n 2 ), see e.g.Lavielle and Moulines (2000), Lavielle andTeyssière, (2006), andLebarbier (2005).
Remark 2 In sake of completness, let us remark that penalized least square as in Lavielle and Moulines (2000), Lavielle andTeyssière, (2006), andLebarbier (2005) can be replaced by penalized least absolute deviations, see e.g.Ciuperca (2011b).However both time and memory complexity will be of order at least O(n 2 ).

Description of the Filtered Derivative
With p-Value Method (FDpV) (Bertrand, Fhima, & Guillin, 2011) This method is a two steps procedure for change detection: Step 1 is based on Filtered Derivative and select a set of potential change points, whereas Step 2 calculate the p-value associated to each potential change point, for disentangling right change points and false alarms.More precisely, the method is defined as follows: Step 1: Computation of the filtered derivative function.The filtered derivative function (Bertrand, 2000;Benveniste & Basseville, 1984;Basseville & Nikirov, 1993) is defined : where denote the empirical mean of the variables X j on (t + 1, t + A).
Next, remark that quantities A × FD(t, A) can be iteratively calculated by using Thus, the computation of the whole function t −→ FD(t) for t ∈ [A, n − A] requires O(n) operations and the storage of n real numbers.
2) The determination of the potential change points.Let us point that the absolute value of filtered derivative |FD| presents hats at the vicinity of the change points see Figure 2 below.In Bertrand (2000) and Bertrand, Fhima, and Guillin (2011), we have given the asymptotic distribution of the maximum |FD| under the null hypothesis.Therefore, we can fix the error type at level p 1 , and then we can deduce the threshold C 1 corresponding to Pr(max We can remark the existence of many local maxima in the vicinity of each right change point (see Figure 4 and Bertrand, 2000;Bertrand, Fhima, & Guillin, 2011 for theoretical explanation).On the other hand, if there is no noise that is when σ = 0, we get hats of width 2A and hight μ k+1 − μ k at each change point τ k , see Figure 3.  Bertrand, Fhima, and Guillin (2011).When there is noise (e.g.σ = 1), we get the following landscape, see Figure 4.

3)
Step 2: Elimination of false alarm by p-value.A potential change point τ k can be an estimator of a right change point or a false alarm.In the first case, there exists an error of estimation on the location of the change.So we have to cancel a small vicinity of size ε k around each point τ k , Bertrand (2000) and Bertrand, Fhima, and Guillin (2011).Then, for each segment, we calculate an estimation of the mean (5) Next, we have to eliminate false detection in order to keep (as possible) only the right change points.In Bertrand, Fhima, and Guillin (2011), we use as Step 2 single hypothesis tests: For each potential change point τ k , we test wether the parameter is the same for t or not.More formally, for all 1 ≤ k ≤ K, we apply the following hypothesis testing (H 0,k ) : μ k = μ k+1 versus (H 1,k ) : μ k μ k+1 where μ k 's are defined by (5).By using this second single hypothesis test, we calculate the p-values p 1 , . . ., p K associated to each potential change point τ 1 , . . ., τ K .4) Calculation of p-Value.We choose the statistic Student T. Indeed, under the null hypothesis, t k has a Student distribution of degrees of freedom μ k 's are given by ( 5), , thus for A > 30 the distribution of t k is approximatively Gaussian an we can set where Φ is the cumulative distribution function of the zero mean standard Gaussian law.Let us point a slight difficulty: Since τ * k maximizes the criterium |FD k (t, A)|, τ * k is also a random variable.We avoid this drawback by canceling a small vicinity of size ε k for each selected change point, see Formula (5) and Bertrand, Fhima, and Guillin (2011, Rem. 2.1, pp. 178-179) for details.
In Bertrand, Fhima, and Guillin (2011), we only keep the change points corresponding to a p-value lesser than a fixed threshold p 2 .Consequently, Step 2 is much more selective and it allows us to deduce an estimator of the piecewise constant map t −→ μ(t), see Figure 5 below.We propose a new method derived from the FDpV one: We replace Step 2 of FDpV by False Discovery Rate method (FDR) and we call this method FDqV.
Step 1.The first step is the same as in FDpV: We compute the filtered derivative function t → FD(t, A) and then select the potential change points as the local maxima of the function t → |FD(t, A)| reaching a threshold C 1 .
Step 2. The novelty of this work is the use of False Discovery Rate thresholding procedure (Benjamini & Hochberg, 1995).The computation of p-value p k is the same as for Step 2 of FDpV method.However, we then use a Bonferroni type multiple testing procedure: (i) We tidy up p-value in the increasing order p * (1) ≤ . . .≤ p * (K * ) .(ii) We choose a threshold q corresponding to the rate of false alarms or FDR.Step 2 select some change points corresponding to a p-value smaller to a fixed threshold (FDpV) or a linear threshold (qFDR).In both case, the p-value is computed following the potential change point selected in Step 1.We recall here that these p-values are therefore calculated with windows larger than A. In other words, we have more information on the mean at Step 2 than at Step 1.This remark has suggested us to add a third step, which is the same as Step2 but with larger windows.More formally, Step 2 of FQqV can be seen as a map FDR : (X, τ * , q) −→ τ, where τ = ( τ 1 , . . ., τ K ) with K ≤ K * .Thus, at Step 3, we plug τ instead τ * as input of Step 2, that is FDR: (X, τ, q) −→ τ, where τ = ( τ 1 , . . ., τ K ) with K ≤ K ≤ K * .
To sum up, in Bertrand, Fhima, and Guillin (2011), we have made simple statistics tests and we compare the means pairwisely.So we choose a threshold of critical probability to eliminate the false alarms.The novelty in this work is the use of a multiple test (FDR) with FDR fixed at level q.Both time and memory complexity of FDqV remain of order O(n).

Numerical Comparisons
In this section, we compare numerically the FDpV method and the new proposed method FDqV.We use Monte Carlo simulation and via MISE.

Simulations Based on one Realization
Firstly, we select the simulation for one realization, which corresponds to Figure 2, and Figures 4-8 above.For n = 5, 000, we have simulated one replication of a sequence of Gaussian random variables (X 1 , . . ., X n ) with variance σ 2 = 1 and mean μ t = f (t) where f is a piecewise constant function with four change points at times τ = (1000, 2000, 3500, 4500) with means μ = (2.5, 3, 4.5, 3, 3.5).Both FDpV and FDqV method depend on extraparameters, namely the window size A, the threshold C 1 corresponding to Step 1, the maximum number changes K max for Step 1, the threshold p * 2 corresponding to step 2 of FDpV, the uncertainties on the location of changes ε k , and the threshold q of False Discovery Rate for Step 2 and Step 3 of FDqV.A brief discussion on the choice of the extra-parameters is postponed in Section 5. We have made the following choices:

Monte-Carlo Simulation
In this subsection, we made M = 1, 000 simulations of independent iterations of sequences of Gaussian random variables (X j 0 , . . ., X j n ) with variance σ 2 = 1 and mean μ t = f (t), for j = 1 . . ., M and t = 1, . . ., N. On each sample, we apply the FDpV method and the FDqV method with the extra-parameters given above.The mean value of ( K − K) is 3.38 with standard deviation (std) 1.64 for FDpV against a mean 2.84 with std = 1.59 for FDqV at Step 2, and mean K − K = 0.65 with std = 1.98 for FDqV at Step 3. Thus we can see that than the number of false discovery is smaller by FDqV.Note that at Step1, we have mean( K − K) = 12.

Numerical Conclusion
We clearly see that the overestimation of the number of change points is smaller for the method FDqV than the for FPpV one.On the other hand, we can note that for the MISE criterion FDqV is better than FDpV, which is still better than Filtered Derivative.

How to Choose the Extra-Parameters?
In this section, we address the question of the choice of extra-parameters for FDpV method.Natural criterions are the error of type I and error of type II, so-called probability of false alarm (PFA), denoted α, and probability of non-detection (PND), denoted β.

Errors of Type I and Type II at Step 1
We stress that error of type II (PND) is more important than error of type I (PFA), at least for the ISE criterion: Indeed, just one change point missing increases strongly the error.On the other hand, as pointed out in Bertrand (2000), when there is more than one change, the notion of probability of non detection should be make more precise: For each right change point τ k , we define the local PND as  Bertrand (2000, Prop. 3.2, p. 222), Next, by remarking that the right side of ( 8) is a decreasing function of δ k and setting δ = inf k=1,...,K δ k , we can deduce that On the other hand, we obviously have which combined with (9) gives us The right number K is unknown, but fixed.Thus, we will control the quantities β * (C 1 , A), for instance we choose to set β * (C 1 , A) = 10 −4 .This equation can be numerically solved, since the map C 1 −→ β * (C 1 , A) is decreasing, and we find an implicit function A → C 1 (A).After having controlled the error of type II (PND), we can control the error of type I (PFA).We know Bertrand (2000, Prop. 3.1, p. 221) that for all ε > 0 there exists a constant M ε such that For instance, we can set ε = 0.1, next we plug the implicit relationship between A and C 1 inside (10) and we obtain a function A −→ α * C 1 (A), A .The first idea is to make varying the parameter A in order to find the optimal value corresponding to a minimum of the map A −→ α * C 1 (A), A .Unfortunately enough, the map A −→ α * C 1 (A), A is decreasing and reaches no minimum value.

Choice of the Window A
From the preceding subsection, we can get the feeling that the larger the window size A is, the smaller type I and type II errors will be.This reasoning holds true as long as Thus, we have to choose a parameter A < L 0 /2, even if we do not exactly know the quantity L 0 .Figures 8-10 below illustrate the necessary condition (11).We consider the case without noise, that is σ = 0, with three change point at τ = (2000, 2200, 2600) thus L 0 = 200, and we make varying the window size A at A = 100, A = 200, and A = 600.
In Figure 8, we detect the three right change points.In Figure 9 and Figure 10, we only detect two change points.This plainly confirm the necessity of condition (11).We can calculate t * k under both null and alternative assumption. 1) Under null assumption (H 0 ) : μ k = μ k+1 , t * k approximatively follows a Gaussian law N(0, 1).
2) Under alternative assumption (H 1 ) with μ k+1 − μ k = δ, then t * k ∼ N(0, 1) Actually, we want to select all the right change points with as few as possible false alarms.So, we want to control the probability of non detection PND.For a single change point, let us fix a critical level t c , then On the other hand, the probability of one false alarm is For example, when δ = 0.5, t c = 1.5, N k = N k+1 = 100 then β k = 0.0207 and α = 0.1336.

Summary and Conclusions
In this work, it clearly appears the power FDqV method than FDpV method.However, the FDqV method is established by the simulations, in the future we will valid by the real data.The questions not developed in the literature are: 1) The FDpV and FDqV methods with the random variable weakly or strongly dependent.
2) The Choice of parameters such the window and the threshold, which depend the both methods.
All these questions are very difficult so more detail is needed.We will try to do in forthcoming work.

Figure 1 .
Figure 1.Blue: Q(K) calculated with dynamical program method; red: the penalized contrast function; Green: the optimal contrast function for K change points

Figure 2 .
Figure 2. The right signal (red), the noisy signal (blue), and Filtered Derivative function (green)

Figure 7 .
Figure 7. Signal reconstruction after Step3 by FDqV method we can define the global PND as PND global = P K k=1 B k .Next, we can obtain an upper bound for PND global .On the one hand, let us denote by δ k the size of the change on the mean at change point τ k , more precisely δ

Figure 10 .
Figure 8.The Filtered Derivative without noise and A=100