An Alternative Hotelling T 2 Control Chart Based on Minimum Vector Variance ( MVV )

The performance of traditional Hotelling T control chart using classical estimators in Phase I suffers from masking and swamping effects. To alleviate the problem, robust location and scale estimators are recommended. This paper proposed a robust Hotelling T control chart for individual observations based on minimum vector variance (MVV) estimators as an alternative to the traditional multivariate T control chart for Phase II data. MVV is a new robust estimator which possesses the good properties as in minimum covariance determinant (MCD) with better computational efficiency. Through simulation study, we evaluate the performance of the proposed chart in terms of probability of detection and false alarm rates, then compared with the performance of the traditional charts and the chart issued from MCD estimators. The results showed that MVV control chart has competitive performance relative to MCD and traditional control charts even under certain location parameter shifts in Phase I data.


Introduction
In many industrial settings it is frequently required to monitor more than one interrelated variables.The Hotelling's T 2 control chart is one of the multivariate statistical tools which is widely used to detect the presence of special-causes of variation by monitoring a nominal mean vector  .This chart is popular as it possesses almost all the desirable characteristics for a multivariate control chart such as ease of application, flexibility, sensitivity to small process changes, and the availability of software for application (Mason & Young, 2002).Like any other control charts for monitoring the variability in a process, its construction consists of Phase I and Phase II (Alt, 1985) which are also referred to as retrospective and prospective analysis respectively (Woodall & Montgomery, 1999).Phase I focuses on analyzing historical data to determine whether the process is in control by estimating the in-control parameters of the process and the control limits.While in Phase II, the centre of attention is on monitoring on-line data to quickly detect shifts in the process from the in-control parameter values estimated in Phase I. Unusual observations in Phase I can lead to the inflation of control limits and reduction of power to detect process changes in Phase II.Therefore a successful Phase II analysis depends on a successful Phase I analysis in estimating in-control mean, variance, and covariance parameters.The preliminary data set collected in retrospective analysis involves either initial subgroups or individual observations.In many situations, data are collected according to the rational subgroups concept.Nevertheless, sometimes data come in the form of individual observations especially when the production rate is too slow to conveniently collect subgroup size greater than one.For individual multivariate observations, the parameter estimates for the mean vector and covariance matrix in Phase 1 are based on pooling all the observations (Jackson, 1985;Tracy et al., 1992;Wierda, 1994;Lowry & Montgomery,1995).However, this approach will cause the variance estimates to inflate if the special-cause of variation present in Phase I, as demonstrated by Sullivan and Woodall (1996;1998), Vargas (2003) and Williams (2004).To overcome the problem, alternative estimation methods have been proposed in the literature.One of the approaches is to calculate the T 2 statistic based on successive differences variance-covariance matrix estimator (Holmes & Mergen, 1993;Sullivan & Woodall,1996;1998;and Williams et al., 2006).Though this approach is effective in detecting shifts in the mean vector, it fails to detect outliers as shown in Vargas (2003).Another approach is to use robust estimators of the process parameters.A notable use of robust estimator is in identifying the deviation of data, or outliers.Compared to the classical statistics, the use of robust statistics will give a clearer variability description between an outlier and 'good data', whereas the classical statistics will vaguely show the difference (Hampel et al.,1986).There are two approaches to deal with outliers.The first approach is to identify and remove outliers before using the remaining good data points to calculate the classical estimators.The other approach is to use the robust estimators in place of classical estimators (Beckman & Cook, 1983).A wide range of robust estimators of multivariate location and scatter is available, see Maronna and Zamar (2002) and Maronna et al. (2006) for a review.Nonetheless, the minimum volume ellipsoid (MVE) and minimum covariance determinant (MCD) estimators introduced by Rousseeuw (1985) have received a considerable attention by scientific community and widely used in practice.The advantage of using MVE estimators is that, they have high breakdown point of approximately 50% and also affine equivariant (Lopuhaa & Rousseeuw, 1991. pg 236).However the computation of MVE estimator is very expensive and it may not even be computationally feasible (Hadi, 1992. pg 762).In addition, there is no fast algorithm known to compute the estimator.This is due to the fact that MVE has poor rate of convergence (Lopuhaa & Rousseeuw, 1991. pg 237) and fails to cope with large sample of more than 30 (Rousseeuw & van Driessen, 1999. pg 213).To alleviate the complexity of MVE, Rousseeuw (1985) also introduced the minimum covariance determinant (MCD) method.MVE and MCD estimators have the same characteristics with respect to affine equivariance, high breakdown value and bounded influence function properties (Rousseeuw & Leroy, 1987).The only difference is in the criteria used where MVE minimizes the volume of the ellipsoid on ( 1 )/2 h n p    data, while MCD minimizes the determinant of the covariance matrix based on the h data.
The MCD estimator is more attractive than MVE because it has a better convergence rate of (Butler et al.,1993;Croux & Haesbroeck, 1999) and MCD gives the exact solution (Hadi, 1992;Hubert et al., 2005).Lopuhaa and Rousseeuw (1991) realized that the efficiency of high breakdown methods can be quite low, and proposed reweighted MCD (RMCD) estimator to alleviate the problem.Croux and Haesbroeck (1999) employed the reweighted version and noticed that this approach maintains the breakdown point of the initial MCD estimators, while attaining a better efficiency.To compute the initial MCD estimator and its reweighted, various algorithms have been suggested.Most of the algorithms attempt to increase the computational efficiency because to obtain approximate values of these estimators is not only expensive, but could be impossible for large sample sizes with large number of quality characteristics (dimensions).Nevertheless, the main contribution in this domain is the Fast MCD algorithm proposed by Rousseeuw and van Driessen (1999) and improved by Hubert et al. , (2005) which is available in many computer packages such as Matlab, R, SAS, and S-Plus.However, Fast MCD is not without limitation.For example, the use of minimum covariance determinant as the objective function in data concentration process can be computationally laborious especially when the data set is of high dimension.On the other hand, as Angiulli and Pizzuti (2005) have pointed out, the computational efficiency is as important as effectiveness.Furthermore, as noted by Fauconnier and Haesbroeck (2009), Fast MCD algorithm may return different results when used repeatedly in the same or in different statistical packages and could be more critical when n/p are small.To overcome the weaknesses of Fast MCD algorithm, Herwindiati (2006) proposed minimum vector variance (MVV) as an alternative measure of multivariate data concentration.Herwindiati et al. (2007) revealed that MVV was successfully used as an objective function in Fast MCD algorithm to substitute the MCD criterion.The findings showed that MVV is computationally more efficient than Fast MCD and as effective as Fast MCD in labelling outliers.Therefore MVV is one of the methods where the algorithm itself is referred to as the estimator as discussed in Werner (2003).A detail explanation about this method can be referred in Section 2.
The study on the significant role of MVE, MCD and RMCD estimators in scientific application can be easily found in the literature specifically in the construction of robust Hotelling T 2 chart.Vargas (2003) and Jensen et al. (2007) introduced robust control charts based on MVE and MCD estimators for multivariate individual observations.They identified and removed the outliers in Phase I analysis and then calculate the classical estimators using the remaining good data points for Phase II analysis.Through this approach, the computability and breakdown point of the estimators become more important, but statistical efficiency is not as crucial because the highly robust estimators will eventually be replaced by classical estimators in Phase II analysis (Jensen et al., 2007).They noticed some drawbacks when MVE and MCD are used in Phase I.The T 2 issued from MVE did not perform under large sample size.Conversely, T 2 issued from MCD needs a larger sample size if large number of outliers is suspected to ensure that MCD estimator does not breakdown and lose its ability especially when monitoring with more variables (p).
To abate the problems, Chenouri et al. (2009) proposed robust Hotelling T 2 chart based on RMCD estimator.Besides possessing the nice properties of MCD estimator, this estimator is not unduly influenced by outliers and is more efficient than MCD.However, their approach is different from Vargas (2003) and Jensen et al. (2007) whereby they used RMCD estimators in place of classical estimators in constructing Hotelling T 2 chart for Phase II data.Using the same approach as Chenouri et al. (2009), Alfaro & Ortega (2009) make comparison study for the performance of Hotelling T 2 control chart based on robust estimators of MCD, MVE, RMCD, and trimmed estimator.They concluded their work by recommending the use of T 2 based on trimmed estimator and RMCD when there are few outliers in the production process because of their ability in controlling false alarm rates.However, in the manufacturing of products which emphasizes more on outliers detection than the false alarms generated, the T 2 based on MCD can be considered as better alternatives.This is due to the fact that the Hotelling T 2 control charts based on MCD performed well in terms of probability of outliers detection.Theoretically, if the percentage of outliers' detection increases, the chart should also able to control the overall false alarm rate, α (Jensen et al., 2007).However the finding in Alfaro and Ortega (2009) shows a conflict between the percentage of outliers detection and the ability of robust control chart in controlling the overall false alarm rate when using robust charts under certain conditions.
In this study, we proposed a robust Hotelling T 2 control chart using a new robust estimator known as minimum vector variance (MVV).The ability of MVV estimator when used in Hotelling T 2 control chart for Phase I has never been tested before.Djauhari et al. (2008) had suggested to use MVV estimator in Hotelling's statistic as a new criterion because MVV estimators have the same breakdown point as MCD estimators and also posses affine equivariant properties.Therefore we are inspired to greater efforts by their suggestion.This study integrates the MVV estimators in the Hotelling T 2 control chart for Phase II data using the same approach as Chenouri et al. (2009) and Alfaro and Ortega (2009) for monitoring the multivariate observations.Even though RMCD was observed to be better than MCD in controlling the false alarm rate, in this study, comparisons are made based on the initial MCD since the algorithm for the proposed method follows the algorithm of the initial MCD.We want to compare the algorithm in its original state and diagnosing problems that might arise using the proposed method (MVV) in constructing Hotelling T 2 control chart.If there is a need to improve the method, this will be continued in the next study.The performance of our proposed control chart was evaluated in the case where there are no changes in the covariance structure.Performance evaluation measured the effectiveness in terms of the probability of outlier(s) detection and false alarm rate (type I error) on Phase II data.It is worthwhile to investigate on the performance using both of them because these measures are closely related (Ramaker et al., 2004).When the data comes from an in-control process the false alarm rate should be close to a nominal value, α.
In this study, α is set to be equals to 0.05.When data comes from an out-of-control process then the probability of detection should be large to ensure that the chart is able to monitor on-line data and quickly detect shifts in the process of Phase II.Organization of the remaining part of the paper is as follows.Section 2 discusses about formal definition and the properties of the MVV estimator.In Section 3, we formally introduce a robust control chart based on the MVV estimators.We also demonstrate the Monte Carlo method to estimate the distribution of Hotelling T 2 statistic needed for the computation of control limit and discuss how the performance evaluation is done.The results of the analysis are presented in Section 4. Finally, discussion and conclusion are given in the last section.

Minimum Vector Variance (MVV) Estimator
MVV and MCD estimators have the same characteristics with respect to breakdown point, affine equivariant properties, and that their algorithms also display the same structures.For details see, Herwindiati (2006), Herwindiati et al. (2007) and Djauhari et al. (2008).The only difference between the two is the way in which the concentration step is generated; MCD uses covariance determinant (CD), while MVV uses vector variance (VV).
By definition, vector variance (VV) is the sum of square of all elements on the diagonal of covariance matrix.If X is a random vector of p dimension with  as its covariance matrix, then VV of X is 2 ( ) r T  .Its value indicates the degree of how multivariate distribution is scattered.The larger the value of VV the more scattered the distribution around its mean vector in a subspace of dimension q p  .It is equal to zero if and only if the distribution degenerates at the mean vector.The use of VV instead of CD as multivariate data concentration measures have several advantages (Djauhari, 2007).First, its computation is very efficient even for covariance matrix of large size because VV is a quadratic form while CD is a multilinear form.Thus, the number of operations of VV is smaller than CD since VV is of order O(p 2 ) and CD is of order O(p 3 ).Second, vector variance does not depend on non-singularity of the covariance matrix like covariance determinant.The singularity problem usually arises when the number of variable p is larger than number of sample size n.Another advantage of VV was illustrated by Djauhari (2007) via comparing the power of vector variance-based test with covariance determinant-based test.In general, both tests have similar performance when p is small such as 2 p  .However, the power of VV is greater than CD to larger shift of covariance structure when p and n are large.The main statistical method used in the estimation of MVV is Mahalanobis squared distances (MSD) which is defined as T S among all possible sets of h data.The location and scatter estimators are given by Equation ( 1) and ( 2) respectively as follows, and covariance matrix  .If  and  are unknown then we need to estimate them using an in-control data set.The process of identifying the in-control data set from i x is referred to as Phase I operation.
However, this standard approach is only effective in eliminating very extreme outliers and detecting large shift in the mean vector in small sample sizes, but it fails to detect more moderate outliers especially when number of variables increased (Vargas, 2003;Jensen et al., 2007;Chenouri et al., 2009).To alleviate the problem of this procedure, we proposed using MVV estimator in Phase I data, x i .Since the estimator is known to be free from outliers due to its estimation process, they could be readily used as in-control estimators in Phase II.Let

 
1 2 , ,... ).The performance of the robust Hotelling T 2 charts is judged based on the false alarm rate and probability of detection in the process behavior from Phase II data.Thus, for Phase II observations, we simulate 1000 new datasets of different sample sizes (n) and dimensions (p) in Table 1.
To determine the false alarm rate and probability of detection, we randomly generate a Phase II observation with in-control and out of control parameters respectively from Phase 1, and calculate the proposed robust Hotelling T 2 statistics.The false alarm rate or probability of detection is estimated as the proportion of statistic values that are above the control limits of 1000 replications.
For Phase I, we simulate data of various conditions created by manipulating the number of observations, dimensions and levels of contamination.To examine the effect of contamination on the charts' performance, we have considered a contaminated model by using a mixture of normal suggested by Alfaro & Ortega (2009) where ε is the proportion of outliers, 0  and 0  are the in-control parameters while 1  and 1  are the out-of-control parameters.In this study we assume contamination with shift in the mean but no changes in covariance structure, therefore, the covariance matrix 0  and 1  in Equation ( 9) represent the identity matrix of p dimensions (I p ).To check on these conditions, we consider ε to be 0, 0.1 or 0.2.While for the probability of detecting a change which depends on the shift in the mean vector, we set 1  to be a vector of size with value of 0 (when there is no change), 3 or 5 (which is a good leverage point).Manipulation on the mean shifts and percentage of outliers generate 5 different types of contaminated distributions as listed below which were categorized as ideal, mildly contaminated, moderately contaminated and extremely contaminated as follows, Each of these model was paired with different combinations of sample sizes, n, and number of dimensions, p (refer to


, where 1  is the shift in the mean vector with values similarly assigned to Phase I (i.e.0, 3, and 5).The performance of the proposed chart is then compared with robust Hotelling T 2 chart using MCD ( 2 MCD T ) and the traditional Hotelling T 2 control charts.For the traditional chart we employed two approaches; first approach denoted as 2 0 T is without cleaning the outliers as being adopted by Alfaro and Ortega (2009) and the second approach, which is known as the standard approach, cleans the outliers once ( 2 S T ).Each of these charts was tested on 5 types of contaminations on 23 combinations of n and p which totalled up to 115 conditions.For each condition, the false alarm rates and probability of detection were determined.The programs and simulations were run using MATLAB 7.8.0 (R2009a).The algorithm of MVV was executed using the MATLAB 7.8.0 (R2009a), while Fast MCD algorithm using mcdcov.m in the LIBRA package under MATLAB 7.8.0 (R2009a).

Simulation Results
In this section, we compared the performance of 2 MVV T , 2 MCD T , 2 0 T and 2 S T control chart in terms of probability of detection and false alarm rate.The results of the investigation are presented in figures and tables for the probability of detection and false alarm rates respectively.

Probability of detection of outliers
The graphs illustrating the performance of the four charts in terms of probability of detection are exhibited in Figure 1-5.Each figure represents different dimension (p).For each condition, the performance of the control chart is regarded as better in detecting changes when the value of the probability is closer to 1.Under bivariate case (p = 2) as presented in Figure 1, initially 2 S T showed better detection than other charts at mild and moderate contamination.However, the good performance of 2 S T only sustain at n = 10, 25.Once the value of n and p increased, which can be clearly observed in Figure 2 ).This pattern repeats even within the same dimension for p > 5. Result on the 2 S T chart reveals that the chart performs so well when the number of outliers is small (small p and low percentage of contamination), but underperforms when the number of outliers gets larger (large p and high percentage of contamination).This weakness can be mitigated by the use of robust Hotelling T 2 chart.

False alarm rates
The performance of a chart is not purely judged by its ability in detecting outliers, but also in controlling the false alarm rate.False alarm rate is the probability of out-of-control signal when a process is in control.The value becomes large if the process is unstable due to increase in variability.Inflated false alarm rate can lead to unnecessary process adjustments and loss of confidence in the control chart as a monitoring tool (Chang & Bai, 2004).Hence, a method which can control the false alarm rate to the desired level is necessary.and 2

MVV
T .The control chart is considered to be in control of its false alarm if the empirical value is close to the nominal value, α.For the bivariate (p = 2) case presented in Table 2, the overall results on false alarm rates show that 2 MVV T outperforms the other control charts in especially when the sample size is very small (n = 10).Even though the results for and under most conditions are well controlled, however under ideal condition (no contamination) the chart failed to control the false alarm, causing the rate to inflate to 0.1000.The 2

T
and 2 0 T control charts are badly affected when the sample size is very small, which are verified by the rates of false alarm which are far below the nominal value except for ideal condition.When the percentage of outliers increased to 20% we observed that the rates for 2 T chart performs so well in controlling false alarm rates except when the percentage of outliers is large.
When the dimension increased to p = 5, the rates of false alarm for both traditional charts ( 2 0 T and 2 S T ) improved.Refer to Table 3.Nevertheless, the rates for chart under ideal condition are still high (very far above the nominal value).We also notice improvements in the robust 2 MVV T charts especially when the percentage of outliers is large.In contrast, the false alarm rates for 2 MCD T chart worsen with values as small as 0.0020.Table 4 displays the false alarm rates for the case of p = 10.While there are noticeable improvements in most of the conditions for 2 chart is still unable to control its false alarm rates under the latter condition.
As we scrutinized the false alarm rates for p = 20 in Table 6, we discover sporadic improvements under different conditions.There is no obvious improvement pattern could be observed.However, we can clearly observe that 2 MCD T chart performs badly in controlling false alarm rate in all conditions.

Real Data Analysis
To illustrate the Hotelling T 2 , a company which involved in the production of advanced for the aircraft industry has provided us with data on spoilers.Spoilers are vital devices in an airplane.Their function is to increase lifts when the airplane is flying.The products are used in civilian, defence, and space applications, which cannot compromise any mistakes, albeit a minor one.Thus, careful monitoring is required to ensure that no variation occur in the process.Any slight mistake could jeopardize a human life.
For the purpose of this study, a sample of 47 products (n = 47) which consists of several features such as trim edge (X 1 ), trim edge spar (X 2 ), and drill hole (X 3 ) was furnished to us by the company.Out of the total, 21 products were collected from 2009, while the rest were from 2010.Hence, we decided to use the 2009 products as Phase I historical data, and considered the products from 2010 as future data in this study.The details of the historical and future data are displayed in Table 7 and 9 respectively.The products consist of 3 quality variables (dimensions) namely trim edge, trim edge spar, and drill hole.Estimates for the location vector (( X ) and scatter matrix (S) are presented in Table 8.The calculation of the upper control limits (UCLs) based on the estimates are presented in the last column of the table.The values of the T 2 based on the above estimators appear in the last four columns of Table 9.The graphical presentation of the corresponding control charts are put on view in Figure 6.
When comparing the values of the T 2 in Table 9 with the corresponding control limits in Table 8, we observe that the three statistics T is as expected since the analysis on the probability of detection using simulated data showed that T performed well in detecting outliers under low dimension (not more than 5) only, but underperformed when the dimension increased to above 5.

Conclusion
Hotelling T 2 chart is well accepted as a reliable method to monitor production; however, under conditions of non normality, this chart is known to be underperformed.Alternative on the Hotelling T 2 statistic particularly on the location and scatter measures are recommended in order to produce a reliable chart regardless of the conditions.This study proposed an alternative to the the Hotelling T 2 chart by using a robust estimator known as minimum variance vector (MVV) for its location and scatter measures.MVV not only has all the properties of the well known minimum covariance determinant (MCD) such as high breakdown point and affine equivariant, but also has better computational efficiency.The performance of our proposed robust Hotelling T 2 chart using MVV in terms of false alarm rate and probability of detection were compared with the robust Hotelling T T and 2 MCD T by Alfaro and Ortega (2009) showed a conflicting result between the percentage of outliers detection and the overall false alarm rate such that when the probability of detection increased, the false alarm rates inflate away from the nominal value.However, our proposed chart, 2 MVV T perform so well in terms of detecting outliers and also in controlling false alarm rates.Even though the traditional Hotelling 2 S T chart performs so well in terms of controlling false alarm rates, but this chart fails to achieve good probability of detection especially when the number of quality characteristics is large.On contrary, the Hotelling 2 MCD T chart performs wonderfully in detecting outliers, however the chart fails terribly in controlling false alarm rates.With its good performance in terms of detecting outliers and controlling false alarm rates, plus the good properties of its statistics, Hotelling 2 MVV T chart is indeed a good alternative to the multivariate control chart.
2)To compute the MVV estimators, we propose the new MVV algorithm by combining the C-step fromHerwindiati et al. (2007).The C-step is similar to the one in the Fast MCD algorithm, except that the computation of covariance determinant is replaced by the vector variance.The complete algorithm is described as below: Stage 1: Creating Initial Subsets.This step must be repeated 500 times Stage 2: Concentration Steps (C-step) p-variate random sample of n observations of preliminary data set in Phase I. Assume that are independent and follow a multivariate normal distribution with mean vector  mean vector and covariance matrix estimators, respectively.We define a robust Hotelling's 2T for Phase II data, of control limits for the proposed Hotelling T 2 control chart 3.2 Performance evaluation.In order to assess the performance of 2 ( ) MUV g T control charts, various conditions were created by manipulating number of observations (n), number of dimensions (p), and level of contamination by using different proportion of outliers (ε) and several mean shifts values ( 1  -5, the line representing 2 MVV T consistently at the highest location in the graphs with the probability value of approximately 1, and overlapping with 2 MCD T line under most of the conditions.There are instances when the 2 MCD T line started with lower values creating gapsbetween the two lines but merged later on when the n values increased.This situation occurs when the sample size is small with 20% outliers and mean shift 3always at the lowest and second lowest respectively, creating a very wide gap between the other two lines (

T
charts dwindle as the sample size increased, but the 2 S T chart is still in control of its false alarm.The performance of the robust 2

STS
seem to be deviating away from the nominal value.Even under the influence of extreme contamination, the rates of 2 MVV T chart are no less than 0.022 value.However, the rates for 2MCDT chart are still far below the nominal level despite the improvement.Under the case of p = 15, as can be clearly observed in Table5, all the charts show better results than the previous case.Great improvement could be detected in 2

T
signal observations 20, 22 and 25 as out-of-control but only signals 20 and 25 as out-of-control observations and fails to signal observation 22.The result for 2 0

T
was not as effective as the other charts in detecting outliers.Chart (a), (b), (c) and (d) in Figure 6 represent the control chart the outcome could be due to the small number of quality characteristics (dimension) of the product.As revealed in the simulation study, 2 S variate observations and let H X  .The optimal number of the data involved in the computation of MVV estimator (i.e.

Table 1 )
to create various conditions which are capable of highlighting the strengths and weaknesses of the charts.Next, in Phase II, we simulate data from multivariate normal distribution 1 ( , )p p MCN I

Table 2 -
6 which recorded the false alarm rates for each condition are arranged based on the ascending number of dimensions (variables) namely 2,5,10,15 p  and 20 with α = 0.05.The first column in each table displays the number of sample sizes, followed by the percentage of outliers and non centrality values respectively in the second and third column.The last four columns record the false alarm rates of the control charts investigated in this study; namely 2

Table 1 .
The values of n and p

Table 2 .
False alarm rates of the corresponding control charts with dimension, p = 2

Table 3 .
False alarm rates for independent case with dimension, p = 5

Table 4 .
False alarm rates for dimension, p = 10

Table 5 .
False Alarm rate for dimension, p = 15

Table 6 .
False Alarm rate for dimension, p = 20

Table 7 .
Historical data set (Phase I data)

Table 8 .
Estimates of location vector, covariance matrix, and UCL.

Table 9 .
The Hotelling T 2 values for the future (Phase II ) data