A New Algorithm for Detecting Outliers in Linear Regression

,


Introduction
Suppose the model is where y is an n vector of dependent variable, X is an n × p matrix of independent variables, β is a p vector of unknown parameters, is an n vector of stochastic error term, p is the number of parameters, and n is the number of observations.Ordinary least squares (OLS) estimator β consistently estimates β with minimum variance among the other estimators when the assumptions are hold.Autocorrelated or heteroscedastic error term, including irrelevant variables in the regression equation decrease the efficiency of β whereas omitting a relevant variable and measurement errors in independent variables yield biased and inconsistent estimates.Since most of the statistical software packages include tests for classical assumptions of OLS, researchers have an ability of testing their hypotheses as fast as possible.However, the problem of outliers is generally considered as a symptom of heteroscedasticity in econometrics books or outlier detection routines are neglected in some software packages.In fact, outliers can be more dangerous than increasing the variability of conditional variances.
A single outlier can be detected by analysing OLS residuals or diagnostic measures derived from OLS estimates.Practitioners tend to analysis these statistics, however we don't know the number of outliers.When the data set contains more than one outlier, the OLS estimator β is usually affected and does not estimate the β correctly.As a result of this; residuals, fitted values and other properties of regression are also affected.Regression diagnostics should be calculated for subsets of observations rather than for each single observation.However, this operation is computationally inefficient.
As a result of technical difficulties, some authors developed more efficient algorithms for detecting regression outliers.Kianifard and Shallow (1989), Marasinghe (1985), Atkinson (1986), Hadi and Simonoff (1993), Peña andYohai (1995), andSebert et al. (1998) developed OLS based outlier detection algorithms which are not based on calculating statistics for all subsets of potential outliers.The methods reported in Billor et al. (2000) and Billor et al. (2005) are modern revisions which can be categorized as robust methods.The success of these methods depends on the number of observations, the number of parameters, and the fraction and the direction of contamination (Wisnowski, 1999).
Robust regression is an other branch of outlier detection and has a vast of literature.Huber (1973) introduced the M-Estimator.Rousseeuw and Leroy (1987) investigated the properties of Least Median of Squares (LMS) and Least Trimmed Squares (LTS) estimators.LMS and LTS stay resistant when the number of outliers is up to 50% of data.However, robust regression estimators are generally based on optimizing non-smooth or discreate functions which consume too much computation time.There is an effort to speed up these methods.Salibian- Barrera and Yohai (2006) suggested a new algorithm for calculating S-Regression estimates.Rousseeuw and van Driessen (2006) suggested a new algorithm for calculating the LTS estimator for large data sets.Satman (2012) proposed a genetic algorithm (GA) based modification on the method given in Rousseeuw and van Driessen (2006) and showed that the GA based search obtains smaller objective values in reasonable CPU times.Torti et al. (2012) performed a simulation study to compare powers of fast robust regression estimators including forward seach (Atkinson, 2010).Shortly, as the level of technology increases, we can collect more data; as we collect more data, the need for the technology increases.
In this paper we devise a new algorithm for detecting regression outliers.In Section 2, we give a brief description of previous works and the problem of outlier detection.In Section 3, we present the devised algorithm.In Section 4, we perform a Monte Carlo simulation to unveil the success of our algorithm.In this simulations we show the MSE's (Mean Square Error) of our estimator as well as the masking and the swamping ratios.Finally, in Section 5, we conclude.

Preliminaries
In regression analysis, an observation is an outlier if the model does not fit this observation well.However, the model can not be estimated correctly by OLS when the data contains outliers.As a result of this, real outliers may be fitted well by the regression equation.This is same as Type I error, namely masking in outlier detection literature, rejecting outlyingness of observations when they are outliers in real (Lawrence, 1995).Similarly, clean observations may be misfitted by the regression equation, that is, we fail to reject outlyingness of observations when they are clean in real.This is the problem of swamping (Barnett & Lewis, 1978).Success of an outlier detection method is generally measured with its masking and swamping ratios.
The LMS estimator stays resistant when the level of contamination is up to 50% (Rousseeuw, 1984).The objective function of the LMS estimator is to minimize the median of squared residuals.Since, median is not a continuous function of squared residuals, gradient based optimization techniques are not applicable.Therefore, several algorithms were developed for the LMS in Winker et al. (2011), Nunkesser and Morell (2010) and Karr et al. (1995) among others.
Another robust regression estimator LTS has the same breakdown point property as the LMS, that is, it stays resistant when the data is contaminated up to 50%.The objective function of the LTS estimator can be written as where r 2 i is the ith ordered squared residual and h is a custom integer which is approximately n/2.Since the objective function given in (2) minimizes the sum of h smallest squared residuals, it can be re-written as where w i ∈ {0, 1} and e 2 i is the ith squared residual.Note that the equation given in ( 3) is a constrained optimization problem with a non-linear objective function and binary variables.Several algorithms for optimizing the LTS objective function can be found in Atkinson and Cheng (1999), Agulló (2001), Giloni and Padberg (2002), Bai (2003), andHofmann et al. (2010) among others.Rousseeuw and Driessen (2006) developed the Fast LTS algorithm which is based on performing C-Steps (Concentration Steps) on randomly selected subsets.In this algorithm, the key point is to find out the right set of p observations and to perform C-Steps to enlarge the initial subset to h observations that minimizes the objective function.In our algorithm, we replace the random subset selection part of Fast LTS by a non-iterative procedure.
Outlier detection methods in multivariate data and linear regression are not independent subjects.In many algorithms, initially the X-Space is considered as multivariate data and an outlier detection procedure is performed.Then, regression outliers are detected using the observations that are labeled as clean in the first stage of algorithms.Hadi (1992) and Hadi (1994) developed and modified a method for detecting outliers in multivariate data by estimating the robust covariance matrix.Rousseeuw and Van Zomeren (1990) introduced the robust covariance estimator MVE (Minimum Volume Ellipsoid) and showed that plotting MVE based Mahalanobis distances versus LMS residuals gives a snapshot of outliers and their directions.MCD (Minimum Covariance Determinant) is an other robust covariance estimator which has a similar objective function with the LTS.In MCD, the h observations which have the minimum determinant of covariance matrix is searched (Rousseeuw & Van Driessen, 1999).Note that, all of the covariance estimators mentioned here requires many iterations and there is an effort to speed up these algorithms.
OLS estimators are directly related to the covariance structure of the data.Suppose the model is a special form of (1) with a single independent variable.Then the slope estimator is and it is very sensitive to outliers because both of the operators given in ( 4) are functions of sample sums.Huo et al. (2012) suggested a new measure of covariance, namely comediance.Comediance of two variates x and y is defined as where med(x) and med(y) are sample medians and i = 1, 2, . . ., n.In their Monte Carlo simulations Huo et al. (2012) showed that the performance of the comediance is comparable with other robust covariance estimators including Campbell, Sign, and Rank.Note that this method requires computing three medians and does not include any iterative procedure (Note 1).

Proposed Method
In our algorithm, we combine the comediance measure introduced in Huo et al. ( 2012) and C-Steps suggested in Rousseeuw and Driessen (2006).Briefly, the algorithm constructs a robust covariance matrix which does not require too much computation time.Mahalanobis distances of independent variables are then calculated using this robust covariance matrix for dispersion and sample medians of variables for location.Then, h observations with minimum Mahalanobis distances are used to calculate OLS estimates.After all, C-Steps are performed using p observations with smallest absolute residuals to enlarge the basic subset to h observations.The whole algorithm is given below.

Main Algorithm
Step 0. Let p is the number of regression parameters, p − 1 is the number of independent variables, n is the number of observations, w is an n vector of zeros, h = n/2 + (p + 1)/2 , and k is the integer part of k Step 1. Construct the covariance matrix Σ of independent variables where (Note 2) j = 1, 2, . . ., p − 1, , i j, and σij is the element of matrix Σ at row i and column j.
Step 2. Calculate the mahalanobis distances D where and μ is the vector of medians of independent variables.Sort the values of D in ascending order.Let k is a vector of indices of first

., and w[k
Step 3. Perform a weighted least squares estimation for the model using the weights w and calculate absolute residuals.
Step 4. Perform C-Steps using p observations with first p ordered absolute residuals to enlarge the basic subset to h observations.Perform many C-Steps using the enlarged basic subset.
Step 5. Standardize the residuals obtained from the final C-Step using the formula r i = e i −med(e) med(|e i −med(e)|) .Report the observation i as an outlier if |r i | > 2.5.
First iteration of C-Steps takes p observations and returns h observations with smallest absolute residuals.In remaining iterations, C-Steps are performed using h observations many times and the final h observations are returned.In those steps, it is expected that some outliers exit the basic subset and clean observations enter.Number of C-Steps is set to 10 by virtue of the graph shown in Figure 1 in our simulations.

Monte Carlo Simulations
We perform a simulation study to unveil the success of our algorithm.We report the MSE's of estimated regression parameters as well as the masking and the swamping ratios.Data are generated for n = 10 3 , 10 4 and 10 5 .The number of parameters are set to p = 5 and p = 10.Any data set that generated with these n, p combinations can be considered as large (Note 3).Data are generated using the linear model (1) where ∼ N(0, 1), X i ∼ N(0, 100), β = [5, 5, . . ., 5] , N(μ, σ 2 ) is a normal distribution with mean μ and variance σ 2 , and i = 1, 2, . . ., p.The level of contamination is variable and set to c m = 20%, 30%, 40% and n − h for Y-Space outliers and c m = n − h for X-Space outliers, where h = n/2 + (p + 1)/2 and k is the integer part of k.Note that n − h is the maximum level of contamination, that is, a contamination level that is higher from n − h means that outliers turn into clean observations and via versa.Independent variables are contaminated by adding random variates that follow a N(100, 100).The dependent variable is contaminated using the formula y i := max(y 1:h ) + C R where max(y 1:h ) is the nth order statistics of first h values of variable y and C R is a random variable that follows a N(10, 100).
Choosing the right number of C-Steps is important.Generally, iterating more C-Steps yields more accurate estimates.However, C-Steps consumes time when the sample size is large.We set the number of C-Steps to 10 in our simulations.In Figure 1, LTS criterion versus number of choosen C-Steps is plotted.It is clear that, performing more than 10 C-Steps does not gain too much.When the computation time is not critical, more C-Steps can be performed.

Results and Discussion
It is important to be aware of outliers in regression analysis.When the data is contaminated, the parameter estimates, calculated residuals and fitted values are affected.Because of that, regression diagnostics are not useful and should be calculated for subsets of observations rather than each single observation.However, this operation consumes too much computation time.Despite the robust procedures are successful in detecting outliers even though the contamination level is up to 50%, they consume too much time and there is also an effort to speed up these procedures in the outlier detection literature.
In this paper, we proposed a new algorithm based on a non-iterative covariance matrix and C-Steps used in LTS estimation.Since this covariance matrix has not all desired statistical properties, it is useful at finding a clean basic subset which is then used in a robust fit.Standardized absolute residuals which are bigger than a predefined criterion can be labelled as outliers.
The proposed algorithm has low masking and swamping ratios in the case of X-Space outliers when the level of contamination, sample size and number of parameters are up to 50%, n = 10 4 , and p = 10, respectively.
In the case of Y-Space outliers, performance of the algorithm reduces but it still stays resistant when the contamination level, sample size and number of parameters are up to 30%, n = 10 5 , and p = 10, respectively.
Our algorithm is suitable and applicable for detecting multiple outliers in regression analysis when the data sets are large and the contamination level is high.Computational cost is low and it is applicable even in interpretable statistical software packages.Regression estimators have small biases, variances and MSE's except for the intercept parameter.
An effort for reducing the bias and variance of intercept estimator would not be trivial and it can be examined in future works.

Figure 2 .
Figure 1.Number of C-Steps and LTS criterion

Table 2
shares a similar status including higher bias, variance and MSE statistics of the intercept estimator.Best results are obtained when n = 10 5 and c m = 20%.