The Analysis of Bootstrap Method in Linear Regression E ff ect

This paper combines the least squaress estimate, least absolute deviation estimate, least median estimate with Bootstrap method. When the overall error distribution is unknown or it is not the normal distribution, we estimate the regression coefficient and confidence interval of coefficient, and through data simulation, obtain Bootstrap method, which can improve stability of regression coefficient and reduce the length of confidence interval.


Bootstrap Regression
This paper focuses on linear regression.The traditional regression analysis method assumes that the regression equation is where random error ε i ∼ N(0, σ 2 ).Under the assumption of error normality, the coefficient β can be estimated, while in the significant test of regression, the corresponding distribution of test statistics is obtained.
However, when the error ε is not normal or its distribution is unknown, how to estimate the regression coefficient ,how to estimate the confidence interval of coefficient and how to significantly test the regression equation?The following uses Bootstrap method to solve the above problems.According to the different regression relationships, Bootstrap re-sampling methods can be divided into two types.

Model-based Bootstrap regression
The independent variable x in correlation model regression is a controllable variable(general variable) and only y is a random variable.Random sampling error is ε i , (i = 1, 2, • • • , n). ε i in regression equation accords with Gauss-Markov assumption: But ε i is not always a normal distribution.It is be noted that σ 2 is not the variance of residual e i = y i − ŷ.Normalize the residual e i to obtain the revised residual In order to better model the actual distribution of residual with the experience distribution, the revised residual can be centralized.Denoted the revised residual after centralizing r = n i=1 r i by r i = r i − r.
Based on model-based re-sampling in linear regression: ) and re-sample the regression residuals.Firstly, establish regression model with all samples and estimate the regression coefficients β0 , β1 .Then re-sample the random residual r * i and calculate the dependent variables, that is (

Cases Bootstrap Regression
The independent variable x and dependent variable y in correlation model regression are random variables and accord with joint distribution F(x, y).

E(y i |x
where β 0 , β 1 are determined constants independent with X and n is the sample number.Assume where ε i matches Gauss-Markov assumption, that is , matches equations (1), ( 2) and (3).
Based on cases re-sampling in linear regression: Random select (x * i , y * i ) in original samples, i.e. cases Bootstrap sample.

Confidence Interval for Bootstrap Regression Coefficient β
Since the error ε i and the distributions of β 0 and β 1 are unable to determine, it is difficult to get a statistic θ(x) whose distribution is known.Bootstrap-t method can avoid this difficulty well.The following gives its specific steps, where takes β 1 for example.
3. Repeat step 1 B times and the last statistic values form a data set {θ * . (HiroshiKon,2003) Thus, even if the overall distribution is uncertain, we can also estimate some statistics and their confidence intervals to solve parameters interval estimation and hypothesis testing problems which are difficult by conventional methods.

Estimation Method of Regression Coefficient β
The following describes three common estimation methods of regression coefficient β, and combines them with Bootstrap method.

Least Squares Method Estimate(OLS)
According to the basic principles of least squares method, the best fit line should make the distance between those points and the straight line minimum, which can also be expressed as the squares sum of distance minimum, even if the error squares sum n i=1 (y i − β0 − β1 x i ) 2 is minimal.In the model-based least squares method, assume ε i ∼ N(0, σ 2 ), thus is gotten.
Although coefficients of ( 6) and ( 4), obtained by least squares method, are the same, the model-based least squares method is derived under the assumption that ε ∼ N(0, σ 2 ).However, the cases least squares method doesn't assume that ε ∼ N(0, σ 2 ).When the samples are consistent with Gauss-Markov assumption (1), adopting classic OLS can obtain satisfactory results, but if existing extraordinary points or heavy-tailed (Gauss-Markov assumption doesn't hold), the results obtained by OLS is difficult to accept.
Here are two kinds of robust regression estimation methods of the regression coefficient β .

Least Absolute Deviation Regression (LAD)
The first robust estimation method of regression coefficient β was the least absolute deviation estimation regression presented by the Edgeworth in 1887, whose principle is that the least absolute deviation of regression equation minimizes, that is min This paper uses the simplest LAD estimation regression algorithm.
1. Take any two points a(x i , y i ), b(x i , y i )(1 ≤ i ≤ j ≤ n) in the n sample points and the coefficients of straight line equation passing a and b points are β0 = y j , β1 = (y j − y i )/(x j − x i ).Let 2. Take d = min i, j {d i, j }.
YadolahDodge,2008) That LAD regression takes Bootstrap re-sampling method in any case will like sample median number, so the distribution between the coefficient difference by Bootstrap re-sampling and ordinary least absolute deviation regression and difference by regression coefficient obtained by LAD regression and real coefficient isn't relevant, and they are similar only in the case of large samples.Application of smooth Bootstrap can improve estimation accuracy. (A.C.Davison,1997)

Least Median Squares Regression(LMS)
The least median squares regression makes the residual squares median minimum.For linear regression situation, median(y i − (β 0 + β 1 x i )) 2 minimizes, where using original LMS cured Algorithm.Its specific proof can be found in the book: Algorithms and Complexity for Least Median of Squaress Regression.
J.MS teele,1956) Table 1 briefly summarizes a few basic properties of three regression coefficient estimate methods, where the estimate

Data Simulation
In this section, data simulation uses computer-generated pseudo-random number technique, where draw out data from known distributions to regression analyze by Bootstrap method, and then according to regression results, test fit situation of Bootstrap method results and assumption distribution.

Model-based Bootstrap Regression Analysis
According to the definition of model-based regression model, we establish a known distribution where ε i matches Gauss-Markov assumptions.Without loss of generality, we assume β 00 = 0, β 10 = 37, Var(ε i ) = 1, x ∼ U(0, 10) and samples number n = 14 in this experiment.
In order to explain the effects of Bootstrap method in condition that error is normally distributed and is not normal distribution.There particularly takes distributions of two common errors: normal distribution and uniform distribution, i.e. do model-based Bootstrap regression analysis under ε i ∼ N(0, 1) and For a more intuitive analysis for Bootstrap method estimate results of model-based regression model, the distribution of regression coefficients β1 and 1−α confidence interval for the former 20 groups of β1 of 2000 groups of Bootstrap samples by OLS estimate method, respectively. Since , graphic 2 is close to the normal distribution.Comparing the upper and lower limits for standard deviation of OLS, LAD and LMS confidence intervals, OLS estimate is slightly larger than that of LAD and LMS, and the upper and lower limits for standard deviation of LAD is slightly larger than that of LMS, but the mean and median degree close to β 10 of LAD and LMS has no absolute relationship; that is, when the error is not large, the robustness advantage of LAD and LMS methods can't be reflected.The mean and median of β1 are close, which should take OLS estimate with simple calculation.
Uniform distribution is more discrete than normal distribution.So the standard deviations of all the statistics in Table 3 increase than that of Table 2, the maximum and minimum values of upper and lower limit for confidence interval in Table 2correspondingly increase, mean and median decrease; that is, confidence interval fluctuates larger and the size of interval decrease.Mean of β is close to 37 and the median deviation is relatively large.It is seen that using LAD and LMS can obtain more accurate value.
Comparing the standard deviation of upper and lower limits for confidence interval of OLS, LAD and LMS , estimate of OLS is larger than that of LMS which is larger than that of LAD, but β1 standard deviation of LMS is smaller than that of LAD, peak of LMS is more obvious and confidence interval range is the smallest.Thus, when error re-sampling Bootstrap and LMS method can obtain more stable and accurate values.
For the confidence interval is almost symmetrical about 0, only the confidence interval limit statistics are given here.Table 9 shows that the tandard deviation for absolute value of β1 s in cases Bootstrap regression is not larger than that of model-based Bootstrap regression, but relative β1 value is larger than that of model-based Bootstrap regression.The reason is that x, y values in N(μ X , μ Y , σ 2 X , σ 2 Y , ρ) is small and rounding error in calculation relatively increases, leading that the regression results are not satisfactory, which is the cause that LAD estimation results in Table 4 are not accurate than that of OLS.In the process of LAD regression iteration, every step has some rounding errors, but OLS need not iteration and rounding error should not be influenced greatly.
In the three methods, β1 value of LMS estimate is most accuration and its standard deviation is the smallest.So in order to get most accurate linear regression coefficient β1 estimation, we should adopt the combination of cases Bootstrap re-sampling and LMS estimation to apply cases linear regression analysis.

Figure 1 .Figure 2 .
Figure 1.Histogram of coefficient β 1 estimation by OLS method for model-based Bootstrap model (error is normal distribution ) 1. Re-sample a group of Bootstrap samples {x * 1 , x * 2 , • • • , x * n } ≡ X * from the samples.Bootstrap samples are used to calculate θ

Table 1 .
Comparsion between regression and least squares regression