Using Simulation to Test the Reliability of Regression Models

In many sciences, it is standard laboratory practice to use a statistical design of experiment and a regression model to study the influence of multiple parameters under a wide range of conditions. The current study aims at investigating the reliability of regression models by examining recently published models. Of particular interest are the assumptions that are not robust to violation such as the reliability of measurements, constant variation of residuals, and sample size. To test regression models simulation is used to model potential measurement error and the importance of sample sizes on parameter estimation. The randomly perturbed designs are then used together with associated mathematical models obtained from the original designs to simulate experiments and obtain new regression models. A comparison of the original model to the new model, and various statistical tests are performed to determine how accurate the original parameters have been predicted when exposed to simulated measurement error.


Introduction
Scientists perform experiments in virtually all areas of study, often to determine a relationship between numerous input factors and either one or multiple output factors.The area known as the Design of Experiments (DOE) is concerned with planning and conducting of experiments, as well as analyzing the resulting data so that valid and objective conclusions are obtained (Montgomery, 2009).Factorial experiments are often used as an experimental strategy in which inputs are varied together.These experiments are an important class of experiments because they may be used to accomplish a variety of different goals, such as perform factor screening, or to determine optimum factor levels.In this study we focus on experiments in environmental sciences, where factorial experiments are primarily used to study the influence of multiple parameters (physical, chemical or biological.)Upon performing experiments dictated by a factorial design, a multiple regression model is created to make predictions and inferences.This is a widely used method, in fact more than 4,000 hits were obtained with the keywords "multiple regression analysis" in the Science Citation Index just within the areas of Environmental Sciences.
Typically, when evaluating a regression model one uses the coefficient of determination (R 2 ) to confirm the goodness of fit of the model to that of experimental data.Coefficients of each variable and its associated p value are also used to help assess the influence of the variable on the process under study.But R 2 values can be made artificially large by including an excessive number of terms, and p-values only indicate if a term is statistically significant and do not assess the accuracy of parameter estimation.Statisticians have studied the reliability of regression models and have identified a "reliability matrix" (Gleser, 1992) to help assess the model.It is widely known that measurement errors influence regression models (Pagano & Anoke, 2013).However, a reliability matrix is rarely used in an environmental science study.
For a multiple regression model to be reliable, a necessary condition is that it does not violate the regression model assumptions (Kahane, 2008).However, a literature search reveals that in environmental sciences, multiple regression models are often not tested for their robustness to regression assumptions prior to interpretation and drawing conclusions.Two of the key model assumptions that are made are: 2. Residuals of the model are independent over time, normally distributed, have mean zero, and exhibit constant variation.
To further exacerbate the problem, scientists have known for a long time that when carrying out experimental studies two types of errors are introduced during input and output analysis: precision errors and accuracy errors (Taylor, 1982).Precision errors are related to the random errors associated with an experiment (e.g.measurement error), whereas accuracy errors are related to the systematic differences observed between laboratories (e.g.different calibration of the instruments).A regression model developed based on experimental data should also be robust to these experimental errors.Compounded with a limited data set available due to practical considerations, a regression model developed could provide a false sense of parameter efficiency and output predictability.
The current study is aimed at developing a method to evaluate the reliability of regression models when the input and output parameters are subjected to small random perturbations created using simulation.It is our assumption that the coefficients of the variables and their p values should not significantly change in reliable models, even when the variables are subjected to simulated random perturbations.Also addressed is how residual analysis of the models can help to evaluate the reliability of the model prediction.

Methodology
Twenty studies within the field of environmental sciences were selected to determine if the published models were robust enough to withstand simulated random perturbations of the input and output values uniformly distributed between ±5%.The studies examined along with errors measures (as described in this section later) are given in Table 1.The different types of experimental designs considered are: full factorial, fractional factorial, mixture design, Box-Behnken and central composite.The perturbations are used to assess how sensitive the model is to small changes in the input and output data.The changes could be a result of measurement error or possibly due to other types of process variation.To introduce a random perturbation of some value between ±5%, every coordinate of the design points and the output values as well, are multiplied by a simulated random number between 0.95 and 1.05.Table 2a provides an example of design points used in a 2 3 full factorial design that have been modified by multiplying by a simulated random number between 0.95 and 1.05 to obtain the design matrix given in Table 2b.

Results and Discussion
We propose through this study that once a regression model has been developed using a DOE, a random perturbation of ±5% should be introduced into the design points and the output values to help assess the model performance.The perturbation is intended to represent the measurement and systematic errors introduced when performing experiments with limited measurement resolution.
We begin with an analysis of how the type of experimental design influences the robustness with respect to measurement errors.A summary of the 20 studies is given in Table 3.The data illustrates a significant potential of errors influencing the results with Box-Behnken designs which produced results with an average MAPE of 180% and an average APE 90 of over 400%.In contrast, mixture design seems to be more resilient to error with minimum % errors in both of the measures used.The percentages in Table 3 were obtained by finding the average MAPE and APE 90 of the four studies in each design group shown in Table 1.A second factor considered is the sample size measured by the ratio of number of design points divided by the number of predictor variables used in the final model.These ratios varied from a low of 1.25 to a maximum of 4.33.Surprisingly, the increase in this ratio does not necessarily decrease the MAPE.Sample size alone does not seem to be critical with respect to the robustness of the model.What is more important than sample size, is the location of the design points within the design space.In particular, the design types with points located throughout the design space, especially at or near the center, are the designs with the lowest MAPE.To test this rigorously we identified designs that have included design points in the interior of the design space vs. those that contain points only on the boundary.The results indicate a significant improvement in reducing the MAPE when interior points are included in the design.
The model ( 2) was obtained with a sample of 32 design points for 16 predictor variables, so the sample size is only twice the number of predictor variables.A potential source of error that arises in the model given in equation ( 2) is the inclusion of 3-way and 4-way cross terms with relatively small coefficients.Terms such as these are very sensitive to small measurement errors.Unless there is extreme confidence in the measurement system, we believe that a high degree term should not be included.Indeed, this is consistent with the "sparsity-of-effects-principle" which states that a system is usually dominated by main effects and low order interactions.The principle has been explained in depth by (Wu & Hamada, 2000).The residuals are essentially normally distributed with mean zero.However, the constant variation assumption is violated with a large variation in the vertical spread, and there is a nonrandom pattern with decreasing residuals illustrated in the residuals versus observation order plot.Hence time is related to decreasing residuals indicating a violation of time independent residuals.After introducing random perturbations we obtained an MAPE of 135% and an APE 90 of 444%.Hence, the model is neither reliable nor robust to the input data variations.The study (Prasad & Srivasta, 2009) exhibits similar problems.There are too many variables artificially inflating R 2 and the constant variation of residuals requirement is violated.
Mixture designs have recently been developed and are included as an option in most computer aided experimental design software.For a comprehensive reference on mixture designs, see (Cornel, 1981).Unlike the previous two types of designs, mixture design have more than half of their points in the interior of the design space, including one point in the center.In full factorial designs the design space is an n-dimensional hypercube where n is the number of input variables.However, in a mixture design each point gives the proportion of each input into the mixture.Hence, there is a constraint that requires the sum of the proportions to be one.This implies that the design space is now an n-dimensional simplex.
The study (Abdullah & Chin, 2010) utilizes a simplex-centroid design for optimizing the composting of kitchen waste.The response is the carbon to nitrogen ratio Y, and the model developed is shown in equation ( 3).Observe that the model contains 3 input variables and 3 interaction terms for a total of 6 predictor terms.Y = 14.9x 1 + 8.2x 2 + 281.6x 3 + 17.7x 1 x 2 − 273x 1 x 3 − 509.3x 2 x 3 (3) The experimental design consists of 13 design points which yields a ratio of roughly 2 times as many sample points as predictor variables.The model shown in (3) had a high goodness of fit with a R 2 of 0.98.The residuals do not violate any of the residual assumptions.Perhaps a better assessment of accurate parameter estimation is indicated by the MAPE being only 2.7% and the APE 90 at 5.8% (Table 1).The other studies in Table 1 based on a mixture design all produced remarkably similar results with respect to the error measures MAPE and APE 90 , even when the residuals appeared to be violated.The mixture design is an example of a design that is very robust to small perturbations in the input data.
The next group considered are the Box-Behnken designs listed in Table 1, which are a special case of response surface designs.In (Gurkok et al., 2011) a three-level Box-Behnken design was used in an optimization study.The model developed is a second order model with three independent variables and eight predictor terms.An inspection of the residual plots confirms the residual assumptions.When perturbations are introduced the MAPE is 4.5% and the APE 90 is 8.3%.Clearly this design is robust with respect to the regression assumptions.
In study (Anunziata & Cussa, 2008) a Box-Behnken was also used to develop a model with 11 predictor terms using 27 data points.The model is given in equation (4).Y = 21.6404+ 1.52833x 1 − 2.52083x 2 − 12.5292x 3 + 8.875x 4 − 2.17x 1 x 2 − 0.005x 1 x 3 − 9.475x 1 x 4 (4) − 0.0125x 2 x 3 + 0.825x 2 x 4 − 1.525x 3 x 4 Inspection of the residual plots does not reveal any violations.However, when applying perturbations to the design we obtained an MAPE of 674%, with a APE 90 of 1,679%.We believe that the model is hyper-sensitive to measurement error because of low number of data points used in the study.
We observed an inconsistency among Box-Behnken designs with respect to the inclusion of interior points in the design.The design used in (Annuziata and Cussa, 2008) did not include any center points.Whereas, the designs used in (Gurkok et al., 2011), (Baskan & Pala, 2010), and (Dopar et al., 2011) included 6, 5, and 3 center points respectively.The results obtained in (Baskan & Pala, 2010) and (Dopar et al., 2011) are much more reliable when subjected to random perturbations with an MAPE of 33% and 8% respectively.
The last group of regression models examined is those based on a central composite design listed in Table 1.Central composite designs are often used to obtain a quadratic model that will facilitate optimization.As illustrated in Table 3, this group would be ranked second best with respect to the error measures.The model that performed best when subjected to input perturbations was (Mohajeri et al., 2010), which involves 4 input variables and 30 design points; where 16 of the points consist of a 2 4 basic factorial design, 8 points are axial points, and 6 points are center points.The response Y for this model is the weathered crude oil removal percentage and is given in equation ( 5).Observe that there are 11 predictor terms which implies a run to predictor variable ratio of roughly 2.7.Moreover, there are no predictor terms with three factors.Y = 68.43− 10.55x 1 + 3.66x 2 + 10.14x 3 + 5.57x 4 + 4.79(x 2 ) 2 − 14.89(x 3 ) 2 − 7.08(x 4 )^2 (5) − 2.28x 1 x 2 + 8.63x 1 x 3 + 1.81x 3 x 4 The axial points are identical to the center points except for one factor, which will take on values both below and above the median of the two factorial levels, and typically both outside their range.Interestingly, the number of center points used in the central composite designs of Table 1 is: 12, 3, 6, 6 respectively, and the number of axial points used in these designs is 8, 0, 8 ,8, respectively.The design with 3 center points and 0 axial points was the only "non-robust" central composite design.As in the case of Box-Behnken designs, we see an inconsistency in the number of center and axial points used.

Conclusions
While the goal of the study was not intended to be comprehensive in the sense of testing a majority of designs, we present a simulation tool that could be used by investigators to test their designs and developed models for coefficient reliability.The study results in the following conclusions:

•
The reliability of a regression model is dependent on the type and parameters of the experimental design.
Including numerous design points in the interior of the design space, such as center and axial points, increases the reliability and robustness of the model.
• Scientists should evaluate the robustness of the regression model assumptions before making inferences.Models with very good R 2 and p-values may not be very reliable when measurement errors are taken into consideration leading to false inferences.Testing the regression model using simulation would provide a greater degree of confidence in the scientific inferences made.It may also provide new insight into the sensitivity of the scientific process being studied.
• High order terms should be avoided as much as possible.Primary effect terms and two-way interactions and squared term are usually sufficient and lead to models that are much more stable than models with high order terms.

Model Confirm Result Use random numbers to simulate measurement error Perform regression and obtain new polynomial model Compare original model to new model Calculate error, repeat 1,000 times Compute summary statistics Run hypothesis tests Model Reliable? Make Predictions and Inferences Revise DOE and Model Figure
1.The simulation process to test regression model reliability

Table 3 .
A summary of the simulation results for the twenty studies examined