Measuring the Impact of Collinearity in Epidemiological Research

Collinearity amongst covariates in linear regression models has long been recognised as a potential source of bias. Various ‘solutions’ have been proposed, though one issue almost entirely omitted in the current literature is the importance of the relationship between the outcome and the correlated covariates. Using vector geometry, it can be shown that the impact of collinearity on the model, such as changes in regression coefficients, cannot be judged by the correlation structure of the covariates alone-their relationship with the outcome is crucial. Traditional diagnostics of collinearity are thus insufficient in evaluating adverse effects or model instability. Collinearity diagnostics should play an important role in assessing this impact, both adverse and beneficial, on model parameters. The objective of this study was to build a new index that measures the impact of collinearity in the model environment, rather than providing only a description of the feature. Vector geometry was used to design a measure that accounts for the relationship between the outcome and the correlated covariates-labelled the D-index. The D-index was implemented as part of a regression study to develop a parsimonious model for body fat using easily obtainable body circumference measurements. The covariates were selected based on the degree of collinearity amongst the predictors in the model and the variance explained in the response. Such a model would potentially allow for a reduction in the number of body size measurements required, reducing study length and cost, whilst maintaining measurements that most accurately represent total body fat.


Introduction
In epidemiological and clinical research, it is not surprising to find that many covariates are correlated as they often share common physiological mechanisms, or measure different aspects of the same underlying mechanism.The question is not whether collinearity is an issue, but what the impact is on the modelling process.The least squares assumption that covariates are independent implies that all pair-wise covariate associations should be negligiblea most unlikely scenario for biological and epidemiological data.Small, but significant, departures from the assumption of independence can severely distort the interpretation of a model and the role of each covariate, causing increased inaccuracy as expressed through bias within regression coefficients and increased uncertainty as expressed through coefficient standard errors.
The variance inflation factor (VIF) (Marquardt, 1970;Stine, 1995) and condition index (CI) (Belsley, Kuh, & Welsch, 1980) are often labelled collinearity 'diagnostic' tools, however this description is perhaps misguided.Collinearity itself is not a 'disease'.Symptoms such as a change of sign or an adverse change in the variance and point estimates may be considered 'problematic'.However, they are only problematic based on prior biological knowledge.In some circumstances, such as confounding, including a collinear variable in the model may be beneficial to increasing the precision and accuracy of the assessment of a cause-effect relationship.These statistical measures are not 'diagnosing' a disease, but instead providing a description of a feature of the data.This description, along with external biological knowledge, should facilitate the process of deciding whether problematic collinearity exists in the data and whether any remedial action is necessary.
Collinearity indices such as the VIF and CI belong to a class of 'correlation based' diagnostics as the assessment rests entirely on the X T X matrix (i.e. the matrix of sums of squares and cross products of all predictors).The VIF is calculated as follows, where R 2 x j is the explained variance of the variable x j regressed on the remaining predictors included in the model.Regardless of the chosen response entered into the model, the assessment of collinearity from a correlation based index such as the VIF will not change.The measure is providing a description of the collinearity present amongst the predictors only.This result may be of limited use in application.The researcher will hold an interest in understanding the potential impact of collinearity on the parameter estimates from the model and subsequently a potential impact on clinical and biological interpretation of the estimates.An arbitrary 'rule of thumb' will often be employed to indicate serious collinearity in a dataset.For instance, VIF's ranging from 4 to 30 have been previously used as an indication that severe collinearity is present in the data (O'Brien, 2007).This may encourage the use of remedial action (such as the removal of collinear variables, entering linear combinations as a single predictor or employing alternative, often more complex, methodology) to relieve or resolve the 'problems' of collinearity.If other factors had been accounted for in the initial assessment of the data, the need for such action may be much less than first thought.The impact of collinearity on parameter estimates is governed by factors such as the response, the sample size and sampling variation.These are all features of the 'model environment'.In Figure 1 there is a rotation of the regression planes (labelled P 12 and P 34 ) that are spanned by the green and red pairs of predictors respectively (see Wickens (1995) for a description of the vector geometry and Draper and Smith (1998) for matrix approaches to regression analysis).This movement represents a change in the position of the response (e.g. a result of sampling variation) as the predictors are assumed to be measured without error (Freund & Wilson, 1998).When the response is closer to the regression plane in the green example (reflected by an increased coefficient of determination-R 2 y ), a change in the slope of the plane will conceptually have less impact on the deviation of the coefficient estimates.Further to that, an increased correlation between the covariates (i.e. an increase in r 12 , reflected by a reduced Figure 2. Placing two models on common collinear axes angle between vectors x 1 and x 2 ) would demonstrate that small changes in the position of the response would be amplified by the change in the coefficient point estimates.This relationship of the response with the covariates mediates the impact of collinearity on the coefficients and standard errors.O'Brien (2007) demonstrates this on the variance of the estimates using a variance deflation factor (VDF).
The VIF represents a multiplicative factor of the inflation in variance against a baseline of independence (i.e.r 12 = 0, VIF = 1).If we were to consider a VIF•VDF measure, it would place the impact of collinearity on the variance of a coefficient in the context of the model environment.The VIF (and similarly VIF•VDF) is a measurement on each of the predictors in the model, however such a measure is often difficult to interpret without a 'global' indicator of the collinearity present in a model.It will also not indicate which covariates are involved in linear dependencies (Belsley et al., 1980).In section 2.1 we develop a new index motivated by vector geometry that incorporates the covariance structure between the predictors and the response.In section 2.2 this concept is extended to the general case to provide a measure of the impact in regression models with k > 2 predictors.In section 2.3 we further develop the measure to identify the individual role of each predictor in contributing to the observed 'global' impact.Finally, in section 3 we provide an illustrative regression study with the interpretation of the results discussed and compared to existing correlation based indices.

The Development of a Covariance Based Collinearity Index
A researcher should not rely exclusively on study data to assess the validity of a model.External information should be incorporated into the analysis to tailor the assessment to a particular discipline or setting.If we consider the impact of collinearity on an estimate to be a 'problem', then to assess that 'problem' we need an idea of what the population structure is.For instance, suppose x 1 and x 2 are two uncorrelated predictors in a population, but the sample observations are correlated.If we believe the population values to be uncorrelated (i.e. a prior assumption), the expectation is that the multivariable regression coefficients on both x 1 and x 2 are unchanged compared to their univariable regression coefficients (i.e. the regression models with only x 1 or x 2 entered).The fact that the sample values are correlated causes the univariable and multivariable estimates to differ.One feature of a 'covariance based' index would be to indicate the 'magnitude' of this deviation in the sample from this (or any other) chosen baseline.Another potential use is in model selection by comparing regression models with different predictors entered.Similarly, this may involve comparing univariable coefficients (as a baseline) to different multivariable regression models.In whichever application the index is required, the motivation remains to measure the deviation of the point estimates between models to illustrate the impact of collinearity on an expectation or a sample estimate.
To measure the disparity between two sets of estimates, we need to put them in a comparable setting.In vector geometry, we could do so by considering both in a common space.The traditional vector geometry representation is to project the response y orthogonally onto the regression space spanned by X (i.e. the vectors x 1 and x 2 in the bivariable example).The fitted response (labelled ŷk=2 ) is then projected orthogonally onto the covariate vectors Figure 3.An Illustration of the computation of components α 2 and β 2 to find the univariable point estimates b x j and parallel to the complementary covariate vector to find multivariable estimates b x j (see Figure 2a).The distance between the projections would indicate the change in point estimates on each predictor moving from the univariable to multivariable model (i.e.shown in blue).To gain a 'global' measure of this impact on the overall model we choose to represent the individual univariable estimates as a single multivariable model involving orthogonal predictors (see Figure 2b).This is achieved by identifying an alternative fitted response ŷ k=2 , that when projected parallel to the covariates (analogous to the construction of ŷk=2 ), would attain the univariable point estimates b x j , rather than the b x j .The distance between ŷk=2 and ŷ k=2 is the product of the change in regression coefficients, relative to the collinearity in the model (i.e.shown by the green line in Figure 2b).Using ŷ k=2 removes the effect of a non-orthogonal projection and places the estimates from the univariable models on the collinear axes.This movement from ŷk=2 to ŷ k=2 represents a global measure of coefficient deviation, which we label D 2 .
For the bivariable example, the calculation of the index D 2 (with the subscript denoting the two covariates entered into the model) is found to be R y • r 12 (see Appendix for a derivation of this result).The proof divides the index into 2 components.The first labelled α 2 is measured by the deviation parallel to x 1 and the second labelled β 2 is the deviation orthogonal to x 1 (see Figure 3).To explain these components further, first consider a single predictor model including only x 1 (i.e. a simple regression), which naturally assumes a zero impact of collinearity (i.e.D 1 = 0).A second predictor x 2 is then added to this model to generate an impact demonstrated by a non-zero D 2 (unless x 2 is uncorrelated with x 1 or neither predictor explains any variance in the response).The unadjusted variance explained by x 2 is r y2 (This quantity is demonstrated as a distance from the origin along the vector x 2 ).This variance on x 2 can be divided into a portion that is 'overlapped' with x 1 (i.e.α 2 ) and a portion of the variance explained by x 1 confounded with x 2 (i.e.β 2 ).The component α 2 is demonstrated by a simple regression of the geometrical point r y2 (on the vector x 2 ) onto x 1 -(i.e.r y2 • r 12 ).The second component β 2 is the residual variance of r y2 from this regression, subtracting the semi-partial correlation of x 2 with y (i.e.r y2 variance attributed to x 2 only).This is found to be r 12 • sr y1 .Therefore, we are projecting two components of the fitted response ŷk=2 (i.e.r y2 and sr y1 , where ) onto vectors to which they would have zero correlation at baseline.Any deviation of these components away from zero will represent an impact of collinearity demonstrated by a deviation in the point estimates.

The bivariable index D
) 2 (squared to make the magnitude comparable to variance based diagnostics such as the VIF) represents the impact on the coefficient point estimates associated with the collinearity amongst the predictors and also the covariates relationship with the response.The composite direction vector formed by the two univariable regression coefficients (ŷ k=2 ) is in the covariance maximizing direction on a single dimension.This is equivalent to a one component partial least squares regression (PLS) (Phatak & Dejong, 1997;Wold, Sjostrom, & Eriksson, 2001).The ŷk=2 vector represents the OLS estimation.By definition, this is the covariance maximizing direction in the bivariable model, thus equivalent to a PLS regression with a full complement of components retained.We understand that the D-index is measuring the distance between an uncorrelated composite single dimension vector and the collinear predictors of the bivariable model.From the vector geometry, the length of the vector ŷ k=2 is equal to the summation of the two univariable r 2 y estimates, relative to the collineatity present (i.e.|ŷ k=2 | = r 2 y1 + r 2 y2 + 2r y1 r y2 r 12 by the cosine rule).The length of ŷk=2 is equal to the R y found in the sample.The D-index represents the impact of collinearity on the point estimates of moving from an uncorrelated prior to a correlated sample estimate, or similarly the impact of collinearity in adding a second predictor to a 'simple' regression model.'Correlation based' indices would be unable to distinguish between the examples in Figure 4 as the correlation amongst the covariates is identical.As illustrated by the geometry in Figure 4b the movement in the point estimates is far greater than in Figure 4a and a change of sign has occurred on x 2 .The change of sign may not be of particular interest statistically, but it could represent a potential change in the clinical interpretation.The D-index does not directly indicate a change of sign, but rather the greater propensity for a change of sign is reflected by an increased D-statistic (i.e. a greater movement).Under sampling variation, these deviations can become inflated or dampened with a potential impact on the conclusions of the study.

Extension of the Bivariable Case to a General Index
For the index to be of use in application it is important that it can be extended to models for k > 2 predictors.We consider two options for extending this measure.First we look for the additional impact on the existing bivariable model (including x 1 and x 2 ) of adding a third predictor x 3 (labelled Ḋ3 ), and second the impact of collinearity on a baseline model that assumes orthogonality amongst all of the predictors (labelled D 3 ).Figure 5 illustrates the vector geometry for the three predictor regression model.The fitted response of the trivariable model ŷk=3 is first projected orthogonally onto the covariate vectors x j to obtain the individual r y j (i.e.regression coefficients from each univariable model).The r y j are then projected along the plane formed by the remaining two predictors to identify ŷ k=3 representing our baseline model of orthogonality amongst the covariates.The distance between ŷk=3 and ŷ k=3 forms the new D 3 (analogous to the D 2 computation).The fitted response in the three predictor model ŷk=3 is an extension of ŷk=2 in the direction orthogonal to the plane spanned by x 1 and x 2 .The orthogonality with the plane demonstrates that this extension represents a partial correlation between y and x 3 , whilst holding x 1 and x 2 constant-we label this correlation pr y3 12 .The ŷ k=3 is an extension of the ŷ k=2 in the direction of x 3 with length (i.e.variance) equal to r y3 .
We first consider the calculation of the additional impact of adding x 3 to the already assumed D 2 impact from the bivariable case.First, we project D 3 onto the 2-dimensional plane spanned by the vectors x 1 and x 2 .The projected D 3 is demonstrating the overlap between x 3 with variance r y3 and the existing predictors in the model (i.e.x 1 and x 2 )-analogous to α 2 in the bivariable index.This projection is labelled α 3 , which is composed of D 2 and a further component γ 3 (see Figure 5).Following the previous construction of D 2 we compute γ 3 as two components.The first is parallel to x 1 , found by an orthogonal projection of x 3 with variance r y3 onto x 1 , γ3 = r y3 • r 13 (3) This represents the overlap of x 3 (of length r y3 ) and x 1 .The second (labelled γ3 ) is in the direction orthogonal to x 1 in the plane spanned by x 1 and x 2 .This demonstrates that x 1 is held constant, thus defining a semi-partial correlation between x 3 and x 2 , holding x 1 constant (labelled sr 23 ).
Therefore, γ 3 is calculated as the squared sum of orthogonal components, From Equation 5we have an extension to D 2 in the plane spanned by x 1 and x 2 after adding x 3 to the model.Finally, there is an additional deviation that would represent the new β component (labelled β 3 ).β 3 represents a deviation of the coefficients in a dimension orthogonal to the computation of D 2 .The vector geometry illustrates that this is a projection of the remaining explained variance of ŷ (i.e. the component of y orthogonal to x 3 ) onto an arbitrary axis orthogonal to the plane spanned by x 1 and x 2 .There is a residual from x 3 (of length r y3 ) after regressing on x 1 and x 2 .This residual is composed of pr y3 12 and β 3 (analogous to our proof for D 2 with the residual composed of sr y2 1 and β 2 ).Therefore, β 3 is an impact of collinearity representing the explained variance of the original model confounded with x 3 .
The index Ḋ3 can be calculated as the squared sum of the components γ 3 and β 3 , Returning to the vector geometry, we can summarise the computation of Ḋ3 .The response ŷ has been split into two components (r y3 and sr y2 + pr y1 23 ).We project r y3 onto the surface spanned by x 1 and x 2 (which would have zero correlation if it were uncorrelated with the baseline model) and project the second component (sr y2 + pr y1 23 ) onto x 3 (which would similarly be uncorrelated at baseline).However, if a correlation is present it will generate a deviation of the point estimates represented by a non-zero Ḋ3 .The advantage of using this measure (i.e. the bivariable model as baseline) is that the interpretation is much the same as the example for D 2 .We again have two components of the response to project, only in this example one component represents a baseline model with the explained variance of two predictors rather than one.
The second index D 3 is an impact of collinearity in moving from uncorrelated covariates at baseline to the three predictor model.In other words, if x 3 had zero correlation with both x 1 and x 2 , the Ḋ3 would always be zero.However, x 1 and x 2 could still be correlated and so an impact on the point estimates from baseline orthogonality would still be seen, but it would be represented solely in the D 2 .Now we look for an overall impact of collinearity to give D 2 and D 3 a common baseline for comparison.In this computation we place an emphasis on x 3 by considering the first component of ŷ to be r y3 (followed by sr y1 3 and pr y2 13 , however any construction of ŷ would produce the same 'global' result.In the D 3 measure we once again split α 3 into two components.The first deviation component is parallel to x 1 , which is the summation of α 2 and γ3 .This represents the portion of explained variance from x 2 and x 3 overlapped with x 1 .There is a second component of this impact in the plane spanned by x 1 and x 2 .This consists of the shared variance of r y3 with x 2 , whilst holding x 1 constant.This is represented by the addition of γ3 and β 2 .The final component of D 3 is the deviation orthogonal to the plane spanned by x 1 and x 2 -this is the Figure 6.Computation of the angle between D 2 and x 1 component β 3 identical to that computed for Ḋ3 .The D 2 3 demonstrating the overall impact from the baseline of orthogonality can now be expressed as follows, In both measures, the components of the response are projected onto vectors which would have zero correlation at the specified baseline.This deviation contributes to the global measure.We can extend these measures generated for the trivariable model to the general case with k predictors. We suggest that our measure in higher dimensions represents a generalized form of R x • R y .The first index Ḋ2 k measures the impact of adding a single predictor to a baseline model (assumed as the model including k − 1 predictors).The second index D 2 k assumes the predictors to be uncorrelated at baseline and incorporates the previous D 2 k−1 impact as part of an overall measure.

Measurement of Impact on Individual Predictors
The global D-index only partly achieves our original goal in creating a regression tool for applied research.It is useful to highlight when there exists a high impact of collinearity on the point estimates, however it will not indicate which covariate contributes a greater impact to the deviation (a similar limitation to the VIF).This is the strength of an index such as the CI on the correlation matrix of the covariates.A feature of the D-index that we have ignored to this point is the direction of the deviation.In coordinate free vector geometry, the direction is relative to the collinear axes of the covariates.Therefore, we choose to focus on the angle between each covariate and the deviation D 2 k (which we now consider in vector form-D k ).The angles (that in turn provide correlations) can perform a similar role to variance decomposition proportions alongside the CI in identifying which predictors are involved in a near dependency (Belsley, 1991).The vector geometry in Figure 6 demonstrates that each correlation (i.e.cosine of θ D1 ) can be calculated in the bivariable model as the ratio of r y j and R y .For example, The component α 2 is redefined for the target variable with which we wish to identify its contribution to the impact of collinearity.The correlation with D 2 is computed by setting arbitrary axes parallel and orthogonal to the target covariate.Therefore, if we are adding x 1 to the simple regression model consisting of the predictor x 2 , the arbitrary axis would be formed parallel to x 2 and represent the degree to which r y1 is explained by x 2 .Scaling by D 2 removes the inflation effect of collinearity, thus normalizing the quantity to place the estimate on a scale of 0 − 1.If the explained variance on each predictor in the univariable models is equal, then the correlations with D 2 (in the bivariable case) will be equally split.However, if the ratio is larger on one covariate, then the covariate with the weaker correlation to the response will have a greater association with D 2 .This dictates the direction of global change (i.e.D 2 ) to be greater in the direction of the covariate with the weaker correlation to the response.Extending to the general case, the calculation remains similar with the correlation calculated as the ratio of α k to D k .

Example
The index was applied to data from a study by Penrose et al. (1985).The study recorded percentage body fat and several body circumference measures of 252 men.We used the data to explore the inter-relationship between body composition using external measurements of different body circumference variables and how these highly correlated variables can then be explored to create the optimal model to explain percentage body fat.The aim was to discover which subset of easily measurable body circumference measurements (x 1 = neck, x 2 = abdomen, x 3 = biceps, x 4 = ribs) could be used to represent body fat (see Table 1 for correlations between the predictors and the response).This would allow a reduction in the number of measurements required reducing study length, cost and participant burden whilst maintaining the measurements that most accurately represent total body fat.2), the greatest impact of collinearity is highlighted for the model involving x 2 and x 3 .This follows with the maximal correlation and subsequently the VIF (r 23 = 0.75, VIF 23 = 2.29).The r D2 demonstrates that the covariate with the greater correlation to the response is x 3 , highlighting that x 2 provides the greater contribution to the impact of collinearity in the model.Studying the correlations between covariates indicates that the model involving x 2 and x 4 has a similarly high correlation (r 24 = 0.73).For this example the variance in y explained by both predictors is low (r y2 = 0.49, r y4 = 0.49) and so the impact of collinearity on the model has been limited by the low model R 2 y .However, both r D j are large indicating that the collinearity is high.Therefore, the individual predictors perform an important role in indicating a potential 'problem' even when the global inflation indicated by D 2 is low.
In each bivariable model the covariate x 1 had the strongest correlation with D 2 .This is demonstrated by the low correlation with y (i.e.r y1 = 0.35).We also observe that x 3 consistently had the lowest correlation with D 2 (for any model) suggesting that it would be a useful predictor to include in the model due to its high explanatory power.A confidence interval for the two predictor models (shown in parentheses in Table 2) was generated using the standard error of R 2 y (Cohen, 2003), whilst r 12 is fixed (due to predictors assumed to be measured without error).A confidence interval for the three and four predictor models was bootstrapped using a "leave one out" approach (Tukey, 1958).We notice that R 2 y does not increase greatly beyond the two predictor model that included x 1 and x 3 .The correlations r D j indicate that x 1 was the main contributor to this impact of collinearity in the bivariable model.We observe a very moderate increase in R 2 y after including x 2 in this model.However, with this inclusion our D-index has increased from 0.27 to 0.89.We can calculate the additional impact of adding the predictor x 2 to this model as Ḋ2 3 = 0.49.This would appear high when viewed alongside other bivariable measures to attain a small increase in R 2 y .We notice that when x 3 is entered into the model along with x 1 and x 2 the r D j are equal for both x 1 and x 2 .In comparison, when x 4 is added to the model with x 1 and x 2 , r D1 is greater than r D2 .This demonstrates how the role of each predictor changes dependent on others entered into the model.In the full four predictor model the R 2 y reaches 0.71, however the deviation peaks at 1.77.The collinearity structure of the four predictors and the variance explained suggests that x 2 is the greatest contributor to this impact of collinearity.This is a change to x 1 being consistently high in previous models.Observing model parsimony would seem to discount the four predictor model.Removing x 2 produces a model with a high R 2 y (0.70) and moderately low D 2 3 (0.79).Excluding x 2 from the model would not seem obvious from only observing the three predictor models (due to the consistently high r D1 ), however noticing the impact in the full model has highlighted the statistical dependency of this predictor with others in the study.

Discussion
From a model building perspective the bivariable model with x 1 and x 3 included as predictors would appear optimal.This model explained a high variance of the response and had a relatively low D 2 .We can demonstrate that adding the predictor x 2 to this model, whilst moderately increasing the R 2 y , would generate a high deviation in the point estimates of the existing model.Also, when considering the full set of predictors, x 2 would seem to have the greatest impact.Therefore, if any predictor would be added to the bivariable model, x 4 would seem the better option from a collinearity perspective.However, adding x 4 does not increase the explanatory power of the model and so this may not be wise.This example has been much simplified as we have not considered the nature of any causal relationships amongst the covariates.This would raise the complexity of the problem and our understanding of incorporating collinearity in the model.We are instead focussing on the purely statistical aspect of what our measure indicates.
The greatest change in impact from D 2 to D 3 is after adding x 3 to the model including x 2 and x 4 .However, x 3 contributes the greatest explained variance individually (r y3 = 0.81) and so including this predictor would appear a sensible decision.It is labelled the most beneficial of the predictors by our correlations with D, suggesting x 1 and x 2 contribute greatly to the impact of collinearity in this model.The high change in global impact could be misleading if it were interpreted as a measure of some collinearity 'problem'.This is why it would seem beneficial that any change in D-index between models be interpreted alongside the R 2 y .If little explained variance is gained by including an additional predictor in the model, but the deviation is high, then this should perhaps be viewed as a potential warning (based on the conceptual model employed) of the impact of collinearity on the model estimates.
If the global inflation is small, but the correlations between predictor and D are large, this would suggest a high degree of collinearity that is being moderated by a low R 2 y .

Concluding Remarks
In this study we have demonstrated the important role that the response plays in mediating the impact of collinearity in an applied regression study.This has demonstrated the need for a collinearity index, not simply to describe the degree of collinearity amongst the covariates, but to identify the potential impact on the variance and point estimates in relation to the response entered into the model.We have developed a novel index based on vector geometry and regression theory that assesses a global deviation in the point estimates and analyses the role of each predictor in contributing to this effect.When interpreting the D-index for use in model building it may appear conceptually appealing to assume a greater R 2 y to be beneficial to the estimation.This is how variance based inflation (such as the VIF • VDF) would be interpreted and to some would seem a more natural metric.However, it is important to stress that the D-index is not necessarily measuring a 'problem'.A high R 2 y will inflate the point estimates under collinearity and this is subsequently reflected in our index.If we were comparing to a baseline prior, whether that be a zero correlation or some 'guesstimate' of a population correlation, then we would wish to know the deviation of the estimates away from our expectation.This is not representing a biased or 'wrong' estimate, but a population prior that is not reflected in the single sample case.Therefore, if a greater R 2 y inflates the change in coefficients, then we would wish to know the degree of inflation in the sample data.
The D-index could be developed in the future to produce a more natural interpretation for model building.Replacing explained variance with some reciprocal estimate could deflate the impact.However, collinearity remains a complex feature in application and the development of a statistical index still requires a very careful conceptual understanding to be of benefit in application.The work in this paper should only be viewed as a starting point for future methodological development and simulation studies.An achievement in the development of this index is in the use of vector geometry to create the measure and to interpret it.One of the reasons for proposing the geometric alternative to the VIF • VDF is that it allows flexibility to incorporate different a priori assumptions.This would be achieved by varying angles of projection to reflect different correlations.

Figure 1 .
Figure 1.An illustration of the role of the response dictating the impact of collinearity

Figure 4 .
Figure 4. Vector geometry illustrating two examples with equal r 12

Table 1 .
Pearson correlations for the body fat study y = body fat x 1 = neck x 2 = abdomen x 3 = biceps x

Table 2 .
Results from the D-index for the four predictor body fat study