Designing a Pseudo R-Squared Goodness-ofFit Measure in Generalized Linear Models

The coefficient of determination is a function of residuals in the General Linear Models. The deviance, logit, standardized and the studentized residuals were examined in generalized linear models in order to determine the behaviour of residuals in this class of models and thereby design a new pseudo R-squared goodness-of-fit measure. The Newton-Raphson estimation procedure was adopted. It was observed that these residuals exhibit patterns that are unique to the subpopulations defined by levels of categorical predictors. Residuals block on the basis of signs, where positive signs indicate success responses and negative signs failure responses. It was also observed that the deviance is a close approximation of the studentized residual. The logit residual is two times the size of the standardized residuals. Borrowing from the Nagelkerke’s improvement of Cox and Snell’s goodness-of-fit measure in generalized linear models and the coefficient of determination counterpart of the general linear model, a new pseudo R squared goodness-of-fit test which uses predicted probabilities and a monotonic link function is here proposed to serve both the linear and Generalized Linear Models.


Introduction
A generalized linear model is one in which each component of the response variable Y has a distribution in the exponential family, taking the form for some specific function a(•), b(•) and c(•, •) (McCullagh & Nelder, 1990).The functions a and c are such that a(ϕ) = ϕ/w and c = c(y, ϕ/w), where w is a known weight for each observation.The model can be stated as where z i is the adjusted dependent variate, x i j is the (i, j)th element of the design matrix, h(µ i ) is the link function and e i is the residual error.The link between y i and z i is in the expression.
Where y i is a binomial random response variable.
From (1), a residual in generalized linear model can be defined as e i , so defined is called Pearson residual.
Standard theory for this type of distribution expresses the mean and variance of the response y as: where V is the variance function.
The log-likelihood function, a goodness-of-fit measure is defined for the following exponential family models: Generally, the log-likelihood function is of the form L(y, µ, ϕ) = Σ i log( f (y i , µ i , ϕ)) with individual contribution for the binomial function as 2. The Newton-Raphson Method The Newton-Raphson estimation scheme is given as where H, the Hessian matrix is given as l, the loglikelihood for a binary response variable can be written as W, the weight matrix is given as W = diag{m i ( dµ i dη i ) 2 /µ i (1 − µ i )}.m i is row subtotal in the cross tabulation table.The gradient vector g is given as where the response or fitted probability µ i is defined as An alternative estimation procedure is the Iterative Weighted Least Squares method which often adopted in order to avoid the computational tedium associated with the Hessian matrix.

Residuals in Generalized Linear Models
The coefficient of determination R 2 , is a function of the residual.It was originally developed for the normal-theory model.Cameron and Windmeijer (1996) designed an R 2 for the Poisson and related count data after observing that it was rarely used for count data.Nagelkerke (1991) generalized the definition of R 2 in what is called the generalized R 2 .The generalized R 2 is consistent with the classical R 2 and is also maximized by the maximum likelihood estimation of a model.The generalized coefficient of determination is given as follows: where L(0) is the likelihood of the model with only intercept.L(θ) is the likelihood of the estimated model and n is the sample size.Residuals in a logistic model can be defined as the difference between y i and the predicted probability θ for y i .We define the predicted probability in a cross-classified data as the probability that an object or a person selected from a subgroup is a success (Stroke et al., 1997).
The monotonic link function relates the predicted probability to the set of linear predictors.For the logistic regression where the underlying distribution is binomial, the link function is a logit.The deviance, Pearson χ 2 , standardized, logit and studentized residuals are the residuals normally associated with generalized linear models.The analysis of residuals made in this paper shows that the logit residual is approximately twice the size of standandized residuals.The standardized residual is approximately equal to the deviance residual.This can be seen in the appendix.

Goodness of Fit Measures in Generalized Linear Models
The deviance and the generalized Pearson χ 2 statistic are two measures of goodness of fit in generalized linear models.Both the deviance and the generalized Pearson χ 2 have exact χ 2 distributions for Normal-theory linear models if the models are true (McCullagh & Nelder, 1990).The deviance uses the log of the ratio of likelihoods.Cox and Snell R squared, another measure of goodness of fit in generalized linear models is a psudo R squared and a modification of the deviance which configures the test interval to lie between 0 and 1 (excluding 1) such that a smaller ratio implies a greater improvement.
The deviance for the set of distributions in generalized linear models is given as follows: for the normal distribution, it is stated as For the poisson, binomial and gamma we have 2 and 2 respectively.For the inverse-Gaussian, multinomial and negative binomial, we have respectively.Cox and Snell R 2 is defined as where L(m int ) is the conditional probability of the dependent variable for the intercept model.
In this paper a new goodness of fit test that makes use of fitted probabilities, a monotomic link function and the Nagelkerke range of possible values is proposed.The test is designed to serve both the general linear and the generalized linear models.
It is given as follows: R 2 G&G , designed for the generalized linear models can be adapted for use as a goodness of fit measure in the general linear model by replacing the fitted probabilities and the link function values with fitted y values and the mean of y respectively.The value of R 2 G&G range from 0 to 1, with higher values implying better fits.

Illustrative Example
The hypothetical data below is used for the illustration of residual analysis in generalized linear models: The probability that a person from the ith sex level and the jth location status is infected with a certain virus.

The model
Let y i j be a binomial random response variable corresponding to the ith sex status and the jth location which assumes the value 0 or 1.The probability θ i j ; that a person of the hth sex and jth location is infected by the virus is modeled as where i = 1, 2, j = 1, 2, Stat Computing (2011) gave three interpretations of R 2 as follows: (i) R 2 as explained variability: The denominator of the ratio indicates total variation in the dependent variable while the numerator is the variability in the dependent variable that is not predicted by the model.The ratio is the proportion of the total variability explained by the model which agrees with R 2 in Ordinary Linear Models (Koutsoyiannis, 1983).Thus a higher ratio implies a better model.
(ii) R 2 as improvement from null model to fitted model: A smaller ratio implies a greater improvement.
(iii) R 2 as the square of the correlation: correlation between predicted values and the actual values.A higher R 2 implies a greater improvement of fit.
It can be seen that the proposed R2 goodness-of-fit measure compares favourably with the Nagelkerke/Gragg & Uhler's R 2 (0.180 against 0.187).