Score Tests for Semiparametric Zero-inflated Poisson Models

Count data sets often produce many zeros. It is sometimes potentially questionable to use a linear predictor to model the effect of a continuous covariate of interest in zero-inflated count data. To relax the restriction, Li (2011) proposed a semiparametric zero-inflated Poisson (ZIP) regression model by using fixed-knot cubic basis splines or B-splines to model the covariate effect, and used the likelihood ratio test to assess the validity of the linear relationship between the natural logarithm of the Poisson mean and the covariate. A score test is conducted to assess whether the extra proportion of zeros in the semiparametric ZIP regression model is equal to zero.


Introduction
It is common to see count data with large numbers of zeros in many disciplines, e.g., biomedical studies, criminology, environmental economics, traffic accidents, et al.To handle count data with excess zeros, a so-called zero-inflated Poisson (ZIP) distribution is employed (Singh, 1963;Johnson, Kotz, & Kemp, 1992).The ZIP distribution is a mixture of a Poisson distribution and a degenerated distribution at zero as follows: . (1) Here I {•} is the indicator function for an event.The π ∈ [0, 1] is a mixing weight to accommodate extra zeros.The ZIP distribution is reduced to a Poisson distribution when π = 0.The λ is the mean of the Poisson distribution.
One can think of the ZIP distribution in (1) as a population that consists of two parts: the proportion π consisting of subjects who are not at risk of an event of interest and the other part consisting of subjects who are at risk of the event and may have the event several times during a specific time period (Dietz & Böhning, 1997).The zeros from the first part are generally referred to as structural zeros and those from the Poisson distribution are called sampling zeros.This mixture distribution has become the foundation of much methodological development in zero-inflated count data analysis.Some authors have made the inferences on the existence of zero inflation in the count data (e.g., El-Shaarawi, 1985;van den Broek, 1995;Deng & Paul, 2000;Ridout, Hinde, & Demétrio, 2001;Jansakul & Hinde, 2002;Thas & Rayner, 2005); others have constructed various ZIP regression models.The seminal work on ZIP regression by Lambert (1992) was used to model the extra proportion of zeros π and the mean of the Poisson distribution λ simultaneously with linear predictors using the appropriate link functions, and the parametric ZIP regression model was applied to the manufacturing data.Many authors adopted this basic modeling structure, and a number of important extensions have been made (e.g., Welsh, Cunningham, Donnelly, & Lindenmayer, 1996;Shankar, Milton, & Mannering, 1997;Böhning, Dietz, Schlattmann, Mendonca, & Kirchner, 1999;Yau & Lee, 2001;Cheung, 2002;Hall & Zhang, 2004;Lu, Lin, & Shih, 2004;Min & Agresti, 2005;Hall & Wang, 2005;Hu, Li, & Lee, 2011).For example, Hu et al. (2001) applied the ZIP models to assess casualty risk of railroad-grade crossing crashes in Taiwan.
Each variant of these ZIP regression models has unique features, but modeling the effect of the covariate via a linear predictor is a common characteristic.Although it may be completely suitable to use a linear predictor in some applications, it may not be appropriate in other cases.Therefore, Li (2011) proposed a flexible procedure to model the covariate effect as a linear combination of fixed-knot cubic basis-spline or B-spline functions (Schoenberg, 1946;Curry & Schoenberg, 1966).Semiparametric analysis of (longitudinal) zero-inflated count data also has been proposed by, e.g., Lam, Xue, and Cheung (2006), Chiogna and Gaetan (2007), and Feng and Zhu (2011), but they did not conduct tests to assess the validity of a postulated parametric function for a covariate effect.For example, Chiogna and Gaetan (2007) proposed semiparametric zero-inflated Poisson models that use penalized regression splines to study the relationship between the number of indigo bunting and five land use predictors in an animal abundance study.
The semiparametric ZIP regression model proposed by Li (2011)  Section 2 introduces first briefly the semiparametric ZIP regression model (Li, 2011) and then the score test in detail.The practical use of the score test is illustrated with a real-life data set in Section 3. Some concluding remarks are given in Section 4.

A Semiparametric ZIP Regression Model
Let Y be the event count random variable.Let W be a binary latent variable to indicate a subject's risk state: W = 0 if the subject is not at risk of an event; W = 1 if the subject is at risk of the event.Therefore, if Y > 0, W = 1, and if Y = 0, W is unobserved.Let z = (x, u), where x = (x 1 , . . ., x p ), for x 1 = 1, is a vector of p covariates and u is a continuous covariate of interest.
In the parametric ZIP regression model proposed by Lambert (1992), both the mixing weight π = P(W = 0; z) and the Poisson mean λ = E(Y|W = 1; z) are modeled as functions of z.However, in this work we are concerned with the z that only affects the Poisson mean λ and not the probability parameter π.Hence, one can write the ZIP model as follows: where λ(z) = E(Y|W = 1; z).
It can be derived easily from (2) that the first two moments of the ZIP distribution are Let (y i , z i ), i = 1, . . ., n, be the data.The log-likelihood is then written as follows: Because cubic splines provide the best compromise between smoothness and computational cost and the B-spline basis produces better-conditioned systems of equations than the truncated power basis and is more likely to have a numerically stable representation of a spline function, Li (2011) used the basis of cubic B-splines with q preselected knots to approximate the unspecified smooth function g in which the rth knot corresponds to the r q+1 th sample quantile of the distinct values of u i s.
Let B 1 (u), . . ., B q+4 (u) be the cubic B-spline basis for the space of cubic splines with q preselected knots.For details of computing B-splines and their mathematical properties, see de Boor (2001).The cubic B-splines space includes a constant function, and the constant is given in the parametric component of the model ( 3), so to model the g one of the q + 4 B-spline basis functions needs to be dropped so that the resulting parametrization is of full rank.Any one of them can be dropped, but for convenience Li (2011) models the g as a linear combination of the first K = q + 3 fixed-knot cubic B-spline basis functions as follows: where λ(z i ) = exp(A z i θ).

A Score Test
Possible tests for the null hypothesis H 0 : π = 0 are the likelihood ratio test, the Wald test and the score test.
Because one needs to estimate the model parameters under the alternative hypothesis π > 0 while using the likelihood ratio and Wald tests, we consider the score test for H 0 : π = 0 because it has the advantage that we do not have to fit the semiparametric ZIP regression model but just a semiparametric Poisson regression model, which is the reduced model of the semiparametric ZIP regression model under H 0 : π = 0. Let τ = π 1−π .Testing H 0 : π = 0 is then equivalent to testing H 0 : τ = 0.With some algebra, one can write the log-likelihood (θ, π) in ( 7) as Based on the log-likelihood (θ, τ) in (8), the score vector is U T (θ, τ) = U T θ (θ, τ), U τ (θ, τ) as follows: and Let θ be the maximum likelihood estimate of θ under H 0 : τ = 0 and λ(z i ) = exp(A z i θ1 ).Then, (9) becomes and ( 10) is Thus, The second-order partial derivatives of (θ, τ) with respect to θ and τ are one can show that the expected Fisher information matrix has the following entries: .
By using the formula of inverse of (partitioned) matrices and the fact of λ = diag( λ)Ae p+K , where e p+K is a (p + K) × 1 vector that has a 1 as the first element and has the other elements equal to zero, we can have , where from (11) we have used n i=1 [y i − λ(z i )] = 0 that is equivalent to 1 T λ = n i=1 y i = nȳ for ȳ = n i=1 y i /n.Therefore, the score statistic to test H 0 : τ = 0 is S ( θ, 0) = U T ( θ, 0)I −1 ( θ, 0)U( θ, 0) e λ(z i ) − 1 − nȳ , which has an asymptotic chi-squared distribution with 1 degree of freedom under H 0 : τ = 0.

Example
To illustrate the practical use of the score test, we use the data set from a study of the attendance behavior of 316 high school juniors at two schools, which is available at the website http://www.ats.ucla.edu/stat/mplus/dae/poissonreg.dat.The response variable is the number of days of absence.The predictors include gender of the student and standardized test scores in mathematics and language arts.Let P(Y = y; λ, π) = πI {y=0} + (1 − π) e −λ λ y y! , y = 0, 1, 2, . . ., = π + (1 − π)e −λ I {y=0} (1 − π) e −λ λ y y!I {y>0} not only enhances fitting flexibility, but also can be used to assess the adequacy of a postulated linear relationship between the natural logarithm of the Poisson mean and the covariate.However, no tests have been proposed for the extra proportion of zeros π in the semiparametric ZIP regression model equal to 0. Motivated by this, we conduct a score test for π = 0.The score test has an advantage over the likelihood ratio and Wald tests because it only requires the parameter estimates under the null hypothesis π = 0, i.e., under the semiparametric Poisson regression model.It is noted thatvan den Broek (1995)   proposed a score test for the extra proportion of zeros, comparing the parametric ZIP regression model with a constant proportion of excess zeros to a parametric Poisson regression model.
can be seen from the variance formula that the ZIP model has the ability to account for data variation beyond that which is accommodated by the Poisson model.To extend the parametric ZIP regression model,Li (2011) assumed that the functional form of the effect of u is smooth but unknown, and the effects of x remain linear.The Poisson mean λ(z) then can be written asln[λ(z)] = xβ + g(u)(3) through the canonical log link function.Here β = (β 1 , . . ., β p ) T is a vector of unknown p regression coefficients for the x = (x 1 , . . ., x p ) with x 1 = 1.The g is an unspecified smooth function for the effect of u.The model in (3) is a semiparametric Poisson regression model, which can be considered a generalized partially linear model.Because the model in (3) contains both parametric and nonparametric components, Li (2011) referred to the model in (2) with the semiparametric Poisson regression model in (3) as a semiparametric ZIP regression model.