Testing Inference in Accelerated Failure Time Models

We address the issue of performing hypothesis testing in accelerated failure time models for non-censored and censored samples. The performances of the likelihood ratio test and a recently proposed test, the gradient test, are compared through simulation. The gradient test features the same asymptotic properties as the classical large sample tests, namely, the likelihood ratio, Wald and score tests. Additionally, it is as simple to compute as the likelihood ratio test. Unlike the score and Wald tests, the gradient test does require the computation of the information matrix, neither observed nor expected. Our study suggests that the gradient test is more reliable than the other classical tests when the sample is of small or moderate size.


Introduction
An important class of regression models used to assess the relationship between the response variable and the covariates in survival analysis is parametric accelerated failure time (AFT) models.These models have a simple intuitive interpretation in real problems (Wei, 1992).Unlike proportional hazards models, AFT models describe the logarithm of the event times as a linear regression on the covariates, which act multiplicatively on the event times, accelerating or decelerating the time scale.Cox semi-parametric models (Cox, 1972) do not assume any particular distribution for the failure times and have been the most popular models in survival analysis.Nevertheless, some authors present reasonable arguments in favor of the parametric AFT models based on their asymptotic properties (Efron, 1977;Oakes, 1977), simulation results (Orbe, Ferreira, & Núñez-Antón, 2002) and applications to real data (Nardi & Schemper, 2003;Grover, Das, Swain, & Deka, 2013).Given the importance of such models, several authors have extended the AFT models, for example, adding a random effect to deal with correlated survival data (Lambert, Collett, Kimber, & Johnson, 2004;Pan, 2001), allowing measurement error in the covariates (Gimenez & Bolfarine, 1997;Valenc ¸a & Bolfarine, 2006) or including a cure fraction to treat the presence of immune individuals in the sample (Yamaguchi, 1992;Peng, 1998;Ortega, 2009).
Typical distributions that have been used by various authors in connection with the parametric AFT models are the exponential, Weibull, log-normal, gamma and log-logistic distributions, in addition to more flexible distributions such as the extended generalized gamma and the generalized F distributions.Descriptions of the most commonly used distributions, as well as related inferential procedures, can be found in Lawless (2003), Kalbfleisch and Prentice (2002) and in Cox and Oakes (1984).Moreover, several extensions of the usual survival distributions, particularly the Weibull distribution, have been proposed to a provide better fit in complex lifetime data; see, e.g., Marshall and Olkim (1997), Lai, Xie, andMurthy (2003), andCarrasco, Ortega, andCordeiro (2008).A comprehensive discussion of general methods for constructing new distributions for lifetime data is presented in Lai (2013).
The most commonly used statistical test in AFT models is the likelihood ratio test, which performs well when the sample size is large.The other classical large sample tests, namely the Wald and score tests, can also be used.However, a disadvantage of these tests is that they require the computation of the information matrix.In AFT models, the expected information matrix is usually difficult or even impossible to obtain, particularly under censoring.An alternative is to replace the expected information matrix by its observed counterpart.We noticed, Lemonte and Ferrari (2012) obtained the second-order local power of the gradient test and showed that none of the four competing tests is uniformly more powerful than the others.In addition, Lemonte and Ferrari (2011a) compared the size and power properties of the four rival tests in a Birnbaum-Saunders regression model for complete samples.Their simulation results suggest that the score and the gradient tests outperform the likelihood ratio and the Wald tests in small and moderate-sized samples, and are consistent with the study of Ferrari and Pinheiro (2014).The only study concerning the performance of the four tests in censored samples is that of Lemonte and Ferrari (2011b).The authors considered independent and identically distributed observations of the Birnbaum-Saunders distribution, and of two generalized versions of this distribution, under type-II censoring.Because the Wald and the score tests involve the information matrix, which could not be obtained under censoring, they implemented these tests with the observed information matrix.They noticed that the Wald and score tests were markedly oversized and that the inverse of the observed information matrix frequently produced negative standard errors for censored samples in their simulations.Their overall conclusion is in favor of the gradient test.
Censored samples are often encountered in survival and reliability studies.Our goal is to evaluate the performance of the gradient test in comparison with the likelihood ratio test in accelerated failure time models under random censoring.This paper is organized as follows.In Section 2 we describe the accelerated failure time models and present the likelihood ratio and gradient tests in these models.In Section 3 we present the simulation results for both tests in different scenarios.In Section 4 we illustrate and compare the tests in two real data applications.Finally, Section 5 closes the paper with some concluding remarks.

Accelerated Failure Time Models
Let T i be the event time for individual i, and let x i = (1, x i1 , ..., x ip ) be a fixed covariate vector that allows a possibly non-null intercept.The AFT model can be represented by log where i are independent and identically distributed random errors with a distribution with support in the whole real line and that does not depend on x i .The vector β = β 0 , . . ., β p and σ are unknown parameters.Hence, (1) describes a linear regression model for log T i .
Survival times may be subject to right censoring.Here, the censoring times are represented by the independent random variables C i , for i = 1, . . ., n, which are assumed to be independent of T 1 , . . ., T n .The censoring mechanism is assumed to be non-informative, that is, the distribution of the C i 's does not depend on unknown parameters.Let δ i = 1, if the observation for individual i is a failure time, and δ i = 0, if it is a censoring time.The observations can be represented by the pairs of random variables (Y i , δ i ), where Y i = min(log T i , log C i ), and the covariate vectors x i , for i = 1, . . ., n.
The likelihood function for the unknown parameters is given by , where y i is the observed value of Y i , f (•) and S (•) denote the density and survival functions of i , respectively, and θ = (β , σ) is the vector of unknown parameters; see, e.g., Kalbfleisch and Prentice (2002).Therefore, the log-likelihood function is where The components of the score vector are given by and where In matrix form, the score vector can be written as where X = (x 1 , x 2 , . . ., x n ) is the n × (p + 1) matrix of the covariates, and δ = (δ 1 , δ 2 , . . ., δ n ) , e = (exp{e 1 }, exp{e 2 }, . . ., exp{e n }) and a = (a 1 , a 2 , . . ., a n ) are n dimensional column vectors.Table 1 gives the expression for a i in (2) for AFT models frequently used in survival data applications.The expression for a i for the exponential distribution equals the corresponding a i for the Weibull distribution with σ = 1.Maximum likelihood estimates (MLEs) for β and σ are obtained by solving the system of equations U(θ) = 0, which requires a numerical nonlinear optimization algorithm (e.g., Newton-Raphson, Fisher's scoring or BFGS).For further details on nonlinear optimization, see Press, Teulosky, Vetterling, and Flannery (1992).The survreg function in the survival package in R (R Development Core Team, 2011) uses the Newton-Raphson algorithm; see (Therneau & Lumley, 2008).
Table 1.Expression for a i in (2) for some common models

Model
Errors distribution is the standard normal cumulative distribution function.
We now turn to hypothesis testing.Let θ = (θ 1 , θ 2 ) where θ 1 and θ 2 are column vectors of dimensions m and k − m, respectively.Consider the null hypothesis H 0 : θ 2 = θ 02 to be tested against H 1 : θ 2 θ 02 , where θ 02 is a fixed k − m dimensional column vector.The partition in θ induces the corresponding partition in the score vector ) and θ = ( θ 1 , θ 02 ) be the unrestricted and the restricted MLE of θ under H 0 , respectively.The likelihood ratio statistic (ξ LR ) and the gradient statistic (ξ G ) for testing H 0 against H 1 are given by respectively.Under typical regularity conditions (e.g., Lawless, 2003, Ch. 6;Kalbfleisch & Prentice, 2002, Ch. 3), ξ LR and ξ G have a χ 2 distribution with k − m degrees of freedom under H 0 .

Simulation
We now present a simulation study for the Weibull, log-normal and log-logistic AFT models.Our goal is to evaluate and compare the performance of the likelihood ratio and gradient tests.The regression structure is as follows: log The values for x i1 , x i2 and x i3 were obtained as random draws of a normal distribution with mean of 1, variance of 0.25, an exponential with mean of 1/4 and a Bernoulli with a parameter of 0.5, respectively.The error terms i were generated as independent random variables with a standard extreme value distribution for the exponential and Weibull cases, a standard normal distribution for the log-normal case, and a standard logistic distribution for the log-logistic case.The censoring times C i were generated as independent random variables with a uniform distribution on the interval [0; c], where c was suitably chosen to produce different proportions of censored observations: 0% (no censoring), 30% and 50%.Four different sample sizes were considered: 30, 50, 100 and 200.Additionally, we consider β = (0.15, 0.15, 0, 0) and set the following values for σ: 0.5, 1 and 1.5.For the Weibull case, these three different values for σ imply increasing, constant (exponential case) and decreasing failures rates.
Table 2. Null percentage rejection rates of H 0 : β 2 = β 3 = 0 for the likelihood ratio and gradient tests for different values of σ, sample size (n) and censoring percentage (c.p.): Weibull AFT model Based on 10, 000 simulation replicates we estimated the null rejection rates of the likelihood ratio and the gradient tests of H 0 : β 2 = β 3 = 0 for three nominal levels (α = 0.01, 0.05 and 0.10).We also considered the null hypothesis H 0 : β 2 = 0, but the results are similar and not presented here for brevity.The simulations were performed in R, with the optimization of the likelihood function obtained using the survreg function in the package survival with default initial values.Tables 2, 3 and 4 reveal the well-known liberal tendency of the likelihood ratio test when the sample is not large.In fact, for all of the cases considered with n ≤ 100, the null rejection rates of the likelihood ratio test exceed the corresponding nominal level regardless of whether censoring is present or not, and regardless of the error distribution.For instance, for the Weibull case (Table 2), when n = 30, σ = 0.5 and α = 5% the null rejection rates are 7.6% (no censoring), 8.6% (30% censoring) and 9.4% (50% censoring).The gradient test exhibits some liberal tendency, but this tendency is much less pronounced than that of the likelihood ratio test.
In the aforementioned case the null rejection rates of the gradient test are 5.1%, 5.9% and 7.0%, respectively.It can be noticed that the performance of both tests deteriorates slightly as the censoring percentage increases.We conclude that the gradient test is less size distorted than the likelihood ratio test.We now turn to the investigation of the finite-sample power properties of the two tests.As our size simulations show, the tests have different sizes.To perform power comparisons, we first ensure that the tests have the correct size under the null hypothesis.To this end, we used 100,000 Monte Carlo simulated samples, drawn under the null hypothesis, to estimate the exact critical value of each test for the chosen nominal level.For the power simulations we considered the Weibull AFT models and computed the rejection rates under the alternative hypotheses H ( ) : β 2 = β 3 = , for ranging from −1.5 to 1.5.The results for the Weibull AFT model are presented in Figure 1.
The plots for the log-normal and log-logistic AFT models are similar and are not presented here for brevity.It can be noticed that the powers of both tests are strongly influenced by the scale parameter, with larger values of σ corresponding to lower power.As expected, increasing censoring percentage is accompanied by a loss of power for both tests.Additionally, the panels in Figure 1 suggest that the likelihood ratio and gradient tests have similar powers.
In short, the likelihood ratio and the gradient tests are equally powerful, but the likelihood ratio test is liberal when the sample size is not very large, while the gradient test is clearly much less size-distorted.Our overall conclusion is that the gradient test is more reliable than the likelihood ratio test when the sample size is small or moderate, and it should be preferred in practical applications.

Real Data Applications
To illustrate the use of the likelihood ratio and gradient tests we present two data analyses.The first concerns the failure of sub-surface equipment in a sample of oil wells obtained from an oil-drilling company.The second It is apparent that the pumpjack method provides a slightly longer operating time for the wells than the progressive cavity pump method.Additionally, wells located at B seem to have higher survival times than those at A or C.
The regression model considered here is the Weibull accelerated failure time model in which for i = 1 . . .70, where the i 's are independent random errors with a standard extreme value distribution, x i1 indicates the elevation method (x i1 = 0 for PCP and x i1 = 1 for PJ), and x i2 and x i3 indicate the administrative unit (x i2 = 1 for A and x i3 = 1 for C).Our goal is to test each factor in the presence of the other, i.e., the null hypotheses under test are H 10 : β 2 = β 3 = 0 and H 20 : β 1 = 0.The results for the likelihood ratio test and gradient test are given in Table 5.We note that both tests indicate that administrative unit is strongly significant.On the other hand, at the 5% significance level, elevation method is considered significant by the likelihood ratio test (p-value = 0.041) but not by the gradient test (p-value = 0.054).On the basis of our simulation results in a similar situation (here, n = 70, 16% censoring and σ = 0.951), the gradient test is more reliable and hence it is the one to be taken into account.The regression model considered here is given in (3) with x i1 indicating sex (x i1 = 0 for male and x i1 = 1 for female), and x i2 and x i3 indicating ECOG score (x i2 = 1 for regular and x i3 = 1 for bad).The inferential results are given in Table 6.We note that sex is considered highly significant by both tests.In contrast, at the 5% significance level, there is a conflict between the tests in terms of ECOG score: the likelihood ratio test indicates that the ECOG score is significant (p-value = 0.036) while the gradient test concludes that the ECOG score is not significant (pvalue = 0.051).The simulation results suggest that inferential decisions should be based on the gradient test.

Concluding Remarks
We investigated and compared the performance of the likelihood ratio and gradient tests in parametric accelerated failure time models for survival data under random censoring.The Wald and score tests were not included in our simulation study because they require the computation of the Fisher information matrix, which cannot be obtained analytically for the models considered here.One could argue that the Fisher information should be replaced by the observed information matrix.However, the observed information produced negative standard errors for a non-negligible proportion of the simulated censored samples.
The simulation results suggest that the gradient and the likelihood ratio tests present similar powers.However, the gradient test clearly presents better size behavior than the likelihood ratio test, the latter being markedly liberal in small samples.Not surprisingly, the performance of both tests is sensitive to changes in the censoring percentage.
Our overall conclusion is that the gradient test should be preferred in practical applications when the sample is small or of moderate size.

Figure 1 .
Figure 1.Power of the likelihood ratio and gradient tests for different values of σ, sample size (n) and censoring percentage: Weibull AFT model

Figure 3 .
Figure 3. Kaplan-Meier estimate of the survival functions; lung cancer data

Table 3 .
Null percentage rejection rates of H 0 : β 2 = β 3 = 0 for the likelihood ratio and gradient tests for different values of σ, sample size (n) and censoring percentage (c.p.): log-normal AFT model

Table 4 .
Null percentage rejection rates of H 0 : β 2 = β 3 = 0 for the likelihood ratio and gradient tests for different values of σ, sample size (n) and censoring percentage (c.p.): log-logistic AFT model

Table 5 .
Summary of inference results: Oil wells dataWe now consider the cancer data set available in the survival package in R. The data are survival times for patients with advanced lung cancer from the North Central Cancer Treatment Group.The original data set has 228 observations and 10 variables, and 28% of the observations are censored.To illustrate the performance of

Table 6 .
Summary of inference results: Lung cancer data