Flexible Bivariate Binary Models for Estimating the E ffi cacy of Phototherapy for Newborns with Jaundice

In this work we analyse the efficacy of phototherapy (treatment) on the probability of being hyperbilirubinemic (outcome) in infants. A realistic quantification of the relationship between treatment and outcome can be challenging for various reasons. First, the probability of interest might be too small. Second, confounding unmeasured variables may exist which can bias the efficacy of phototherapy at preventing significant hyperbilirubinemia. Third, relationships between covariates and the outcome variable may exhibit non-linear patterns that, if not accounted for, can bias the relationship of interest. One way of dealing with the second and third issues is to use a semiparametric recursive bivariate probit model. To address the first issue as well, we explore an extension of this model which accounts for the fact that being hyperbilirubinemic can be regarded as a rare event. The proposed approach combines the marginal distributions of treatment and outcome using copulae, and uses asymmetric link functions to deal with rare outcome events. The main features underpinning the use of asymmetric link functions within semiparametric bivariate binary models are discussed.


Introduction
Although jaundice in newborns is common and generally benign, very high total serum bilirubin (TSB) levels can injure the newborn's central nervous system (Maisels et al., 2001).This is known as hyperbilirubinemia.For this reason, TSB levels in jaundiced newborns are followed and sometimes treated with phototherapy if they are at risk of rising to or have already reached potentially dangerous levels.The American Academy of Pediatrics (AAP) has published guidelines that suggest TSB levels at which phototherapy is recommended for term and late preterm newborns (Maisels et al., 2004).However, no randomized trials have quantified the efficacy of this intervention at the TSB levels at which they are currently recommended.A randomized trial of phototherapy is difficult to do because relevant outcomes, such as a TSB level exceeding the AAP's threshold for exchange transfusion, are rare, and because there are ethical obstacles to randomizing newborns not to receive a therapy recommended by the AAP (Newman et al., 1999(Newman et al., , 2004)).
Another issue with quantifying the efficacy at preventing significant hyperbilirubinemia in infants is the presence of unmeasured confounding variables.For example, because continuing exclusive breastfeeding is a risk factor for subsequent hyperbilirubinemia, phototherapy is generally more indicated and therefore might be more commonly used in infants continuing to breastfeed exclusively (Kuzniewicz et al., 2008).Thus, it is expected that the effect of not having breastfeeding data might confound the relationship between phototherapy and the outcome, by causing the effect of phototherapy to be falsely high.
To account for the problem of unmeasured confounding, the recursive bivariate probit model (RBPM) may be used (Heckman, 1978).A recent application of RBPM to estimate the efficacy of phototherapy for newborns with jaundice has appeared in Newman et al. (2012).In brief, this approach assumes that the two observable binary variables (outcome and treatment) are manifestations of a corresponding pair of unobserved, correlated, bivariate

The Proposed Model and Its Main Features
Consider a pair of random variables (y 1i , y 2i ) (treatment and outcome, respectively) for i = 1, . . ., n, where y vi ∈ {0, 1}, v = 1, 2, and n is the sample size.The observed y vi can be viewed as determined by a latent continuous variable y * vi such that y vi = 1(y * vi > 0), where 1 is the indicator function.We assume that where and ϵ vi follows a standardised skew normal (SN) distribution with shape parameter ξ v ∈ R. Note that our development also allows ϵ vi to follow either a power normal (PN) or a reciprocal power normal (RPN), for instance; more details are given in the Appendix.Parameter ψ quantifies the effect of the treatment on the outcome on the scale of η 2i , u T 1i is defined as and corresponds to the i th row of U 1 = (u 11 , . . ., u 1n ) T , the n × P 1 model matrix containing P 1 parametric terms (e.g., intercept, dummy and categorical variables), α 1 is a coefficient vector, and the s 1k 1 are unknown smooth functions of the is the i th row vector of the n × P 2 model matrix U 2 = (u 21 , . . ., u 2n ) T , α 2 is a parameter vector, and the s 2k 2 are unknown smooth terms of the K 2 continuous regressors z 2k 2 i .The s vk v (z vk v i ) are subject to the centering (identifiability) constraint (Wood, 2006).η 1i and η 2i can also include smooth terms interacted with some predictor(s) and smooth functions of two or more covariates (e.g., Wood, 2006).

Smooth Function Representation
An effective way of representing smooth functions of continuous variables is the regression spline approach (e.g., Eilers & Marx, 1996).Specifically, where the b vk v j (z vk v i ) are known basis functions, the β vk v j are parameters, J vk v is the number of spline bases used to represent s vk v (•), B vk v (z vk v i ) T is the i th vector of dimension J vk v , defined as

}
, and β vk v is the corresponding parameter vector.The cases of smooth terms multiplied by some covariate(s) and of smooths of more than one variable follow a similar construction.Several choices for the basis functions are possible and include low rank thin plate regression splines, B-splines and cubic regression splines (e.g., Ruppert et al., 2003;Wood, 2006).Linear predictors (1) and (2) can, therefore, be written as

SN
The probability density function (pdf) and cdf of the standard SN distribution are given as where T (•, •) denotes the Owen's T function and ξ v ∈ R. A reflection of SN can be obtained by using the result (Azzalini, 1985).

Note that
, that is the standardised skew normal distribution reduces to a standardised normal distribution.Using the latent variable representation of the response, we have that whereas if ϵ vi ∼ SN(−ξ v ) then (8) becomes The distribution of SN is positively or negatively asymmetric in agreement with the sign of ξ v (Azzalini, 1985).
Figure 1 shows the shape of the cdfs and pdfs of SN for different values of ξ v .From the curves, it is evident that SN allows the probability of a binary variable to approach zero at different rates than it approaches one depending on the values of ξ and η.If ξ = 0 then SN yields a classic probit curve.

Copula representation and log-likelihood function
Given some marginal cdfs, it is possible to define the joint cdf of event (y 1i = 1, y 2i = 1) by using the copula representation (Sklar, 1959(Sklar, , 1973) where C is a two-place copula function and θ is an association parameter measuring the dependence between P(y 1i = 1) and P(y 2i = 1).A nice feature of the copula approach is that the marginal distributions may come from different families.Also, choosing the marginal distributions and modeling the dependence between them can be treated as two separate but related issues.The families considered are Clayton, Frank, Gaussian, Gumbel, Joe and Student-t, as well as rotated versions (90, 180 and 270) of the Clayton, Gumbel and Joe copulas; see the table and figures in Radice et al. (2013) for these copulas' definitions and illustrations.Nelsen (2006) provides a comprehensive overview of copulas and their properties.The log-likelihood function of the model can be written as where P(y 1i = 1, y 2i = 0) = P(y 1i = 1) − P(y 1i = 1, y 2i = 1), P(y 1i = 0, y 2i = 1) = P(y 2i = 1) − P(y 1i = 1, y 2i = 1) and P(y 1i = 0, ] .

Treatment Effect
In our case, the aim is to investigate how the treatment changes the expected outcome.In other words, the interest is in the effect of y 1i on P(y 2i = 1|y 1i ), which in practice can be calculated using the commonly used average treatment effect in the specific sample at hand (SATE; Imbens, 2004).Assuming that P( where η (y 1i =r) 2i represents the linear predictor evaluated at y 1i = r where r is equal to 1 or 0, δ contains all the model parameters and X = (x 1 | . . .|x n ) T with x i defined as (X T 1i , X T 2i ) T .SATE(δ, X) can be estimated using SATE( δ, X), whereas a confidence interval for it can be obtained using posterior simulation as explained in Section 4. If we use different marginal probability models then expressions (4) and ( 5) have to be changed accordingly.

Parameter Estimation
As pointed out by Azzalini & Arellano-Valle (2013), extra care must be taken when designing an algorithm to estimate the coefficients of a model that involves a link function which depends on a shape parameter.Here, the authors propose to penalize the shape parameter in estimation.We follow a similar idea and use a ridge penalty approach to penalize the shape parameters in our model.
Let us denote the copula log-likelihood as ℓ(δ), where δ T = (δ T 1 , δ T 2 , θ, ξ 1 , ξ 2 ).Because θ is bounded in most copula cases, we replace θ with θ * (see the table in Radice et al. (2013)).The use of δ T * = (δ T 1 , δ T 2 , θ * , ξ 1 , ξ 2 ) will ensure that in optimization δ * ∈ R p , where p is the total number of parameters.To avoid overfitting, which is likely to happen when a model includes smooth components, we employ in estimation the (second-order roughness) penalty term } 2 dz vk v , where the λ vk v are smoothing parameters controlling the trade-off between fit and smoothness (e.g., Ruppert et al., 2003;Wood, 2006).Because regression splines are linear in their model parameters, the overall penalty can be written as β T S λ β where λ vk v S vk v and the S vk v are positive semi-definite known square matrices.Expressions for the b vk v j (z vk v i ) needed to represent the smooth terms and the respective S vk v can be found in Ruppert et al. (2003) and Wood (2006).The ridge penalties for the shape parameters are simply given by ξ 2 1 and ξ 2 2 .The function to maximize is where Note that in (6) we need to assume that λ T = (λ 1k 1 , . . ., λ 1K 1 , λ 2k 2 , . . ., λ 2K 2 ) is known; joint estimation of δ and λ via maximization of (6) would clearly lead to λ = 0, which is not practically useful (e.g., Wood, 2006).Building on the approach described in Radice et al. (2013), estimation of δ and λ is achieved in two steps which are iterated until convergence: step 1 For a given parameter vector value δ [a]   * and holding the smoothing parameter vector fixed at λ [a] , find an estimate of δ * using the trust region Newton approach: where, at iteration a, g [a]  p = g [a] − Sλ [a] δ[a] * , H [a]  p = H [a] − Sλ [a] , g [a] is made up of g 1 and g [a]  5 = ∂ℓ(δ * )/∂ξ 2 | ξ 2 =ξ [a]   2 , and the Hessian matrix has a 5 × 5 matrix block structure with (r, h) th element H r ,δ h =δ [a]   h , r, h = 1, . . ., 5, where δ 3 = θ * , δ 4 = ξ 1 and δ 5 = ξ 2 .∥ • ∥ denotes the Euclidean norm and r [a] represents the radius of the trust region.At each iteration, lp (δ [a]   * ) is minimized subject to the constraint that the solution falls within a trust region with radius r [a] .The proposed solution is then accepted or rejected and the trust region expanded or shrunken based on the ratio between the improvement in the objective function when going from δ [a]   * to δ [a+1] * and that predicted by the quadratic approximation; see Nocedal & Wright (2006) for full details.Note that, near the solution, the trust region Newton algorithm typically behaves as a Newton algorithm.As pointed out by Nocedal & Wright (2006), this approach usually works better than classic optimization methods.step 2 For a given smoothing parameter vector value λ [a] and holding the main parameter vector value fixed at , find an estimate of λ: where ň = 4n, z a+1] is the square root of a block diagonal matrix made up of a+1]   is the hat matrix, Šλ = diag(0 ) represents the effective degrees of freedom (ed f ) of the penalized model (e.g., Wood, 2006), and γ is a tuning parameter which can be increased from its usual value of 1 to obtain smoother models (Kim & Gu, 2004).Function V u (λ) can be regarded as an approximate Akaike information criterion (AIC) (Wood, 2006).A similar approach has been employed in a related context by Radice et al. (2013) and Marra & Radice (2011) to which we refer the reader for more details.Note that quantities needed to construct the working linear model are efficiently set up by using sparse algebra.In addition, the working linear model quantities in (7) are constructed for a fixed value of θ * , hence keeping the dimensionality of the problem to the lowest possible.This is sensible because this parameter is not either penalized or specified as function of a linear predictor whose components may need to be penalized.

Model Selection and Confidence Intervals
Model selection can be achieved using the AIC which, in our case, is AIC = −2ℓ( δ * ) + 2ed f , where the loglikelihood is evaluated at the penalized parameter estimates and ed f = tr( Âλ ).
For the construction of confidence intervals, we use the Bayesian covariance matrix V δ * = −H −1 p which has been shown to produce intervals with close to nominal coverage probabilities as opposed to its frequentist counterpart (Marra & Wood, 2012).As explained, for instance, in Wood (2006) and Radice et al. (2013), another advantage of using a Bayesian result for interval construction is that intervals for non-linear functions of the model coefficients (e.g., SATE) can be conveniently obtained by simulation from the posterior distribution of δ * .This can be achieved as follows: step 1 Draw n sim random vectors from N( δ * , Vδ * ).step 2 Calculate n sim simulated realizations of the function of interest.Let us assume that a Gaussian copula is employed and that the function of interest is SATE(δ, X) where We then have n sim simulated realizations δ sim,1 , . . ., δ sim,n sim and hence SATEs.step 3 Using the n sim realizations of SATE calculate the lower, (ς/2), and upper, 1 − ς/2, quantiles.n sim can be set to 1000 as the 3 steps are relatively inexpensive to run, whereas the significance level is usually set to 0.05.Note that an approximate confidence interval for SATE could also be obtained using the delta method in Radice et al. (2013).

Data
The aim is to estimate the efficacy of phototherapy in newborns with jaundice.Following previous studies, we considered 20731 infants born in 12 Northern California Kaiser Permanente Medical Care Program hospitals from 1 January 1995 to 31 December 2004 whose birth weight was at least 2000 grams and whose gestational age at least 35 weeks, with TSB levels within 3 mg/dL of the AAP's phototherapy threshold.
The data includes the following variables: • Binary outcome variable (threshold) indicating the cross of the AAP exchange transfusion threshold within 48 hours of the qualifying TSB.
• Binary treatment variable (phototherapy) representing the receipt of hospital phototherapy within 8 hours of the qualifying TSB.-Gender (gender), binary variable (gender=1 if male, 0 otherwise).

Model Specification
The linear predictors for the phototherapy and threshold equations were specified as where s 11 and s 21 are the unknown smooth functions described in Section 2.1.The smooth components were represented using penalized thin plate regression splines with basis dimensions equal to 10 and penalties based on second order derivatives (Wood, 2006).The non-linear specification for weight arises from the fact that this covariate is likely to affect P(phototherapy = 1) and P(threshold = 1) non-linearly.gest was included as a parametric component because it did not have enough unique covariate values to justify the use of a smooth function.Regarding the use of link functions, we considered the probit and skew normal links for each copula model mentioned in Section 2.3.

Results
For each fitted model the AIC, estimated θ and estimated SATE are reported in Tables 1, 2 and 3.The results for the model cases that did not achieve convergence are not reported.Specifically, for the models with probit margins and 90 and 180 degrees rotated Clayton copulas and 270 degrees rotated Gumbel copula, the algorithm failed to find a solution.For the models with skew normal margins and Frank, Student-t, 180 degrees rotated Clayton copula, 270 degrees rotated Gumbel copula, not-rotated Joe and 270 degrees rotated Joe copula, the algorithm did not converge.Numerical problems sometimes arise when the model employed is not appropriate to fit the data at hand.
The model with lowest AIC is the 180 degrees rotated Gumbel copula with skew normal margins, although other copula models with very similar AIC values are good candidate as well.The findings are consistent with the interpretation that babies who are less likely to receive phototherapy are at lower risk of being hyperbilirubinemic.
The result is consistent with the interpretation that the probability of being hyperbilirubinemic decreases by 0.78% when a baby receives photherapy as compared to a baby who does not receive it.When comparing the estimated results among all the models two points are worth noting.First, with some exceptions (e.g., the 270 degrees rotated Clayton copula, 90 degrees rotated Gumbel and Joe copulas), the estimated SATE appears to be robust to different margins and copula specifications.Second, the estimated θ are similar for the models whose AIC values are reasonably close to each other, whereas for the models which are not supported by the data the estimates vary considerably.Description: These were obtained by using the rotated 180 degrees Gumbel copula with skew normal margins.
Results are on the scale of the respective linear predictors.Grey shades represent 95% confidence intervals and the 'rug plot', at the bottom of each graph, is used to show the covariate values.
Figure 2 shows the estimated smooth components for the phototherapy and threshold equations.The effect of weight is non-linear for the treatment equation suggesting that the propensity of receiving phototherapy first decreases and then increases as weight increases.As for the outcome equation the pointwise confidence intervals for the smooth function of weight suggest that weight does not have a linear or non-linear effect.For completeness, Table 4 reports the estimated parametric coefficients for Gumbel180 with skew normal margins.As the qualifying TSB level increases the propensity of receiving phototherapy and that of being hyperbilirubinemic increase.Male babies are more likely to receive phototherapy and to be at higher risk of being hyperbilirubinemic.By increasing gestational age, both the propensity of receiving phototherapy and that of being hyperbilirubinemic decrease.Finally as age increases babies are less likely to receive phototherapy and less likely to be hyperbilirubinemic.

Discussion
The use of asymmetric link functions in the context of recursive bivariate models estimating the efficacy of phototherapy for newborns with jaundice has been explored.The skew probit link approach by Azzalini (1985) has been employed.We also explored the use of power probit and reciprocal power probit links.The choice of these link functions has been driven by the fact that they include the probit link as special case and have desirable properties.Our approach can also be employed for modeling jointly two imbalanced binary responses with a focus on predictive performance, for instance.
The approach has been applied to a study where the aim was to quantify the efficacy at preventing significant hyperbilirubinemia in infants.Results showed that phototherapy significantly decreases the probability of being hyperbilirubinemic.Also, our empirical findings suggest that the estimated quantity of interest (SATE) is robust to relaxing the assumption of symmetric links.

Appendix
The pdf and cdf of the standard power normal (PN) distribution, denoted in our context by ϵ vi ∼ PN(ξ v ), are given as respectively, where ϕ(.) and Φ(.) are the pdf and cdf of the standard normal distribution, and ξ v > 0. A new random variable can be defined such that In this case ϵ vi ∼ RPN(ξ v ) denotes the standard reciprocal PN distribution.The name comes from that fact that F RPN (ϵ vi , ξ v ) = 1 − F PN (−ϵ vi , ξ v ).Therefore, the standard PN and standard RPN distributions are distinct although closely related because one is the reflection of the other.
Using the latent variable representation of the response, we have that P(y vi = 1) = P(y * vi > 0) = P(η vi + ϵ vi > 0) = P(ϵ vi > −η vi ) = 1 − P(ϵ vi ≤ −η vi ).( 8) If ϵ vi ∼ PN(ξ v ), then (8) becomes whereas if ϵ vi ∼ RPN(ξ v ) then ( 8) is As explained by Gupta & Gupta (2008), the PN density is unimodal and is skewed to the right if ξ v > 1 and to the left if 0 < ξ v < 1.On the other hand, by considering the reciprocal formulation, the RPN density is also unimodal and is skewed to the left if ξ v > 1 and to the right if 0 < ξ v < 1. Figure 3 shows the shape of the cdfs and pdfs of PN and RPN for different values of ξ v .From the curves, it is evident that PN and RPN allow the probability of a binary variable to approach zero at different rates than it approaches one depending on the values of ξ and η.When ξ = 1, PN and RPN yield a classic probit curve.

Figure 2 .
Figure 2.Estimated smooth components for the continuous variable weight in the phototherapy equation (eq.1) and threshold equation (eq.2)

Table 3 .
Estimated SATE (and confidence interval) obtained from the fitted copula models with probit and skew normal margins