Statistical Issues on Analysis of Censored Data Due to Detection Limit

Measures of substance concentration in urine, serum or other biological matrices often have an assay limit of detection. When concentration levels fall below the limit, the exact measures cannot be obtained, and thus are left censored. Common practice for addressing the censoring issue is to delete or ’fill-in’ the censored observations in data analysis, which often results in biased or non-efficient estimates. Assuming the concentration or transformed concentration follows a normal distribution, a Tobit regression model can be applied. When the study population is heterogeneous, for example due to the existence of a latent group of subjects who lack the substance, the problem becomes more challenging. In this paper, we conduct intensive simulation studies to investigate the statistical issues in analyzing censored data and compare different methods in which the data are treated either as a dependent variable or an independent variable. We also analyze triclosan data in the NHANES study and metabolites data in the Bogalusa Heart Study to illustrate the issues. Some guidelines for analyzing such censored data are provided.


Introduction
Measures of substance concentration in urine, serum or other biological matrices that fall below assay limit of detection are common in epidemiological and medical research (Nassan et al., 2017;Ferrero et al., 2017;Østergren et al., 2017;Kim et al., 2018;Zhao et al., 2018;Gomez et al., 2019;Maule et al., 2019). When the concentration levels are under the detection limit (DL), denoted as L, accurate measures cannot be obtained. Instead, their values are only partially known and left censored. For example, triclosan, collected in NHANES studies, is a broad-spectrum antimicrobial chemical and widely used in household and health care related products. Currently the DL for urine triclosan concentration is 2.3 ng/ml, meaning only triclosan concentrations greater than or equal to 2.3 ng/ml can be detected. For triclosan concentrations lower than 2.3 ng/ml, their values are censored. Instead of precise measures of the triclosan concentration levels, the values are only partially known, namely somewhere between 0 and 2.3. The censoring generates informative missing data that need to be properly handled in order to obtain valid inference and achieve efficiency.
A simple approach for handling such censored data is complete-case analysis (Little and Rubin, 2002). This approach discards all subjects who have censored data and fits regression models by using only subjects whose data are uncensored. This method is often invalid because it violates the missing completely at random assumption and is potentially highly inefficient even with moderate censoring.
Another common approach is the 'fill-in' method where the censored observations are replaced by a constant such as L, or 1 2 L (Olson, 1993;LaFleur et al., 2011;Slymen et al., 1994;Gleit, 1985;Newman et al., 1989). This approach is widely applied in epidemiological and medical research because of its simplicity of implementation, but often leads to biased estimates.
When the underlying measures follow a single normal distribution, the data then follow a censored normal distribution due to DL. Tobit regression models can be applied to model such censored dependent variables (Tobin, 1958). When the underlying measures are not from a normal distribution, data transformation such as log-transformation can be employed first, and a Tobit regression model can then be applied on the transformed data. The Tobit regression model has been widely applied in some fields such as economics (Rosen, 1976;Keeley et al., 1978;McDonald and Moffitt, 1980;Amemiya, 1984;Zhou et al., 2018;Al-Hanawi et al., 2018;Deng et al., 2019;Leng et al., 2019;Hezaveh and Cherry, 2019); however, its application is not well-recognized in epidemiological and medical research despite the universality of censored data due to DL.
We run into issues when censored observations are from heterogeneous populations, such as when exposure and nonexposure become factors. Subjects from the non-exposure population are not exposed at all, so their concentration level is 0 and thus censored, while subjects from the exposure population will have a concentration level greater than 0. If their levels are lower than L, they are censored as well. Only those subjects who are exposed and also have concentration levels greater than or equal to L can be observed. In the above triclosan example, subjects having triclosan in their urine, namely the exposed group, but with concentration levels lower than L are censored. However, when there is a subgroup of subjects who do not have triclosan at all, i.e, are not exposed to triclosan, their measures are of course under L and thus are censored as well. Because measures from the non-exposed group are always censored, they are considered latent. If the data from the exposure population follow a censored normal distribution, data from the whole sample follow a mixture of censored normal and degenerate distributions. In this case, if a Tobit model is applied, biased estimates can occur. To model data from such heterogeneous populations, mixture model techniques should be applied to achieve unbiased estimates (Moulton and Halsey, 1995;Taylor et al., 2001;Reisetter et al., 2017).
When censored variables are used as predictors, common practices such as deletion or 'fill-in' can be problematic (Zhao et al., 2018;Park et al., 2018;Javad et al., 2018) . To better facilitate information provided by the censored observations in predictors, a joint modeling approach is proposed to aid estimation. The joint modeling has been applied in Lynn (2001); Rigobon and Stoker (2003); Austin and Hoch (2004); Bernhardt et al. (2015), but its application in epidemiological and medical research is very rare. Furthermore, no method is available when censored predictors are from heterogeneous populations.
In this paper, we propose a mixture model for outcomes obtained from heterogeneous populations, and develop joint modeling for censored predictor, either from a single or heterogeneous populations. We conduct intensive simulation studies to investigate statistical issues for different methods and also use two real data examples to illustrate the methods.

As Dependent Variable
When censored data are used as outcome, Tobit models can be applied. Next, we give a brief review of the Tobit model for the outcome from a single population. Because the Tobit model is inappropriate for modeling outcomes from heterogeneous populations, a mixture model is also proposed for those cases (Moulton and Halsey, 1995;Taylor et al., 2001;Reisetter et al., 2017).

Tobit Model for Single Population
The Tobit model is a linear regression model for data from censored normal distribution. Consider an independent sample (x i , z i ), i = 1, ..., n, where n is the sample size, x i = (x i1 , x i2, ..., x ip ) is a p-dimensional covariate, and z i is the outcome for the i th subject. The z i is obtained based on an underlying outcome variable z * i , which is assumed to have a linear relationship with x i through Let L be the lower DL, the outcome z i is defined as Under (1), the z i follows a censored normal distribution with likelihood given by where Φ (·) is the cumulative distribution function of the standard normal and µ i = x i β is the conditional mean of the outcome. The outcome z i can be modeled by a Tobit regression model, denoted as Tobit(µ i , σ 2 , L), given by Let c i indicate whether z i is censored or not with c i = 1 for censored and c i = 0 for z i ≥ L. The likelihood function International Journal of Statistics and Probability Vol. 9, No. 4;2020 consisting censored and non-censored subjects is The parameters β and σ can be estimated based on maximum likelihood estimate (MLE) approach, i.e., maximizing the likelihood (5) for given (4).
The Tobit regression model is applied for outcomes from censored normal distribution. When the underlying measures are not normally distributed, data transformation can be applied first, and a Tobit regression model can then be applied on the transformed data.

Mixture Model for Heterogenous Populations
For data from heterogeneous populations, such as concerning exposure and non-exposure, Tobit models are not appropriate. Instead, models which can separate the two populations are needed.
Let ω be the prevalence of the non-exposure, so the whole population consists of 100ω% non-exposure and 100(1 − ω)% exposure. If data from the exposure population follow a censored normal distribution (3), the data from the whole population have a mixture of censored normal distribution with probability (1 − ω) and degenerate distributions with probability ω, given by if exposure and z i ≥ L, If the likelihood of non-exposure is predicted by some covariates, say u i , and modeled by a logit model, then the exposure outcome is modeled by a Tobit model (4), so the data from the heterogeneous populations can then be modeled by a mixture of Tobit and logit models, denoted as mTobit ω i , µ i , σ 2 , L , given by The mTobit model is composed of a logit model for the likelihood of non-exposure, and a Tobit model for the correlates of the exposure. The covariates u i and x i in (7) can be the same or different. If the likelihood of non-exposure doesn't depend on any covariates, no link function is needed. The likelihood function of the mTobit model is given by Again, the parameters β ω , β and σ can be estimated by an MLE approach based on the likelihood (8) and model (7) Under the Tobit model, for a given L, the proportion of the data under the DL is E Φ L−µ i σ , while under the mTobit model, the proportion of the data under the DL becomes ω for ω > 0. Therefore, ω is a parameter indicating excessive observations under the DL. Therefore, when the data exhibit more censored observations than what would be expected under Tobit model, the mTobit can be applied to address the heterogeneous issue.

As Independent Variable
Censored data are also often used as predictors to examine their relationship with some health-related outcomes. Because of the limitations of deletion and 'fill-in' methods, a joint modeling approach is developed to address the censoring issues. Similar to the censored data being treated as the outcome, we consider now censored predictors from either a single exposure or heterogeneous populations. To illustrate the issues of censored predictors, for simplicity, we consider the censored predictor as the only predictor for continuous outcomes as similar issues occur for other types of outcomes.

Predictor From Single Population
Suppose the underlying exposure Z * i ∼ N(µ, σ 2 z ), the continuous outcome Y i is associated with Z * i through Due to DL, z * i is observed only if z * i ≥ L, i.e., the predictor z i is obtained based on (2), and the distribution of z i is given by (3). Let F(·), f (·) and g(·) be the distribution of (Y, Z), Y|Z and Z, separately. In joint modeling, we have For uncensored subjects, based on (9) and (3), we have Thus the joint likelihood for the uncensored subjects is given by While for censored subjects with Z i < L, the joint likelihood So the likelihood for the whole sample can be given by By plugging (11) and (12) into (13) and applying MLE approach, the parameters (β 0 , β 1 , µ, σ 2 z , σ 2 z ) can be estimated, and inference can be made.

Predictor From Heterogenous Populations
Assume the predictor z is from heterogeneous populations composed of non-exposure with probability ω and exposure with probability (1 − ω). Suppose the underlying exposure z * i ∼ N(µ, σ 2 z ), and z * i is observed only if z * i ≥ L due to DL. The predictor z i is a mixture of censored exposure and non-exposure with distribution given in (6). As it is very likely that the relationships of exposure and non-exposure with outcome are different, we assume that relationships of the exposure and non-exposure with outcome y i is where ε i ∼ N(0, σ 2 ). In (14), the coefficient β 1 describes the relationship between the exposure and outcome, while β 2 captures the relationship between the non-exposure and outcome.
For uncensored subjects (c i = 0), based on (14) and (6), we have So the joint likelihood is given by Censored subjects can be from either the non-exposure or exposure population. Assume R i be the membership indicator http://ijsp.ccsenet.org International Journal of Statistics and Probability Vol. 9, No. 4;2020 with R i = 1(0) for non-exposure (exposure) and R i is known, based on (14) and (6), the joint likelihood for non-exposure is given by and the joint likelihood for censored exposure is The joint likelihood for the whole sample including non-exposure, censored exposure and non-censored exposure is given by Plugging (15), (16), (17) into (18), and apply MLE approach, the parameters (β 0 , β 1 , β 2 , µ, σ 2 z , σ 2 ) can be estimated and inference can be conducted.

Simulation Studies
Simulation studies are conducted to investigate statistical issues of different methods for handling censored data. For censored data treated as the outcome, the data are considered from single exposure and heterogeneous populations separately. Methods considered for handling the censoring issue include deletion, 'fill-in' by L or 1 2 L, and fitting the data with Tobit and mTobit models. For data treated as a predictor, in addition to deletion or 'fill-in' by L or 1 2 L, we also consider joint modeling for continuous outcome.
Different values for L are considered to reflect the varying amount of data censored. For data from heterogeneous populations, different prevalence (ω) of the non-exposure is also considered. In all the simulations, small (200), moderate (500) and large (1000) sample sizes are considered and a Mont Carlo (MC) sample size of 1000 is used.

As Dependent Variable
For Tobit models, we use the R function 'VGAM' to estimate parameters. As there is no R package available for mTobit models, we use a R function 'optim' to find the MLE of parameters based on (8).

From a Single Exposure Population
Let x, generated from N(0, 1), be a predictor of the underlying exposure z * through Let β 0 = 0.5 and β 1 = 1, and let the outcome z be obtained by censoring z * at L, where L is set to be −0.93, −0.44, −0.086, 0.22 and 0.50 in corresponding to 10%, 20%, 30%, 40% and 50% of data censored, respectively. The outcome z is fitted by 1) deleting all the censored observations; 2) filling-in the censored observations by L or 1 2 L; 3) Tobit model; 4) mTobit model to estimate β 1 , which is of our interest.
The mean of the estimated β 1 across 1000 realizations for different methods are provided in Figure 1. Based on the figure, all the methods except the Tobit model yield biased estimates with the deletion method performing the worst. As expected, the bias becomes larger as more data censored. The patterns are similar across different sample sizes. In all the cases, the Tobit model gives unbiased estimates regardless of the proportion of censored data and sample size. For outcome from a single population, the mTobit model doesn't converge for most (75% to 100%) of the cases because of the lack of a non-exposure population (ω = 0).

From Heterogeneous Populations
Outcomes from heterogeneous population are generated in two steps. Let ω be the prevalence of non-exposure. In the first step, a membership R i is generated to indicate whether subjects are exposed based on a Binomial (n, ω) distribution. In the second step, for subjects who are exposed, the underlying exposure z * i are generated as in (19). Specifically, the outcome z i is generated by Let ω range from 10% to 30%, and L be the same as in Section 3.1.1. The censored observations come from two sources, one from all non-exposed subjects, and the other from exposed subjects with values under L. The same methods as in Section 3.1.1 are considered.
The mean of the estimated β 1 for different L and ω across 1000 realizations are presented in Figure 2. All the methods except the mTobit model yield biased estimates. Unlike the data from a single exposure population, when the outcome is from heterogeneous populations, the Tobit model yields biased estimates with less bias for smaller non-exposure group. In all the cases, only the mTobit model gives unbiased estimates.

From a Single Exposure Population
Let z * be the underlying exposure and generated similarly as in (19). Specifically, z * is generated from N(0.5, 0.5), due to DL, with the predictor z censored at L; i.e., z = z * if z * ≥ L; otherwise z is censored.
Suppose the outcome y i is associated with z * i through y i = β 0 + β 1 z * i + ε i with ε i ∼ N(0, 0.5), β 0 = −1.0 and β 1 = 1.0. The outcome y i is observed for all subjects, but z i is censored for some subjects. In addition to deletion and 'fill-in' methods, joint modelling based on (13) is applied to estimate β 1 . Figure 3A summarizes the mean of the estimated β 1 across 1000 realizations. When the censored data are 'filled-in' by L or 1 2 L, the estimates are biased, and the bias becomes worse with more data censored. When the censored observations are deleted or the data is fitted by joint modeling, the estimates are unbiased. However, the joint modeling yields more efficient estimates, with the relative efficiency ranging from 70% to 30% of that of the joint model when censoring ranges from 10% to 50% (see Figure 3B).

From Heterogeneous Populations
The predictor z i from a heterogeneous population is generated similarly as in Section 3.1.2. Specifically, a binary indicator R i is first generated from Binomial (n, ω) to indicate if a subject is exposed or not. For subjects exposed, the underlying exposure z * is generated from N(0.5, 0.5), but censored at L, i.e., L is still set to be −0.93, −0.44, −0.086, 0.22 and 0.50 to correspond to 10%, 20%, 30%, 40% and 50% of data censored for the exposure population, respectively. We assume outcome y is associated with the exposure and non-exposure through International Journal of Statistics and Probability Vol. 9, No. 4;2020 to reflect different relationships of exposure and non-exposure with outcome, where β 0 = −1.0, β 1 = 1.0 and β 2 = −1.0.
We consider two scenarios. In the first scenario, the predictor is treated from a single population, i.e, all the censored observations are treated as the same and from the exposure population. In the second scenario, we assume that the membership is known, and we use the non-exposure (R i = 1) to estimate β 2 , but use the data from exposure population (R i = 1) to estimate β 1 based on joint modeling.
The estimated β 1 are summarized in Figure 4A. The 'filled-in' methods yield biased estimates, and results in greater bias for larger non-exposure populations and increased censored data from the exposure population. The joint modelling produces biased estimates if the censored data is treated as the same, i.e., the censored observations are not differentiated between exposure and non-exposure, while when the membership is known, i.e., if the censored observations from the exposure group can be differentiated from non-exposure, the joint modelling yields unbiased estimates. The deletion method also gives unbiased estimates as deleting the censored observations doesn't change the relationship between the exposure and the outcome. However, the joint modelling is more efficient than the deletion method. The efficiency of the deletion is 70% to 30% of that under joint modeling when the censored data for the exposure population ranges from 10% to 50% ( Figure 4B). Similar patterns are found for sample sizes 200 and 500, see Figure S1 and Figure 2 in the online supplementary material. Another advantage is that the joint modelling can estimate the relationship between the non-exposure and outcome, which is often important, both conceptually and clinically. Figure S1. The mean of estimated β 1 (β 1 =1) for different methods when outcome is continuous and the predictor is from heterogenous populations (A), and the relative efficiency of Deletion over Joint Model K (B) with sample size 200 Note: K: the membership is known.
Single: the predictor is treated as from a single population Figure S2. The mean of estimated β 1 (β 1 =1) for different methods when outcome is continuous and the predictor is from heterogenous populations (A), and the relative efficiency of Deletion over Joint Model K (B) with sample size 500 Note: K: the membership is known.
Single: the predictor is treated as from a single population.

Case Studies
Next we use two examples to illustrate the methods, the urine triclosan exposure and participants ' BMI in NHANES 2003-2010 Study and the serum metabolites and BMI in the Bogalusa Heart Study. The two examples used here are not intended to make formal inferences, but are for the purpose of illustration.

NHANES 2003-2010 Study
NHANES is a continuous program that examines a nationally representative sample of about 5000 persons each year to assess health and nutritional status of adults and children in the general population of the USA (http://www.cdc.gov/nchs/nhanes.htm). Demographic, socioeconomic, dietary, and health-related data were collected via interviews. Blood and urine samples were also collected for laboratory testing. In four surveys conducted between 2003 and 2010, urinary triclosan concentration was measured in a random sample of 3659 children (6-19 years old) and 6566 adults (20 years or older). Of these, 2898 children and 5066 adults had detectable levels of urinary triclosan, which means that there are about 22% participates with their triclosan concentration undetected, and thus censored.
Because of the skewed distribution of triclosan, a logarithm transformation is first applied. Table 1 summaries the point estimates and the associated p-value from different methods. The deletion and 'fill-in' methods detect significant association between age and triclosan, however no association is detected between BMI and triclosan for all the methods. Based on the simulation studies, the Tobit model is preferred to the deletion and 'fillin' methods. The mTobit model doesn't converge, which is likely due to lack of non-exposure population. This is not surprising as the study participants are 6 years or older, it's very likely everyone has been exposed to triclosan to some degree. This also coincide with the results provided in (He et al., ress) that no non-exposure population was detected in urine triclosan measurement. , 10-20%, 20-30%, 30-40% and 40-50% of below detection rate, respectively. We examine associations between metabolites and BMI with metabolites treated as outcome. For illustration purposes, we only consider 12 metabolites with known super-pathway and under-detection rate between 45% and 50%. The model considered is Metabolites ∼ β 0 + β 1 Age + β 2 Gender + β 3 Race + β 4 BMI.
The results are summarized in Table 2. Based on the results, the mTobit model does not converge except for two metabolites, which implies that the two metabolites likely have a non-exposure group. These findings are consistent with the results presented in (He et al., ress), where the tests indicate that there is a non-exposure group in metabolites. For the two metabolites, the estimated associations are different between Tobit and mTobit model, but the results from mTobit model should be preferred because of the existence of non-exposure.

Discussion
In this paper, we investigate statistical issues in analyzing censored data due to detection limits via intensive simulation studies. Based on the simulation studies, when the data is treated as a dependent variable, the deletion and 'fill-in' methods yield biased estimate and thus shouldn't be used. Only Tobit and mTobit models should be opted for. If the outcome is from a single exposure population, the Tobit model can be applied, while for outcomes from heterogeneous populations such as exposure and non-exposure, the mTobit model should be used.
When the censored data are treated as predictors, the 'fill-in' method always yield biased estimates and hence shouldn't be used. The deletion method usually produces unbiased estimates, but it is highly inefficient. For censored predictors from a single population, joint modeling yields unbiased estimates, and is much more efficient than the deletion method, and hence should be preferred. For predictors from heterogeneous populations, the joint modeling needs to differentiate the two populations in order to achieve unbiased estimates.
In the paper, the censored predictor is considered as the only predictor. When other predictors/covariates need to be included, and the censored predictor is correlated with other predictors/covariates, joint modeling is much more involved and needs to be developed to achieve both unbiasedness and efficiency. Joint modeling becomes even more challenging for censored predictors from heterogeneous populations with unknown membership, which is a universal case. However, regardless of if the censored data is treated as an outcome or a predictor, a foremost question is to test if the data is from a single exposure population or from heterogeneous populations in order to choose an appropriate method. Available tests for testing if there is a non-exposure population can be found in (He et al., ress), where three tests including Wald test, likelihood ratio test and score test are developed.