Comparison of Means of Two Log-Normal Distributions when Data is Multiply Censored

When measuring concentration of chemical compounds, we often have to deal with a situation when the resulting values are found below the detection limit of the determination method. In order to statistically evaluate such data, the newly developed method of maximum likelihood considering multiply left-censored samples is applied. This paper is motivated by the need to have valid inference concerning the equality of the means of two lognormal distributions that are frequently encountered in environmental and exposure data analysis. As a model distribution of measured environmental and/or biomedical data, log-normal distribution is considered. Moreover, using the asymptotic properties of maximum likelihood estimates, concentrations of chemicals can be compared. A test procedure for comparing the means of two independent log-normal populations in the presence of multiply censored data is also introduced and evaluated. Asymptotic chi-square test is used in the proposed test procedure. Worked example is given illustrating the use of the methods provided utilizing a computer program written in the R language. A simulation study was performed to examine the power and the size of the proposed test procedure introduced in this article.


Introduction
Many environmental data sets are characterized by a small number of high concentrations and a large number of low concentrations and are often right-skewed (Shumway et al., 1989).The log-normal distribution is positively skewed and hence can incorporate the few unusually high measurements of such environmental data in its long right-hand tail.For this reason the log-normal distribution is often applied to environmental data (Gilbert, 1987).While analyzing environmental and exposure data, a very common phenomenon is the occurrence of non-detects, i.e., observations below an analytical detection limit (DL), resulting in Type I singly left censored samples.Detection limit is the lowest concentration level that can be determined to be statistically different from a blank.The presence of observations below the DL significantly complicates the data analysis.Faced with such data, several strategies have been recommended for data analysis.One approach consists of replacing the below DL values with a constant such as DL 2 , and using methods available for complete samples.It is easy to demonstrate that the conclusions resulting from this routine practice can be seriously flawed; in fact, the conclusions may depend on the substitution value used for replacing sample values below the DL.In general, censoring means that observations at one or both tails are not available.Left-censored data commonly arise in environmental contexts.Left-censored data (data reported as less than detection limit) can occur when the substance or attribute being measured is either absent or exists at such low concentrations that the substance is not present above the DL level.Data sets containing left-censored observations are referred to as left-censored data.A sample is multiply censored if there are several detection limits.When more than two distinct detection limits DL 1 , DL 2 , ..., DL k (k ≥ 3) are reported, the data are said to be multiply-left-censored, (USPEA 1989b).It is common to have environmental data contains detection limits.Multiple censoring commonly occurs with environmental data because detection limits can change over time (e.g., because of analytical improvements), or detection limits can depend on the type of sample or the background matrix.Left censoring frequently arises in environmental studies due to: (1) sometimes nondetect is reported because the measurement lies below a threshold set by the client or laboratory, (2) sometimes the instrumentation registers a low signal, but the chemist decides that unpollutant environmental samples could give a similar signal and reports nondetect instead of the observed measurement, (3) sometimes the signal produced by the pollutant is too small for the instrumentation to discriminate from background noise, or (4) sometimes a signal is registered, but certain criteria that identify the compound are not met.In many environmental applications the distribution of variables such as chemical concentration, inhalation, digestion, and consumption rates are positive and skewed to the right.Hence, censored observations occur between zero and DL.In some instances a log transformation can provide a more natural scale to analyze such measurements.Samples to be considered in this paper are those that are Type I single-left-censored. Suppose that a sample of n data points is given of which m data points are non-censored (fully measured), and the remaining m c = n − m observations are left-censored with a single detection limit DL.In such Type I censored samples DL is fixed, whereas m and m c are random.
Nondetect values can cause an especially difficult problem when the goal is to compare two different populations.There has been a great deal of literature on the subject of the statistical inference of the parameters of normal and log-normal populations from both fully measured and censored data.Gupta and Li (2006) developed a score test for testing the equality of the means of two independent log-normal populations from fully measured data.Zhou et al (1997) considered two methods for comparing the means of two independent log-normal non-censored samples.Harris (1991) considered two parametric and two non-parametric methods for testing the equality of medians of two independent log-normal distributions when some data are left-censored.Paul and Gary (2007) compare the performance of several methods for statistically analyzing censored data sets when estimating the 95th percentile and the mean of right-skewed occupational exposure data.Krishnamoorthy et al (2014) proposed tests and confidence intervals for the ratio of the two means of two log-normal distributions, based on pivotal quantities involving the maximum likelihood estimators.Other suggested methods for comparing the means of two log-normal distributions are discussed in Krishnamoorthy et al (2014Krishnamoorthy et al ( , 2011Krishnamoorthy et al ( , 2007Krishnamoorthy et al ( , 2006Krishnamoorthy et al ( , 2003)).Some of these methods are based on the generalized p-value and generalized confidence intervals, and others are based on the generalized test variable.Aboueissa (2015) introduced a test procedure for comparing the means of two independent log-normal populations when data is singly censored.Abdollahnezhad et al (2012) introduced a new method of test for comparing the means of two log-normal populations through the generalized measure of evidence to have against the null hypothesis.Prentice (1978) developed linear rank tests with right censored data.Millard and Deverel (1988) adapted several existing right censored non-parametric procedure so that they can be used in environmental setting with left-censored data.Methods for the estimation of the log-normal parameters for one-sample cases where there may exist left-censored data are discussed by El-Shaarawi (1989).Stoline (1993) extended results first suggested by Harris (1991) and proposed a procedure for comparing medians of two independent log-normal distributions where some data may be left-censored.Stoline (1993)used the Expectation Maximization (EM) algorithm introduced by Dempster et al. (1977) to calculate the maximum likelihood estimates of population parameters µ and σ.Other suggested methods for estimating population parameters from censored samples are discussed in Marco (2005), Jin et al (2011), Gibbons (1994), Gleit (1985), El-Shaarawi and Esterby (1992), Elshaarawi and Dolan (1989), Gilbert (1987) and Schneider (1986).
The purpose of this paper is to develop a parametric procedure to test the hypothesis of equal means when data are sampled from two independent log-normal distributions utilizing multiply left-censored data sets.The EM algorithm will be used to obtain the maximum likelihood estimates of population parameters under different hypotheses.To facilitate the application of this procedure, a computer program is written in the R language which calculates the maximum likelihood estimates, and asymptotic chi-square test statistics and their p-values.A numerical example is given illustrating the use of this procedure utilizing a computer program written in the R language.A simulation study was performed to inspect the size and the power of the proposed test procedure.

Assumptions and Notations
Assume that there exists two random samples of n 1 and n 2 data values: y 11 , y 12 , ..., y 1m 1 , y 1m 1 +1 , ..., y 1n 1 and y 21 , y 22 , ..., y 2m 2 , y 2m 2 +1 , ..., y 2n 2 taken from two independent log-normal populations LN(µ 1 , σ 1 ) and LN(µ 2 , σ 2 ), respectively.Where LN(µ, σ) denotes a log-normally distributed variable y with the probability density function , for y > 0, where −∞ < µ < ∞ and σ > 0. For convenience, for each sample i let us assume that the first m i observations y i1 , y i2 , ..., y im i are non-censored (fully measured) and the remaining m c i = n i − m i observations are left-censored for i = 1, 2. For left censored observations, it is assumed that for each sample i it is only known that y i j < LDL i j for j = m i + 1, ..., n i (or j = 1, 2, ..., m c i ) and i = 1, 2. The parameters for the i th log-normal population can be expressed as functions of the parameters µ and σ as: mean: where γ i = e σ 2 i for i = 1, 2. Let where LDL i j are the detection limits in the i th log-normal sample and m i + m c i = n i for i = 1, 2 and j = m i + 1, m i + 2, ..., n i .
To simplify the presentation in this paper, the analysis is described and illustrated by reference to the analysis of normally distributed data, though this condition occurs infrequently in typical environmental data analysis.However, it is frequently necessary to transform real environmental data before analysis; typically the logarithmic transformation of x i j = log(y i j ) is used, although other transformations are possible.When the logarithmic or other transformation is used prior to censored data set analysis, it is necessary to transform the analysis results back to the original scale of measurement following parameter estimation.For each sample i let be the sample mean and sample variance of the m i non-censored observations x i1 , x i2 , ..., x im i , for i = 1, 2. Let the functions ϕ(.) and Φ(.) be the pd f and cd f of the standard unit normal.Define for i = 1, 2 and j = m i + 1, m i + 2, ..., n i .
We also define and z i j = x i j − µ i σ i for i = 1, 2 and j = 1, 2, ..., m i .
The likelihood function of the samples under consideration is given by: which can be written as where and The two log-normal population means are confirmed equal whenever no evidence was available to reject the null hypothesis H 0LN : µ y 1 = µ y 2 in favor of the alternative hypothesis H ALN : µ y 1 µ y 2 .Equivalently the two lognormal population means are confirmed equal whenever the null hypothesis is accepted in favor of one of the alternative hypotheses: and σ 1 σ 2 (mean homogeneity, variance heterogeneity).

Maximum Likelihood Estimates of Population Parameters
In this section the maximum likelihood estimates of population parameters µ i and σ i , for i = 1 and 2, are derived under each of the hypotheses H 0N , H A1N , H A2N and H A3N .The derivations of these estimates are now described.

Maximum Likelihood Estimates under H 0N
Under the hypothesis H 0N , x i j , for i = 1, 2 and j = 1, 2, ..., n i , are assumed to be normally distributed with mean µ and standard deviation σ.That is, it is assumed that there exists a random sample of n = n 1 + n 2 data values taken from a normal population with mean µ and standard deviation σ.For convenience, for each sample i let us assume that the first m i observations x i1 , x i2 , ..., x im i are non-censored (fully measured) and the remaining m c i = n i − m i observations are left-censored for i = 1, 2. For left censored observations, it is assumed that for each sample i it is only known that x i j < DL i j for j = m i + 1, ..., n i (or j = 1, 2, ..., m c i ) and i = 1, 2.
The likelihood function L H 0N (µ, σ) under H 0N is given by: Hence, the corresponding log-likelihood function ℓ ) is given by: and be the sample mean and sample variance of the m non-censored observations x 11 , x 12 , ..., The maximum likelihood estimates μ and σ of µ and σ are the solutions to equations (3.3) and (3.4), the partial derivatives for the log-likelihood equation with respect to µ and σ: Φ(ξ i j ) and ξ i j = DL i j −µ σ .The expectation maximization (EM) algorithm will be used iteratively to obtain the solutions μ and σ to the maximum likelihood equations (3.3) and (3.4).The EM algorithm was proposed by Dempster et. al. (1977) for calculating the maximum likelihood estimated from censored samples.The procedure consists of alternately estimating the censored observations from the current parameter estimates and estimating the parameters from the actual and estimated observations.The EM algorithm can be used to calculate the maximum likelihood estimates for the mean µ and standard deviation σ of a normal distribution from both singly-and multiply-censored samples.A brief description for the EM algorithm is given here.
At step 0 of the EM algorithm all non-censored observations are used to calculate the initial estimates of µ and σ as follows: Let μs and σs be the maximum likelihood estimates of µ and σ at step s of this procedure.At step s + 1, each censored observation x i j (where i = 1, 2; j = m i + 1, 2, ..., n i ) is replaced by an estimate of μs − σs W( DL i j − μs σs ).Let the values u i j be calculated at step s + 1 as follows: So the updated estimates μs+1 and σs+1 of µ and σ are given by where the function γ(t) is defined as: More details about the EM algorithm procedure can be found in Wolynetz (1979).Convergence is achieved if both | μs − μs+1 | < 0.00001 and | σs − σs+1 | < 0.00001 occur.When these convergence criteria are met, the maximum likelihood estimates for µ and σ are then given by μ = μs and σ = σs , respectively.

Maximum Likelihood Estimates under H A1N
Under the hypothesis H A1N x i j are assumed to be normally distributed with mean µ i and standard deviation σ i , for i = 1, 2 and j = 1, 2, ..., n i .Thus the likelihood function under H A1N is given by: Hence, the corresponding log-likelihood function ℓ H A1N (µ 1 , µ 2 , σ 1 , σ 2 ) of (3.5) which is defined as log(L H A1N (µ 1 , µ 2 , σ 1 , σ 2 )) is given by: The maximum likelihood estimates μi and σi of µ i and σ i are the solutions to equations (3.7) and (3.8) for i = 1, 2. where The single sample EM algorithm estimation method can be used to obtain the maximum likelihood estimates μi and σi of µ i and σ i for i = 1, 2 as follows.At step 0 of the EM algorithm all non-censored observations are used to calculate the initial estimates of µ i and σ i for i = 1, 2 as follows: For i = 1, 2 let μis and σis be the maximum likelihood estimates of µ i and σ i at step s of this procedure.At step s + 1, each censored observation x i j (where i = 1, 2; and j = m i + 1, ..., n i ) is replaced by an estimate of μis − σis W( DL i j − μis σis ).Let the values v i j be calculated at step s + 1 as follows:
The expectation maximization (EM) algorithm will be used iteratively to obtain the solutions μ1 , μ2 and σ to the maximum likelihood equations (3.11) and (3.13).At step 0 of the EM algorithm all non-censored observations are used to calculate the initial estimates of µ 1 , µ 2 and σ as follows: Let μ1s , μ2s and σs be the maximum likelihood estimates of µ 1 , µ 2 and σ at step s of this procedure.At step s + 1, each censored observation x i j (where i = 1, 2; j = m i + 1, 2, ..., n i ) is replaced by an estimate of μis − σs W( DL i j − μis σs ).

Maximum Likelihood Estimates under H A3N
The maximum likelihood estimators under the hypothesis H A3N are more complicated to derive than those obtained under H 0N , H A1N and H A2N .In this section only an outline for obtaining estimates of µ, σ 1 and σ 2 for the hypothesis under consideration will be given.Under the hypothesis H A3N x i j are assumed to be normally distributed with mean µ and standard deviation σ i , for i = 1, 2 and j = 1, 2, ..., n i .Thus the likelihood function under H A3N is given by: Hence, the corresponding log-likelihood function ℓ H A3N (µ, σ 1 , σ 2 ) of (3.14) which is defined as log(L H A3N (µ, σ 1 , σ 2 )) is given by: The maximum likelihood estimates μ, σ1 and σ2 of µ, σ 1 and σ 2 are the solutions to equations (3.16)-(3.18).
The expectation maximization (EM) algorithm will be used iteratively to obtain the solutions μ, σ1 and σ2 to the maximum likelihood equations (3.16)-(3.18).At step 0 of the EM algorithm all non-censored observations are used to calculate the initial estimates of µ, σ 1 and σ 2 as follows: Let μs , σ1s and σ2s be the maximum likelihood estimates of µ, σ 1 and σ 2 at step s of this procedure.At step s + 1, each censored observation x i j (where i = 1, 2; j = m i + 1, 2, ..., n i ) is replaced by an estimate of μs − σis W( DL i j − μs σis ).Let the values t i j be calculated at step s + 1 as follows: where W(t), γ(t), ξ 1 j and ξ 2 j were previously defined.
Also define be the sample mean of the updated values in the i th sample for i = 1, 2.

Asymptotic Chi-Square Test
The estimated log-likelihood functions lH 0N , lH A1N , lH A2N and lH A3N under the hypotheses H 0N (overall homogeneity), H A1N (overall heterogeneity), H A2N (mean heterogeneity, variance homogeneity) and H A3N (mean homogeneity, variance heterogeneity), respectively; are obtained by replacing population parameters by their maximum likelihood estimates.Therefore from (3.2), (3.6), (3.10) and (3.15) we get: and ) . (4.4) In general, the asymptotic α−level chi-square test used to test the null hypothesis H 0 : θ = 0 versus the alternative hypothesis H a : θ 0 is defined by where χ 2 0 has a chi-square distribution with degrees of freedom d f , which is defined by the number of free parameters under the alternative hypothesis H a minus the number of free parameters under the null hypothesis H 0 , and χ 2 (α,d f ) is the upper α-point value obtained from the chi-square table with degrees of freedom d f .The asymptotic α−level chi-square test used to test the equality of the means of two independent log-normal populations is now described.
Test 1: Overall Homogeneity versus Overall Heterogeneity The asymptotic α−level chi-square test to test the null hypothesis H 0N versus the alternative hypotheses H A1N is defined by: where χ 2 (α,2) is the upper α−point for a chi-square random variable with 2 degrees of freedom.The p-value of this test statistic is defined by: Thus in this case the null hypothesis that the means of two independent log-normal populations are equal will be rejected if χ 2 0A1 > χ 2 (α,2) or equivalently if p − value < α.Test 2: Overall Homogeneity versus Mean Heterogeneity and Variance Homogeneity The asymptotic α−level chi-square test to test the null hypothesis H 0N versus the alternative hypotheses H A12N is defined by: where χ 2 (α,1) is the upper α−point for a chi-square random variable with 1 degrees of freedom.The p-value of this test statistic is defined by: Thus in this case the null hypothesis that the means of two independent log-normal populations are equal will be rejected if χ 2 0A2 > χ 2 (α,1) or equivalently if p − value < α.Test 3: Overall Homogeneity versus Mean Homogeneity and Variance Heterogeneity The asymptotic α−level chi-square test to test the null hypothesis H 0N versus the alternative hypotheses H A3N is defined by: where χ 2 (α,1) is the upper α−point for a chi-square random variable with 1 degrees of freedom.The p-value of this test statistic is defined by: p − value = P(χ 2 (1) > χ 2 0A3 ).(4.10) Thus in this case the null hypothesis that the means of two independent log-normal populations are equal will be rejected if χ 2 0A3 > χ 2 (α,1) or equivalently if p − value < α.Computer Programs: To facilitate the application of the test procedure and parameter estimation method described in this article, a computer program called "Abou.T wo.Lognormal.Estimation.MultipleDL" is written in the R language to automate parameters estimation from multiply left-censored data sets that are normally or lognormally distributed and to obtain the estimated values of the log-likelihood functions under the null and the alternative hypotheses .In addition, this computer program will be used to obtain the asymptotic α−level chisquare test statistic and its p-value.A Copy of the source code is given in the Appendix section and is available upon request.
For the sake of simplicity, in the remaining part of this article Test 1 (H 0N versus H A1N ) will be considered.Test 2 and Test 3 can be easily programmed and computed.

Example:
Copper and Zinc Data Millard and Deverel (1988) compared the median copper and zinc trace element concentrations in groundwater sampled from two geological areas in the San Joaquin Valley, the Basin-Trough Zone and the Alluvial Fan Zone in California.Data from both sites are given in Table 1.These data sets contains seven distinct detection limits (LDL : Table 1.Millard and Deverel copper and zinc data: Groundwater concentrations of copper and zinc at two geological zones in the San Joaquin valley, California < 1 < 1 3 3 5 1 4 4 2 2 1 2 < 5 11 Alluvial Fan < 1 2 2 2 2 < 20 2 2 3 3 < 20 < 10  Millard and Deverel (1988) give three possible causes for multiple left-censoring when measuring the concentration of copper and/or zinc in shallow groundwater.First cause may be decreasing detection limits over time as measurement devices improve.Second, there may be more than one method available, and each method may be optimal in different ranges of zinc and/or copper concentration.A third cause involves the amount of dilution that a lab technician may use.Table 2 contains estimates of the normal and log-normal population parameters.It is noted that the highest reported concentration of zinc (620) in the Alluvial Fan Zone seems to be unusual data value since the second highest observed zinc concentration is (50) in this zone.Two different estimates of the normal and log-normal population parameters for the zinc data sets are reported.The first estimate includes all data and the second includes all data with the zinc data value 620 removed (ZincW620R).The corresponding estimates of the zinc means µ y i = e µ i + σ 2 i 2 and medians m y i = e µ i are also included in Table 2.The influence of the single large zinc data value 620 can be most clearly seen by comparing the estimates for σ 1 under the hypothesis H A1N .The estimate is σ1 = 0.8035 with the 620 value included and σ1 = 0.5799 with the 620 value removed.The estimates for µ 1 with the 620 value included and with the 620 value removed do not differ appreciably.These estimates for µ 1 are μ1 = 2.4854 and μ1 = 2.4665 for these two cases, respectively.The corresponding estimates of the log-normal median zinc concentration in the Alluvial Fan Zone are My 1 = 12.006 with the 620 value included and My 1 = 11.781 with the 620 value removed.The comparable estimates of the log-normal mean zinc concentration are μy 1 = 16.580 with the 620 value included and μy 1 = 13.922 with the 620 value removed.In addition, the estimates of the log-normal standard deviation for zinc concentration in the Alluvial Fan Zone are σy 1 = 15.792 with the 620 value included and σy 1 = 8.767 with the 620 value removed.The estimates of the log-normal median are similar, but the estimates of the log-normal mean and standard deviation are appreciably different, owing to the influence of the single large zinc data value 620.Table 2 contains the p-value results associated with the application of the recommended asymptotic chi-square test to the the Millard and Deverel (1988) copper and zinc data presented in Table 1.The Millard and Deverel (1988) p-value results using the normal scores permutation variance (NS 2P) procedure are also presented in Table 2.

Simulation Study
In this simulation study, type I error rates and power of the test procedure introduced in this article are investigated.
A computer program was written in the R language for this purpose.For each combination of the population parameters µ 1 , µ 2 , σ 1 and σ 2 described below, two sample size cases were considered: in case one, n 1 = n 1 = 25 and in the second case, n 1 = n 2 = 75.The first case will be referred to as the small sample size case and the second as the large sample size case.Censoring at two different detection limits was used in each simulated sample.
From Table 3, one can see that the estimated simulated Type I error rates are slightly higher than 0.06 (0.0692, 0.0657) for the small sample size case, and slightly higher than 0.05 (0.0586, 0.0507) for the large sample size case.The censoring levels do not seem to affect the value of Type I error rate, α. (0, 0.9, 1, 1.9) 0.9435 0.9168 1.0000 1.0000 (0, 1.0, 1, 2.0) 0.9703 0.9425 1.0000 1.0000 From Table 4, one can see that the estimated simulated power is higher for large sample size case than the small sample size case, and slightly higher for the lower level of censoring.Specifically, in the small sample size case with h 1 = 15%andh 2 = 30% (h 1 = 30%andh 2 = 50%) censoring level we reach a power of 0.9703 (0.9425) when the difference between µ 1 and µ 2 is 1.0, and the difference between σ 1 and σ 2 is 1.0.Alternatively, in the large sample size case with h 1 = 15%andh 2 = 30% (h 1 = 30%andh 2 = 50%) censoring we reach a power above 0.99 when the difference between µ 1 and µ 2 is 0.6, and the difference between σ 1 and σ 2 is 0.7; and a power of 1.0 when the difference between µ ′ s and σ ′ s is 0.8.
In summary, the test procedure introduced in this article maintains its stated significance level and has much power with larger sample size and a bit less power with greater censoring levels.In addition, the power decreases when the censoring levels moves from 0.15and0.30to 0.30and0.50.Also, the power increases greatly when the sample size moves from the order of 25 to the order of 75.

Conclusions and Remarks
It is well known that the log-normal distribution is widely used in modelling environmental and biomedical censored data.This article has dealt with the problem of comparing means of two independent log-normal populations in the presence of multiply left-censored data.The EM Algorithm is employed to obtain the maximum likelihood 3.20)-(3.22)yields a cubic equation in μs+1 which can be , 3 , 5 , 10 , 15 , 20)  with censoring level (percentage of non-detected observations) of 22%.
The solution of equations (3.16)-(3.18)for µ, σ 1 , and σ 2 is more complicated in this case than the cases H 0N , H A1N and H A2N since it involves solution of a cubic equation for µ.Thus at this point a procedure similar to the one used in cases H 0N , H A1N and H A2N is not valid.Let

Table 2 .
Estimates of normal and log-normal parameter values from the copper and zinc data given in Table1The p-value of the asymptotic chi-square test statistic of testing the null hypothesis H 0N versus H A1N or equivalently H 0LN versus H ALN is 0.2285.Therefore the hypothesis of equal means is accepted for copper at significance level of α = 0.05.The reported p-value for equality of medians of Millard and Deverel NS 2P is 0.320.The p-value of the asymptotic chi-square test statistic of testing the null hypothesis H 0N versus H A1N or equivalently H 0LN versus H ALN is 0.1079 with the 620 value included.Therefore with the 620 value included the hypothesis of equal means is accepted for zinc at significance level of α = 0.05.The p-vale of testing the null hypothesis H 0N versus H A1N or equivalently H 0LN versus H ALN is 0.0004 with the 620 value removed.Therefore with the 620 value removed the hypothesis of equal means is rejected for zinc at significance level of α = 0.05.The reported p-value for equality of medians of Millard and Deverel NS 2P is 0.020.
ZincW620R: Alluvial Fan Zone zinc data set with the data value 620 removed.Copper Case: