Comparison of Two Means of Two Log-Normal Distributions When Data is Singly Censored

It is common in environmental and biomedical data analysis to deal with censored data that are log-normally distributed. This paper is concerned with the statistical analysis for comparing the means of two independent lognormal distributions from censored data with a single detection limit. The method of maximum likelihood will be used to obtain closed form estimates for population parameters under different hypotheses. A test procedure for comparing the means of two independent log-normal populations in the presence of censored data is also introduced and evaluated. Asymptotic chi-square test is used in the proposed test procedure. Worked example is given illustrating the use of the methods provided utilizing a computer program written in the R language. A simulation study was performed to examine the power of the proposed test procedure introduced in this article.


Introduction
The processing of the analytical results of environmental data containing potentially hazardous chemicals is often complicated by the fact that some of these pollutants are present at trace levels which cannot be measured reliably and consequently are reported as results lying numerically below a detection limit, DL.In general, censoring means that observations at one or both tails are not available.Left-censored data commonly arise in environmental contexts.Left-censored data (data reported as less than detection limit) can occur when the substance or attribute being measured is either absent or exists at such low concentrations that the substance is not present above the DL level.Data sets containing left-censored observations are referred to as left-censored data.When more than two distinct detection limits DL 1 , DL 2 , ..., DL k (k ≥ 3) are reported, the data are said to be multiply-left-censored, (USPEA 1989b).It is common to have environmental data contains detection limits.Left censoring frequently arises in environmental studies due to: (1) sometimes nondetect is reported because the measurement lies below a threshold set by the client or laboratory, (2) sometimes the instrumentation registers a low signal, but the chemist decides that "unpollutant" environmental samples could give a similar signal and reports nondetect instead of the observed measurement, (3) sometimes the signal produced by the pollutant is too small for the instrumentation to discriminate from background noise, or (4) sometimes a signal is registered, but certain criteria that identify the compound are not met.A sample for which some observations are known only to fall above a known detection limit, while the remaining observations falling below the detection limit are fully measured and reported is called right censored.This type of data are so common in biomedical studies.In many environmental applications the distribution of variables such as chemical concentration, inhalation, digestion, and consumption rates are positive and skewed to the right.Hence, censored observations occur between zero and DL.In some instances a log transformation can provide a more natural scale to analyze such measurements.Samples to be considered in this paper are those that are Type I single-left-censored. Suppose that a sample of n data points is given of which m data points are non-censored (fully measured), and the remaining m c = n − m observations are left-censored with a single detection limit DL.In such Type I censored samples DL is fixed, whereas m and m c are random.
Nondetect values can cause an especially difficult problem when the goal is to compare two different populations.There has been a great deal of literature on the subject of the statistical inference of the parameters of normal and log-normal populations from both fully measured and censored data.Gupta and Li (2006) developed a score test for testing the equality of the means of two independent log-normal populations from fully measured data.Zhou et al (1997) considered two methods for comparing the means of two independent log-normal non-censored samples.Harris (1991) considered two parametric and two non-parametric methods for testing the equality of medians of two independent log-normal distributions when some data are left-censored.Prentice (1978) developed linear rank tests with right censored data.Millard and Deverel (1988) adapted several existing right censored non-parametric procedure so that they can be used in environmental setting with left-censored data.Methods for the estimation of the log-normal parameters for one-sample cases where there may exist left-censored data are discussed by El-Shaarawi (1989).Stoline (1993) extended results first suggested by Harris (1991) and proposed a procedure for comparing medians of two independent log-normal distributions where some data may be left-censored.Stoline (1993)used the Expectation Maximization (EM) algorithm introduced by Dempster et al. (1977) to calculate the maximum likelihood estimates of population parameters µ and σ.Other suggested methods for estimating population parameters from censored samples are discussed in Marco (2005), Jin et al (2010), Gibbons (1994), Gleit (1985), El-Shaarawi and Esterby (1992), Elshaarawi and Dolan (1989), Gilbert (1987) and Schneider (1986).
The purpose of this paper is to provide a parametric procedure for comparing means of two independent lognormally distributed populations utilizing left-censored data sets.The method of maximum likelihood will be used to obtain estimates of population parameters µ and σ.To facilitate the application of this procedure, a computer program is written in the R language which calculates the maximum likelihood estimates, asymptotic chi-square test statistics and their p-values.A numerical example is given illustrating the use of this procedure utilizing a computer program written in the R language.

Assumptions and Notations
Assume that there exists two random samples of n 1 and n 2 data values: y 11 , y 12 , ..., y 1m 1 , y 1m 1 +1 , ..., y 1n 1 and y 21 , y 22 , ..., y 2m 2 , y 2m 2 +1 , ..., y 2n 2 taken from two independent log-normal populations LN(µ 1 , σ 1 ) and LN(µ 2 , σ 2 ), respectively.Where LN(µ, σ) denotes a log-normally distributed variable y with the probability density function , for y > 0, where −∞ < µ < ∞ and σ > 0. For convenience, for each sample i let us assume that the first m i observations y i1 , y i2 , ..., y im i are non-censored (fully measured) and the remaining m c i = n i − m i observations are left-censored for i = 1, 2. It is assumed that each sample i has a detection limit LDL i , for i = 1, 2, where LDL i is the detection limit in lognormal sample i for i = 1, 2. For left censored observations, it is assumed that for each sample i it is only known that there are m c i observations less than LDL i , for i = 1, 2. That is for sample i, y i j < LDL i for i = 1, 2 and j = 1, 2, ..., m c i .The parameters for the i th log-normal population can be expressed as functions of the parameters µ and σ as: where Let where LDL i is the detection limit in the i th lognormal sample.
To simplify the presentation in this paper, the analysis is described and illustrated by reference to the analysis of normally distributed data, though this condition occurs infrequently in typical environmental data analysis.However, it is frequently necessary to transform real environmental data before analysis; typically the logarithmic transformation of x i j = ln(y i j ) is used, although other transformations are possible.When the logarithmic or other transformation is used prior to censored data set analysis, it is necessary to transform the analysis results back to the original scale of measurement following parameter estimation.For each sample i let be the sample mean and sample variance of the m i non-censored observations x i1 , x i2 , ..., x im i , for i = 1, 2. Let the functions ϕ(.) and Φ(.) be the pd f and cd f of the standard unit normal.Define and We also define The likelihood function of the samples under consideration is given by: The two log-normal population means are confirmed equal whenever the null hypothesis H 0LN : µ y 1 = µ y 2 is accepted in favor of the alternative hypothesis H ALN : µ y 1 µ y 2 or equivalently whenever the null hypothesis and σ 1 σ 2 (mean homogeneity, variance heterogeneity).For the sake of simplicity the test procedure for the null hypothesis H 0N versus the alternative hypothesis H A 1 N will be considered in this paper.The method of maximum likelihood will be used to obtain the maximum likelihood estimates of population parameters under the hypotheses H 0N and H A 1 N .

Maximum Likelihood Estimates of Population Parameters
In this section the maximum likelihood estimates of population parameters µ i and σ i , for i = 1 and 2, are derived under each of the hypotheses H 0N and H A 1 N .The derivation of these estimates is now described.

Maximum Likelihood Estimates under H 0 N
Under the hypothesis H 0N , x i j , for i = 1, 2 and j = 1, 2, ..., n i , are assumed to be normally distributed with mean µ and standard deviation σ.That is, it is assumed that there exists a random sample of n = n 1 + n 2 data values taken from a normal population with mean µ and standard deviation σ.For convenience, let us assume that the first m = m 1 + m 2 observations are non-censored (fully measured) and the remaining m c = m c 1 + m c 2 = n − m observations are left-censored.For left censored observations, it is only known that m c 1 observations are reported as less than DL 1 and m c 2 observations are reported as less than DL 2 .
The maximum likelihood estimates for μ and σ of µ and σ are the solutions to equations (3.3) and (3.4), the partial derivatives for the log-likelihood equation with respect to µ and σ: The expectation maximization (EM) algorithm will be used iteratively to obtain the solutions μ and σ to the maximum likelihood equations (3.3) and (3.4).The EM algorithm was proposed by Dempster et. al. (1977) for calculating the maximum likelihood estimated from censored samples.The procedure consists of alternately estimating the censored observations from the current parameter estimates and estimating the parameters from the actual and estimated observations.The EM algorithm can be used to calculate the maximum likelihood estimates for the mean µ and standard deviation σ of a normal distribution from both singly-and multiply-censored samples.
A brief description for the EM algorithm is given here.
At step 0 of the EM algorithm all non-censored observations are used to calculate the initial estimates of µ and σ as follows: Let μs and σs be the maximum likelihood estimates of µ and σ at step s of this procedure.At step s + 1, each censored observation x i j (where i=1,2; j=1,2,..., m c i ) is replaced by an estimate of μs − σs W( x i j − μs σs ).Let the values u i j be calculated at step s + 1 as follows: for non-censored data values μs − σs W( for censored data values So the updated estimates μs+1 and σs+1 of µ and σ are given by where the function γ(t) is defined as: More details about the EM algorithm procedure can be found in Wolynetz (1979).Convergence is achieved if both | μs − μs+1 | < 0.00001 and | σs − σs+1 | < 0.00001 occur.When these convergence criteria are met, the maximum likelihood estimates for µ and σ are then given by μ = μs and σ = σs , respectively.
Case 2: In this case it is assumed that DL 1 = DL 2 = DL.The likelihood function L H 0N (µ, σ) under H 0N determined by the pooled sample of n = n 1 + n 2 observations of which m c = m c 1 + m c 2 are left censored and share the same detection limit DL, is given by: Hence, the corresponding log-likelihood function of (3.10) is given by: The maximum likelihood estimates for μ and σ of µ and σ are the solutions to equations (3.5) and (3.6), the partial derivatives for the log-likelihood equation with respect to µ and σ: By solving equations (3.7) and (3.8) for µ and σ lead to the following maximum likelihood estimating equations: where To obtain the desired maximum likelihood estimates of µ and σ from (3.9)-(3.12), it is necessary to estimate the auxiliary function λ = λ(ξ, h).To obtain the estimate value λ of λ, it is necessary to solve the rather complex nonlinear estimating equation (3.11) for the estimate ξ of ξ.Because of the difficulty of solving this equation explicitly for ξ, an iterative method proposed by Aboueissa and Stoline (2004) will be used for obtaining the maximum likelihood estimates μ0 and σ0 of µ and σ.Let ξ be the estimate of ξ.Thus the maximum likelihood estimates μ0 and σ0 of µ and σ are given by: μ0 Alternatively, the EM algorithm estimation procedure presented in case 1 can be used to obtain the maximum likelihood estimates μ and σ for µ and σ using equations (3.7) and (3.8).

Maximum Likelihood Estimates under H A 1N
Under the hypothesis H A1N x i j are assumed to be normally distributed with mean µ i and standard deviation σ i , for i = 1, 2 and j = 1, 2, ..., n i .Thus the likelihood function under H A1N is given by: Hence, the corresponding log-likelihood function of (3.15) is given by: The maximum likelihood estimates for μi and σi of µ i and σ i are the solutions to equations (3.17) and (3.18) for i = 1, 2. ∂ℓ H 1N (µ i , σ i ) By solving equations (3.17) and (3.18) for µ i and σ i lead to the following maximum likelihood estimating equations: and where and for i = 1, 2, where h i = m c i n i .To obtain the desired maximum likelihood estimates of µ 1 and σ 1 from (3.19)-(3.22), it is necessary to estimate the auxiliary functions λ i (ξ i , h i ).To obtain the estimate value λi of λ i , it is necessary to solve the rather complex non-linear estimating equation (3.21) for estimates ξi of ξ i for i = 1, 2. Because of the difficulty of solving this equation explicitly for ξ i , an iterative method proposed by Aboueissa and Stoline (2004) will be used for obtaining the maximum likelihood estimates μi and σi of µ i and σ i , for i = 1, 2. Let ξi be the estimates of ξ i for i = 1, 2. Thus the maximum likelihood estimates μi and σi of µ i and σ i under the hypothesis H A1N are given by: μi where λi = λ i ( ξi , h i ) .
Alternatively, the EM algorithm estimation procedure presented above can be used to obtain the maximum likelihood estimates μ1 and σ1 for µ 1 and σ 1 using equations (3.17) and (3.18) for i = 1.Similarly, the maximum likelihood estimates μ2 and σ2 for µ 2 and σ 2 using equations (3.17) and (3.18) for i = 2.

Asymptotic Chi-Square Test
The estimated log-likelihood functions lH 0N and lH A1N under the hypotheses H 0N (Case 1 and Case 2)and H A 1 N , respectively; are obtained by replacing population parameters by their maximum likelihood estimates.Therefore from (3.2), (3.6) and (3.16) we get: Asymptotic α−level chi-square test will be used to test the equality of the means of two independent log-normal populations.Asymptotic α−level chi-square test to test the null hypothesis H 0LN : µ y 1 = µ y 2 versus the alternative H ALN : µ y 1 µ y 2 or equivalently to test the null hypothesis H 0N : µ 1 = µ 2 = µ and σ 1 = σ 2 = σ versus the alternative hypotheses H A 1 N : µ 1 µ 2 and σ 1 σ 2 is defined by: where χ 2 (α,2) is the upper α−point for a chi-square random variable with 2 degrees of freedom.The p-value of this test statistic is defined by: p − value = P(χ 2 (2) > χ 2 0 ).(4.5) Thus the null hypothesis that the means of two independent log-normal populations are equal will be rejected if χ 2 0 > χ 2 (α,2) or equivalently if p − value < α.Computer Programs: To facilitate the application of parameter estimation method described in this article, a computer program called "Abou.T wo.Lognormal.Estimates" is written in the R language to automate parameters estimation from left-censored data sets that are normally or log-normally distributed and to obtain the estimated values of the log-likelihood functions under the hypotheses H 0N and H A 1 N .In addition, this computer program will be used to obtain the asymptotic α−level chi-square test statistic and its p-value.Copy of source code is given in the Appendix section.Millard and Deverel (1988) compared the median copper and zinc trace element concentrations in groundwater sampled from two geological areas in the San Joaquin Valley, the Basin-Trough Zone and the Alluvial Fan Zone in California.Data from both sites are given in Table 1.
A third cause involves the amount of dilution that a lab technician may use.In this article it is assumed that both data sets are singly left-censored, thus to utilize the estimation methods described here, the left-censored observations within each data set will be set to equal to the average of detection limits.Table 2 contains estimates of the normal and log-normal population parameters.It is noted that the highest reported concentration of zinc (620) in the Alluvial Fan Zone seems to be unusual data value since the second highest observed zinc concentration is (50) in this zone.Two different estimates of the normal and log-normal population parameters for the zinc data sets are reported.The first estimate includes all data and the second includes all data with the zinc data value 620 removed (ZincW620R).The corresponding estimates of the zinc means µ y i = e µ i + σ 2 i 2 and medians m y i = e µ i are also included in Table 2.The influence of the single large zinc data value 620 can be most clearly seen by comparing the estimates for σ 1 under the hypothesis H A1N .The estimate is σ1 = 0.887 with the 620 value included and σ1 = 0.674 with the 620 value removed.The estimates for µ 1 with the 620 value included and with the 620 value removed do not differ appreciably.These estimates for µ 1 are μ1 = 2.405 and μ1 = 2.382 for these two cases, respectively.The corresponding estimates of the log-normal median zinc concentration in the Alluvial Fan Zone are my 1 = 11.078 with the 620 value included and my 1 = 10.827 with the 620 value removed.The comparable estimates of the log-normal mean zinc concentration are μy 1 = 16.418 with the 620 value included and μy 1 = 13.587 with the 620 value removed.In addition, the estimates of the log-normal standard deviation for zinc concentration in the Alluvial Fan Zone are σy 1 = 14.7580 with the 620 value included and σy 1 = 10.343 with the 620 value removed.The estimates of the log-normal median are similar, but the estimates of the log-normal mean and standard deviation are appreciably different, owing to the influence of the single large zinc data value 620.Table 2 contains the p-value results associated with the application of the recommended asymptotic chi-square test to the the Millard and Deverel (1988) copper and zinc data presented in Table 1.The Millard and Deverel (1988) p-value results using the normal scores permutation variance (NS 2P) procedure are also presented in Table 2.

Simulation Study
In this simulation study, type I error rates and power of the test procedure introduced in this article are investigated.
A computer program was written in the R language for this purpose.For each combination of the population parameters µ 1 , µ 2 , σ 1 and σ 2 described below, two sample size cases were considered: in case one, n 1 = n 1 = 25 and in the second case, n 1 = n 2 = 75.The first case will be referred to as the small sample size case and the second as the large sample size case.Censoring at two different detection limits was used for each case.The simulation study was performed with 10,000 repetitions (N = 10, 000) of sample normal distributions for each combinations of n, µ 1 , µ 2 , σ 1 , σ 2 , and censoring level(s).Censoring levels were set at the 15th and 30th percentiles of the parent distribution(s).In order to check the Type I error, the population parameters were specified as µ 1 = µ 2 = 0, and σ 1 = σ 2 = 1 as shown in Table 3.In order to check the power, the population parameters were specified as µ 1 = 0, µ 2 = 0.1(0.1)1.0,σ 1 = 1 , and σ 2 = 1.0(0.1)2.0 as shown in Table 4.
Table 3.The Estimated Simulated Type I Error Rates: The following observations and conclusions are made from an examination of the simulation results reported in Tables 3 and 4.
From Table 3, one can see that the estimated simulated Type I error rates are slightly higher than 0.06 (0.067, 0.063) for the small sample size case, and slightly higher than 0.05 (0.054, 0.056) for the large sample size case.The censoring levels do not seem to affect the value of Type I error rate, α. 1.0000 (0, 0.9, 1, 1.9) 0.9628 0.9624 1.0000 1.0000 (0, 1.0, 1, 2.0) 0.9893 0.9758 1.0000 1.0000 From Table 4, one can see that the estimated simulated power is higher for large sample size case than the small sample size case, and slightly higher for the lower level of censoring.Specifically, in the small sample size case with 15% (30%) censoring level we reach a power of 0.9893 (0.9758) when the difference between µ 1 and µ 2 is 1.0, and the difference between σ 1 and σ 2 is 1.0.Alternatively, in the large sample size case with 15% (30%) censoring we reach a power above 0.99 when the difference between µ 1 and µ 2 is 0.6, and the difference between σ 1 and σ 2 is 0.6; and a power of 1.0 when the difference between µ ′ s and σ ′ s is 0.7.
In summary, the test procedure introduced in this article maintains its stated significance level and has much power with larger sample size and a bit less power with greater censoring levels.In addition, the power decreases when the censoring level moves from 0.15 to 0.30.Also, the power increases greatly when the sample size moves from the order of 25 to the order of 75.

Conclusions and Remarks
It is well known that the log-normal distribution is widely used in modeling environmental and biomedical censored data.This article has dealt with the problem of comparing means of two independent log-normal populations in the presence of singly left-censored data.The EM Algorithm is employed to obtain the maximum likelihood estimates of population parameters under different hypotheses.A parametric test procedure for testing the equality of means of two independent log-normal in the presence of censored data with single detection limit is presented.The performance of the test procedure presented in this article is evaluated by means of simulation studies.A detailed case study example of the method is provided using copper and zinc data presented in Millard and Deverel (1998).It is seen in the analysis of the Millard and Deverel (1998) data as shown in the study case example that large (unusual) data values do influence the estimate of the mean, but do not influence the estimate of the median in log-normal parametric model analysis.The nonparametric median comparison methods are not as sensitive to these unusual data values.I hope that my paper would be useful to researchers using the log-normal distribution in their analysis of the censored data.

Table 2 .
Estimates of normal and log-normal parameter values from the copper and zinc data given in Table1ZincW620R: Alluvial Fan Zone zinc data set with the data value 620 removed.The p-value of the asymptotic chi-square test statistic of testing the null hypothesis H 0N versus H A1N or equivalently H 0LN versus H ALN is 0.2204.Therefore the hypothesis of equal means is accepted for copper at significance level of α = 0.05.The reported p-value for equality of medians of Millard and Deverel NS 2P is 0.320.The p-value of the asymptotic chi-square test statistic of testing the null hypothesis H 0N versus H A1N or equivalently H 0LN versus H ALN is 0.0983 with the 620 value included.Therefore with the 620 value included the hypothesis of equal means is accepted for zinc at significance level of α = 0.05.The p-vale of testing the null hypothesis H 0N versus H A1N or equivalently H 0LN versus H ALN is 0.0029 with the 620 value removed.Therefore with the 620 value removed the hypothesis of equal means is rejected for zinc at significance level of α = 0.05.The reported p-value for equality of medians of Millard and Deverel NS 2P is 0.020.

Table 4 .
The Estimated Simulated Power Rates