A Simulation Study of a Parametric Mixture Model of Three Different Distributions to Analyze Heterogeneous Survival Data

In this paper a simulation study of a parametric mixture model of three different distributions is considered to model heterogeneous survival data. Some properties of the proposed parametric mixture of Exponential, Gamma and Weibull are investigated. The Expectation Maximization Algorithm (EM) is implemented to estimate the maximum likelihood estimators of three different postulated parametric mixture model parameters. The simulations are performed by simulating data sampled from a population of three component parametric mixture of three different distributions, and the simulations are repeated 10, 30, 50, 100 and 500 times to investigate the consistency and stability of the EM scheme. The EM Algorithm scheme developed is able to estimate the parameters of the mixture which are very close to the parameters of the postulated model. The repetitions of the simulation give parameters closer and closer to the postulated models, as the number of repetitions increases, with relatively small standard errors.


Introduction
The survival time data analysis is concerned with the analysis of time to occurrence of a particular event of interest.The data are usually related to clinical studies of human or laboratory studies of animal or studies to test the life time of some devices.Historically, nonparametric techniques were used to handle survival data.Parametric distributions are the conventional techniques in statistics; they are very useful if the selected parametric probability distribution fits the data properly.The most frequently used parametric distributions in survival time data analysis includes the Exponential, Gamma, and Weibull among others (Ibrahim, Chen, & Sinha, 2001;Kalbfleisch & Prentice, 2002;Lawless, 2003;Lee & Wang, 2003).In cases of data with heterogeneous structure, mixture distributions are more convenient to handle such data.Recently, a considerable number of authors used mixture model technique to analyse survival time data.Cheng and Fu (1982) proposed a parametric mixture model of Weibull distribution where they employed the weighted least squares method to estimate the parameters.Jiang and Kececioglu (1992a) estimated the parameters of a mixture model of Weibull distribution using graphical approach.They (Jiang & Kececioglu, 1992b) also developed a new procedure to estimate the parameters of a mixture model of Weibull.Zhang (2008) proposed a two-component mixture model of the Weibull-Weibull distribution to model survival time data and investigated the suitability of the model in survival analysis.Also Erisoglu, Erisoglu and Erol (2012) modelled heterogeneous survival time data by a mixture model of Gamma-Gamma, a mixture of Lognormal-Lognormal and a mixture of the Weibull-Weibull distributions, where they investigated the best fit model to real survival time data.A mixture model of mixed distributions was proposed by Ersioglu and Erol (2010), to model heterogeneous survival time data, where they employed a two component mixture model of the Extended Exponential-Geometric (EEG) distribution.In Erisoglu, Erisoglu and Erol (2011), a mixture of two different distributions Exponential-Gamma, Exponential-Weibull and Gamma-Weibull were used to model heterogeneous survival data.
In the case of open-heart surgery, Blackstone, Naftel and Turner Jr. (1986) identified three overlapping phases of death after surgery which could be modelled by a three component parametric mixture model instead of the conventional parametric survival time model, as was pointed out by Ng, Mclachan, Yau, andLee (2004) andPhilips, Coldman, andMcBride (2002).Mixture of different distributions would be appropriate to model a different mode of hazard in heterogeneous survival time data.The Expectation Maximization Algorithm (EM) is effectively used in cases of data with missing of unobserved observations (Dempster, Laird, & Rubin, 1977).The maximum likelihood estimates of the parameters of the survival mixture model are estimated usually via (EM) (Mclachlan & Peel, 2000;Mclachlan & Krishnan, 2008).
The purpose of this paper is to investigate the consistency and stability of EM in estimating the parameters and the appropriateness of a mixture of three different distributions in analysing heterogeneous survival time data.The article is arranged as follows.Section two to discusses survival analysis and some frequently used theoretical distributions and their properties.Section three will be devoted to discussing the mixture model of three different distributions in the survival time analysis.Section four for the implementation of EM scheme to estimate the maximum likelihood estimator of the model.Section five is devoted to simulation, estimation of the parameters of the model and demonstrates the successful convergence of the EM, consistency and stability of the scheme.

Survival Analysis and Functions
Survival analysis deals with the implementation of certain statistical techniques to model and analyze survival time data.The primary interest in such data is the endpoint time when an event of interest occurs.Generally, the events of interest are referred to as failures.They could be; the time to death of a patient, time to learning a new skill, time to exit from unemployment, time to promotion for employees and time to breakdown of some device.The response of primary interest, T is a non-negative random variable representing survival time of an individual and can be described by three important functions.The probability density function (pdf) denoted by ( ) f t , which can be written as Where ( ) F t is the the responsetion function of response variable T. The probability density function can also be presented graphically, the graph of ( ) f t , is known as the density curve.The density function ( ) f t is a nonnegative function and the area between the curve and the t axis is equal to 1.The survival function denoted by ( ) S t , which can be written as that represents the probability that an individual survives beyond time .Note that the survival function is a monotonic decreasing continuous function with   The hazard function which is denoted by ( ) h t , and can be written as representing the probability that an individual fails within a small interval ( , Δ ) , given that the individual survived to the beginning of the interval.The cumulative hazard function of the survival time T is defined by; Therefore, when 0 That is, the cumulative hazard function can assume any value between zero to infinity.The hazard function specifies the instantaneous rate of failure at time given that the individual survived up to time , and sometimes it is known as the instantaneous failure rate, force of mortality, conditional mortality rate, and age-specific failure rate.The hazard function is also presented graphically.These three functions are equivalent if any one of them is known then the two others can be derived (Lee & Wang, 2003).
Parametric statistical techniques are convenient tools in survival analysis; provided that the selected parametric distribution adequately fit the data at hand.In the literature the Exponential, Gamma and Weibull probability distributions are the most frequently used density functions in modelling survival time data (Cheng & Fu, 1982;Erisoglu, & Erol, 2011, 2012).The probability density function ( ) f t , survival functions ( ) S t and hazard functions ( ) h t of these distributions are highlighted below.
x  is known as the incomplete Gamma function.

Parametric Mixture of Three Different Distributions
Mixture models are implemented to analyse survival time data in different situations, because of their flexibility, and they are the best choice in situations where a single parametric distribution may not suffice (Mclachlan & Peel, 2000;Fruhwirth-Schnatter, 2006).A mixture model of three different distributions is considered where it is assumed that it is sampled from a population consisting of three subpopulation or subgroups.The mixture model can be expressed as Where the vector , contains all the unknown parameters in the mixture model.The functions are known as the mixture component density functions for some parameters , In this paper a mixture of three different distributions of Exponential, Gamma and Weibull is proposed to model heterogeneous survival time data, the different distribution takes care of different hazard mode in the heterogeneous data, and the model defined as Where ' i s  are the mixing proportions or mixing probabilities and as defined in ( 5), ( 8) and ( 11), are the probability density functions of Exponential, Gamma and Weibull distributions respectively.

The Expectation Maximization Algorithm (EM) and Parameter Estimation
The EM Algorithm is frequently employed in the literature as an efficient technique to estimate the maximum likelihood estimators of finite mixture models (Mclachlan & Krishnan, 2008).
Let 1 2 , , , n t t t  be a set of observations of n incomplete data and 1 2 3 , , z z z be a set of missing observations, where  , if the observation belongs to the k th component and 0 otherwise for 1, 2,3 k  and 1, , i n   .Here z`s are treated as missing values when applying the EM Algorithm to the mixture distribution.
The EM Algorithm proceeds in two steps, the Expectation step or the E-step and Maximization step or the M-step.
In the E step the i z Variables are considered as missing data, the expectation , ( | ) ki i E z t is obtained to estimate the hidden variable vector The functions     E z t calculated in the E step will be maximized in the M step of the EM Algorithm under the condition


. The Lagrange method can be employed to estimate the mixing probabilities and parameter vector [ , , ] The estimated mixing probabilities are; The maximum likelihood estimator of the parameter for the proposed model can be obtained by the equation ( 22) The maximum likelihood estimators of the parameters and for the proposed model can be estimated from the equations ( 23) and ( 24) respectively and Where is the number of Newton-Raphson iteration within EM Algorithm and Ψ .and Ψ ′ .are a digamma and trigamma functions respectively.
The maximum likelihood estimators of the parameters and for the proposed model can be estimated from the equations ( 25) and ( 26) respectively. Where and r is the number of Newton-Raphson iteration within EM Algorithm.

Simulation
Simulations were performed to investigate the convergence of the proposed EM scheme.Samples of size 400 observations were generated, each of them randomly sampled from three-component survival mixture model of Exponential, Gamma and Weibull.There was no restriction imposed on the number of iterations and convergence was achieved when the differences between successive estimates were less than 10 -4 .Three different postulated models were considered with a different set of parameters.The result of the parameter estimation of the three sets of mixture model is given below:

The First Postulated Mixture Model
The set of parameters of the postulated mixture model is , and the model can be written as Where the density functions f Exp , f Gm and f Wbl are Exponential, Gamma and Weibull densities as defined in ( 5), ( 8) and ( 11).The mixing probabilities for each component are 1 0.1   , 2 0.6   and 3 0.3   .
The result of the parameter estimation, shown in Table 1, shows that the estimated parameters of the model are almost similar to the true postulated parametric mixture model parameters.

The Second Postulated Mixture Model
The set of parameters of the postulated mixture model is , = ( = 0.3, = 0.375, 3, 11, 7, 4, 6.3), and the model can be written as Where the density functions f Exp , f Gm and f Wbl are Exponential, Gamma and Weibull densities as defined in ( 5), ( 8) and ( 11).The mixing probabilities for each component are 0.3, 0.375 and 0.325.
The result of the parameter estimation shown in Table 2 shows that the model estimated parameters are almost similar to the parameters of the true postulated parametric mixture model.
Where the density functions f Exp , f Gm and f Wbl are Exponential, Gamma and Weibull densities as defined in ( 5), ( 8) and ( 11).The mixing probabilities for each component are 0.25, 0.4 and 0.35.
The result of the parameter estimation shown in Table 3 shows that the model estimated parameters are almost similar to the parameters of the postulated parametric mixture model.Likewise, the estimates of the Exponential, Gamma and Weibull distributions get close to the parameters of the postulate model.Note that the shape parameters of the Weibull distribution are estimated exactly as that of the postulated models for all the repetitions.Note: av π, θ , se π, θ are the averages and standard errors of the estimated parameters respectively.
The table's show that the EM scheme converged to the true values of the parameter in 10, 50, 100 and 500 repetitions and that emphasizes the stability of the algorithm in estimating the parameters with different proportion of mixing probabilities.The averages are close to the true values of the parameters and the standard errors are relatively small which suggest that the EM parameter estimates performed consistently.

Conclusions
The paper proposed a mixture model of three different distributions namely, Exponential, Gamma and Weibull to model the heterogeneous survival time data.EM algorithm was employed to estimate the maximum likelihood estimator of the parameter of the parametric mixture model.The convergence of the EM was investigated through the simulations performed.The results revealed that the EM successfully estimated the parameters of the three component mixture model.The mixture model of three different distributions, Exponential, Gamma and Weibull could be successfully applied to model heterogeneous survival time data instead of the conventional parametric models.

Figure 1
Figure 1 displays the comparison between the probability density function of the parametric Exponential, Gamma and Weibull mixture and the probability density functions of each single distribution.The histogram represents the simulated data.As can be seen the mixture model fits the simulated data far better than the single distributions.

Figure 1 .
Figure 1.The density function of three component parametric mixture versus the single distributions of the first postulated model

Figure 2
Figure2displays the comparison between the probability density function of the parametric Exponential, Gamma and Weibull mixture and the probability density functions of each single distribution.Also here it can be observed that the mixture model fit the simulated data far better than the distribution of each component.

Figure 2 .
Figure 2. The density function of three component parametric mixture versus the each single distribution of the second postulated model

Figure 3
Figure3displays the comparison between the probability density function of the parametric Exponential, Gamma and Weibull mixture and the probability density functions of each single distribution.Also here it can be observed that the mixture model fit the simulated data far better than the distribution of each component.

Figure 3 .
Figure 3.The density function of three component parametric mixture versus the each single distribution of the third postulated model

Table 1 .
The result of the simulation of the first postulated model

Table 2 .
The result of the simulation of the second postulated model

Table 3 .
The result of the simulation of the third postulated model

Table 4 .
The result of the repeated simulation of the first postulated model Note: av π, θ , se π, θ are the averages and standard errors of the estimated parameters respectively.

Table 5 .
The result of the simulation of the second postulated model , se π, θ are the averages and standard errors of the estimated parameters respectively.

Table 6 .
The result of the repeated simulation of the third postulated model