Adjusted Adaptive LASSO in High-Dimensional Poisson Regression Model

The LASSO has been widely studied and used in many applications, but it not shown oracle properties. Depending on a consistent initial parameters vector, an adaptive LASSO showed oracle properties, which it is consistent in variable selection and asymptotically normal in coefficient estimation. In Poisson regression model, the usual adaptive LASSO using maximum likelihood coefficient estimators can result in very poor performance when there is multicollinearity. In this study, we proposed an adjusting of the adaptive LASSO to take into account the maximum likelihood standard errors of the coefficient parameters. The performance of the adaptive LASSO was demonstrated through simulation and real data. Our simulation and real data results show that adaptive LASSO has advantage in terms of both prediction and variable selection comparing with other existing adaptive penalized methods when the explanatory variables are highly correlated. Hence we can conclude that adaptive LASSO is a reliable adaptive penalized method in a Poisson regression model.


Introduction
With the advancement of technologies, massive amount of data with increasing dimensions have been generated in many areas such as genetics, medical, economic and social sciences.The expansion of the data is in two dimensions: the number of variables and the number of observations.Such high-dimensional data has posed new challenges to statistical analysis, because a lot of classically statistical methods do not automatically apply into these datasets, for example, the curse of dimensionality makes many classical regression models, such as Poisson regression model, ineffective, because statistical issues associated with modeling high-dimensional data include model overfitting, estimation instability, computational difficulty (Pourahmadi, 2013).
How to reduce the dimensionality has been an important research question in statistical application.One way to handle the high dimensional data is to perform data reduction.To do this, various penalized methods have been proposed begin by ridge penalty (Hoerl & Kennard, 1970).It estimates the regression coefficients through 2  -norm penalty.It is well-known that ridge regression shrinks the coefficients of correlated predictor variables toward each other, allowing them to borrow strength from each other (Friedman, Hastie, & Tibshirani, 2010).LASSO (least absolute shrinkage and selection operator), introduced by Tibshirani (1996), is another most frequently used penalized method.LASSO imposes the 1  -norm penalty to the residual sum of square.Because of the 1  -norm property, LASSO can perform variable selection by assigning some explanatory variable coefficients to zero.For this reason, LASSO gets its popularity in high dimensional data.Despite the advantage of the LASSO, it has some shortcomings.First, it cannot select more explanatory variables than the number of observations.Second, when there is multicollinearity, LASSO tray to select one variable among correlated variables.Last, LASSO does not have the oracle properties, which were referred to that the probability of selecting the right set of explanatory variables (with nonzero coefficients) converged to one and that the estimators of the nonzero coefficients were asymptotically normal with the same means and covariances as if the zero coefficients were known in advance.To overcome the first two limitations, Zou & Hastie (2005) proposed the elastic net penalty, for which the penalty is a linear combination of 1  -norm and 2  -norm.
Related to the last limitation of LASSO, oracle properties, Fan & Li (2001) showed that LASSO did not enjoy the oracle properties because its model selection result could be inconsistent which it cannot guarantee a consistent identification of the true model.In addition, its estimators cannot be as efficient as the oracle.To overcome this problem, Zou (2006) proposed the adaptive LASSO (ALASSO) in which adaptive weights are used for penalizing different coefficients in the 1  -norm penalty.He proved that, if the weights are chosen to be small amount for the important explanatory variables and large amount for the unimportant explanatory variables, then adaptive LASSO consistently can select the model.Zou (2006) presented that the performance of the ALASSO cannot outperform LASSO in term of prediction when the ordinary least square (OLS) estimators are variable.This is because it suffers from the highly correlated variables.Therefore, Wang et al. (2011) introduced the random LASSO which is depend on the bootstrap idea as a novel extension of LASSO to handle multicollinearity and to use the bootstrap estimates as initial weights in the second stage of their proposed method.Lu & Fine (2012) studied the adaptive LASSO under many model misspecifications in GLM, where OLS were used as initial weight.Using the LASSO estimates as an initial weight was proposed by Lian (2012).A study of controlling the false positive rate in adaptive LASSO was conducted by Sampson et al. (2013) where the OLS were used as initial weight.Recently, an adjusted adaptive LASSO using standard error was proposed by Qian & Yang (2013).They proposed to use the ratio of standard error of the OLS to the OLS coefficients as an initial weight to deal with multicollinearity.Furthermore, Zeng et al. (2013) proposed to use adaptive LASSO in count regression model, in particular in zero-inflated count data.They proposed to use the maximum likelihood estimators as an initial weight in low dimensional data.
In this study, the ratio of the standard error of the maximum likelihood (ML) estimator to the ML estimator was proposed as an initial weight in adaptive LASSO (RALASSO).Simulation and real data results show that the proposed method can perform better than both LASSO and adaptive LASSO using maximum likelihood estimator.The remainder of this paper organizes as follows.Section 2 covers the penalized Poisson regression methods.Description of the ALASSO and RALASSO are explained in sections 3 and 4, respectively.Sections 5 and 6 are devoted to simulation study and its results.The real data analysis results are covered by section 7. We end this paper with a conclusion in section 8.

Penalized Poisson Regression Model
Poisson regression models have received much attention in econometrics and medicine literature as model for describing count data that assume integer values corresponding to the number of events occurring in a given interval.The Poisson regression model is the most basic model, where the mean of the distribution is a function of the explanatory variables.This model has the defining characteristic that the conditional mean of the outcome is equal to the conditional variance (Algamal, 2012;Zeng et al., 2013).A procedure called penalization, which is always used in variables selection in high dimensional data, attaches a penalty term ( ) P λ β to the log-likelihood function to get a better estimate of the prediction error by avoid overfitting.Recently, there is growing interest in applying the penalization method in the Poisson regression models.Friedman, Hastie, and Tibshirani (2010) developed an efficient algorithm for the estimation of a generalized linear model including Poisson regression with a convex penalty.Fan & Lv (2011) developed the methodologies of nonconcave penalization for generalized linear model.Hossain & Ahmed (2012) proposed Stein-type shrinkage estimator for the parameters of Poisson regression model.Wang et al. (2014) proposed a combination of minimax concave and ridge penalties and a combination of smoothly clipped absolute deviation and ridge penalties.
In Poisson regression model, the number of events i y has a Poisson distribution with a conditional mean that depends on individual characteristics according to the structural model.
and the conditional mean parameter exp( ).
Under the assumption of independent observations, the log-likelihood function is given by The penalized Poisson regression (PPR) is defined as where λ is defined as a tuning parameter ( 0 λ ≥ ).It controls the strength of shrinkage the explanatory variables, when λ takes larger value, more weight will be given to the penalty term.Since the value of λ is depends on the data, it can be computed using cross-validation method (Y.Fan & Tang, 2013) (James, Witten, Hastie, & Tibshirani, 2013).Before solving the PPR, it is worth to make centering to the y and standardization to j x , so that 1 0 , this is to make the intercept ( 0 β ) equals zero.The LASSO for the Poisson regression model was originally proposed by Park & Hastie (2007).This technique is in some sense similar to ridge regression but it can shrink some coefficients to zero, and thus can implement variable selection.The LASSO method estimates the coefficients by minimizing the negative log-likelihood with the constraint that the sum of the absolute values of the model coefficients is bounded above by some positive number.The LASSO estimator is where 0 λ ≥ is the tuning parameter.For large values of λ , Eq. ( 5) produces shrunken estimates of the β and sets some variables to equal zero.LASSO can be efficiently solved by several methods, such as least angle regression algorithm (LARS) (Efron, Hastie, Johnstone, & Tibshirani, 2004) and the coordinate descent algorithm (Friedman et al., 2010).
Compared to the classical variable selection methods, LASSO has two advantages.Firstly, the selection process in LASSO is continuous which make the selection more stable than the subset selection.Secondly, LASSO is computational feasible in high dimensional generalized linear model (GLM).On the other hand, LASSO has three main drawbacks.First of all, LASSO selects at most n explanatory variables because of the nature of the convex optimization problem.In addition, LASSO cannot handle multicollinearity.When the pairwise correlations among a group of explanatory variables are very high, then LASSO tends to select only one explanatory variable from the whole group and does not take into account which one is selected.Lastly, LASSO lacks the oracle properties which is stated in Fan & Li (2001).Elastic net is a penalized method for variable selection, which is introduced by Zou & Hastie (2005) to deal with the first two drawbacks of LASSO.Elastic net tries to merge the 2  -norm and the 1  -norm penalties, by using ridge regression penalty to deal with high correlation problem while taking advantage of LASSO penalty in variable selection property.

Adaptive LASSO
According to Fan & Li (2001), a good penalty term should result in an estimator with three properties: unbiasedness, sparsity and continuity.Unbiasedness means the resulting estimator has no over penalization for large parameter to avoid unnecessary modeling biases.Furthermore, sparsity is another property that an estimator enjoys.In other words, the resulting estimator automatically set insignificant parameters to zero.Lastly, continuity is the third property, meaning that the resulting estimator is continuous in data in order to avoid instability in model prediction.Using the language of Fan & Li (2001), we called a penalty term enjoy oracle properties it has the following properties: a) It can identify the right subset model, i.e., { } { } : 0 : 0 , b) It has asymptotic normal distribution, i.e., ( ) ( , ) and * Σ represents the covariance matrix when the true subset model is known.
One of the main reasons for the LASSO not to be consistent, i.e., lacks the oracle property (J.Fan & Li, 2001) is that it equally penalized all the regression coefficients, which over-penalized the irrelevant explanatory variables leading it to be biased estimator.To alleviate this drawback, Zou (2006) proposed the adaptive LASSO in which adaptive weights are used for penalizing different coefficients in the 1  -norm penalty.The basic idea behind ALASSO is that, by assigning a higher weight to the small coefficients and lower weight to the large coefficients, it is possible to reduce the selection bias, therefore consistently can select the model.Furthermore, the ALASSO solution is continuous from its definition, which makes it enjoying oracle properties.The penalized likelihood using adaptive LASSO is defined by 1 ˆarg min ( ) , , where γ is a positive constant and it usually set to equal one.Eq. ( 6) also reduces the variance of the conditional mean of the response, and for fixed value of n , larger values of λ lead to higher bias and less variability in predicted values (Hui, Warton, & Foster, 2014).

RALASSO
The actual variable selection performance of the ALASSO depends on the type of the selected weight.Zou (2006) used the OLS estimator as an initial weights, i.e., ( ) . In the same manner, when other GLMs are used, the initial weight be ( ) . These types of weights will not correct in use when there is multicollinearity.From this point, the ratio of the standard error of the ML estimator to the ML estimator was proposed as an initial weight in adaptive LASSO.The advantage of using standard error of the ML ˆML s β is to adjust the adaptive LASSO when using ML estimates as an initial value.
be the vector of ML estimate, ˆarg min exp( ) .

Simulation Study
In this section, simulation studies are used to investigate the performance of the RALASSO.Furthermore, we compare RALASSO with LASSO and ALASSO.In all simulations the response variable was generated from Poisson distribution with conditional mean T i x β .All simulation scenarios are replicate 100 times.For every simulation scenario and in each replication we generate training, validation, and testing data.The training data were used for model fitting.The validation data were used to determine the tuning parameters.The testing data were used to evaluate the penalization methods.For each scenario, the observation numbers of the corresponding data sets are denoted by training/validation/testing.Based on the simulated data, we used four metrics to evaluate all penalization methods which were studied in this paper, mean-squared errors for both training data set (MSE train ) and testing data set (MSE test ), hits which stands for the number of correctly identified true variables, and false positive (FP) which denotes to the number of zero variables which are wrongly considered as true variables.
Since we investigate a penalization method with both variable selection and multicollinearity, we use three simulations with different values of the correlation and different numbers of training, validation, and testing observations.In addition, we set different number of variables.

Simulation Results
To examine the performance of the RALASSO penalty we compare it with two well-known penalized methods; LASSO and ALASSO.The MSE train and MSE test are computed as the criterion of evaluation.respectively.Furthermore, the reduction of MSE test is usually substantial compared to ALASSO.For example, the reduction in simulation 1, simulation 2, and simulation 3 is 2.701%, 9.230%, and 15.062% respectively.Moreover, RALASSO performs well in terms of both MSE train and MSE test when the multicollinearity was presented.Besides, from the simulation results we can observe that LASSO came the last method.This is due to its limitation when there is grouping effects between variables.For variable selection accuracy, the penalization methods should include all important variables (non-zero variables), hits and FP were used to measure the performance of RALASSO, ALASSO, and LASSO in term of selecting the non-zero variables.From Table 1, RALASSO succeeds in selecting the true non-zero variables in most of the cases in term of hits.For example, in simulation 1 RALASSO selects the all ten non-zero variables.Moreover, when the correlation coefficient varies from small, medium, to high correlation both LASSO and ALASSO select less non-zero variables comparing to RALASSO.We can expect such a result because LASSO and ALASSO have their limitation with grouping effects.In term of FP, RALASSO method usually selects less ineffective variables than ALASSO and LASSO in most cases.To this end, it is obvious from our simulation results that the RALASSO method performs better in term of MSE train and MSE test by obtaining smaller values, hits, and FP followed by ALASSO and LASSO for small, medium, and high correlation, and has greater advantage of variable selection with multicollinearity in Poisson regression model.

Real Data Set Application
The real data set which belong to the study of the distribution of freshwater mussels was taken from Sepkoski and Rex (1974).The study aims at the estimation of the numbers of species of mussels in 41 rivers in US by various explanatory variables.The nine explanatory variables are: area, number of stepping stones (intermediate rivers) to 4 major species-source river systems (Alabama-Coosa (AC), Apalachicola (AP), St. Lawrence (SL), and Savannah (SV)), nitrate concentration, hydronium concentration (10 ^(-pH)), and solid residue.
In order to investigate the performances of the RALASSO method the data set has been split 50 times at random into a training set of 28 observations and a test set of 12 observations.Model fitting and tuning parameter selection has been done by fivefold cross validation on the training data set.

Conclusion
A study of an improving adaptive LASSO was proposed by applying on Poisson regression model.RALASSO and two penalized methods including LASSO and ALASSO in were compared by using both simulation studies and real data analysis.The simulation and real data results show that the RALASSO is outperforming the other two methods in term of mean-squared error of training and testing data sets and variable selection accuracy.We can conclude that RALASSO more reliable than ALASSO when there is multicollinearity between variables in applying penalized regression model.
Then a coordinate descent method can be used to solve RALASSO.The computation details are given in algorithm 1.Algorithm 1: The coordinate descent method for the RALASSO.
is equal to 1, we generate data sets with sample sizes 200/200/400 and 20 explanatory variables.
Figures 1 -2 display the corresponding boxplots of the MSE train and MSE test for the three used methods for three simulations, respectively It is clearly seen that RALASSO has less variability comparing with LASSO and ALASSO.

Figure 1 .
Figure 1.Comparison of median mean-squared error of the training data for three methods

Table 1
summarizes the median MSE, the median number of hits, and FP.The bold font indicates the best method on MSE, hits, and FP.Table1reveals that the RALASSO method produces considerably smaller median MSE train and MSE test among all methods in all simulation scenarios.For example, in simulation 1 the median MSE train of RALASSO is 3.245 which smaller than 3.314 and 3.346 for ALASSO and LASSO methods,

Table 2 .
Table 2 shows the median number of explanatory variables selected by each of the LASSO, ALASSO, and RALASSO in the training data set, and the corresponding median MSE test .It can be seen that RALASSO performs best in term of prediction error where the MSE test of the RALASSO is approximately 7.45% lower than ALASSO and 9.43% lower than LASSO.Moreover, RALASSO selects less explanatory variables than the other two methods.Comparison among methods for the real data set