A New Algorithm in Maximum Likelihood Estimation for Generalized Linear Models

We intrduce a new algorithm for 1 L regularized generalized linear models. The 1 L regularization procedure is useful,especially because it ,in effect,selects variables according to the amount of penalization on the 1 L norm of the coefficients,in a manner less greedy than forward selection/backward deletion. The algorithm efficiently computes solutions along the entire regularization path using the predictor-corrector method of convex-optimization. Selecting the step length of the regularization parameter is critical in controlling the overall accuracy of the paths; we suggest intuitive and flexible strategies for choosing appropriate values.


Introduction
In this paper we propose a preditor-corrector algorithm for 1 L regularized generalized linear models (GLM).GLM models a random variable Y that follows a distribution in the exponential family using a linear combination of the predictors, β x′ , where x and β denote vectors of the predictors and the coefficients, respectively.The random and the systematic components may be linked through a non-linear function; therefore,we estimate the coefficient β by solving a set of non-linear equations that satisfy the maximum likelihood criterion.
( ) where L denotes the likelihood function with respect to the given data ( ) { } When the number of predictors p exceeds the number of observations n , or when insignificant predictors are present, we can impose a penalization on the 1 L norm of the coefficients for an automatic variable selection effect.Analogous to Lasso (Tibshirani 1996) that added a penalty term, to the squared error loss criterion, we modify criterion (1) with a regularization: where 0 > λ is the regularization parameter.Logistic regression with 1 L penalization has been introduced and applied by other researchers, for example in Shevade & Keerthi (2003).

Problem setup
be n pairs of p factors and a response.Y follows a distribution in the exponential family with mean and variance . Depending on its distribution, the domain of i y could be a subset of ℜ .GLM models the random componentY by equating its mean µ with the systematic component η through a link function g : The likelihood of Y is expressed as follows (McCullagh & Nelder 1989): , and ) (⋅ c are functions that vary according to the distributions.Assuming that the dispersion parameter φ is known, we are interested in finding the maximum likelihood solution for the natural parameter θ , and thus with a penalization on the size of the 1 L norm of the coefficients ( 1 β ).Therefore, our criterion with a fixed λ is , which minimizes the following: Assuming that none of the components of β is zero and differentiating ) , ( λ β l with respect to 4565, we define a function H : where X is an n by 1 + p matrix including the column of s 1′ ,W is a diagonal matrix with n diagonal elements 2 1 , and δµ δη µ) Our goal is to compute the entire solution path for the coefficients β , with λ varying from ∞ to 0. We achieve this by drawing the uniquely determined curve 0

Predictor-Corrector algorithm
We introduced an algorithm that implements the predictor-corrector method to determine the entire path of the coefficient estimates as λ varies , i.e., to find ( ) , our algorithm computes a series of solution sets, each time estimating the coefficients with a smaller λ based on the previous estsmate.Each round of optimization consists of three steps: determining the step size in λ ; predicting the corresponding change in the coefficients, and correcting the error in the previous prediction.
The following lemma provides the initialization of the coefficient paths: Lemma 1: When λ exceeds a certain threshold, the intercept is the only nonzero coefficient: ( ) Proof.The Karush-Kuhn-Kuhn-Tucker (KKT) optimality conditions for minimizing (5) imply As λ is decreased further, other variables join the active set, beginning with the variable Reducing λ , we alternate between a predictor and a corrector step.

Predictor step
In the k-th perdictor step, ( ) A X denote the current weight matrix and the columns of X for the factors in the current active set,respectively.β in the above equations are composed only of current nonzero coefficients.This linearization is equivalent to making a quadratic approximation of the log-likeihood and extending the current solution k β ˆby taking a weighted lasso step (as in LARS).
Define ( ) ( ) ( ) ; in the domain that yields the current active set, ) (λ f is zero for all λ .By differentiating f with respect to λ ,we obtain ( ) from which we compute δβ δλ .

Corrector step
In the following corrector step,we use ˆk β + as the initial value to find the β that minimizes ( ) for β ).Any (convex) optimization method that applies to the minimization of a differentiable objective function with linear constrains may be implemented.The previous predictor step has provided a warm start;because + k β is usually close to the exact solution 1 k ˆ+ β , the cost of solving for the exact solution is low.The corrector steps not only find the exact solution at a given λ but also yield the directions of β for the subsequent predictor steps.

Active set
The active set Α begins from the intercept as in Lemma 3.1;after each corrector step,we check to see if Α should have been augmented.The following procedure for checking is justified and used by Rosset&Zhu(2003) and Rosset( 2004): We repeat the corrector step with the modified active set until the active set is not augmented further.We then remove the variables with zero coefficients from the active set.This is,

Step length
Two natural choices for the step length As we decrease the step size,the exact solutions are computed on a finer grid of λ values,and the coefficient path becomes more accurate.
We propose a more efficient and useful strategy: (3) select the smallest k ∆ that will change the active set of variables.
We give an intuitive explanation of how we achieve this,by drawing on analogies with the Lars algorithm (Efron et al.2004).At the end of the th k − iteration,the corrector step can be characterized as finding a weighted Lasso solution that satisfies ( ) .This weighted Lasso also produces the direction for the next predictor step.If the weights k W were fixed,the weight Lars algorithm would be able to compute the exact step length to the next active-set change point.We use this step length,even though in practice the weights change as the path progresses.
Lemma 2: Let µ ˆ be the estimates of y from a corrector step,and denote the corresponding weighted correlations as The absolute correlations of the factors in Α (except for the intercept) are λ ,while the values are smaller than λ for the factors in c Α .
The next predictor step extends β ˆ as in (11),and,thus,the current correlations change.Denoting the vector of changes in correlation for a unit decrease in λ as a , ( ) Where 0 > h is a given decrease in λ .For the factors in Α ,the values of a are those of .To find the h with which any factor in c Α yields the same absolute correlation as the ones in Α ,we solve the following equations: ( ) The equations suggest an estimate of the step length in λ as In addition,to check if any variable in the active set reaches 0 before λ decreases by h ,we solve the equations ( ) ( ) ,we expect that the corresponding variable will be eliminated from the active set before any other variable joins it;therefore, h ~ rather than h is used as the next step length.
Letting the coefficient paths be piecewise linear with the knots placed where the active set changes is a reasonable simplification of the truth based on our experience(using both simulated and real datasets).If the smallest step length that modifies the active set were to be larger than the value we have estimated,the active set remains the same,even after the corrector step.If the true step length were smaller than expected,and,thus,we missed the entering point of a new active variable by far,we would repeat a corrector step with an increased λ .Therefore,our path algorithm almost precisely detects the values of λ at which the active set changes,in the sense that we compute the exact coefficients at least once before their absolute values grow larger than δ (a small fixed quantity).
We can easily show that in the case of Gaussian distribution with the identity link, the piecewise linear paths are exact. .The step lengths are computed with no error; in addition, since the predictor steps yield the exact coefficient values, corrector steps are not deccessary.In fact, the paths are identical to those Lasso.

Discussion
We can extend the use of the predictor-corrector scheme by generalizing the loss+penalty function to any convex and almost differentiable functions.For example, we can find the entire regularization path for the Cox proportional hazards models with 1 L penalization,.Just as the solution paths for Gaussian distribution were computed with no error through the predictor-corrector method, so any other piecewise linear solution paths can be computed exactly by open range of λ that yields a certain active set of variables; the existence of such mappings