Regularized Single Index Quantile Regression Model

This article proposes a new approach for variable selection in the single index quantile regression model. Compared to existing methods, the new approach produce sparse solutions for the index vector. Performance of the new method is enhanced by a fully adaptive penalty function. Finite sample performance is studied through a simulation study that compares the proposed method with existing work under several criteria. A data analysis is given which highlights the usefulness of the proposed methodology.


Introduction
With the advances in data generation, collection, and storage, modeling data with large number of covariates has become the rule rather than the exception.However, when the number of covariates is large, classical modeling approaches often suffer the curse of dimensionality.In the context of nonparametric regression, handling this issue becomes an extremely challenging task, because data sparseness in local neighborhoods makes it virtually impossible to perform a fully nonparametric estimation of a regression function with multiple covariates.To overcome this issue Ichimura (1990) proposed the single-index model.It is a semiparametric model that combines the strengths of parametric and nonparamntric paradigms to alleviate the issues surrounding high dimensionality.However, single-index models suffers from the same set of drawbacks of classical regression models: 1. the typical singleindex model assumes that the errors have finite variance.In practice this assumption may not hold, especially with heavy tailed error distributions; 2. the single-index model attempts to model the conditional mean of the response variable.As a result the estimates becomes highly sensitive to extreme values.To overcome these difficulties and allow the estimation of various conditional quantiles of the response variable, Koenker & Bassett (1978) in their pioneering work introduced the quantile regression framework.In this article we examine the estimation and variable selection of the single-index quantile regression model.

Consider the single-index quantile regression model
where {y i }'s are univariate response variables, {x i }'s are p × 1 covariates and g τ is a univariate "link" function.The {ϵ i }'s are independent random error terms with the τ th conditional quantile to be zero, i.e.P(ϵ i ≤ 0|x i ) = τ .
Estimation of model (1) has been studied by Wu et al. (2010).They proposed an iterative method to estimate the index vector and the link function.Their algorithm uses local linear estimation and regular quantile regression estimation iterativelly until convergence.The drawback of their procedure is that it does not perform variable selection.Therefore, when applied to data with large number of covariates, it performs poorly in identifying relevant predictors associated with the given response.To remedy this issue, Alkenani and Yu (2013) introduced regularization into the method proposed by Wu et al. (2010).They propose to use Least Absolute Shrinkage and Selection Operator (LASSO) and Adaptive LASSO as a regularizing mechanism to select variables in model (1).Although, adaptive LASSO has been shown to perform quite well under various settings (Zou (2006), Wu & Liu (2009), Zheng et al. (2013)) their estimation procedure is a computationally expensive one that involves minimizing a double weighted summation of n 2 items, where n is the sample size.
To overcome this difficulty Lv et al. (2014) proposed an improved version of the estimation algorithm by extending Carroll et al. (1997)'s approach.In effect, they cut down the computational burden from n 2 to n for the estimation of the index vector and link function of model ( 1).In addition, to select important variables, they propose to use LASSO type regularization as a follow-up step to their estimation.The use of regularization as a secondary step has made their approach to perform poorly in estimating the link function when the number of covariates, p, is considerably large (see Figures 1 and 2).Furthermore, the resulting estimate of the index vector is not sparse and there is a large bias and variance in the estimated coeffcients and the link function.In addition, they rely on methods like average derivative estimation (ADE) of Chaudhuri et al. (1997) to obtain initial values for their algorithm.Unfortunately, ADE method require multidimensional kernel smoothing which is computationally expensive and sometimes infeasible with high dimensional covariates.Therefore, it is counterproductive to use ADE in a single-index modeling framework.In order to overcome these drawbacks, we propose the following two improvements to Lv's algorithm: 1. To use the standard unpenalized linear quantile regression estimator as the initial value for the algorithm thereby reduce the overall computational burden in estimation.
2. To penalize the estimated coefficients iteratively so that the final estimate of the index vector is sparse.
This proposal is motivated by the following observations: 1. Zhu et al. (2012) showed that the linear quantile regression estimator is a √ n-consistent estimator of the index parameter even under link violations.Therefore, we can exploit this property in the single index framework as a "good" starting point for the estimation; 2. Although the algorithm proposed by Lv et al. (2014) performs well with small number of non-significant covariates, when a large number of extra (non-significant) covariates are introduced into the estimation, which is more likely to happen in practice with large datasets, we see a dramatic drop in performance in Lv's method.We demonstrate this effect in the following example:
2. Same as above, except p = 50, with p 0 = 3 and the rest of the 47 coefficients are zeros.
Figure 1 shows the estimated coefficients and the link function estimates of the 1000 simulated samples from case 1.As evident in Figure 1, both methods perform really well in recovering the link functions.However, when we introduce a large number of non-significant covariates into the estimation problem (case 2), the method proposed by Lv et al. (2014) was unsuccessful in recovering the link function.This is shown in Figure 2.This drop in effectiveness of the method proposed by Lv et al. ( 2014) is a direct consequence of how they chose to combine the estimation and variable selection methods.In essence, their approach follows a two-step structure.
First, they estimate the index vector using an unpenalized minimization algorithm.Then, they use the resulting index vector and the link function in a penalized estimation procedure to arrive at their final estimate of the index vector.Although, this method work well with relatively few number of covariates being non-significant (as in Figure 1), it fails to perform adequately when a large number of (non-significant) covariates are introduced (see Figure 2).In addition to unsatisfactory performance in link function estimation it fails to yield a sparse index vector estimate when there are many non-significant covariates are present in the data.
For example, Table 1 shows a brief summary of the estimated coefficients for cases 1 and 2 with the linear link g(t) = 2t + 3. When p=10 with only 7 non-significant covariates, both methods perform quite well.However, when the number of non-significant covariates increase to 47, Lv et al.'s method did not recover most of the zero coefficients.Furthermore, their mean squared errors of the three estimated significant variables is much larger (about five times) compared to our method.1), the overall performance of the approach proposed by Lv et al. (2014) is not satisfactory .To remedy this issue, we now present the details of our proposal and discuss how we improve the estimation of the index vector and link function in model (1).
Step 2: Given θ(0) , use local linear estimation to obtain initial estimates of the link function g (and its derivative) at θ(0) ⊤ x j , j = 1, . . ., n, by solving the minimization problem min Here ρ τ (s) = |s| + (2τ − 1)s is the loss function (commonly known as the "check" function), K(•) is a univariate kernel function and h is the bandwidth that governs the smoothness in the local linear estimation.
Step 3: Using the estimate â j = ĝ( θ(0) ⊤ x j ) and b j = ĝ′ ( θ(0) ⊤ x j ) from Step 2, obtain the final estimator of the index parameter θ by solving following penalized minimization problem: where λ n is the penalty parameter and ω ∈ R p is a data-driven weight vector that controls the penalty for each index coefficient.
Note: This step is where we differ from Lv et al. (2014).We introduce the penalty function as a main step in the estimation rather than a post estimation step.Furthermore, the penalty function that we propose is fully adaptive unlike the one proposed by Lv et al. (2014).
Step 4: Repeat Steps 2 and 3 until convergence.Denote the final estimate of θ as θ.
Once the index parameter θ is estimated, an improved estimate of the link function is obtained by smoothing {η i , y i } n i=1 with local linear estimation.Here ηi = θ⊤ x i and the minimization is identical to the one in Step 2. Denote the final estimate of g τ as ĝ.

Tuning Parameter Selection
It is well known that any estimation procedure that depends on non parametric smoothing requires careful attention to bandwidth selection.Most bandwidth selection procedures relies on minimizing the mean squared error in estimation.For local linear quantile regression function estimation, Yu & Jones (1998) have developed a useful rule-of-thumb bandwidth by minimizing the asymptotic mean squared error (AMSE).This bandwidth, h τ , is calculated as follows: where h mean is the AMSE optimal bandwidth for local linear nonparametric mean regression.The functions ϕ(•) and Φ(•) are the probability density function and the distribution function of the standard normal distribution respectively.This is a commonly used (and computationally efficient) way to calculate AMSE optimal bandwidths for nonparametric quantile regression.We use this optimal bandwidth selector in step 2 of our algorithm.
The penalty parameters λ n and ω can be chosen using leave-one-out cross-validation.We propose a two dimensional grid search with cross validation in terms of check loss optimality to select the optimal parameters.Let ω = 1/| θ| δ , with δ > 0, and θ is any √ n-consistent estimator of the index vector.Define the leave-one-out cross validation score for the τ th conditional quantile at (λ, δ) as where Ŷ(−i) is the fitted value of the τ th conditional quantile with the i th observation removed.Although this approach yield check loss optimal tuning parameters, their statistical properties are not well understood especially under heavy tail error distributions where quantile regression is more applicable.Furthermore, this is a computationally expensive method due to the two dimensional nature of the minimization.Therefore, we illustrate this method only in our data analysis in section 4. To save computational time in our simulations, we used a coarse grid search to find the check loss optimal tunning parameters.In particular, we used a two-dimensional grid {(λ, δ) : λ = 2, 4, . . ., 10; δ = 0.5, 1, 2, 3}.

Simulation Results
We conducted a simulation study to compare the performance of our method with that of Lv et al. (2104).It covers several aspects of estimation: mean squared error in estimation of the non-zero indices; recovery of the link function; sparseness in estimation; robustness to outliers.We present 3 examples.Examples 1 focuses on estimation of a highly non-linear link function and variable selection.Example 2 highlights the robustness to outliers.Example 3 demonstrates the behavior under a hetroscedastic error structure.
2. Same as above, except p = 50, with p 0 = 3 coefficients are non-zero and the rest of the 47 coefficients are zeros.
For each example we used 1000 random samples each with size n = 200 and compared the performance of the our method with Lv et al. (2014).To assess the performance, we choose to compare: 1. Mean squared error of the estimated non-zero components; 2. the average number of correctly classified zeros (denoted by 'Correct'); 3. the average number of incorrectly classified zeros (denoted by 'Wrong').We used a threshold of 10 −4 to label any coefficient estimate as zero.To assess the effectiveness in estimation of the link function, we plotted the link function estimates in summary form, i.e. the 5 th , 50 th and the 95 th percentiles of the estimated link functions in the 1000 simulated samples.All simulations and analyses were carried out using R statistical software (R Core Team (2015)).

Example 1
In this first example, we illustrate the behavior of the estimators under a highly non-linear link function and a homoscedastic error distribution.In particular, we generated data from y = sin(4t) + 0.1ϵ where t = θ T x with θ is chosen according to the two scenarios described in cases 1 and 2. Also, The error term ϵ follows a normal distribution with mean 0 and standard deviation 1.This is the same type of link function used by Lv et al. (2014) as their first example.The following figure shows the link function estimates for Lv et al. (2014) and our method respectively.As shown in Figure 3, the link function estimates are quite good (Figure 3 top row) for Lv's approach but the performance goes down considerably when large number of non-significant covariates are introduced (Figure 3 bottom row).In contrast, our approach performs quite well under both settings (p = 10 and p = 50) as evident in Figure 4.  2014) with n=200: solid=true; dashed=50 th , dotted=5 th and 95 th percentiles of the quantile function estimates of 1000 simulated samples.Top row: p = 10 with 3 non-zero coefficients and 7 zero coefficients.Bottom row: p = 50 with 3 non-zero coefficients and 47 zero coefficients.
Next, we present the performance measures on the variable selection aspect of the two methods.The results are given for five quantiles, τ = 0.1, 0.25, 0.5, 0.75, and 0.9.Table 2 summarizes the following performance metrics: the mean squared error of the three estimated non-zero indices; the average number of correctly estimated zeros; the average number of incorrectly estimated zeros.
In Case 1 (p=10), both methods were successful in selecting the three significant variables and shrinking the other 7 coefficients to zero.However, in Case 2 (p = 50), Lv's method was not effective in identifying the 47 non-significant variables and setting them to zero.Also, the mean squared error in Case 2 is much higher in Lv's approach compared to our method.This is a direct consequence of the lack of sparsity in their index vector estimates.

Example 2
In this example we examine how the two approaches fair when there is some contamination in the data due to outliers.We generated data from y = t 2 + 0.1ϵ .
As before, t = θ T x with θ is chosen according to the two cases described above with x i i.i.d Unif[0,1].The error term ϵ follows a mixture distribution of 0.9N(0, 1) + 0.1Cauchy(0, 1).This error distribution ensures that about 10% of the data are contaminated with very large outliers.The following figures show the link function estimates   2014) and our method.As shown in Figure 5, the link function estimates are quite good (Figure 5 top row) for Lv's approach even when there are outliers in the data.However, as in example 1, their performance goes down considerably when large number of non-significant covariates are introduced (Figure 5 bottom row).In contrast, as evident in Figure 6, our approach performs quite well under both settings (p = 10 and p = 50). .Comparison of estimated link functions using the proposed method with n=200: solid=true; dashed=50 th , dotted=5 th and 95 th percentiles of the quantile function estimates of 1000 simulated samples.Top row: p = 10 with 3 non-zero coefficients and 7 zero coefficients.Bottom row: p = 50 with 3 non-zero coefficients and 47 zero coefficients.
As before, the performance of the variable selection aspect of the estimation is studied at five quantiles, τ = 0.1, 0.25, 0.5, 0.75, and 0.9.Table 2 summarizes, for cases 1 and 2, the mean squared error of the three estimated non-zero components in the index vector and the average number of correctly and incorrectly estimated zeros for both methods.We see a distinct pattern as we observed in Example 1: both methods are successful in selecting the correct number of covariates when there are few non-significant variables are present (p = 10), but the proposed method outperformed Lv's approach when there are many non-significant variables are involved (p = 50).

Example 3
In this final example, we explore how the two approaches behave under a heteroscedastic error structure.We generated data from where σ(t) = 2 + cos(2πt) and t = θ T x.As before θ corresponds to the two cases mentioned above with x i i.i.d Unif[0,1].The error term ϵ follows a distribution of 0.25N(0, 1).The following two figures shows the estimated link functions for Lv et al. (2014) and our method respectively.
As shown in Figure 7 top row, when the number of covariates is small (p = 10), the link function estimates are quite good for Lv's approach.However, their performance goes down considerably when large number of nonsignificant covariates are introduced (Figure 7 bottom row).In contrast, our approach performs quite well under both settings (p = 10 and p = 50) as evident in Figure 8.
Finally, the performance of the variable selection aspect for both methods is summarized in Table 4.As before, it is quite evident that the proposed method outperforms Lv's method when large number of insignificant variables are involved.The main goal of Hirst et al. (1994) is to use neural networks and logic programming to describe the quantitative structure-activity relationships (QSAR) in the inhibition of DHFR.QSAR is a model that correlates measurable or calculable physical or molecular properties to some specific biological response.Once a valid QSAR has been determined, it should be possible to predict the biological activity of related drug candidates before they are put through expensive and time-consuming biological testing ( Hansch & Fujita (1964)).The QSAR method examined by Hirst et al. (1994) is for application to drug design problems where there is no receptor structure.The goal of the study is to identify the important attributes (variables) that relate to the inhibition level which is measured by log(1/K), where K is the inhibition level as experimentally assayed.
The actual data set contains structural information on 74 drug variants of a particular DHFR inhibitor in E. coli.
For each drug variant, there are 3 positions where chemical activity occur and 9 attributes per position leading to 27 total variables.These variables represent the attributes: polarity, size, flexibility, number of hydrogen-bond donors and acceptors, presence and strength of rc-acceptors and re-donors, polarisability of the molecular orbitals, and o-effect and branching.The attributes were assigned to reproduce general trends rather than precise values.For example, the size of a substituent was based on the number of carbon, nitrogen and oxygen atoms it contains, with size = 0 for hydrogen, size = 1 for single atoms and size = 2 for substituents containing two to four carbon, nitrogen or oxygen atoms.One of the variables (variable 25) had no variability and was removed from the data set.
A complete description of the variables is given in Appendix A.
We analyzed this data under the single index qunatile regression framework to identify important variables that are strongly associated with the inhibition level log(1/K).The following table summarizes the variables (attributes) selected by three methods: 1. linear quantile regression (LQR); 2. method by Lv et al. ( 2014) -denoted by (LV);  (1994) reported in their original work using classical mean regression.They claimed that the quadratic term of the variable X8 (3-position polarizability) is statistically significant.This highlights the importance and the the flexibility of the single index approach in determining the appropriate link in exploratory data analysis.

Conclusion
The iterative estimation method that we propose in this article outperforms Lv et al. (2014) both in estimating the link function and selecting relevant variables especially when there are large number of non-significant variables are involved.The key to our success is the iterative penalizing technique and the adaptive penalty weights that we used in the estimation algorithm.Asymptotic properties of the proposed estimators are of great interest.Especially the consistency and the oracle property of the variable selection method.These aspects are currently being investigated.

Figure 3 .
Figure3.Comparison of estimated link functions using the method inLv et al. (2014) with n=200: solid=true; dashed=50 th , dotted=5 th and 95 th percentiles of the quantile function estimates of 1000 simulated samples.Top row: p = 10 with 3 non-zero coefficients and 7 zero coefficients.Bottom row: p = 50 with 3 non-zero coefficients and 47 zero coefficients.

Figure 4 .
Figure4.Comparison of estimated link functions using the proposed method with n=200: solid=true; dashed=50 th , dotted=5 th and 95 th percentiles of the quantile function estimates of 1000 simulated samples.Top row: p = 10 with 3 non-zero coefficients and 7 zero coefficients.Bottom row: p = 50 with 3 non-zero coefficients and 47 zero coefficients.

Figure 5 .
Figure 5.Comparison of estimated link functions using the method in Lv et al. (2014) with n=200: solid=true; dashed=50 th , dotted=5 th and 95 th percentiles of the quantile function estimates of 1000 simulated samples.Top row: p = 10 with 3 non-zero coefficients and 7 zero coefficients.Bottom row: p = 50 with 3 non-zero coefficients and 47 zero coefficients.

Figure 9 .
Figure 9.Comparison of estimated link functions for the DHFR inhibitor data.Top row: Method by Lv et al. (2014).Bottom row: Proposed Method

Table 1 .
Summary of results: Mean Squared Error (MSE) of the non-zero coefficients and the identification of zero coefficients in Case 1 and 2 with linear link at τ = 0.5.
As evident in this motivational example (Figures1, 2 and Table

Table 2 .
Summary of results: Mean Squared Error (MSE) of the non-zero coefficients and the identification of zero coefficients in Case 1 and 2 with sin link and normal errors.

Table 3 .
Summary of results: Mean Squared Error (MSE) of the non-zero coefficients and the identification of zero coefficients in Case 1 and 2 with quadratic link and a mixture error distribution.

Table 4 .
Summary of results: Mean Squared Error (MSE) of the non-zero coefficients and the identification of zero coefficients in Case 1 and 2 with linear link and a heteroscedastic error distribution.