A Simulation Study Comparing Knot Selection Methods With Equally Spaced Knots in a Penalized Regression Spline

Penalized regression splines are a commonly used method to estimate complex non-linear relationships between two variables. The fit of a penalized regression spline to the data depends on the number of knots, knot placement, and the value of the smoothing parameter. In this paper, we use a simulation study to compare knot selection methods with equidistant knots in a penalized regression spline model. We found that one method generally performed better than others. The results provide guidance in selecting the number of equidistant knots in a penalized regression spline model.


Introduction
Consider the general regression model with a single explanatory variable that takes the form where x i ∈ [a, b], y i is a response variable, x i is a covariate, g(x) is the regression function dependent on the covariate, n is the number of observations, and ε i iid ∼ N(0, σ 2 ) for all i.The goal with model (1) is to estimate the function g(x), which describes the relationship between x and y in the presence of uncertainty.There are various approaches for estimating the true underlying curve, g(x) for all x ∈ [a, b], depending on what we know about g(x).
Non-parametric methods do not assume a parametric form for g(x) in model ( 1), but instead the estimation of g(x) is driven by the data.For example, to obtain a cubic penalized regression spline estimate of g(x), one may find g(x) to minimize the following criterion n i=1 where λ is called the smoothing parameter and a = x 1 < • • • < x n = b.To minimize fitting criterion (2), a penalized regression spline approximates g(x) by a linear combination of basis functions.The number of basis functions that approximate g(x) depends on the number of knots K.The cubic penalized regression spline estimate obtained by minimizing fitting criterion (2) is dependent on the value of λ, K, and the location of the knots.There are various families of basis functions that may be used in a penalized regression spline.To confine the study to a reasonable length, we only consider a truncated power basis for illustration in Section 2.
When a knot is placed at each data point, the penalized regression spline is referred to as a smoothing spline.Using smoothing splines, Lee (2003) compares, among others, Akaike's information criterion corrected, Cross-Validation, Generalized Cross-Validation, and Mallow's Cp criterion to select the smoothing parameter.The location of the knots may also be chosen to control the shape of the curve estimate.Stone et al. (1997) and references cited therein discuss various stepwise selection procedures to select the number of knots and their location in a regression spline setting.In a penalized regression spline setting, Spiriti et al. (2013) proposes two random search algorithms for selecting the knot locations.
Using a penalized regression spline, Ruppert (2002) compares the performance of two different algorithms (Myopic and Full Search) to select the number of knots when placing the knots at quantiles of the x i 's with generalized cross validation for smoothing parameter selection and found that both algorithms perform well.Ruppert et al. (2003) further expands on this discussion and provides a default choice for the number of knots when placing the knots at quantiles of the x i 's.Wand and Ormerod (2008) also place the knots at quantile locations using a similar default rule as Ruppert et al. (2003) to select the number of knots in a penalized regression spline.Using equidistant knot locations, Eilers and Marx (2010) compare two types of penalized regression splines.They demonstrated that a large number of equally spaced knots outperformed quantile spacing of the knots in their investigation, but an algorithm for determining what is large was not discussed.See Wand (2000) for a discussion on some of these procedures and others in a regression and penalized regression spline setting.
The term "mixed model spline" has been used by some to indicate a spline smooth estimated via the mixed model methodology.Welham et al. (2007) provide an overview of smoothing splines and penalized regression splines in a linear mixed model setting.For quantile knot placement, they suggest using the default choice, Myopic algorithm, or Full Search algorithm proposed by Ruppert et al. (2003), but the performance of these methods in their setting were not investigated.Kauermann and Opsomer (2011) provide a likelihood based approach to select the number of knots using quantile knot spacing in a mixed model spline.Software implementation and other examples of mixed model splines are discussed in Ngo and Wand (2004).
In this article, we build upon the analysis of Ruppert (2002) and Ruppert et al. (2003) in regards to selecting the number of knots in a penalized regression spline; a notable difference is that we use equally spaced knots and several smoothing parameter selection methods under various simulation settings.Our aim is to determine if one method for choosing the number of equally spaced knots outperforms another, thereby providing a preferred approach for selecting the number of equally spaced knots.To our knowledge, the performance of these knot selection approaches when using equidistant knots have not been examined previously.In Section 2, we review a penalized regression spline with a truncated power basis.We henceforth refer to a penalized regression spline with a truncated power basis as a TSM.In Section 3, we present several commonly used criteria for selecting the smoothing parameter λ in a penalized regression spline.In Section 4, we review three different methods presented in Ruppert et al. (2003) to select the number of knots.In Section 5, we present a simulation study to examine the performance of the knot selection methods with equally spaced knots.We conclude with discussion in Section 6.

A Regression Spline With a Truncated Power Basis
We use a cubic penalized regression spline with a truncated power basis to estimate g(x) in model (1).Key references for regression splines with a truncated power basis include Hastie et al. (2001) and Ruppert et al. (2003).Using a cubic penalized regression spline with a truncated power basis, we model g(x) as where k j is the j th knot, and (x − k j ) + = (x − k j ) if x ≥ k j or zero if x < k j .Assuming that the observation points x i are unique, an estimate of g(x) can be found to minimize (2) if λ and K are held constant.To illustrate, we follow a matrix-based approach as described in Ruppert et al. (2003), among others. and Using this notation, the TSM may be expressed as where y is a column vector, X is a n × (4 + K) matrix.To estimate β, one may use the estimator where and I K×K is a K × K identify matrix.To increase numerical stability when using a truncated power basis in our simulation study in Section 5, we replace λR in (2) by 10 −10 R + λR (Ruppert, 2002).This results in a penalty lower bound of 10 −10 instead of 0. Using the estimate from (2), we obtain This estimate depends on the value of λ and K.

Selecting the Smoothing Parameter
In this section, we discuss various data driven criteria for selecting λ in fitting criterion (2) conditional on the value of K: Akaike's information criterion corrected (AICc), Bayesian information criterion (BIC), Cross-Validation (CV) and Generalized Cross-Validation (GCV) criterion.Each of these data driven criteria provides an approach to select the value of λ conditional on the number of knots, and each criterion is a function of λ.Selection of the value of K is addressed in Section 4.
For the AICc, BIC, CV, and GCV criteria, the value of λ that gives the minimum value of the criteria is taken to be a good value for the smoothing parameter conditional on the value of K.Each of these methods is dependent on the sum of squares error, as well as the effective degrees of freedom of g(x), d f λ = tr(S λ ) as defined in Buja et al. (1989), where The matrix S λ is known as the smoother matrix.The more complex or "wiggly" the estimate of g(x), the higher its The method of CV recycles the data by using training and test samples.Specifically, for each i = 1, ..., n, we obtain the estimate of g(x) that minimizes n l=1,l i conditional on the value of λ.Denote the minimizing solution based on all the data except the i th observation by g −i (x i ).This process is repeated for each observation.There are n cases that one can delete; therefore, for each λ the cross-validation score is defined as Because of the implementation, the CV method is also known as the leave-one-out method.The value of λ that gives the minimum CV score is taken to be a good choice for the smoothing parameter.If a smoother matrix exists, then the CV formula (Silverman, 1985) may be expressed as where S ii denotes the i th diagonal element of S λ and g(x i ) is the estimate that minimizes (2).Using this formula we only have to fit the data once for each λ.
Akaike's information criterion corrected, AICc, was introduced by Hurvich et al. (1998) because the commonly used Akaike's information criterion (Eubank, 1999) may have a tendency to over-fit the curve estimate for small samples.The AICc criterion may be expressed as The BIC criterion (Schwarz, 1978) may be defined as, which is similar to AICc but penalizes a model fit with a larger d f λ more strongly than the AICc for large n.
Developed by Craven and Wahba (1979), GCV may be defined by Each of these smoothing parameter selection methods consists of trading off the complexity of an estimate of g(x) against how well the model fits the data.Using an adequate number of knots, the size of the smoothing parameter controls the influence of the penalty in (2).

Knot Selection Methods
We discuss three methods given in Ruppert et al. (2003) to select the number of knots K.These methods were presented using the quantiles of the x i 's as the knot locations with GCV for smoothing parameter selection.In Section 6, we study these methods using equally spaced knots with several smoothing parameter selection methods.
The three methods to select the number knots K are the fixed selection method, Myopic algorithm, and Full Search algorithm.When using fitting criterion (2) selecting too many knots or too few knots may lead to over-fitting or under-fitting, respectively.The intent of these knot selection methods is to provide an adequate number of knots to allow a flexible enough fit when controlling the smoothness of the curve estimate with a smoothing parameter.

Fixed Selection Method
The fixed selection approach sets the number of knots to This is referred to as the default choice by Ruppert et al. (2003).

Myopic Algorithm
The Myopic algorithm is an algorithm that uses a smoothing parameter selection method.For illustration, we will use the AICc criterion.Begin with a set of candidate values for the number of knots K.In this study, we use candidate values similar to those suggested in Ruppert et al. (2003).Specifically, we set {K 1 , K 2 , K 3 , K 4 , K 5 , K 6 } = {5, 10, 20, 40, 80, 160}, assuming 160 ≤ n.If 160 > n, then reduce the maximum of the set of candidate knot values to n.For w = 1, . . ., 6, proceed as follows: 1) Fit the model with AICc to select λ using K w knots.Let λ 1 denote the smoothing parameter chosen by AICc.
2) Fit the model with AICc to select λ using K w+1 knots.Let λ 2 denote the smoothing parameter chosen by AICc.
2) For w = 1, . . ., 6, select the value of AICc(λ w ) that is the smallest, and then use K w knots.
Although we used AICc to illustrate the Myopic and Full Search algorithm, one may use any of the smoothing parameter selection methods discussed in Section 3.More details regarding these knot selection methods may be found in Ruppert (2002) and Ruppert et al. (2003).

A Simulation Study
In this section, we perform a simulation study using 12 different combinations for selecting λ and K to compare the knot selection methods when using equally spaced knots.Recall the data-driven smoothing parameter selection approaches discussed in Section 3: AICc, BIC, CV, and GCV.We use each of these smoothing parameter selection methods combined with a knot selection method (fixed knot selection method (FM), Myopic algorithm knot selection method (MA), or Full Search algorithm knot selection method (FA)) to select the number of equally spaced knots K.For comparison, we also consider a smoothing spline estimate by placing a knot at each data point, and we denote this approach as the SMS method.Note that any penalized regression or smoothing spline estimator g(x) will always depend on the chosen smoothing parameter λ and the number of knots K, but we suppress the notation of this dependence for better readability.
To compare the performance of an estimator g(x) to the true function g(x), we use the median of a measure used by Lee (2003) to compare smoothing parameter selection methods for smoothing splines.The measure is defined as (3) such that the estimate g λ (x i ) depends on the smoothing parameter λ chosen to minimize The closer the median of ( 3) is to 1, the better we consider the estimator g(x) relative to the true function g(x).
To obtain an analytical expression of the median of (3), we require the sampling distribution of the estimator g(x).However, the sampling distribution will not be analytically derivable due to the smoothing parameter selection.Therefore, for each method of choosing λ and K, we estimate the median of (3) via simulation using where the subscript l in g l denotes the corresponding estimate of g(x) for the l th simulated data set.
Using different simulation settings, (4) is obtained with L = 1000 simulated data sets under model (1).For the simulations, we use three different true functions g(x), each with three different sample sizes, as previously used in a simulation study by Wang and Wahba (1995) to study the performance of confidence intervals for smoothing splines.We denote the three forms of true function g(x) as g 1 (x), g 2 (x), and g 3 (x), and they are defined as Beta 5,10 (x), and where These curves are presented based on the ascending order of their signal to noise ratio (SNR), S D(g)/σ, as defined in Luo and Wahba (1997).
We use R (R Core Team, 2012) to carry out the simulations and to write a function to fit the TSM with any of the smoothing parameter selection methods described in Section 3, combined with a knot selection method described in Section 4. For each of L = 1000 simulated datasets, (4) is computed for each combination of g(x) ∈ {g 1 (x), g 2 (x), g 3 (x)}, n ∈ {32, 64, 128}, and σ ∈ {.3, .5}under each combination of the smoothing parameter and knot selection methods previously discussed when using equidistant knot locations.A set of observation locations, {x 1 , . . ., x n }, for this simulation study are generated randomly from a uniform distribution on the (0, 1) interval.As an illustration, Figure 1 shows a single simulated data set for each respective sample (n = 32, 64, or 128) along with its corresponding true curve (g 1 (x), g 2 (x), or g 3 (x)) when σ = .3.The simulated median estimates given in (4) for each knot selection method under the various simulation settings when the true curve is g 1 (x), g 2 (x), and g 3 (x) are provided in Tables 1, 2, and 3, respectively.
Table 1.The SED, when using each smoothing parameter selection criterion, under each knot selection method, for true curve g 1 (x) σ = .3σ = .5σ = .3σ = .5σ = .3σ = .5σ = .3σ = .2. The SED, when using each smoothing parameter selection criterion, under each knot selection method, for true curve g 2 (x) = .3σ = .5σ = .3σ = .5σ = .3σ = .5σ = .3σ = .5 AICc  3. The SED, when using each smoothing parameter selection criterion, under each knot selection method, for true curve g 3 (x) When the true curve is g 1 (x), the FM provides the lowest SED when compared to the MA and FA across all simulation settings.In addition, the FM provides a lower SED than the SMS for all but one simulation setting.Under g 2 (x), the FM provides the lowest SED when compared to the MA and FA for all but one simulation setting.When comparing the FM to the S MS , the FM tended to outperform the S MS for n = 32 and n = 128, but not for n = 64.When the true curve is g 3 (x), the FM provides the lowest SED for all cases when compared to the MA and FA.The FM also provides a lower SED than the S MS for all cases.The FM clearly outperforms both the MA and FA for almost all circumstances in our study in terms of (4).In addition, the FM provides a lower SED than the SMS for most cases.
Boxplots of the replications of D(g, g l ) for l = 1, ..., 1000 on the log scale for each simulation setting when σ = .3and σ = .5are shown in Figures 2 and 3, respectively.Note that the results under GCV with the FM tend to, more often than not, exhibit less variability that the other smoothing parameter selection methods with the FM in terms of (4).While the differences among the SED's for the each simulation setting may not seem large, we conduct all pairwise comparisons of the medians using a Kruskal-Wallis test as described in Siegel and Castellan (1988) at significance level .05 to compare the FM, MA, and FA.This approach performs Wilcox rank-sum tests on all possible pairwise comparisons of the FM, MA, and FA under a given smoothing parameter selection method, true curve, sample size, and noise level.Furthermore, we compare the FM with the S MS using a Wilcox rank-sum test.The result of all comparisons when σ = .3and σ = .5are summarized in Figures 2 and 3, respectively.
Figure 2. Replications of D( g l , g) on the log scale, for l = 1, . . ., 1000 when σ = .3under different simulation settings.Top left corner panel: Under each sample size, the four box plots contain the replications when the true curve is g 1 (x) when AICC is used for smoothing parameter section for the FM, MA, FA, and S MS , respectively.
An FM * * denotes that the median for the FM was significantly lower than the median of the MA and FA.An FM * denotes that the median for the FM was significantly lower than the MA or FA.An FM without a * or * * indicates that the median for FM was not significantly different from median of the MA or FA.An + FM denotes that the median for the FM was significantly lower than the median of the S MS .An FM without a + preceding it indicates that the median for FM was not significantly different from median of the S MS .An +S MS denotes that the median for the S MS was significantly lower than the median of the FM.Remaining panels: Analogous top left corner panel for the true curve and smoothing parameter selection method shown in the title of the panel Under σ = .3,the tests indicate that the FM has a lower SED than the MA and FM under the true curves g 1 (x) and g 3 (x) for a given sample size and smoothing parameter selection method.Under g 2 (x), we find the FM has an SED that is lower or no different than the SED under the MA and FA.When comparing the FM and S MS , the results indicate that the FM performed just as well or better for most cases.For σ = .5, the results show the FM performs better or not significantly different than the MA and FA across all settings.Furthermore, the results indicate that the FM performed just as well or better than the S MS for all cases.
Tables 4, 5, and 6 show the median number of knots selected, when using each smoothing parameter selection criterion, under each knot selection method, for true curve g 1 (x), g 2 (x), and g 3 (x), respectively.In addition to the FM providing favorable results based on (4), the FM also uses less knots than the S MS .This shows that the FM is more computationally attractive in terms of model fitting.Tables 4, 5, and 6 also provide the median computational time, in terms of elapsed time in seconds with our computing resources, when using each smoothing parameter selection criterion, under each knot selection method, for true curve g 1 (x), g 2 (x), and g 3 (x), respectively.The results show that the FM has a lower computational time than the MA, FA, and S MS .The difference in time between the methods increases as the sample size increases.Boxplots of the smoothing parameter selected on the log scale for each simulated data when σ = .3and σ = .5are shown in Figures 4 and 5, respectively.Under a given true curve and smoothing parameter selection method, the selected smoothing parameters exhibit similar variability for a given sample size.We also note that the variation in the smoothing parameters selected tends to be smaller for the larger sample sizes.This shows that when there is more information, the smoothing parameter is less variable.
Table 4.The median number of knots selected, when using each smoothing parameter selection criterion, under each knot selection method, for true curve g 1 (x).The corresponding median computing time is shown in parentheses below the median number of knots selected σ = .3σ = .5σ = .3σ = .5σ = .3σ = .5σ = .3σ = .

Conclusions
We compared the performance of three knot selection methods using the median of (3) to provide guidance in selecting the number of equidistant knots in a penalized regression spline model.In addition, for comparison, we also considered a smoothing spline estimate by placing a knot at each data point.One should keep in mind that our findings arise from a simulation study.However, under our simulation settings, the FM tended to perform just as well or better than the MA, FA, and S MS across all smoothing parameter selection methods considered.The boxplots given in Figures 2 and 3 show that GCV with the FM generally exhibits lower variability than the other smoothing parameter selection methods with the FM in terms of (4).Since the smoothing parameter selection methods considered have the same computational complexity, GCV should be preferred.Overall, our results suggest that when using equidistant knots in a penalized regression spline, the default rule provided by the FM with GCV should be used.

Figure 1 .
Figure 1.Top row: g 1 (x) (left panel), g 2 (x) (center panel) and g 3 (x) (right panel) and a corresponding single simulated data set (one of the l simulated datasets) of size n = 32 when using σ = .3.Second and third rows: analogous for n = 64 and n = 128 for a single simulated dataset

Figure 4 .
Figure 4.The smoothing parameter λ selected on the log scale for each simulated data set, when σ = .3under different simulation settings.Top left corner panel: Under each sample size, the three box plots contain the smoothing parameter selected when the true curve is g 1 (x) when AICC is used for smoothing parameter section for the FM, MA, FA, and S MS , respectively.Remaining panels: Analogous top left corner panel for the true curve and smoothing parameter selection method shown in the title of the panel

Table 5 .
Analogous to Table 4 for true curve g 2 (x)

Table 6 .
Analogous to Table 4 for true curve g 3 (x)