Strict Interpolation of a Smooth Function and Its First Derivative Using a Linearly-Trained Radial Basis Function Neural Network

We present a neural network, based on Gaussian functions, for interpolating a univariate function and its ﬁrst derivative. The network is linearly trained, and constitutes a continuous piecewise approximation. It is based on the superposition of three standard Gaussian-based radial basis function networks. Analysis indicates that this network is a better approximation than the standard network.


Introduction
Neural network approximation of continuous functions typically involves training algorithms that use discrete values of the function only, rather than its first derivative, in addition.In this paper, we suggest a linear algorithm for training a network to approximate a smooth function, using values of the function and its first derivative at discrete points.In a sense, we are trying to reproduce the idea of Hermite polynomial interpolation in the context of neural network approximation.
In the next section, we briefly describe standard radial basis function neural network interpolation.Thereafter, we describe our algorithm, discuss its stability present an error analysis, consider the possibility of optimizing the network, and present some numerical examples.[Bishop, 2000;Haykin, 1999] Let f : R → R be a continuous function.Assume that (x i , f (x i )) , i = 1, ..., N are available, where the nodes x i are unique.The data (x i , f (x i )) are used to construct an approximation of the form

Standard Radial Basis Function Neural Network interpolation
where the w i are referred to as weights and the φ i are N linearly independent radially symmetric basis functions (also called neurons).The linear combination above is called a radial basis function neural network (RBFNN).The most commonly used basis functions are Gaussians where a i > 0 is the width parameter and c i is the center.Note that where σ i is the standard deviation of the Gaussian.Often, σ i is chosen as the minimum of or some average of {(x i − x i−1 ), (x i+1 − x i )} so that, if the nodes are equispaced, a i is the same for each basis function.In this work we will consider the width parameter to be an adjustable parameter whose value will be determined subject to an optimizing condition.
Determination of the weights is referred to as training the network.In the case of strict interpolation (sometimes known as exact interpolation), we require that R(x i ) = f (x i ) at each of the available nodes x i .So, if there are N distinct nodes we have in which the interpolation matrix Φ, the weight vector W and the target vector F have been defined.Solvability of this system depends on the determinant of the interpolation matrix; it has been shown that Φ is invertible for several types of basis function, including Gaussians [Michelli, 1986], provided that the nodes are distinct (for non-strict interpolation, when the number of nodes is different to the number of basis functions, the matrix Φ is not square and the linear system is solved by taking the pseudoinverse of Φ).
The strict interpolation problem described above is thus a straightforward one -invertibility is ensured for Gaussian basis functions, and solving linear systems computationally is easily accomplished using modern numerical software.
Our objective is to design a strict interpolation RBFNN that is trained linearly and that also interpolates the first derivative of f (x), in the spirit of Hermite polynomial interpolation [Mhaskar & Pai, 2000].Of course, one expects, as in the case of Hermite interpolation, that interpolation of the first derivative, in addition to the function itself, will result in a more accurate approximation.To the best of our knowledge, this has not been attempted before, and so constitutes a gap in the body of research in this field.It is our intention to attempt to fill this gap.Our algorithm, designated the superposition network (SPN), is described in the next section.

Description of the SPN algorithm
Assume that x i , f i , f i , i = 1, ..., N are available, where f i ≡ f (x i ) and f i ≡ f (x i ).An algorithm for constructing an RBFNN approximation to f (x) on a subinterval [x i , x i+1 ], which interpolates (x i , f i ) , x i , f i , (x i+1 , f i+1 ) and x i+1 , f i+1 proceeds as follows: 1. Choose a width parameter a i by simply taking σ i to be the average spacing h ave between the nodes on [ We then have 2. Construct three standard RBFNNs, using width parameters a i √ 2 , a i and 3 2 a i .These networks each interpolate (x i , f i ) and (x i+1 , f i+1 ) .The number of neurons in each network should be at least two, although we will use four in this work, centered at the nodes {x i−1 , x i , x i+1 , x i+2 }.We will refer to these networks as component networks, and we denote them by R 1 , R 2 and R 3 .These component networks are constructed on the subinterval [x i−1 , x i+2 ] and each of them interpolates ] to determine the width parameter a i .It is important to note that the guaranteed inversion of Φ ensures that these three component networks do exist.

We now determine
for v ∈ {i, i + 1} and p = 1, 2, 3.These Δs are the differences between the derivatives of the component networks and the objective function f (x) at the nodes x i and x i+1 , and we will refer to them as derivative residuals.
4. We now seek α, β and γ such that Hence, which may be written as 5. If we now form the linear combination Our notation also emphasizes the dependence of R i,i+1 , R 1 , R 2 and R 3 on the parameter a i .
6.For the subintervals at the extreme left and extreme right of the interval, [x 1 , x 2 ] and [x N−1 , x N ], the same procedure applies except that we determine the derivative residuals at {x 1 , x 2 } and {x N−1 , x N } , respectively.
7. We apply the above procedure on each subinterval, so generating N − 1 networks R i,i+1 , i = 1, ..., N − 1.If we now define 8. Note that since R S PN is a linear combination of standard RBFNNs, any convergence results that hold for a standard RBFNN [Williamson, 1995;Liu & Si, 1994;Park & Sandberg, 1991] also hold for R S PN .
9. Comments: The choice of width parameters a i √ 2 , a i and 3 2 a i for the component networks is somewhat arbitrary.Whatever values are used, they must be such that the derivative residuals for the component networks are not identical -otherwise we run the risk of having a row or column of zeros in the coefficient matrix T in Step 4 (see next section).Also, the number of neurons in each component network could be as high as N. We do not need to base the approximation on subinterval approximation; it is easy to extend this idea to a superposition of N + 1 networks, each of which interpolates f (x) on the entire set (x i , f i ) , although the resulting coefficient matrix T is large (N × N).The virtue of a 2 × 2 coefficient matrix T is that it is potentially easier to ensure invertibility (see next section).From a stability point of view, the inversion of T could present a problem if it is not well-conditioned.Our approach in this work is subinterval (piecewise) approximation.

Invertibility of the matrix Φ
In consideration of the stability of SPN, we note that as a i → 0, Φ tends to the unit matrix, which is singular.However, as mentioned earlier, for nonzero width parameters Φ is invertible.Nevertheless, we should be careful not to choose width parameters too small, lest Φ becomes ill-conditioned.
In a 'direct' interpolation algorithm, the interpolation matrix would contain the first derivative of the Gaussians, in addition to the Gaussians themselves.As far as we are aware, no guarantee can be given for the invertibility of such a matrix; this is one of our prime motivations for developing SPN, in which we can rely on the guaranteed invertibility of Φ.

Invertibility of the matrix T
We have As a i → ∞, T → 0, since the derivative of a Gaussian at its center is zero, and the derivative of a Gaussian at any other point tends to zero as a i → ∞.This means that the entries of the superposition matrix T tend to zero and so T becomes singular.Also, as a i → 0, the Gaussians become horizontal on [x i , x i+1 ] , in which case all derivatives in T are zero.
There is a special case that must be considered.If the objective function is symmetric or antisymmetric on the interval and the nodes {x i−1 , x i , x i+1 , x i+2 } are uniformly distributed, then the component networks R 1 , R 2 and R 3 will also be symmetric or antisymmetric (because the Gaussian basis functions are symmetric).Examples of such functions are sin(x) on [0, π] (symmetric) and sin(x) on [0, 2π] (antisymmetric).For the antisymmetric case we would have which is singular.For the symmetric case we would have which is also singular.Naturally, the symmetry can be broken simply by choosing nodes that are not symmetrical about the centre of the interval.
However, if it is not possible to redefine the nodes, we can still offer a solution in the form of the linear system where the first equation comes from the condition This condition is motivated by requiring that the coefficient of the network R 3 is the average of the coefficients of R 1 and R 2 , although any other similar condition could be imposed.The coefficient matrix in this system has determinant We must show that this determinant is nonzero: assume and, since the Gaussian functions are linearly independent, for each j.Now, since w j and z j cannot be expected to be identically zero in all cases, we must have a 1 = 0 and a 2 = 0 (24) in order for (23) to be generally true.But the widths of the component networks are intentionally chosen to be nonzero.Thus, we have a contradiction and we conclude We would resort to the system ( 16) only if we detected the presence of the type of (anti)symmetry problem described here.

Stability test
For brevity we will write R(x; a i ) instead of R i,i+1 (x; a i ) from now on.If either of the matrices T and Φ is illconditioned, SPN will fail to properly interpolate ( ) gives an indication of the stability of SPN for any given width parameter a i .A tolerance on these values could be specified so that if this tolerance is breached, the corresponding a i -value is rejected.
Indeed, we define the end-point-conditions (EPCs) by The denominators in these expressions contain a 1 in case f i or f i+1 are close to zero.These EPCs measure the extent to which SPN properly interpolates f i , f i+1 , f i and f i+1 , as is intended.

Error analysis
We have that R(x; a i ) approximates f (x), such that f = R at {x i−1 , x i , x i+1 , x i+2 } and f = R at {x i , x i+1 } .Hence, the function Ω(x; a i ) ≡ f (x) − R(x; a i ) has two roots of multiplicity one at {x i−1 , x i+2 } and two roots of multiplicity two at {x i , x i+1 } .So, we may write where Υ(x; a i ) is a function to be determined.

Now, define the function F(z) by
So F(z) = 0 at {x i−1 , x i , x i+1 , x i+2 } and at some x {x i−1 , x i , x i+1 , x i+2 } , a total of five distinct points in [x i−1 , x i+2 ].These five points define four adjacent subintervals, and somewhere on each subinterval there is at least one point such that F (z) = 0 (by Rolle's theorem).Thus, there are four points distinct from the nodes at which F (z) = 0, and F (z) = 0 also at x i and x i+1 .Hence, F (z) = 0 at six distinct points.This means that F (z) = 0 at five distinct points and so on, until we have that F (6) (z) = 0 at one point in [x i−1 , x i+2 ] , say z = ζ(x; a i ).We indicate that ζ can generally be expected to be dependent on x and a i .
Therefore, we have since the sixth derivative of (x and so This is true for all It is easily shown that which is a sixth-order bound, wherein h max denotes the maximum separation of the nodes x i−1 , x i , x i+1 and x i+2 .Note that we have assumed that f (x) is suitably differentiable in this derivation; certainly, R (x; a i ) is infinitely differentiable w.r.t.
x since it is a linear combination of Gaussians.
Other upper bounds may be derived in a similar manner by assuming any of the following: The first two of these lead to second-order error expressions, and the last one leads to a fourth-order expression and, consequently, second-and fourth-order bounds.Clearly, these bounds are less stringent than the sixth-order bound in (33).It must be noted here that the standard four-neuron network on [x i−1 , x i+2 ] cannot have error better than O h 4 max , so we see that R S PN is expected to be a more accurate approximation than the standard network.

Optimal a i
The traditional approximation problem is to replace a given f (x) with a linear combination of preselected basis functions.In such case, f (x) is known explicitly and it is easy to measure the quality of R(x; a i ).We simply determine the cost function , where M is large.We then determine C(a i ) for a number of a i -values in some interval [a i min , a i max ] and settle on that a i -value which gives a minimum C(a i ).This is, admittedly, a brute force approach, but it is effective in a numerical context.However, it must be noted that this approach is easy and fast, due to the linear character of SPN.
If f (x) is not known explicitly, but rather only values of f (x) are available at discrete nodes x i , it is more difficult to estimate C(a i ) reliably.If the f i are close together it might be a good idea to choose [x i , x i+1 ] such that this subinterval actually contains Q values of f (x) at nodes x q between x i and x i+1 .These data points are not used in the construction of R(x; a i ), but rather are used to determine This idea of omitting some data for the purpose of quality control -termed generalization -is often used in neural networking.Other cost functions that could be used are We will use both ( 35) and (37) in our examples in the next section.

Numerical examples
We have approximated f (x) = e x on 4 3 , 5 3 , f (x) = sin(x) on π 6 , π 3 , and f (x) = x on [0, 1] .For f (x) = e x SPN was about three orders of magnitude better than the standard network; for f (x) = sin(x) SPN was about two orders better; and for f (x) = x SPN was about three orders better.
Figure 1 shows the cost C(a) defined in (37) for the f (x) = e x example, for both the standard RBFNN with four neurons (denoted STN from now on), and SPN.Clearly, there is a range of a-values for which SPN is considerably more accurate than the standard algorithm.The jagged effect for small a-values is due to the ill-conditioning of the interpolation matrix Φ.The quantity a h indicates the heuristic width parameter given by 1 √ 2h ave as described previously.We see that the optimal width parameter is different to a h .Figure 1 is typical of the results obtained for the other numerical examples.
Figure 2 shows the EPCs for the f (x) = e x example.In this figure 'Function EPC' is Γ (a i ) , and 'Function derivative EPC' is Γ (a i ) .We see that the EPCs become large for both small and large values of a, and that there is an intermediate range of a-values for which the EPCs are of the order of machine precision (∼ 10 −16 ).
We have also approximated the function which has both a strong and a weak maximum on the interval [0, 1].Our approach is to choose 100 evenly spaced nodes on [0, 1], so that [0, 1] is divided into 99 subintervals.We apply SPN and STN to find approximations on each subinterval, each optimized with respect to the width parameter a i .Figure 3 shows the cost function (35) on each subinterval for STN and SPN.Clearly, SPN is better than STN everywhere, and generally by several orders.Figure 4 shows the optimal width parameters per subinterval for this example, for both STN and SPN, demonstrating that they can differ from subinterval to subinterval.
Lastly, we approximate sin x on [0, π] at the equispaced nodes 0, π 3 , 2π 3 , π , using the heuristic width parameter, so as to illustrate the use of ( 16).Because of the symmetric character of sin x on this interval, and the uniform spacing of the nodes, the matrix T is singular.However, we find that the coefficient matrix in ( 16) has determinant of −0.1114... and condition number ∼ 18, and so is reliably inverted.Hence, we find that STN has a maximum error of 0.05, and SPN has a maximum error of 8 × 10 −4 , on π 3 , 2π 3 .

Conclusion
The use of RBF networks to approximate nonlinear functions, using a discrete set of values of both the function and its first derivative has, to the best of our knowledge, not been attempted before.In this paper, we have addressed this issue by developing a strict interpolation approximation algorithm based on Gaussian basis functions, that not only interpolates the target function but also its first derivative.The approximation is piecewise, and is based on a superposition of three standard RBF networks.Coefficients in the superposition are determined linearly, using a discrete set of values of both the function and its first derivative as training data.By implementing the algorithm in such a manner, we are able to exploit the guaranteed invertibility of the coefficient matrices associated with standard RBF networks.In the special case of approximating a symmetrical/antisymmetrical function, a variation in the training algorithm has been given that avoids potential singularities in the relevant coefficient matrix.The possibility of using cost functions to determine an optimal width parameter has been suggested.Stability of the algorithm has been investigated, and a simple test for instability, w.r.t. the width parameter, has been proposed.A theoretical derivation of the approximation error indicates that the algorithm can be expected to be sixth-order in the node spacing h, while the standard network is only fourth-order.Numerical examples demonstrate the superiority of the algorithm over the standard approach, as anticipated.

Figure 2 .
Figure 2. End-point-conditions (EPCs) vs width parameter for SPN when approximating f (x) = e x