Approximating an Infinite Horizon Model in the Presence of Optimal Experimentation

,


Introduction
Recently there has been a renewed interest in optimal experimentation.In the engineering literature referred to as active learning, see e.g.Amman and Tucci (2020), Buera et al. (2011), Savin and Blueschke (2016).There are two dominant approaches for solving this class of models.The first method is based on the value function approach and the second on an approximation method.The former uses Bellman"s (1957) dynamic programming approach for the closed loop (value function) form of the problem, which is used in studies by Prescott (1972), Taylor (1974), Easley and Kiefer (1988), Kiefer (1989), Kiefer and Nyarko (1989), Aghion et al. (1991), Beck and Wieland (2002), Coenen et al. (2005), Levin et al. (2003) and Wieland (2000) and many more.
In principle, the value function approach is theoretically the preferred method as it derives the optimal values for de policy variables.Unfortunately, it suffers from the curse of dimensionality and is only applicable to problems of low dimensionality due to the fact that the solution space needs to be discretized.The approximation methods as described in Cosimano (2008) and Cosimano and Gapen (2005), Kendrick (1981) and Hansen and Sargent (2007) use approaches, that are applied in the neighborhood of the linear regulator problems (Note 1).Because of this local nature with respect to the random elements of the approach, the method allows for models of larger dimension.
In Amman and Tucci (2020) both the value function approach and the approximation method are used to solve the same problem and their solutions are compared.For this purpose we used a common testbed model as presented in MacRae (1975) and Beck and Wieland (2002) (Note 2).In that paper the focus is on comparing the policy function results reported in Beck and Wieland (2002), through the value function, to those obtained through an approximation method.In this paper we present the full derivation of testbed model.In this way providing insight into the nature of the approximation approach.
(1) where is the expectation operator conditional on the information available at time 0, subject to with   and  the state and control variables, respectively, and the tilde indicating the desired path of the specified variable.Also α, β and γ are the parameters of the system equation and  + is an error term identically and independently distributed (i.i.d.) normal with mean zero and variance q.Finally, the initial state x0 and the penalty weights w"s and λ"s are given constants.The parameter associated with the control is assumed constant but unknown with mean, at time t,  and variance   .Also, the state is measured without error (Note 3).
Following Tse and Bar-Shalom (1973) methods for solving active learning stochastic control problem, Tucci et al. (2010) compute, for each time period, the approximate cost-to-go at different values of the control and then choose that value which yields the minimum approximate cost (Note 4).This approximate cost-to-go is decomposed into three terms and, for the present problem, written as (3) where is the total cost-to-go with N periods remaining and , and are the deterministic, cautionary and probing component, respectively.The deterministic component includes only terms which are not stochastic.The cautionary one includes uncertainty only in the next time period and the probing term contains uncertainty in all future time periods.Thus the probing term includes the motivation to perturb the controls in the present time period in order to reduce future uncertainty about parameter values (Note 5).
In the following pages, this model is rewritten as an infinite horizon model and the associated formulae for the approximate cost-to-go are derived.The problem now is to find the set of controls  for t = 0, 1, ..., ∞, where t = 0 denotes the current period, which minimizes the linear functional (4) with the desired path for the state and control set equal to 0,   subject to the system equation ( 2) and   and   where ρ is the discount factor between 0 and 1.
The control problem (2) and ( 4) is solved treating the stochastic parameters as additional state variables as in Kendrick (1981;2002, Chapter 10) and restating it in terms of an augmented state vector  as: find the controls  for t = 0, 1, ..., ∞ minimizing (5) with  having  on the top left corner and zeros elsewhere.subject to the discrete-time system equations, with no measurement equation, ( 6) with the arrays defined as ( 7) Problems ( 2) and ( 4) and ( 5)-( 7) are equivalent "however the first is described as a linear quadratic problem with random coefficients and the second as a nonlinear (in x, u and β) stochastic control problem" as noted in Kendrick (1981;2002, p. 94).

One-Period Ahead Projection of the Mean and Variance of the Augmented State Vector Z
For this simple model the one-period ahead projection of the mean of the augmented state vector  , after control at time zero is applied, is where is the estimate of the unknown parameter at time 0, with estimated variance to save on notation,  is the initial condition for the state and being the search control at iteration τ, with the Certainty Equivalence (CE) solution being the first search control, i.e.
. The projected mean of the parameter is equal to its current estimate because the unknown parameter is assumed constant.
For the model presented in Beck and Wieland (2002) and MacRae (1975) (BWM) with no measurement error, the projected variances look like (Note 6) (10)

The Nominal Path for the State and Control
At this point the nominal, or CE, path for state and control are needed.This is done by solving the CE problem for the un-augmented system from time 1 on, using  ̂ as initial condition and the nominal path for the parameters.Given that in the present case all of them are assumed constant, at this stage the estimate is treated as the true parameter for all future periods.Then the nominal control for a generic period j in the time-horizon can be expressed as, in the present case, In the special case where γ = 0, the nominal state and control are simply ( 20) and ( 21)

Riccati Equations for the Arrays of the Augmented System
The K and p Riccati arrays of the augmented system are partitioned as

Updating the Covariances of the Augmented System
For the BWM problem the updating equations for the covariances of the augmented system look like (Note 9) (29) then the elements of the updated covariance matrix are defined as (30) where the projected covariances take the form in (10) when j and j-1 replace 1 and 0, respectively.Combining ( 30) and ( 10), it yields, for j = 1, ( 31) and in general it can be shown that (Appendix D). ( 32) with ( 33) and (34)

The Approximate Cost-to-Go
As in Kendrick (1981;2002, Chapter 10) the approximate cost-to-go associated with the "search" control u τ is decomposed into three parts: deterministic , cautionary and probing .The deterministic component for the control at time 0 is, see, e.g., equation (10.49) in Kendrick (1981;2002), (35) For the model at hand, equation ( 35) can be rewritten as (36) where    and .Equation ( 36) can be written more compactly as (37) The parameters in equation (37) simplify to (38) when there is no constant term and zero desired path for the state and control (Appendix E).The cautionary component looks like (39) By using the definitions of the k"s and rearranging the terms it yields ( 40) with (41) as apparent from Appendix F, when the identity is used.Finally, the probing component takes the form (42) Similarly to Amman and Kendrick (1995) and Tucci et al. (2010), equation ( 42) can be rewritten as ( 43) with ( 44) and ( 45) with (46) as shown in Appendix G.At this point by substituting (37), ( 40) and ( 43) into ( 35) yields ( 47) with the parameters defined as in ( 38), ( 41) and ( 46).As shown in Appendix H through Appendix J, these new definitions are perfectly consistent with those associated to the two-period finite horizon model reported in Amman and Kendrick (1995) and Tucci et al. (2010).

Numerical Example
In this section the DUAL infinite horizon control is computed using the parameter set in Beck and Wieland (2002, Figure 1, p.  affected by the penalty weight on the state.The main difference between the finite and infinite model lies in δ 3 , the constant term in the cautionary component, which jumps from 1, the variance of the system disturbance, to 9.5 which is, approximately, half the inverse of the discount rate, i.e.
. Therefore this coefficient reflects the infinite sum of the discount factor ρ.
The results for the J D , J C , J P and J ∞ , using the above parameters, are plotted in figure 1, which is for certain levels of the parameters not globally convex.As mentioned earlier, Amman and Kendrick (1995), the solution of the may suffere from non-convexities and can have several (local) minima for certain parameter sets.Figure 2 shows clearly that when the uncertainty of the parameter β, increase the chance of hitting a multiple minima increases.Hence, when doing a numerical optimization with the model, caution is required.

Conclusion
By applying a well-known testbed model, we presented the full derivation of an (value function) approximation approach, in this way providing insight into the nature of the approximation.The appropriate Riccati quantities for the augmented system have been derived and the time-invariant feedback rule defined.The resulting formulas are easy to compute and allow for problems of higher dimensions that can be solved in feasable time.
Due to the local nature of the approximation, caution is required when the model suffers from a high degree of stochasticity as defined by the various (co)variance of in the model.With high levels of randomness, the solution may produce multiple local optima.

Notes
Note 1.For consistency and clarity in the main text, we used the term approximation method instead of adaptive or dual control.The adaptive or dual control approach in MacRae (1975), see Kendrick (1981), Amman (1996) and Tucci (2004), uses methods that draw on earlier work in the engineering literature by Bar-Shalom and Sivan (1969) and Tse (1973).There are differences between this approach and the approximation approaches in Cosimano (2008) and Savin and Blueschke (2016) which we will not discuss in detail here.Through out the paper we will use the approach in Kendrick (1981).
Note 2. Throughout the paper we will use the abbreviation BWM for the testbed model Note 3.This is equivalent to setting H=I and R=O in Kendrick (1981;2002, Chapter 10 -11) or Tucci (2004, chapter 2-5).
Note 7. In this case the Riccati equation is scalar function and can easily be solved.The multi-dimensional case can be more complicated to solve.See Amman and Neudecker (1997).
Note 8.This compares with    and in the two-period finite horizon model.The double summation on the right hand side is equal to (B-19) when the system is stable and ρ <1, then (B-20) Therefore when the system is stable and ρ <1, the component depends only upon g 0,1 ≡ g and (g 0,2 − g 0,1 ) ≡ (g 0,2 − g) and for j = 1, ..., ∞ When the conditions for the existence of an infinite horizon solution are satisfied, see e.g.DeKoning (1982),Hansen and Sargent (2007, section 4.2.1), with   and   , the optimal control law is time invariant, i.e. the results inTucci et al. (2010) it can be shown, by repeated substitutions, that in the infinite horizon problem the j-th nominal control can be written as the sum of two components (Appendix A).One associated with  ̂ depending upon the control applied at time 0, , and the other due solely to the system parameters and exogenous forces, in this case the constant term γ.Namely (and the nominal control at time j can be rewritten as(19) array, matrix corresponds to the quantity discussed in the previous section and when the condition for stabilization holds, i.e.  is stable, and γ = Appendix C and Appendix F (Note 8).The elements of the p Riccati vector are defined as 1367) which translates to (48) in the present context.With this parameter set, the fixed point solution to the usual Riccati recursions for the unaugmented system is (49) with and the time invariant optimal control law simplifies to (50) It follows that the relevant terms for the computation of the approximate cost-to-go described in the previous section 9 specialize to (51) (52) Then the coefficients characterizing the deterministic, cautionary and probing component are, respectively, By comparing the new results with those associated with a two-period model reported in Tucci et al. (2010, equations 34-39) some interesting features emerge.First of all the ψ"s in the deterministic component are the same both in the finite and infinite model except for the fact that the former uses undiscounted penalty weights on the state, i.e. , and the latter assumes   with w = 1.The same consideration explains the slight difference existing between the new and old coefficient δ 1 in the cautionary component and φ 1 in the probing one.It is noteworthy that the coefficient δ 2 in the cautionary component and φ 2 and φ 3 , in absolute value, in the probing one are identical in the finite and infinite model.This means that these coefficients are not

Note 9 .
See, e.g.,Kendrick (1981; 2002, Chapter 10, p. 103)  orTucci (2004, chapter 2, pp.27-28)  for details.with G and g as in equations (A-10)-(A-11) in Appendix Arepeated substitution, it can be shown that (B-9) By using equation (A-14) in Appendix A for the nominal control, it follows that can be viewed as the sum of two components, one dependent uponthe control applied at time 0, u 0 , and the other due solely to the system parameters and exogenous forces, in this case the constant term γ.Namely, of G 0 , j , i.e. equation (A-18) in Appendix A, into (B-11) yields (B-13)The component associated with the constant term γ, i.e., can be rewritten as (The first infinite summation on the right hand side is equal to(B-18)

(
the nominal controls with equation (A-14) in Appendix A, computing the infinite summation and double summation and rearranging the terms, the quantity can be rewritten as (B-23) with (B-24)