A Q-learning Approach to a Consumption-Investment Problem

The problem of consumption-investment initially arose in work on the portfolio problem by (Samuelson, 1969). Samuelson considers an investor who consistently wants to increase wealth at each point in time and, wishes to allocate this wealth between investment and consumption. The objective is then to maximize the investor’s utility consumption, where the investor chooses on the best consumption-investment strategy for allocating the total investment among various assets. See (Mendelssohn & Sobel, 1980; White, 1993; Cruz-Suárez, Montes-de-Oca, & Zacarı́as, 2011 and Vitoriano & Parlier, 2017).


Introduction
The problem of consumption-investment initially arose in work on the portfolio problem by (Samuelson, 1969). Samuelson considers an investor who consistently wants to increase wealth at each point in time and, wishes to allocate this wealth between investment and consumption. The objective is then to maximize the investor's utility consumption, where the investor chooses on the best consumption-investment strategy for allocating the total investment among various assets. See (Mendelssohn & Sobel, 1980;White, 1993;Cruz-Suárez, Montes-de-Oca, &Zacarías, 2011 andVitoriano &Parlier, 2017).
The main objective of this study is to propose a procedure based on machine learning techniques to approximate the investment strategy in a consumption-investment problem. Machine learning is the study of computer algorithms that automatically improve through experience. Machine learning algorithms build a mathematical model based on sample data, known as training data, to make predictions or decisions without being explicitly programmed to do so. The Qlearning technique belongs to this class of algorithms, and it is a reinforcement learning algorithm that was introduced by Watkins in 1989 (Watkins, 1989). Q-learning is an asynchronous dynamic programming method that allows the controller learn to act optimally in Markovian domains. (Watkins & Dayan, 1992) prove that Q-learning converges to the optimum action values with unit probability. Therefore, Q-learning is an adequate technique to implement in the solution of the consumption-investment problem. In closely related studies, (Weissensteiner, 2009) investigated the optimal consumption problem with a finite planing horizon and the simulation of scenarios were obtained via a geometric Brownian motion. Now, this work proposes using Q-learning combined with classic value iteration algorithms for Markov decision processes in discrete time with the Robbins-Monro stochastic approximation method (Robbins & Monro, 1951), defining and optimizing an auxiliary function, and, discretizing the elements of interest, achieving an approximation of the optimal policy. To this end, a logarithmic function is assumed for the utility of consumption, see (Liang & Ma, 2019), and the states of the system satisfy a discrete dynamic system (see (5), below). Moreover, we consider state space and a continuous action space, and dedicate a section to the process of discretizing such spaces so they can be adapted to Q-learning. The conditions that guarantee the existence of a solution are covered, for both the original problem and the convergence of the Q-learning approach. In is presented a consumption investment problem with finite horizon and a numerical solution via Q-learning approach is provided.
The paper is organized as follows. Section 2 states the problem, considering the source of the law of capital transition through the Langevin equation. Section 3 illustrates the process of discretization of state and action spaces. Section 4 presents the Q-learning method. Finally, in Section 5, the theory is implemented in a numerical example, see (Cruz-Suárez, Montes-de-Oca, & Zacarías, 2011).

Problem Statement
Let {W t , t ≥ 0} be a Brownian motion with respect to a filtered probability space (Ω, F , P, (F t ) t≥0 ), where F t = σ{W s : 0 ≤ s ≤ t}. Consider {Y t , t ≥ 0} a stochastic process defined on {F t : t ≥ 0} that satisfies the following stochastic equation: where µ, σ, T ∈ R + , with initial condition Y 0 = y ∈ R . Expression (1) is known as the Langevin equation. {Y t , t ≥ 0} is the Ornstein-Uhlenbeck process with state space R, see (Mikosch, 1998). Equation (1) can be rewritten as where µ and σ are parameters that are commonly called the drift and diffusion coefficients, respectively, of process {Y t }, for details see p. 136 in (Mikosch, 1998).

Consumption Plan
We now present a controlled version of the previous problem (Cruz-Suárez, Montes-de-Oca, & Zacarías, 2011). First, suppose that an investor's wealth is governed by the law defined by difference equation (4), where x t is the current wealth at time t, for t = 0, 1, 2, ..., δ < 1, and {ξ t } is a sequence of independent and identically distributed random variables with density function ∆. Suppose that the investor wants to optimally manage his or her current capital x t , dedicating part of this capital to consumption a t and the remainder, h(x t ) − a t , to investment. In particular, consider the production function h(x) := x δ with x ∈ X := [0, ∞) called the state space. The transition law is thus given by and X 0 = x := e y is known. Loans are not allowed. Therefore, is the admissible action space at time t, and A := [0, +∞) is the admissible action set.
The dynamics described in this consumption-investment system are as follows: if the system is observed at time t, the state considered is x t = x ∈ X = [0, +∞) and action a t = a ∈ A(x) is applied. A reward r(a) is then obtained and the system moves to the next state, x t+1 ∈ X, by means of transition law (5). This process is repeated as discounted rewards accumulate at each time point t up to a horizon N according to a performance criterion. We denote K := {(x, a) | x ∈ X, a ∈ A(x)} as the set of admissible state-action pairs. Definition 1 A consumption plan is a sequence π = {π n } ∞ n=0 of stochastic kernels π n over the set of controls, given the history h n = (x, a 0 , x 1 , a 1 ..., x n−1 , a n−1 , x n ), for each n = 0, 1, ... . That is, for n ≥ 0, π n is a stochastic kernel if it satisfies the following properties: a) π n (· | h n ) is a probability measure on X, for each h n ∈ K × X.
b) π n (B | ·) is a random variable for each B ∈ B(X), where B(X) denotes the Borel σ-algebra of X. The set of all plans is denoted by Π.
We thus have an initial capital X 0 = x = e y ∈ X, a plan π ∈ Π, and a utility consumption U : A → R defined as The performance criterion used to evaluate the quality of the plan π ∈ Π is given by where E π x [·] is the expectation operator when x ∈ X is the initial condition and policy π ∈ Π is applied. Future utilities of consumption are discounted according to a discount factor α ∈ (0, 1).
The investor's objective is to maximize the discounted consumption utility for all plans π ∈ Π, that is, In this case, π * is the optimal policy, and V * is the optimal value function. Then, the optimal control problem consists in determining the optimal policy π * ∈ Π.

Existence of Solution
The production function and the consumption utility function satisfy certain usual conditions, see (De La Fuente, 2000).
Let C 2 (X, Y) denote the set of functions l : X → Y with a continuous second derivative for Euclidean spaces X and Y.
Assumption 1 In the context of the previous problem, the production and utility functions satisfy the following conditions: is strictly increasing and strictly concave, and e) U is an invertible function, U (0) = ∞, and lim a→∞ U (a) = 0.
The following definitions will be subsequently used.
Definition 2 A measurable function ϑ : X → R is said to be a solution to the optimality equation if it satisfies where Q is the corresponding transition law induced by (5) according to Section 1.2 in (Hernández-Lerma & Laserre, 1996); in other words, for B ∈ B(X), where I B (·) is the indicator function of set B.
Remark 1 Under certain assumptions (Hernández-Lerma & Laserre, 1996), one can show that, for each n = 1, 2, ..., there exists a stationary policy f n ∈ F := { f : X → A | f (x) ∈ A(x)} such that the maximum is attained in (11), that is, Lemma 1 For logarithmic utility and transition law (5), the following conditions are satisfied: a) The optimal value function V * is a solution to the optimality equation (9). b) For every x ∈ X, ϑ n (x) → V * (x) as n → ∞. We approach the solution of the control problem with a discrete-time machine learning technique, in particular, a Qlearning method, see (Watkins & Dayan, 1992;Powell, 2011;Bertsekas, 2019). To this end, in the next section, we present a discretization procedure in state-action space.

Discretization of State and Action Spaces
Approximate solutions using the Q-learning method require discretization of the state and action spaces X and A, respectively, and discretization requires a new set of states and actions. We therefore first consider a common finite truncated space of the original state space, namely, We must partition the interval I into a finite number η of subintervals. We define ρ := x/η and call ρ the grid size. Each subinterval is a subset I i that consists of the nonempty intervals of the form We choose a representative element from each I i and letĨ be the set of all representative elements. For any x ∈ X, we let σ x be the element I i , (i = 0, 1, 2, ..., η) to which x belongs. We also useσ x to denote the representative element of the set σ x . For convenience, we take the right end of each subinterval I i as the representative elements.
Thus, the discretized state space is given bỹ For the discretization of the action space A, we first simulate a finite number k max of random values of ξ. Then, the simulated actions a k are generated from transition law (5), such that a k ≥ 0 and a k < x δ , because borrowing is not allowed. For simplicity, these computerized realizations are rounded. We round these values a k and assign them to a r k . Set z 1 := Min{a r k } and z 2 := Max{a r k }, for k ∈ {1, 2, ..., k max }, and consider r 1 := Round{z 1 } and r 2 := Round{z 2 }. Finally, we obtain the discretized admissible action space, as follows: A(x i ) = ã j ∈ A(x i ) |ã j = r 1 + j/100 , f or j = 0, 1, ..., (r 2 − r 1 )100, i ∈ {1, 2, ..., η}.
The law transition Q(·|·) is a stochastic kernel on X, given K.
Define the functionQ :X ×Ã ×X → [0, 1] according to (Chow & Tsitsiklis, 1991) by lettinĝ where Q is as in (10), that is, the probability function obtained by normalizing a lognormal density function ∆. Thus, {ξ t } is a sequence of independent and identically distributed random variables. Therefore, which is interpreted as the probability of the transition to statex j when the current state isx i and actionã is chosen.

Development of the Q-Learning Method
Bellman's optimality equation, see (Gosavi, 2015) yields for allx i ∈X, where J * (x i ) denotes the ith element of the objective function vector associated with the optimal policy.
Definition 4 Forx i ∈X,ã ∈Ã(x i ) we define the Q-factors function as follows: Then, Therefore, we can write the following equation which is known as the version of the Bellman optimality equation for Q-factors: forx i ∈X,ã ∈Ã(x i ).

Robbins-Monro Algorithm for Q-Factors
Recalling the Q-Factor definition in the form of the Bellman equation, we have The expectation can be estimated using the Monte Carlo method (Ross, 2013), for instance; that is, it is possible to estimate the Q-factors by using the Robbins-Monro algorithm scheme, see (Robbins & Monro, 1951): where the function λ k is the reinforcement or learning rate. We derive λ k = 1/(k + 1) from the Robbins-Monro algorithm. Other reinforcement rates can also be considered, as long as they satisfy the conditions:

Q-learning Modified Algorithm
The algorithm for applying Q-learning to Markov decision processes with a discounted reward through value iteration is presented as Algorithm 2.
Algorithm 2: Q-learning modified algorithm Initialization Specify k max and parameters of {ξ}; for k = 0 to k max do Simulate ξ k ; a k ←− x δ − x ξ k such that a k ≥ 0 and a k < x δ ; a r k ←− Round[a k ]; end Set r 1 := Round{Min{a r k }}, r 2 := Round{Max{a r k }}; for j = 0 to (r 2 − r 1 )100 dõ a j ←− r 1 + j/100 (ã j ∈Ã(x)); Simulate the next statex j , until returning to x =x i ;

Q-learning Convergence
Definition 5 The asynchronous stochastic approximation scheme is given by the recursive formula of the Q-learning method: for k = 1, 2, ..., k max , where k max is the number of iterations.
Note that the following statement holds.
Remark 2 Let Θ k be the index of the state-action pair visited in the kth iteration of the algorithm, and I(·) the indicator function. We then define that is, the indicator function will return a zero for a Q-factor that is not visited in the kth iteration, which implies that Q-factors that are not visited in that iteration will not be updated. There is a deterministic scalar χ > 0 such that we almost surely obtain for allx i ∈ C :=X\{0} andã ∈ [0, because C is an irreducible closed set of recurrent states. Moreover, when we obtain for any z > 0, the limit Proposition 1 The step size λ k (x i ,ã) = Log(k) k and the action selected in the Q-learning algorithm satisfy the following statements: a) The step size λ k (x i ,ã) satisfies the following conditions for allx i ∈X andã ∈ [0,x δ i ]: ii) For any z ∈ (0, 1), sup k λ [zk] iii) For any z ∈ (0, 1), and, therefore, Now, applying the properties of the zeta function, ζ, we can demonstrate that Therefore, the result follows, given (20) and (21). b) Moreover, λ k satisfies the following properties: i) λ k is strictly decreasing, ∀k ≥ 1.
Proposition 2 The sequence of policies generated by the Q-learning algorithm, {π k } ∞ k=1 , converges to π * .

Numerical Example
To exemplify Algorithm 2, we consider the performance criteria to be optimized, as follows: where the initial capital is x and α = 0.5 is the discount factor.
The transition of the system is governed by (5), where δ = 0.75 and {ξ t } is such that Log(ξ t ) ∼ N(0, 0.5), which leads to the normalized mass probability function on spaceX ×Ã ×X given by (15).
For the Q-learning method, set the step size λ k (x i ,ã) := Log(k) k as the reinforcement rate and k max = 1000. Table 1 and Figure 1 summarize some results, compared with the exact solution.
(Cruz-Suárez, Montes-de-Oca, & Zacarías, 2011) find that the optimal policy is given by a * (x) = x δ (1 − αδ), x ∈ X.   Table 1 shows approximate optimal actions for some of the values of x, obtained using the method described in Algorithm 2, contrasted with the exact optimal policy. Figure 1 shows further approximate optimal actions, denoted by red dots, which exhibit a behavior similar to that of the exact optimal policy, graphed by the solid blue line.
The proposed modified Q-Learning algorithm is a good approximation for the optimal actions with the given parameters without the need to resort to a closed solution, if there is one.

Conclusion
The method proposed in this work offers an alternative source of solutions to the optimal policy of a control problem. This method is also useful when a closed solution is not available or when classical resolution tools fail to find the optimal policy directly.
In the optimal investment strategies that are derived in closed form, the candidate functions are usually proposed with the general form of the solution. These functions can be very difficult to find when using more complex or unknown utility or production functions. Reinforcement learning methods have an advantage in terms of implementation capacity, assigning values of goodness to state-action pairs, putting them to the test, and discarding action selections that produce adverse results.