Nonlinear Markov Games on a Finite State Space (Mean-field and Binary Interactions)

Managing large complex stochastic systems, including competitive interests, when one or several players can control the behavior of a large number of particles (agents, mechanisms, vehicles, subsidiaries, species, police units, etc), say Nk for a player k, the complexity of the game-theoretical (or Markov decision) analysis can become immense as Nk → ∞. However, under rather general assumptions, the limiting problem as all Nk → ∞ can be described by a well manageable deterministic evolution. In this paper we analyze some simple situations of this kind proving the convergence of Nashequilibria for finite games to equilibria of a limiting deterministic differential game.


Introduction
A steady increase in complexity is one of the characteristic features of the modern technological development.It requires an appropriate (or better optimal) management of complex stochastic systems consisting of large number of interacting components (agents, mechanisms, vehicles, subsidiaries, species, police units, etc), which may have competitive or common interests.Carrying out a traditional Markov decision analysis for a large state space is often unfeasible.However, under rather general assumptions, the limiting problem as the number of components tends to infinity can be described by a well manageable deterministic evolution, which represents a performance of a dynamic law of large numbers (LLN).In general, this limiting deterministic evolution is measure-valued (it is an evolution of probability laws on the initial state space), and its probabilistic analysis has led to the notion of a nonlinear Markov process, see monograph (Kolokoltsov, 2010) and references therein.Its controlled version can be naturally called a nonlinear Markov control process or (in case of competitive interests) a nonlinear Markov game (Kolokoltsov, 2009).In case of finite initial state space, the corresponding space of measures is a finite-dimensional Euclidean space (more precisely its positive orthant R d + ), so that the limiting measure-valued evolution becomes a deterministic control process or a differential game in R d .This paper is devoted to the analysis of simplest situations of this kind, aiming at the identification of deterministic limit and proof of convergence with explicit rates.More precisely, we shall assume that there is a fixed number of players {1, • • • , K} each controlling a stochastic system consisting of a large number N 1 , • • • , N K → ∞ components respectively.These can be generals controlling armies, engineers controlling robot swamps, large banks managers controlling subsidiaries, etc.The components can interact between themselves and with agents of other groups.The limit N 1 , • • • , N k → ∞ will be described by a differential game in R K + .The plan of the paper is as follows.In a preliminary Section 2 we set the stage by describing the dynamic law of large numbers for interacting Markov chains (without control).This topic is rather well developed by now, but we present it on the level of generality needed for what follows, including time nonhomogeneous chains (with discontinuous dependence on time) and rates of convergence resulting from Hölder continuity assumptions on the r.h.s. of the limiting ODE.Section 3 introduces control without competition.A strong progress for the analysis of such systems was made recently in (Gast, Gaujal & Le Boudec, 2010).We present it in a quite different form (see discussion in the last Section), which paves the road for the extension to competitive interests developed further.Main results are presented in Sections 3-5, where we discuss consecutively two player zero-sum games with mean -field interaction, two player zero-sum games with binary interaction and a K player noncooperative game.The proofs are given in Sections 6 and 7.The last section is devoted to a short review of relevant literature and to further perspectives.
The following (rather standard) notations for functional spaces will be used throughout the paper.For a closed subset Ω of a Euclidean space we shall denote by C(Ω) the Banach space of bounded continuous functions on Ω equipped with the usual sup-norm (which will be denoted simply ∥.∥ everywhere), and by C k (Ω), k ∈ N, the Banach space of k times continuously differentiable functions in the interior of Ω with f and all its derivatives up to and including order k having continuous and bounded extension to Ω, equipped with norm ∥ f ∥ C k (Ω) which is the sum of the sup-norms of f and all its derivatives up to and including order k.Finally, for α ∈ (0, 1], we denote by C k,α (Ω) the subspace of C k (Ω) consisting of functions, whose kth order derivatives are Hölder continuous of index α.The Banach norm on this space is defined as the sum of the norm in C k (Ω) plus the minimal Hölder constant.For an operator Φ in a Banach space B we shall denote by ∥Φ∥ B the corresponding operator norm of Φ.

Preliminaries: LLN for Interacting Markov Chains
Let us first recall the basic setting of mean-field interacting particle systems with a finite number of types.Suppose our initial state space is a finite set {1, ..., d}, which can be interpreted as the types of particles (say, possible opinions of individuals on a certain subject, or the levels of fitness in a military unit, or the types of robots in a robot swamp).Let {Q(t, x)} = {(Q i j )(t, x)} be a family of d × d square Q-matrices or Kolmogorov matrices (i.e.non-diagonal elements of these matrices are non-negative and the elements of each row sum up to one) depending continuously on a vector x from the closed simplex and piecewise continuously on time t ≥ 0. For any x, the family {Q(., x)} specifies a Markov chain on the state space {1, ..., d} with the generator and with the intensity of jumps being In other words, the transition matrices P(s, t, x) = (P i j (s, t, x)) d i, j=1 of this chain satisfy the Kolmogorov forward equations Remark 1. Instead of piecewise continuous dependence on t we can assume that Q is uniformly bounded and depends measurably on t.Everything remains the same.This extension is relevant if one is interested in arbitrary discontinuous controls.
Suppose we have a large number of particles distributed arbitrary among the types {1, ..., d}.More precisely our state space S is Z d + , the set of sequences of d non-negative integers N = (n 1 , ..., n d ), where each n i specifies the number of particles in the state i.Let |N| denote the total number of particles in state N: |N| = n 1 + ... + n d .For i j and a state N with n i > 0 denote by N i j the state obtained from N by removing one particle of type i and adding a particle of type j, that is n i and n j are changed to n i − 1 and n j + 1 respectively.The mean-field interacting particle system specified by the family {Q} is defined as the Markov process on S specified by the generator (1) Probabilistic description of this process is as follows.Starting from any time and current state N one attaches to each particle a |Q ii |(N/|N|)-exponential random waiting time (where i is the type of this particle).If the shortest of the waiting times τ turns out to be attached to a particle of type i, this particle jumps to a state j according to the distribution Briefly, with this distribution and at rate |Q ii |(N/|N|), any particle of type i can turn (migrate) to a type j.After any such transition the process starts again from the new state N i j .Notice that since the number of particles |N| is preserved by any jump, this process is in fact a Markov chain with a finite state space.
Remark 2. Yet another way of describing the chain generated by L t is via the forward Kolmogorov (or master) equation for its transition probabilities P MN (s, t): To shorten the formulas, we shall denote the inverse number of particles by h, that is h = 1/|N|.Normalizing the states to N/|N| ∈ Σ h d , where Σ h d is a subset of Σ d with coordinates proportional to h, leads to the generator of the form or equivalently where e 1 , ..., e d denotes the standard basis in R d .With some abuse of notation, let us denote by hN t,h the corresponding Markov chain.The transition operators of this chain will be denoted by Ψ h s,t : where E s,x denotes the expectation of the chain started at x at time s.These operators are known to form a propagator, i.e. they satisfy the chain rule (or Chapman-Kolmogorov equation) We shall be interested in the asymptotic behavior of these chains as h → 0. To this end, let us observe that, for where The limiting operator Λ t f is a first-order PDO with characteristics solving the equation called the kinetic equations for the process of interaction described above.The characteristics specify the dynamics of the deterministic time-nonhomogeneous Markov Feller process in Σ d defined via the generator Λ t .The corresponding transition operators act on C(Σ d ) as where X s,x (t) is the solution to (6) with the initial condition x at time s.These operators form a Feller propagator (i.e.Φ s,t depend strongly continuous on s, t and satisfy the chain rule Φ s,t Φ t,r = Φ s,r , s ≤ t ≤ r).Of course in case of Q that do not depend on time t explicitly, Φ s,t depend only on the difference t − s and the operators Φ t = Φ 0,t form a Feller semigroup.
Remark 3. It is easy to see that if x k 0, then (X s,x (t)) k 0 for any t ≥ s.Hence the boundary of Σ d is not attainable for this semigroup, but, depending on Q, it can be glueing or not.For instance, if all elements of Q never vanish, then the points X s,x (t) never belong to the boundary of Σ d for t > s, even if the initial point x does so.
The convergence of the Markov chains with generators of type (2) to a deterministic evolution and various versions of this result are well known, see e.g.(Darling & Norris, 2008;Kolokoltsov, 2010;Benaïm & Le Boudec, 2008) and references therein.
We present here an extension (for time-nonhomogeneous chains with discontinuous time dependence) of a result from (Kolokoltsov, 2011, Sect. 5.11), on a level of generality which allows us to get the corresponding convergence results for controlled problems as more or less straightforward corollaries.
Theorem 1. (i) Let all the elements Q i j (t, .)belong to C 1,α (Σ), α ∈ (0, 1], with norms uniformly bounded in t.Then, if for some s > 0 and x ∈ R d , the initial data hN s converge to x in R d , as h → 0, the Markov chains hN(t, h) with the initial data hN s (generated by L h t and with transitions Ψ s,t ) converge in distribution and in probability to the deterministic characteristic X s,x (t).For the corresponding converging propagators of transition operators the following rates of convergence hold: where C(T ) depends only on the supremum in t of C 1,α (Σ)-norm of the functions Q(t, x).
(ii) Assuming a weaker regularity condition, namely that Q i j (t, .)belong to C 1 (Σ) uniformly in t, the convergence of Markov chains hN(t, h) in distribution and in probability to the deterministic characteristics still holds, but instead of (8), we have weaker rates in terms of the modulus of continuity w h of ∇ f and Q (see (39) for the definition): where C(T ) depends on the C 1 (Σ)-norm of Q.A similar modification of (9) holds.
Our objective is to extend this result to interacting and competitively controlled families of Markov chains.

Mean Field Markov Control
As a warm-up, let us start with mean-field controlled Markov chains without competition.Suppose we are given a family of ≥ 0 and a parameter u from a metric space interpreted as control.The main assumption will be that Q ∈ C 1,α (Σ d ) as a function of x with the norm bounded uniformly in t, u, and Q depends continuously on t and u.
Any given bounded measurable curve u(t), t ∈ [0, T ], defines a Markov chain on Σ h d with the time-dependent family of generators of type (2), that is or equivalently For simplicity (and effectively without loss of generality), we shall stick further to controls u(.) from the class C pc [0, T ] of piecewise-continuous curves (with a finite number of discontinuities). where with the corresponding controlled characteristics governed by the equations For a given T > 0 and continuous functions J (current payoff) and V T (terminal payoff), let Γ(T, h) denote the problem of a centralized controller of the chain with |N| = 1/h particles, aiming at maximizing the payoff The optimal payoff will be denoted by V h (t, x): where E u(.) t,x denotes the expectation with respect to the Markov chain on Σ h d generated by ( 11) and started at x = hN at time t.
We are aiming at approximating V h (t, x) by the optimal payoff for the controlled dynamics ( 14).
We can also obtain approximate optimal synthesis for problems Γ(T, h) with large |N| = 1/h, at least if regular enough synthesis is available for the limiting system.Let us recall that a function γ(t, x) is called an optimal synthesis (or an adaptive policy) for the problem Γ(T, h) if for all t ≤ T and x ∈ Σ h d , where E γ t,x denotes the expectation with respect to the Markov chain on Σ h d generated by ( 11) with u(t) = γ(t, x) and starting at x = hN at time t.A function γ(t, x) is called an ϵ-optimal synthesis or an ϵ-adaptive policy, if the r.h.s. of (18) differs from its l.h.s. by not more than ϵ.Similarly an optimal synthesis or an adaptive policy are defined for the limiting deterministic system.
Theorem 2. (i) Assume that Q, J depend continuously on t, u and Q, J ∈ C 1,α (Σ d ), α ∈ (0, 1], as functions of x, with the norms bounded uniformly in t, u, and finally with C(T ) depending only on the bounds of the norms of Q in C 1,α (Σ d ).Moreover, if u(t) is an ϵ-optimal control for deterministic dynamics (14), that is the payoff obtained by using u(.) differs by ϵ from V(t, x), then u(.) is also an (ϵ + C(T )h α )-optimal control for |N| = 1/h particle system.
(ii) Suppose additionally that u belong to a convex subset of a Euclidean space and that Q(t, u, x) depends Lipschitz continuously on u.Let ϵ ≥ 0, and let γ(t, x) be a Lipschitz continuous function of x uniformly in t that represents an ϵ-optimal synthesis for the limiting deterministic control problem.Then, for any δ > 0, there exists h 0 such that, for h ≤ h 0 , γ(t, x) is an (ϵ + δ)-optimal synthesis for the approximate optimal problem Γ(T, h) on We omit the detail.The same remark concerns other theorems given below.
Notice finally that by the standard dynamic programming, see e.g.(Fleming & Soner, 2006;McEneaney, 2006), the optimal payoff V(t, x) given by ( 17) represents the unique viscosity solution of the HJB-Isaacs equation and the optimal payoff V h (t, x) given by ( 16) solves the HJB equation Thus, as a corollary of Theorem , we have proved the convergence of the solutions of the Cauchy problem for equation ( 21) to the viscosity solution of (20).

Two Players with Mean-field Interaction
Let us turn to a game-theoretic setting starting with a simplest model of two competing mean-field interacting Markov chains.Suppose we are given two families of where We shall assume for simplicity that Then ( 22) rewrites as where The corresponding controlled characteristics are governed by the equations For a given T > 0, let us denote by Γ(T, h) the stochastic game with the dynamics specified by the generator ( 22) and with the objective of the player I (controlling Q via u) to maximize the payoff for given functions J (current payoff) and V T (terminal payoff), and with the objective of player II (controlling P via v) to minimize this payoff (zero-sum game).As previously we want to approximate it by the deterministic zero-sum differential game Γ(T ), defined by dynamics (25), ( 26) and the payoff of player I given by Recall the basic notions of the upper and lower values for a game Γ(T ), see e.g.(Fleming & Soner, 2006) or (Malaeyev, 2000).As above, we shall use controls u(.) and v(.) from the classes C pc ([0, T ]; U) and C pc ([0, T ]; V) of piecewisecontinuous curves with values in U and V respectively.A progressive strategy of player I is defined as a mapping β from C pc ([0, T ]; V) to C pc ([0, T ]; U) such that if v 1 (.) and v 2 (.) coincide on some initial interval [0, t], t < T , then so do u 1 = β(v 1 (.)) and u 2 = β(v 2 (.)).Similarly progressive strategies are defined for player II.Let us denote the sets of progressive strategies for players I and II by S p ([0, T ]; U) and S p ([0, T ]; V).Then the upper and the lower values for the game Γ(T ) are defined as If the so called Isaac's condition holds, that is, for any p k , q k , max then the upper and lower values coincide: Similarly the upper and the lower values V h + (t, x, y) and V h − (t, x, y) for the stochastic game Γ(T, h) are defined.Theorem 3. Assume that Q, P, J depend continuously on t, u and Q, P, J, V T ∈ C 1,α (Σ d ), α ∈ (0, 1], as functions of x, with the norms bounded uniformly in t, u, v. Then with C(T ) depending only on the bounds of the norms of Q in C 1,α (Σ d ).Moreover, if β ∈ S p ([0, T ]; U) and v(.) ∈ C pc ([0, T ]; V) are ϵ-optimal for the minimax problem (29), then this pair is also (ϵ+C(T )h α )-optimal for the corresponding stochastic game Γ(T, h).
As in Theorem (ii), one can also approximate optimal (equilibrium) adaptive polices for Γ(T, h), if regular enough (i.e.Lipschitz continuous) equilibrium adaptive policies exist for the limiting game Γ(T ).In fact, as is known from differential games, see e.g.(Fleming & Soner, 2006;Malaeyev, 2000) or (Petrosjan & Zenkevich, 1996), the upper value V + (t, x, y) represents the unique viscosity solution of the upper Isaac's equation and V − (t, x, y) of the lower Isaac's equation (with min and max placed in a different order).Similar equations are satisfied by the values of stochastic games V h ± (t, x, y), see e.g.(Fleming & Souganidis, 1989).Now, if V * is a solution to the Cauchy problem (32) and there exist Lipschitz continuous functions v * (t, x, y) and u * (t, v, x, y) such that u * (t, v, x, y) ∈ argmax [J(t, u, v, x, y) then V * is a saddle point for the differential game Γ + (T ) giving the information advantage to maximizing player I, see e.g.(Fleming & Soner, 2006, Theorem 3.1).Analogously to Theorem (ii), we can conclude by Theorem that the policies v * (t, x, y) and u * (t, v, x, y) represent ϵ-equilibria for the corresponding stochastic game Γ + (T, h).

Two Players with Binary Interaction
In a slightly different setting one can assume that changes in a competitive control process occur as a result of group interactions, and are not determined just by the overall mean field distribution.Let us discuss a simple situation with binary interaction.
As in the previous section, assume we have two groups of d states (of objects or agents) controlled by players I and II respectively.Suppose now that any particle from a state i of the first group can interact with any particle from a state j of the second group (binary interaction) producing changes i to l and j to r with certain rates Q lr i j (t, u, v) that may depend on controls u and v of the players.Assuming, as usual, that our particles are indistinguishable (any particle from a state is selected for interaction with equal probability), leads to the process, generated by the operators Again let us assume for simplicity that |M| = |N| and define h = 1/|N| = 1/|M|.To get a reasonable scaling limit, it is necessary to scale time by factor h leading to the generators which, for x = hN, y = hM and h → 0, tends to x i y j Q lr i j (t, u(t), v(t), x, y) The corresponding kinetic equations (characteristics of this first order partial differential operator) have the form As in the previous section, we are interested in the zero-sum stochastic game, which will again be denoted by Γ(T, h), with the dynamics specified by generator (33) and with the objective of the player I (controlling Q via u) to maximize the payoff of the same type ( 27), and in an approximation of this game by the limiting deterministic zero-sum differential game Γ(T ), defined by the payoff (28) of player I.
Theorem 4. Assume that Q, J depend continuously on t, u, v and Q, J, V T ∈ C 1,α (Σ d ), α ∈ (0, 1], as functions of x, with the norms bounded uniformly in t, u, v. Then the same estimate (31) holds for the difference of upper and lower values of limiting and approximating games.
Moreover, literally the same approximations (as in the game of the previous section) hold for optimal strategies and policies.

Several Players with Coupled Mean-field Interaction
Suppose now that there are K players, each one controlling a mean-field interacting Markov chain, which are coupled via the joint mean-field distribution.Namely, suppose we are given where this rewrites as where we denote by h the maximum of all where The corresponding controlled characteristics are governed by the equations For a given T > 0, suppose the objective of player k is to maximize the payoff with given functions J k (current payoffs) and V k T (terminal payoffs).Let us denote by Γ K (T, h) the corresponding stochastic game.In the limit h → 0 with all ratios N k /N j uniformly bounded we get the deterministic differential game Γ K (T ) of K players with separated dynamics (38) and the payoffs K-player differential games are much less understood as two-player zero-sum games, see e.g.(Friedman, 1971;Malafeyev, 2000;Olsder, 2001;Ramasubraanian, 2007;Tolwinski, Haurie & Leitmann, 1986) for informative discussions, including links with viscosity solutions of the systems of HJB equations.It is not our objective here to contribute to this development.We just want to stress that, as our method shows, most of the natural equilibria of various kinds that can be analyzed for the limiting differential game Γ K (T ) do approximate the corresponding equilibria for games Γ K (T, h).As simplest examples, let us consider open-loop equilibria and K-player analogs of upper and lower values of zero-sum games.A vector curve u Similarly open-loop Nash equilibria are defined for the games Γ K (T, h).
On the other hand, for any permutation π of K players one can define a vector-value V π arising from information discrimination specified by π.That is, the set of strategies of player π(1) is just C pc ([0, T ]; U π(1) ), and the progressive strategies for a player π(k) π(1) are defined as mapping )) coincide on some initial interval [0, t], t < T , then so do β k (u 1 ) and β k (u 2 ).Let us denote the sets of progressive strategies for players π(k) by S p ([0, T ]; U π(k) ).
The discriminated vector-value is defined as a Nashequilibrium payoff of the game (in normal form) with these strategy spaces.
Theorem 5. Assume that Q k , J k depend continuously on all parameters and Q k with the norms bounded uniformly in t, u, v. Then the same estimate (31) holds for the difference of payoffs in open-loop Nash equilibria of limiting and approximating games, as well as for the difference of payoffs in Nash equilibria of the limiting and approximating discriminatory games specified by any permutation π.
There does not seem to exits any general results on the regularity of adaptive policies for N player game.On the other hand, in many examples, see (Case, 1967), and in some sense in a general position, see (Malafeyev, 2000), the state space can be decomposed into a finite number of open sets, where equilibrium adaptive policies are smooth, with these sets being separated by lower dimensional manifolds, where switching occurs.Under this condition, one could possibly prove the convergence of adaptive policies for large |N|-approximations to adaptive policies of the limiting deterministic differential games, but such analysis is beyond the present contribution.

Auxiliary Lemmas
We shall need two simple lemmas (possibly well known for experts), one on the bounds for semigroups arising from deterministic processes, and another on the coupling of jump-type Markov processes.
Let us recall that the modulus of continuity for a continuous function ϕ on Σ d is defined as follows: Below we denote by ∇ the derivative operator with respect to variable x ∈ Σ d .
) as a function of x uniformly in t and depends measurably on t.Let Φ s,t denote the linear operators (7), where X s,x (t) is the solution to (6) with the initial condition x at time s.Then Φ s,t preserve the space C 1 (Σ d ) and or more explicitly where ω is the supremum over t of the norms of the functions xQ(t, x) (as functions of x) in C 1 (Σ d ).Moreover, let wh denote the supremum over t of the modulus of continuity of the functions ∇(xQ(t, x)) (as functions of x).Then (ii) Supposes additionally that Q(t, .)∈ C 1,α (Σ d ), α ∈ (0, 1], uniformly in t, and let ω α denote any uniform upper bound for the corresponding Hölder constants.Then the space C 1,α (Σ d ) is preserved by Φ s,t and Proof.(i) The derivative of X s,x (t) with respect to x solves the linear ODE and hence (40).The proof of (41) is the same as of (42) below, and therefore is omitted.
(ii) The function Hence the Hölder constant (of index α) for the function ∂X s,x (t)/∂x, as a function of x, is bounded by ω α e ω(t−s) implying (42), because ) .
The next lemma will be used only for the proof of the second part of Theorem 2. It is needed to compare the effects of applying different adaptive policies.
Lemma 2. Let Z z (t) and W w (t) be two jump-type Markov processes in R d (z and w stand for the initial points) with integral generators Proof.Let X z,w (t) be specified by the so called marching coupling, that is by the generator where m(x 1 , x 2 , y) = min(ν(x 1 , y), µ(x 2 , y)).
Clearly, if f does not depend on the second argument, then L f (x 1 , x 2 ) = L Z f (x 1 ), and if f does not depend on the first argument, then L f (x 1 , x 2 ) = L W f (x 2 ), so that X z,w (t) is really a coupling of Z z (t) and Z w (t).Moreover, as one sees by inspection, for f Consequently, By Dynkin's formula, the process (where X = (X 1 , X 2 )), which by Gronwall's lemma implies that is (44).

Proof of the Theorems
Proof of Theorem 1.
Let us start by noting that the transition operators Ψ h s,t of the Markov chains hN t,h satisfy (outside the finite number of discontinuity points of Q(t, x)) the standard Kolmogorov equations for any function f on Σ h d .Similarly, the transition operators Φ s,t of the deterministic Markov process X s,x (t) satisfy the equations (again outside the discontinuity points of Q(t, x)), for any f ∈ C 1 (Σ d ), which is of course easily checked.
To compare these propagators, we shall use the following standard trick.We write Let us apply this equation to an f ∈ C 1,α (Σ d ).By Lemma 1 (ii), Φ r,t f ∈ C 1,α (Σ d ) for all r with a uniform bound, so that the above equation applies.Moreover, from (3) and ( 5) we conclude that with a constant C(T ) depending on the sup-norm of Q(t, x) only.Consequently, by (42), implying (8).The estimate ( 9) is a straightforward corollary, taking into account (40).The convergence of the chains in distribution follows from the convergence of transition operators by a well known general result, see e.g.(Kallenberg, 2002), or (Kolokoltsov, 2011).Moreover, weak convergence of random variables to a constant implies their convergence in probability.
Proof of Theorem 2.
(i) By Theorem 1 (and taking into account that the integral of the function J can be approximated by Riemannian sums), for any u(.)∈ C pc (t, T ), ) .
Hence the same estimate holds for the difference of the suprema of these two functions of u(.).
(ii) First of all approximating J by a smooth function (and noting that all payoffs will be then uniformly approximated) reduces the problem to the case of a smooth J. Next, we can approximate γ(t, x) arbitrary close by a smooth function γ(t, x) in x, so that the corresponding V(t, x, γ) and V(t, x, γ) differ by arbitrary small amounts.By Theorem 1, V h (t, x, γ) will converge to V(t, x, γ) as h → 0, and hence γ becomes an (ϵ + δ)-optimal policy of Γ(T, h) for small enough h.It remains to compare V h (t, x, γ) and V h (t, x, γ).But they are close by Lemma 2.
Proof of Theorems 3, 4, 5.It is the same as the proof of Theorem 2, being based on Theorem 1 and an evident observation that if two families of functions are uniformly close, then so are also their minimax values.

Conclusion and Bibliographical Comments
The work on deterministic LLN limits for interacting particles and its representation in terms of nonlinear kinetic equations goes back to (Leontovich, 1935) and (Bogolyubov, 1946).A deduction of kinetic equations in a very general setting of kth order (binary, ternary, etc) interaction can be found in (Maslov & Tariverdiev, 1982) or (Belavkin & Kolokoltsov, 2003).
In particular, for the corresponding limit in a game-theoretic setting (replicator dynamics), we can refer to (Benaim & Weibull, 2003) or the last section of (Kolokoltsov & Malafeyev, 2010).
The closest to our setting seems to be the recent work (Gast, Gaujal, & Le Boudec, 2010), which is devoted to a convergence result similar to our Theorem 2. However, in (Gast, Gaujal, & Le Boudec, 2010) a continuous time mean-field control model is obtained as a limit of discrete-time models, and we work directly with continuous time models.More essentially, under a slightly stronger regularity assumptions on the model (basically continuous differentiability of coefficients instead of Lipschitz continuity) we obtain explicit rate of convergence for the averages over empirical measures, and in some cases even for adaptive control policies.Moreover, our main objective was to develop a general framework to treat competitive control problems presented in Sections 4-6.Even further, we developed a method to treat not only the simplest mean-field type interactions, but more involved binary or ternary ones.
As was mentioned, our initial state space was finite, resulting in the corresponding measure-valued limit being a finitedimensional differential game.In more general setting (which will be discussed elsewhere), for an arbitrary initial state space of a single particle (in our large ensemble), the corresponding limit becomes truly measure-valued controlled nonlinear evolution (controlled nonlinear Markov process) specified by kinetic equations of rather general type.Other possible extensions can include models with variable number of particles (robots becoming out of order, or military units destroyed, etc).
A slightly different class of models describe the situations when each agent in a large group pursue his/her own interests.These class of models is often referred to as mean-field games.Related equilibrium concept was called the Nash certainty equivalence principle, see (Lasry & Lions, 2006;Achdou & Capuzzo-Dolcetta, 2010;Gomes, Mohr, & Souza, 2010;Huang, Caines, & Malhamé, 2003;Huang, Malhamé, & Caines, 2006;Huang, 2010;Kolokoltsov, Li, & Yang, 2011) and references therein.Unlike our centralized setting above, analysis of these models does not lead to the control of distributions in the limit, but rather to certain consistency condition on homogeneous individual controls (Nash certainty equivalence).However, recent paper (Huang, Malhamé, & Caines, 2007) aims at linking individual and centralized controls.
The main objective of this paper was to approximate discrete stochastic systems with large number of particles by a simpler continuous state limit.One can also look at these results from an opposite point of view: as approximating differential games by discrete Markov chains.This point of view would relate our results with Kushner (2002), Kushner and Dupuis (2001).
Then there exists a Markov process X z,w (t) on R d × R d that couples Z z (t) and W w (t) in the sense that the distribution of the first (respectively second) coordinate of X z,w (t) coincides with the distribution of Z z (t) (respectively W w (t)), and such that, with respect to this coupling, E|Z z (t) − W w (t)| ≤ e κt (|z − w| + tϵ).