Simulation-Based Optimization for Convex Functions Over Discrete Sets

We propose a new iterative algorithm for finding a minimum point of f∗ : X ⊂ Rd → R, when f∗ is known to be convex, but only noisy observations of f∗(x) are available at x ∈ X for a finite set X. At each iteration of the proposed algorithm, we estimate the probability of each point x ∈ X being a minimum point of f∗ using the fact that f∗ is convex, and sample r points from X according to these probabilities. We then make observations at the sampled points and use these observations to update the probability of each point x ∈ X being a minimum point of f∗. Therefore, the proposed algorithm not only estimates the minimum point of f∗ but also provides the probability of each point in X being a minimum point of f∗. Numerical results indicate the proposed algorithm converges to a minimum point of f∗ as the number of iterations increases and shows fast convergence, especially in the early stage of the iterations.


Introduction
We consider the problem of finding a minimum point x * of a convex function f * : X ⊂ R d → R when f * (x) cannot be evaluated exactly but can be observed only with noise at each x ∈ X, possibly via stochastic simulation. Specifically, we assume we can obtain a noisy observation Y(x) of f * (x), satisfying Y(x) = f * (x) + ε(x) for some random variable ε(x) with a mean of zero and a variance of σ 2 (x) at each x ∈ X. Our focus in this paper is on the case where X is a finite set, i.e., X = {x 1 , · · · , x l } ⊂ R d . Thus, we say f * : X ⊂ R d → R is convex if there exists a convex function f * : R d → R satisfying f * (x i ) = f * (x i ) for 1 ≤ i ≤ l. Such convexity arises in many applications. For example, the expected loss function in the classical newsvendor problem is proved to be convex in the order quantity, which takes on values in a discrete set (Porteus, 1990). The expected completion time of the ith customer, for i = 1, 2, · · · , in a tandem queueing system is proved to be convex in the service rate (Shanthikumar & Yao 1989).
In this paper, we propose an iterative algorithm for finding a minimum point x * of f * . At each iteration of the proposed algorithm, we sample r points from X, obtain m observations from each sampled point, and use these observations to decide which points to sample in the next iteration. The key feature of the proposed algorithm is that at each iteration, we compute the probability of x being a minimum point of f * for each x ∈ X using the fact that f * is convex and use these probabilities to sample points from X in the next iteration.
More specifically, our algorithm proceeds as follows. Initially, we assume every point in X is equally likely to be a minimum point of f * . Thus, we sample r points, say {x 1 , · · · , x r }, uniformly from X. We then obtain m independent realizations of Y(x i ) for i = 1, 2, · · · , r. We will denote the average and variance of these m copies of Y(x i ) by Y(x i ) and σ 2 (x i ) for i = 1, 2, · · · , r, respectively. The next step is to compute the probability of x being a minimum point of f * at each x ∈ X. To compute this, we consider the following quantity: where F x is the set of convex functions f : X ⊂ R d → R that have x as a minimum point. If x is indeed a minimum point of f * , then International Journal of Statistics and Probability Vol. 10, No. 5;2021 where ε(x i ) is the average of m independent copies of ε(x i ). We note that the rightmost term in the above equation, r j=1 m σ 2 (x j ) ε(x j ) 2 converges weakly to a χ 2 -distribution with r degrees of freedom as m increases. Therefore, we can approximate the probability of x being a minimum point of f * by the probability of observing a value that is as extreme as q(x) from a χ 2 -distribution with r degrees of freedom, which in turn is P(χ 2 (r) ≥ q(x)), where χ 2 (r) represents the χ 2 -distribution with r degrees of freedom. The remaining question is how to compute q(x). Proposition 1 of this paper shows that q(x) can be computed by solving a quadratic program. With the probability of x being a minimum point of f * at each x ∈ X, our next iteration proceeds by sampling another set of r points from X proportionately to these probabilities. A similar procedure is repeated in each iteration. At each iteration, our estimate of the minimum point of f * is the point that achieves the minimum value of the average of all observations made at each point sampled so far.
Simulation-based optimization over discrete sets has received a great deal of attention in the research community. On the one hand, when one assumes no structure on f * , random search type methods are the most widely used; see Glover (1989) for the tabu search, Liepins and Hilliard (1989) for the generic algorithm, Chen (1996) and Kim and Nelson (2006) for the ranking and selection method, Yan and Mukai (1992) for the stochastic ruler method, Alrefaei and Andradóttir (1999) for the simulated annealing method, Gong, Ho and Zhai (1999) for the stochastic comparison method, Shi andÓlafsson (2000) for the nested partitions method, and Hong and Nelson (2006) for convergent optimization via most-promising-area stochastic search (COMPASS).
On the other hand, when one has a priori knowledge of the shape of f * , such as convexity, one can exploit it to design a more efficient algorithm for finding a minimum point of f * . A number of local search methods have been proposed, and they can be applied to the case where f * is known to be convex; see Gerencsér et al. (1999) and Wang et al. (2013). However, most of them are gradient-based methods, which use a gradient estimate to update their estimate of the minimum point of f * at each iteration. The performance of gradient-based methods is, in general, highly sensitive to the choice of the step-size sequence, and how to choose the step-size sequence remains challenging. Another drawback of a gradient-based method is that at each iteration, it uses the most recently sampled point and its gradient estimate only, ignoring all the previously sampled points and the observations made at these points.
In this paper, we take a different view, beginning with the following two observations.
1. When f * is known to be convex, it is desirable to utilize all the previously sampled points and the observations made at these points because a point that is far from x * can still add information on x * . For example, suppose we wish to estimate a convex regression function g * : R d → R. By sampling n points from the domain randomly, making observations at the sampled points, and fitting a convex function to the data set, one can successfully estimate g * (x) at x ∈ X consistently as n increases to infinity; see Seijo & Sen (2011). This shows that each observation in the data set makes a contribution to the estimation of g * (x), even if it is located far from x.
2. Most existing methods generate an estimate of a minimum point of f * at each iteration, but they do not provide the probability of their choice being a minimum of f * . It is nonetheless desirable to do so because the probability will give a sense of how confident we can be about the estimate.
In our proposed algorithm, we make use of all the previously sampled points and all the observations made at these points. We also provide the probability of each point in X being a minimum point of f * . To the best of our knowledge, this paper is the first to achieve this latter result.
The rest of this paper is organized as follows. In Section 2, we introduce some notation and definitions. In Section 3, we describe the proposed algorithm in greater detail. In Section 4, we present the numerical performance of the proposed algorithm in two settings: (1) a stylized model in which f * is given by a quadratic equation and (2) a more realistic model in which f * is the expected completion time of the 10th customer in a tandem queueing system. We provide concluding remarks and suggest future research directions in Section 5.

Notation and Definitions
For x = (x 1 , x 2 , · · · , x d ) ∈ R d , x T and x denote the transpose of x and (x 2 1 + · · · + x 2 d ) 1/2 , respectively. For any set S , |S | denotes the number of elements in S . For a function g : R m → R, a vector ξ ∈ R m is said to be a subgradient of g at x ∈ R m if g(y) ≥ g(x) + ξ T (y − x) for all y ∈ R m .

Proposed Algorithm
In this section, we give a full description of the proposed algorithm. In each iteration of the proposed algorithm, we will sample r points from X according to the probabilities (p(x) : x ∈ X), where p(x) is an estimate of the probability of x being a minimum point of f * , and take m observations from each of the r sampled points. Thus, the total number of observations made at x ∈ X up to iteration k, which we will denote by N k (x), is updated at each iteration. We will also denote the average and variance of all the observations made at x ∈ X up to iteration k by Y k (x) andσ 2 k (x), respectively. In addition, S k is the set of all the points in X sampled up to iteration k. At iteration k, our estimate of x * is the point in S k that has the minimum value of Y k (x) and will be denoted byx * k . Below is the detailed description of the proposed algorithm.

Proposed Algorithm
Step 0: Set the iteration count k = 0. Let (p(x) : x ∈ X) be the uniform distribution over X, i.e., p(x) = 1/|X| for each x ∈ X. Set S k = ∅ and N k (x) = 0 for each x ∈ X.
Step 2: For each x ∈ X, compute p(x), the probability of x being a minimum point of f * . To compute p(x), one must compute q(x) where F x is the set of convex functions f : X ⊂ R d → R having x as a minimum point. One can compute q(x) by solving the quadratic program in Proposition 1, which can be found below. Once q(x) is computed, we set p(x) = P(χ 2 (|S k |) ≥ q(x)), where χ 2 (|S k |) is the χ 2 -distribution with |S k | degrees of freedom and |S k | is the total number of points in X that have been sampled up to iteration k.
Step 3: Setx * k = arg min x∈S k Y k (x). Go to Step 1.
One missing piece of information in the proposed algorithm is how to compute q(x). The following proposition suggests q(x) can be computed by solving a quadratic program.
Proposition 1. At iteration k, assume N k (x), Y k (x), andσ 2 k (x) are given for all x ∈ S k . For x ∈ S k , define q(x) by the optimal objective value of the following convex quadratic program: in the decision variables g z ∈ R and ξ z ∈ R d for all z ∈ S k . In (2), g z represents g(z) for g ∈ F x . For x S k , define q(x) by the optimal objective value of the following convex quadratic program: in the decision variables g z ∈ R ad ξ z ∈ R d for all z ∈ S k , g x ∈ R, and ξ x ∈ R d . In (3), g z and g x represent g(z) and g(x), respectively, for g ∈ F x . (1). Proof. First, we focus on the case where x ∈ S k . It suffices to show that (i) for any f ∈ F x , there exist g z ∈ R and ξ z ∈ R d for all z ∈ S k satisfying the constraints of (2), and (ii) for any g z ∈ R and ξ z ∈ R d , z ∈ S k , satisfying the constraints of (2), there exists a function f ∈ F x . To prove (i), let f ∈ F x . Since f is convex, there exist subgradients of f at z ∈ S k . Let ξ z be a subgradient of f at z ∈ S k . Also, let g z = f (z) for z ∈ S k . Then g z and ξ z , z ∈ S k , satisfy the constraints of (2); see, for example, Boyd and Vandenberghe (2004), p. 338. Conversely, let g z and ξ z , z ∈ S k , satisfy the constraints of (2).
In the case where x S k , the proof follows in a similar fashion. 2

Numerical Results
In this section, we investigate the performance of the proposed algorithm in two settings: (1) a one-dimensional case where f * (x) = (x − 50) 2 for x ∈ {1, 2, · · · , 100} and (2) a two-dimensional case where f * is the expected completion time of the 10th customer in a tandem queueing system with two servers connected in series.

One-Dimensional Case Where d = 1
We consider the case where X = {1, 2, · · · , 100}, f * (x) = (x − 50) 2 for x ∈ X, Y(x) = f * (x) + ε(x) for x ∈ X, and ε(x) follows a normal distribution with a mean of 0 and a standard deviation of 30. It should be noted that x * = 50. We applied the proposed algorithm with r = m = 10. Figure 1 illustrates how {p(x) : x ∈ X} changes as the iteration progresses in a particular instance. One can observe that after the first iteration, the proposed algorithm successfully screened out 75% of the points in X. In the first iteration, we sampled 10 points from X randomly. Thus, the proposed algorithm screened out 75% of the points in X correctly by sampling only 10% of the points in X, using the fact that f * is convex. In the second and third iteration, the proposed algorithm screened out 85% and 90%, respectively, of the points in X.
We computedx * k and x * k − x * for k = 1, 2, · · · , 15 and repeated this procedure 100 times independently. We denoted the mean and standard deviation of the 100 copies of x * k − x * bym 1 andŝ 1 , respectively. Thenm 1 ± (1.96)ŝ 1 / √ 100 represents the 95% confidence interval of E[ x * k − x * ] based on the 100 independent copies. In Table 1, we report the 95% confidence intervals for a range of the iteration count. The results show that the proposed algorithm converges to x * as the iteration increases, and the convergence is especially rapid in the early stage of the iterations.

Two-Dimensional Case Where d = 2
We consider a tandem queueing network with two servers connected in series. The external arrival process to the first server follows a Poisson process with a rate of 100 customers an hour. A customer departing from the first server joins the queue for the second server. There are no external arrivals to the second server. A customer departing from the second server leaves the system. The variables for this example are the number of resources allocated to the first and second servers, say x 1 and x 2 . Thus, x 1 and x 2 take discrete values. We assume we have a total of 20 resources available, so X = {(x 1 , x 2 ) ∈ Z 2 : x 1 + x 2 = 20, x 1 ≥ 1, x 2 ≥ 1}. The service times for the first and second servers follow exponential distributions with rates of 4x 1 and 10x 2 , respectively, for (x 1 , x 2 ) ∈ X. There is a buffer of infinite capacity in front of each server. Each server serves customers on a first in/first out basis. We assume the system is empty at time 0. For x ∈ X,  Figure 1. In all graphs, the x axis is x ∈ X and the y axis is p(x). The graphs on top, in the middle, and at the bottom represent p(x) after the first, second, and third iterations, respectively 0.00 ± 0.00 7 0.00 ± 0.00 f * (x) is the expected completion time of the 10th customer in the system. It should be noted that f * (x) is proved to be convex in x; see Shanthikumar & Yao (1989).
We applied the proposed algorithm, with r = 5 and m = 100, to obtainx * k and x * k − x * for k = 1, 2, · · · , 7. We estimated x * = (13, 7) by generating 1,000,000 copies of Y(x) for each x ∈ X, computing the average of the 1,000,000 copies at each x ∈ X, and finding the minimum of these averages across x ∈ X.
We computedx * k and x * k − x * for k = 1, 2, · · · , 7 and repeated this procedure 100 times independently. We denoted the mean and standard deviation of the 100 copies of x * k − x * bym 2 andŝ 2 , respectively. Thenm 2 ± (1.96)ŝ 2 / √ 100 represents the 95% confidence interval of E[ x * k − x * ] based on the 100 independent copies. In Table 2, we report the 95% confidence intervals for k = 1, 2, · · · , 7. The results indicate the proposed algorithm converges to the true minimum point of f * as the iteration progresses.

Conclusions
In this paper, we proposed a new algorithm for finding a minimum point x * of f * : X → R when f * is known to be convex, but f * (x) for x ∈ X cannot be evaluated exactly and must be estimated via simulation. At each iteration, the proposed algorithm not only suggests an estimate of x * but also computes the probability of each point x ∈ X being a minimum point of f * . The numerical results indicate the proposed algorithm shows fast convergence, especially in the early stage of the iterations.
There are multiple directions for future research. The proposed algorithm can be extended to the case where X is a continuous space or to the case where the observations are correlated. It also can be extended to the case where f * has other shape characteristics such as unimodality or Lipschitz continuity.