# Workshop on restless bandits, index policies and applications in reinforcement learning

**Keywords:** Restless bandits, indexability, weakly-coupled MDPs, decomposition

**Abstract:** The goal of this workshop is to gather the community that work on the topic of restless bandits and their applications in MDPs and learning. Topics of interest include:

- Application of restless bandits
- Index policies and indexability, including algorithms to compute indices
- Decomposition methods to solve large MDPs (including weakly-coupled MDPs and factored MDPs)
- Mean field control

**invited talks**and presentations of

**accepted abstract**.

## Abstract submission (intended for PhD students or young researchers)

**Deadline**: September 29.

**Notification of acceptance:**October 10.

PhD students are invited to propose talks that falls in the themes of the workshop or more broadly in the themes of GDR COSMOS topics. This includes bandits, restless bandits, MDP, stochastic models or queueing theory.

Anyone can submit an abstract but priority will be given to PhD students, post-docs or young researchers.

Submission form## Confirmed speakers

- Konstantin Avrachenkov (Inria)
- Urtzi Ayesta (CNRS, IRIT)
- Dorian Baudry (CNRS, ENSAE)
- Ana Busic (Inria, Ens Paris)
- Céline Comte (CNRS)
- Jean-Philippe Chancelier et Michel De Lara (École des Ponts ParisTech)
- Christopher Dance (Naverlabs)
- Manu Gupta (IIT Roorkee)
- Peter Jacko (University of Lancaster)
- Dong Li (University of Lancaster)
- Aditya Mahajan (McGill University)
- José Nino-Mora (Carlos III University of Madrid) --
**Keynote** - Cem Tekin (Bilkent University)
- Chen Yan (Inria and Inrae)

## Venue and registration

Location: Grenoble campus, Batiment IMAG, main amphitheatre (ground floor). See also How to get here.

Date: November, 20 and 21, 2023.

- By plane the closest airport is Lyon St Exupéry. There are direct buses or train from Lyon St Exupéry to Grenoble (about 1h). Geneva is also an option but about 2h-2h30 from the airport to Grenoble.
- By train, Grenoble is about 3h from Paris, 2h from Geneva.

**Registration** is free but mandatory (in order to plan for catering). Registration form.

## Technical Program and Schedule

The workshop runs from 10am on Monday to 4pm on Tuesday. Click on titles for abstracts.#### Monday, Nov 20, 2023

- 9h30-10h15: coffee
- 10h15 - 10h30: Introduction (Nicolas Gast)
- Session 1: 10h30-12h00 (chair: Konstantin Avrachenkov)

**Aditya Mahadjan**,Abstract: Reinforcement learning is an attractive approach for learning good policies based on data when the system model is unknown. However, the cumulative regret of most RL algorithms scales linearly with the size of the state space. Such bounds are prohibitively large for resource allocation and scheduling problems, where the state space is typically exponential in the number of alternatives. In this talk, we present a model-based RL algorithm for such problems which has scalable regret. In particular, we consider a restless bandit model, and propose a Thompson-sampling based learning algorithm which is tuned to the underlying structure of the model. We present two characterizations of the regret of the proposed algorithm with respect to the Whittle index policy. First, we show that for a restless bandit with $n$ arms and at most $m$ activations at each time, the regret scales either as $\tilde{O}(mn\sqrt{T})$ or $\tilde{O}(n^2 \sqrt{T})$ depending on the reward model. Second, under an additional technical assumption, we show that the regret scales as $\tilde{O}(n^{1.5} \sqrt{T})$ or $\tilde{O}(\max\{m\sqrt{n}, n\} \sqrt{T})$.**Cem Tekin**,Surrounded by a rich set of applications, restless bandits have intrigued the minds of many researchers over the last decades. In addition to many computational challenges in finding an optimal policy, a question of significant interest is how to play when the transition probabilities and rewards are unknown.

In this talk, I will answer this important question by reviewing advances in rested and restless bandits with unknown statistics. I will motivate this setup with real-world applications and then talk about regret notions and algorithm design techniques that cope with uncertainties in problem parameters. In particular, I will talk about the regenerative cycle approach to algorithm design and its regret analysis. Finally, several extensions will be discussed, such as multiple-play and decentralized multiplayer settings.**Céline Comte**,Stochastic networks and queueing systems often lead to Markov decision processes (MDPs) with large state and action spaces as well as non-convex objective functions, which hinders the convergence of many reinforcement learning (RL) algorithms. In this paper, we show that these difficulties can be circumvented by exploiting the simple structure of the underlying MDP. We focus on a particular class of RL algorithms, called policy-gradient methods, which directly optimize the policy parameter via stochastic gradient ascent, without necessarily relying on value-function estimation. These methods are known to perform better on MDPs with large state and action spaces, but they sometimes experience slow convergence due to the high variance of the gradient estimator. Our first contribution is therefore to design a new family of gradient estimators, called score-aware gradient estimators (SAGEs), that apply to policy parameterizations under which the stationary distribution of the MDP forms an exponential family parameterized by the policy parameter; in the context of stochastic networks and queueing systems, this means essentially that the stationary distribution has a product-form. Our second contribution is a convergence result showing that, under appropriate assumptions, the policy under SAGE-based policy-gradient methods has a large probability of converging to an optimal policy, provided that it starts sufficiently close. This holds also when the objective function is nonconvex and has multiple optimal policies. Lastly, we compare numerically the performance of a SAGE-based policy gradient method with that of actor-critic, a classical policy-gradient method.

This is joint work with Jaron Sanders (Eindhoven University of Technology), Matthieu Jonckeere (LAAS CNRS), and Albert Senen-Cerda (Eindhoven University of Technology). - Lunch break
- Session 2: 14h00-15h30 (chair: Céline Comte)

**Dong Li**,**Abstract:**Reward-based crowdfunding plays a crucial role in fundraising for start-up entrepreneurs. However, the actual success rate of fundraising projects is surprisingly low across multiple crowdfunding platforms. This paper considers crowdfunding platforms' decision-making of selecting projects to highlight on their homepage to boost the chance of success for projects, and investigates optimal promotion strategies that maximize platforms' revenue over a fixed period. We characterize backers' investment decisions by a discrete choice model and formulate the problem as a stochastic dynamic program, which is however computationally intractable. To address this issue, we follow the Whittle's Restless Bandit approach to decompose the problem into a collection of single-project problems and prove indexability for each project under some mild condition. We show that the index values of the proposed index policy can be directly derived from the value-to-go of each project under a non-promotion policy, which is calculated recursively with a linear-time complexity. Moreover, to further enhance the scalability we also develop a closed-form approximation to the index values. Extensive numerical experiments show that the proposed index policies outperform the other benchmark heuristics in most scenarios considered.**Christopher Dance**,**Abstract:**We consider the problem of choosing when to make noisy observations of a time series, if each observation has a fixed cost, and we wish to minimize a discounted sum of penalties for uncertainty about the state of the time series (e.g., the posterior variance) and the fixed observation costs. We prove the optimality of a threshold policy (which observes if and only if the posterior variance exceeds a threshold), for time series whose posterior variance updates are given by a scalar Kalman filter. It follows that this threshold policy is also optimal for control of such processes if we must minimize a quadratic penalty on the state plus a fixed cost per observation (i.e., for LQG control with costly observations). Furthermore, we show that the associated restless bandit problem, in which we must observe (and potentially control) multiple time series, is indexable. We deployed the associated Whittle index policy as part of a parking management project, to select the placement of mobile cameras used to measure parking occupancy.

The idea of the proof is to determine the possible sequences of actions resulting from the threshold policy: after an initial prefix, any such sequence is either periodic (repeating a Christoffel word), or aperiodic (given by a Sturmian word). This characterization of action sequences enables us to show that the ratio of marginal cost to marginal work under a threshold policy is a Lipschitz continuous and nondecreasing function of the threshold. It then follows (from results of Niño-Mora) that a threshold policy is optimal.**Peter Jacko**,**Abstract:**Multi-armed bandit problems (MABPs) are a special type of optimal control problem that has been studied in the fields of operations research, statistics, machine learning, economics, and others. It is a framework well suited to model (dynamic) resource allocation under uncertainty in a wide variety of contexts. The use of bandit models to optimally design clinical trials is one of the typical motivating applications for designing the so-called patient-centric trials, which would take into account the benefit of the in-trial patients. Nevertheless, the resulting theory has had little influence on the actual design of clinical trials. Contrary to similar learning problems arising for instance in digital marketing where interventions can be tested on millions of users at negligible cost, clinical trials are about “small data”, as recruiting patients is remarkably expensive and (in many cases) ethically challenging. Due to the focus on small sizes, we do not resort to the use of the normal distribution to approximate a binomial distribution which is a common practice for large samples either “for simplicity” or “for ease of computation”. We evaluate and compare the performance of a variety of operations research and machine learning procedures (e.g., those based on decision theory and solved by dynamic programming, restless bandits' Whittle index rule, frequentist Upper Confidence Bound index rule,...) for the finite-horizon MABP, including the traditional and still dominant clinical trial design choice – equal fixed randomization – and interpret them in the context of designing clinical trials. Our results illustrate how bandit approaches could offer significant advantages, mainly in terms of allocating more patients to better interventions, but still pose important inferential challenges, particularly in terms of their resulting lower statistical power, potential for bias in estimation and existence of closed-form test distributions or asymptotic theory. We illustrate some promising modifications to bandit procedures to address power and bias issues, and we reflect upon the open challenges that remain for an increased uptake of bandit models in clinical trials.

This talk is based on joint work mainly with Sofia S. Villar (Cambridge University) and with other collaborators and research students.

- Coffee break
- Session 3: Young researchers: 16h - 17h30 (chair: Bruno Gaujal)

**Olivier Bilenne**(Avignon University)Extensions of fictitious play to stochastic games have been recently examined in combination with reinforcement learning techniques inherent to Markov decision processes. We revisit this approach in the context of partially observable stochastic games. For this, we consider a two-player (finite-state) zero-sum stochastic game where one player (the attacker) has full visibility of the system, whereas the other player (the defender) has no access to the state of the opponent and must instead compose with public sources of information (in our setting: the actions played and their associated payoffs). We study a fictitious play dynamics where the players best response to the estimated empirical frequencies of action of their opponent. This sequence of play requires from the players to form beliefs on both their opponent's strategy and on their own continuation payoff (modeled by a Q-function), based on the (full or partial) information that is available to them. The strategy estimation scheme, in particular, features a correction mechanism making up for delayed symptoms in the partially observable setting, thus enabling the defender to better predict the attacker's strategy. The convergence of the fictitious play sequence is analyzed as a perturbed solution to a differential inclusion. The solution strategies of play, which map belief states to mixed best response strategies, are specified by only a finite number of parameters that form a stationary Nash equilibrium with respect to expected payoff.**Orso Forghieri**(Ecole Polytechnique)High dimensionality of Model-based Reinforcement Learning and Markov Decision Processes can be reduced using an abstraction of the state and action spaces. Hierarchy and abstraction methods have been explored during the last decades. However, those works rarely provide explicit methods to build useful abstractions of models.

We study a class of iterative aggregation algorithms for solving infinite horizon problems in the discounted setting with Approximate Value and Q-Value Iteration. Our scheme inspired by previous approximations of the Bellman operator is to iteratively refine an aggregation of states according to a value function in order to reduce its range on each piece of the partition. We provide a simple proof of convergence for this algorithm without making any assumptions about the structure of the problem. The main interest of this method is that its numerical convergence is fast.

Indeed, an important new feature of our method is to progressively aggregate along Value Iteration steps to associate states with similar properties. This process decreases the computational complexity of the iteration of the Bellman operator and provides useful abstractions. These operations can also be used to partition through states and actions through an approximate Q-Value Iteration. A numerical comparison on some classical MDPs shows that our algorithm is faster than traditional dynamic programming methods and than recent aggregative methods with a fixed number of adaptive partitions.

Orso Forghieri (Ecole Polytechnique),Erwan Le Pennec (Ecole Polytechnique),Hind Castel(Télécom SudParis),Emmanuel Hyon(Sorbonne)**Henrique Donâncio**(INRIA Grenoble)Deep Reinforcement Learning methods are sample inefficient when exploring the environment from scratch. In this work, we introduce an approach of knowledge transfer using the value function combined with curriculum learning, which aims to leverage the learning process by transferring knowledge among progressively increasing task complexity. Our main contribution is demonstrating the effectiveness of this approach by modifying the degrees of freedom of the target task, breaking it down into simpler sub-tasks, and leveraging learning by transferring the knowledge along the curriculum steps. We empirically demonstrate the broad possibilities of modifying the degrees of freedom of the target task to leverage learning in classical Reinforcement Learning problems and a real-world control task.**Victor Boone**(Université Grenoble Alpes).This talk will be about the one-shot behavior of no-regret algorithms for stochastic bandits. Many algorithms already achieve asymptotically optimal regret. Yet, their behaviors over a single run are very different: Their pseudo-regrets can be either smooth or bumpy. This tendency is measured with a new metric that we called the sliding regret, measuring the worst pseudo-regret over a time-window of fixed length sliding to infinity. We show that index policies (UCB, KL-UCB, IMED, MOSS etc) all have *linear* sliding regret as soon as their index satisfies a bunch of regularity assumptions. Meanwhile, Thompson Sampling and MED are better behaved and achieve constant sliding regret.**Milad Malekipirbazari**(Chalmers University of Technology)This study explores the integration of risk into the multiarmed bandit problem (MAB) that involves switching penalties. Specifically, we investigate scenarios in which decision-makers incur a cost each time they switch between arms, which is a common occurrence in many real-world applications. Unlike the risk-neutral case, the structure of optimal policy and the performance of priority-index algorithms for allocating resources in risk-averse MAB problems with switching penalties remain unknown, and no heuristic method exists for generating high-quality, interpretable solutions. To address this gap, we explore the qualitative features of optimal policies to streamline the search for an efficient allocation strategy. Then, we develop an index-based heuristic algorithm and show its performance in achieving near-optimal performance in practice.

#### Tuesday, Nov 21, 2023

- Session 4: 9h-10h30 (chair: Aditya Mahadjan)

**Chen Yan**,**Abstract:**The Whittle index for restless bandits (RB) is typically defined by relaxing the strict budget constraint to be met merely in expectation, closely aligning with the certainty equivalence control (CEC) principle. In this talk, I'll elucidate this connection. Starting with an introduction to CEC, I'll formulate the RB problem in such a way that CEC can be readily applied. Subsequently, I'll demonstrate the design of policies based on this methodology to ensure close-to-optimal performance, and discuss some intuition on its convergence speed to the optimal value. I'll conclude by offering perspectives on extensions of this model to weakly coupled MDPs and problems with nonlinear convex structures. This presentation stems from joint work with Nicolas Gast, Bruno Gaujal, and Alexandre Reiffers-Masson.**Jean-Philippe Chancelier and Michel De Lara**,**Abstract:**This work establishes a bridge between stochastic multistage optimization and multi-armed bandits problems. Multi-armed bandit algorithms have become a fundamental component of reinforcement learning. Their performances are backed by a substantial body of theoretical regret guarantees. We consider the case where the decision-maker has a prior on the arms and knows the time horizon. Then, the multi-armed bandit problem can be scrutinized under the lens of stochastic multistage optimization. Notably, while the bandit literature mostly focuses on regret guarantees, the stochastic multistage optimization literature focuses on sub-optimality: how worse is a candidate policy compared to the optimal, adapted policy. Decomposition methods --- based on the combination of dynamic programming and Lagrangian relaxation --- have proven very efficient in practice for solving large-scale stochastic multistage optimization problems. They have never been assessed on finite-horizon Bayesian bandit problems. In this experimental work, we aim to fill this gap by applying such a decomposition method to the Bayesian-Bernouilli bandit. We observe that it performs empirically very well. This motivates further work on the optimality gap provided by decomposition methods.**Manu Gupta**,**Abstract:**The multi-armed restless bandit framework allows to model various decision-making problems in areas as diverse as industrial engineering, computer communication, operations research, financial engineering, communication networks, etc. In a seminal work, Whittle developed a methodology to derive well-performing (Whittle’s) index policies that are obtained by solving a relaxed version of the original problem. However, the computation of Whittle’s index itself is a complex problem; hence, researchers focused on the calculation of Whittle’s index on a problem-to-problem basis. Our main contribution is the derivation of a closed-form expression for Whittle's index when the bandit has Markovian evolution, which is valid as long as the technical condition of indexability is satisfied. Our solution approach provides a unifying expression for Whittle’s index, and in particular cases, we retrieve a few known results from the literature including classical machine repairman problem, content delivery network, transmission control protocol (TCP), etc. We also use Whittle indices for load balancing in heterogeneous servers. - Coffee break
- 11h-12h
**José Niño-Mora**,**Abstract:**In this talk I will first review the approach to establishing indexability and computing the Whittle index and extensions based on partial conservation laws, which I introduced in 2001 and have since extended to increasingly broad settings. I will then review past illustrative applications demonstrating the effectiveness of such an approach, and highlight recent and present developments, both on methodology, algorithms, and applications to specific models. Finally, I will discuss open challenges providing avenues for future research. - Lunch break
- Session 5: 14h--16h (chair: Manu Gupta)

**Urtzi Ayesta**,**Abstract:**In this talk I will consider the characterization of Whittle's Index in Multi Armed restless bandits under modulation. Modulation is modeled as an exogenous process whose state affects the transitions of the RMAB, and its state might be observable or unobservable.**Konstantin Avrachenkov**,**Abstract:**We discuss the design of two-timescale reinforcement learning algorithms for weakly coupled Markov decision processes (MDPs). The MDPs are coupled through a common constraint on the actions. The Restless Multi-Armed Bandit Problems (RMABPs) are examples of the weakly coupled MDPs with only two possible actions for each arm. The Whittle index policy is a heuristic that has shown remarkably good performance when applied to RMABPs. In this talk, we first present QWI and QWINN, two reinforcement learning algorithms, respectively tabular and deep, to learn the Whittle index. Then, we discuss some generalizations to multi-action RMABPs and continuous-action weakly coupled MDPs. The talk is based on a series of join works with U. Ayesta, V.S. Borkar and F. Robledo.**Dorian Baudry**,**Abstract:**There has been a recent surge of interest in non-parametric bandit algorithms based on subsampling. One drawback however of these approaches is the additional complexity required by random subsampling and the storage of the full history of rewards. Our first contribution is to show that a simple deterministic subsampling rule, proosed in the recent work of Baudry et al. (2020) under the name of “last-block subsampling”, is asymptotically optimal in one-parameter exponential families. In addition, we prove that these guarantees also hold when limiting the algorithm memory to a polylogarithmic function of the time horizon. These findings open up new perspectives, in particular for non-stationary scenarios in which the arm distributions evolve over time. We propose a variant of the algorithm in which only the most recent observations are used for subsampling, achieving optimal regret guaran tees under the assumption of a known number of abrupt changes. Extensive numerical simulations highlight the merits of this approach, particularly when the changes are not only affecting the means of the rewards.

Based on http://proceedings.mlr.press/v139/baudry21b/baudry21b.pdf>**Ana Busic**,**Abstract:**The theory and application of mean field games has grown significantly since its origins less than two decades ago. We consider a special class in which the game is cooperative, and the cost includes a control penalty defined by Kullback-Leibler divergence, as commonly used in reinforcement learning and other fields. Its use as a control cost or regularizer is often preferred because it leads to a tractable solution. We will focus on a particular control paradigm, called Kullback-Leibler Quadratic optimal control, and illustrate its applications to distributed control of flexible power demand.