# Multi Armed Bandit Workshop

Date: 25th-26th September 2o19

Location: Imperial College London, Lecture theatre: Skempton 60 on Wednesday and Electrical and Electronic Engineering Building, Room 509, on Thursday. Campus map. Further travel information.

**** Skempton 60 is in the basement. Room 509 is on floor 5 but the elevator does not stop on floor 5, so it is best to take the stairs, or the elevator to floor 6 and then walking down one floor; the ground floor is floor 2, so walking up is not terrible. ****

**** Social event: We will go to the Gloucester arms pub (SW7 4RB), which is close to Imperial College, around 20:30-21:00. Everyone is welcome to join! ****

The Statistics group at the Department of Mathematics at Lancaster University, Imperial College London and Google Deepmind will be holding a specialist workshop on multi-armed bandits in London on the 25th and 26th of September.

The multi-armed bandit problem has become classic because it is well suited for modelling many real world problems while being amenable to theoretical analyses due to its simplicity. Researchers from different fields, especially statistics, operational research, applied probability and computer science, have made theoretical contributions that in the recent years started gaining wider popularity in practice.

This workshop will provide a forum for stimulating discussions, especially across the boundaries between computer science, statistics and operations research.

#### schedule

Wednesday (25.9)

09:30 – 09:40 Welcome

09:40 – 10:30 “Bandit PCA”
Gergely Neu (abstract) (slides)
10:30 – 11:20 “Laplacian-regularized Graph Bandits”
Laura Toni (abstract) (slides)

11:20 – 11:40 Coffee break

11:40 – 12:30 “Provably efficient reinforcement learning with rich observations”
Akshay Krishnamurthy (abstract) (slides)

12:30 – 14:00 Lunch break

14:00 – 14:50 “Weighted Linear Bandits for Non-Stationary Environments”
Olivier Cappé (abstract) (slides)
14:50 – 15:10 “Bandits with Distributional Context”
Johannes Kirschner, Ilija Bogunovic and Andreas Krause
(abstract)
15:10 – 15:30
“Non-Asymptotic Pure Exploration by Solving Games”
Rémy Degenne, Wouter M. Koolen and Pierre Ménard (abstract)

15:30 – 16:00 Coffee break

16:00 – 16:50 “Recent advances in both rested and restless rotting bandits”
Michal Valko (abstract)
16:50 – 17:10
“Fidelity Bandits”
Gábor Lugosi, Ciara Pike-Burke and Pierre-André Savalle (abstract)
17:10 – 17:30 “Sequential Monte Carlo Bandits”
Iñigo Urteaga and Chris H. Wiggins
(abstract)(slides)

Thursday (26.9)

09:40 – 10:30 “Nonstochastic Multiarmed Bandits with Unrestricted Delays”
Nicolò Cesa-Bianchi (abstract) (slides)
10:30 – 11:20
“Learning in structured MDPs with convex cost function: improved regret bounds for inventory management”
Shipra Agrawal (abstract)

11:20 – 11:40 Coffee break

11:40 – 12:30 “SIC-MMAB: Synchronisation Involves Communication in Multiplayer Multi-Armed Bandits”
Vianney Perchet (abstract)

12:30 – 13:30 Lunch break

13:30 – 15:00 Poster session

15:00 – 15:20 Coffee break

15:20 – 16:10 “Stochastic Bandits with Changing Reward Distributions”
Ronald Ortner (abstract) (slides)
16:10 – 17:00
“Matching Regret Lower Bounds in Structured Stochastic Bandits”
Wouter Koolen (abstract) (slides)

#### Poster session

• “Online learning for optimal model adaptation? An empirical evaluation.” R. Bakirov.
• “BelMan: An Information Geometric Approach to Bandits.” D. Basu.
• “Preselection Bandits under the Plackett-Luce Model.” V. Bengs and E. Hüllermeier.
• “Reinforcement Learning on Online Markov Decision Processes with Concave Rewards.” W. Cheung.
• “A Thompson Sampling Algorithm for Cascading Bandits.” W. Cheung, V. Tan and Z. Zhong.
• “On Thompson Sampling for Lipschitz Bandits.” J. Grant and D. Leslie.
• “Best Arm Identification in Spectral Bandits.” T. Kocák and A. Garivier.
• “Adaptive and Safe Bayesian Optimization in High Dimensions via One-Dimensional Subspaces.” J. Kirschner, M. Mutný, N. Hiller, R. Ischebeck and A. Krause
• “A Combinatorial Multi-Armed Bandit Approach for Online Multi-Item Pricing.” C. Kocyigit.
• “Boundary focused Thompson sampling.” P. Libin, T. Verstraeten, D. Roijers, W. Wang, K. Theys and A. Nowé.
• “Bandits in a changing environment.” G. Lugosi, G. Neu and J. Olkhovskaya.
• “Taking the Gini Out of the Bottle: Introducing the Equitable Bandit.” A. Lupu, C. Lyle and A. Durand.
• “Private and Warm Bandits.” M. Malekzadeh, D. Athanasakis, A. Nappa, B. Livshits.
• “Decentralized Cooperative Stochastic Multiarmed Bandits.” D. Martínez-Rubio, V. Kanade and P. Rebeschini.
• “Optimal Strategy for Graph-Structured Bandits.” O. Maillard, P. Ménard and H. Saber.
• “Near-optimal Optimistic Reinforcement Learning using Empirical Bernstein Inequalities.” A. Tossou, D. Basu and C. Dimitrakakis.
• “Near-optimal Egalitarian Bargaining Solution For Two Players Multi-Armed Bandits.” A. Tossou, C. Dimitrakakis, J. Rzepecki and K. Hofmann.
• “Censored Semi-Bandits: A Framework for Resource Allocation with Censored Feedback.” A. Verma, M. Hanawal, A. Rajkumar and R. Sankaran.
• “Multi-Agent Thompson Sampling.” T. Verstraeten, E. Bargiacchi, P. Libin, D. Roijers and A. Nowé.
• “The Off-Policy Evaluation Problem Through the Lens of Classical Statistics.” A. Wise and S. Grünewälder.

#### Abstract Submission

Format: 1 page, 12pt font, pdf only.

Send the abstract to s.grunewalder@lancaster.ac.uk

#### Registration

Register by email (s.grunewalder@lancaster.ac.uk). The workshop is free of charge.

** Please register early if you are likely to attend. This will allow us to have a good estimate of the number of attendees. **

** Abstract submission is optional and not a requirement for attendance **

#### Student support

**** We still have a significant amount of funding to support UK based students. Let us know if you need travel support ****

We have funding to support UK based students with travel expenses. Please send us an email if you need financial support.

#### Organizers

Contact: Steffen Grünewälder (s.grunewalder@lancaster.ac.uk)

London Mathematical Society, Google Deepmind, Data Science Institute Lancaster, Data Science Institute Imperial College London, Amazon Research.

#### Talk abstracts

###### Bandit PCA

We consider a partial-feedback variant of the well-studied online PCA problem where a learner attempts to predict a sequence of $d$-dimensional vectors in terms of a quadratic loss, while only having limited feedback about the environment’s choices. We focus on a natural notion of bandit feedback where the learner only observes the loss associated with its own prediction. Based on the classical observation that this decision-making problem can be lifted to the space of density matrices, we propose an algorithm that is shown to achieve a regret of $O(d^{3/2}\sqrt{T \log T})$ after $T$ rounds in the worst case. We also prove data-dependent bounds that improve on the basic result when the loss matrices of the environment have bounded rank or the loss of the best action is bounded. One version of our algorithm runs in $O(d)$ time per trial which massively improves over every previously known online PCA method. We complement these results by a lower bound of $\Omega(d\sqrt{T})$.

###### Laplacian-regularized Graph Bandits

We study a stochastic linear bandit problem with multiple users, where we model user relationship as a graph, and assume parameters (preferences) of users form smooth signals on the graph. We propose a bandit algorithm employing a novel \textit{Upper Bound Confidence} (UCB) defined based on the random-walk normalized graph Laplacian. Our algorithm yields a regret bound of $\Tilde{O}(\Psi d \sqrt{T})$ for each user where $\Psi$ depends on the connectivity between this user and its neighbors. Specifically, $\Psi \in (0,1]$ and denser connectivity leads to smaller $\Psi$, hence to lower regret. This proves the gain in exploring the graph smoothness prior, as opposed to consider all users independently (with a single-user regret bound of $\Tilde{O}(d \sqrt{T})$). Moreover, involving users graph into bandit algorithms typically results in a computation complexity scaling quadratically with the number of users. Hence, we propose a simplified algorithm, which enjoys the same regret bound theoretically but scales linearly with the number of user. We also present a finite-time analysis on proposed algorithms, and demonstrate their advantage in comparison with state-of-the-art graph-based bandit algorithms on both synthetic and real-world datasets.

###### Provably efficient reinforcement learning with rich observations

I will present some new advances in provably efficient reinforcement learning with rich, high-dimensional observations. In the first part, I will describe a very general algorithm, which, while computationally inefficient, is provably sample efficient in a wide variety of rich-observation environments. Among these environments is a class that we call Block-MDPs, where each observation is associated with one of a small number of hidden states. In the second part, I will focus on this class and describe a computationally and statistically efficient approach. Both algorithms provide exponential improvements over prior approaches, and, for the latter, I will present some empirical results verifying these improvements.

###### Weighted Linear Bandits for Non-Stationary Environments

In this talk (based on joint work with Yoan Russac and Claire Vernade), we consider a  stochastic linear bandit model in which the available actions correspond to arbitrary context vectors whose associated rewards follow a non-stationary linear regression model. In this setting, the unknown regression parameter is allowed to vary in time.  To address this problem, we propose to use an optimistic algorithm based on discounted linear regression, where exponential weights are used to smoothly forget the past. Obtaining theoretical guarantees on the behaviour of this approach, in both slowly-varying and abruptly-changing environments, involves studying the deviations of the sequential weighted least-squares estimator under generic assumptions.

###### Bandits with Distributional Context

Contextual bandit algorithms have been successfully used in many applications, including computational advertising, recommender systems and experimental design. In many applications, the context is itself a noisy measurement or the output of some forecasting mechanism. To address this, we introduce a novel stochastic contextual bandit model, where at each step the adversary chooses a distribution over a context set. The learner observes only the context distribution, while the exact context realization remains hidden. By leveraging the UCB algorithm to this setting, we propose an algorithm that achieves a O(dT^(1/2)) high-probability regret bound for linearly parametrized reward functions. Our results strictly generalize previous work: Both our model and algorithm reduce to the linear bandit setting when the environment chooses Dirac delta distributions, i.e., provides the exact context to the learner. Moving on, we then consider a robust variant of this setting, akin to distributionally robust optimization (DRO). Here, the learner’s task is to propose a robust action choice for a context distribution that is chosen by an adversary within an epsilon-ball of a provided reference distribution.

###### Non-Asymptotic Pure Exploration by Solving Games

Pure exploration (aka active testing) is the fundamental task of sequentially gathering information to answer a query about a stochastic environment. Good algorithms make few mistakes and take few samples. Lower bounds (for multi-armed bandit models with arms in an exponential family) reveal that the sample complexity is determined by the solution to an optimisation problem. The existing state of the art algorithms achieve asymptotic optimality by solving a plug-in estimate of that optimisation problem at each step. We interpret the optimisation problem as an unknown game, and propose sampling rules based on iterative strategies to estimate and converge to its saddle point. We apply no-regret learners to obtain the first finite confidence guarantees that are adapted to the exponential family and which apply to any pure exploration query and bandit structure. Moreover, our algorithms only use a best response oracle instead of fully solving the optimisation problem.

###### Recent advances in both rested and restless rotting bandits

This stationarity of reward distribution  is often violated in practice (e.g., in recommender systems), where the reward of an arm may change whenever is selected, i.e., in rested bandits. We first consider the non-parametric rotting bandit setting, where rewards can only decrease. We introduce the Filtering on Expanding Window Average (FEWA) algorithm that constructs moving averages of increasing windows to identify arms that are more likely to return high rewards when pulled once more. We prove that for an unknown horizon $T$, without any knowledge on the decreasing behavior of the $K$ arms, FEWA achieves problem-dependent regret bound of $\tcO(\log{(KT)}),$ and a problem-independent one of $\tcO(\sqrt{KT})$. Our result substantially improves over the algorithm of Levine et al. (2017), which suffers regret $\tcO(K^{1/3}T^{2/3})$.  Next, we give  a refined algorithm, the eXpanding Upper Confidence Bound (X-UCB) which computes the same statistics as FEWA, but uses them as upper confidence bound index strategy This algorithm shows better empirical behaviour and has a better agreement between the empirical and the theoretical tuning. Our algorithms recover the  same asymptotic guarantees as the near-optimal UCB1 in the stochastic bandits, thus showing that the rotting bandits are not more difficult. We finally consider two decreasing restless settings, i.e. where rewards decay independently of the actions we take. These problems have been considered as significantly more challenging since Levine et al. (2017) showed that known algorithms for restless bandits don’t learn in the rotting setting. By contrast, we show that both FEWA and X-UCB have a great level of adaptivity, as they recover near-optimal gap-free and gap-dependent bounds while being adaptive to the multiple rotting settings, the amount of change and the horizon. As no algorithm is known to achieve these characteristics for increasing reward setting, we believe that the decreasing property does make the problem easier than the increasing one. Joint work with Alexandra Carpentier, Alessandro Lazaric, Andrea Locatelli, Pierre Ménard, and Julien Seznec.

###### Fidelity Bandits

The fidelity bandits problem is a variant of the K-armed bandit problem in which the reward of each arm is augmented by a known nonnegative fidelity reward that provides the player with an additional payoff depending on how ‘loyal’ the player has been to that arm in the past. We propose two models for fidelity. In the loyalty-points model the amount of extra reward depends on the number of times the arm has previously been played. In the subscription model the additional reward depends on the current number of consecutive draws of the arm. For each of these models, we study two natural classes of fidelity rewards, namely increasing, and coupons (where the player gets an additional reward every m plays of an arm). We consider both the stochastic and adversarial settings. Our aim is to determine for which models it is possible to achieve sub-linear regret. For those models which exhibit sub-linear regret, we provide simple and intuitive algorithms and bound their regret. For the models which do not necessarily enjoy sub-linear regret, we provide a worst case lower bound. For several of these models, we further show that there are classes of fidelity functions that enable us to avoid this worst case bound and achieve sub-linear regret.

###### Sequential Monte Carlo Bandits

This work extends existing multi-armed bandit (MAB) algorithms beyond their original settings by leveraging advances in sequential Monte Carlo (SMC) methods from the approximate inference community. By combining SMC with Bayesian state-of-the-art MAB algorithms, we extend their applicability to complex models: those for which sampling may be performed even if analytic computation of summary statistics is infeasible — nonlinear reward functions and dynamic bandits. We combine SMC both for Thompson sampling and upper confident bound-based (Bayes-UCB) policies, and study different dynamic bandit models: those with contextual linear-Gaussian, logistic and categorical-softmax rewards.

###### Nonstochastic Multiarmed Bandits with Unrestricted Delays

We investigate multiarmed bandits with delayed feedback, where the delays need neither be identical nor bounded. We first prove that the “delayed” Exp3 achieves the $O(\sqrt{(KT + D)\ln K})$ regret bound conjectured in the case of variable, but bounded delays. Here, $K$ is the number of actions and $D$ is the total delay over $T$ rounds. We then introduce a new algorithm that lifts the requirement of bounded delays by using a wrapper that skips rounds with excessively large delays. The new algorithm maintains the same regret bound, but similar to its predecessor requires prior knowledge of $D$ and $T$. For this algorithm we then construct a novel doubling scheme that forgoes this requirement under the assumption that the delays are available at action time (rather than at loss observation time). This assumption is satisfied in a broad range of applications, including interaction with servers and service providers. The resulting oracle regret bound is of order $\min_\beta (|S_\beta|+\beta \ln K + (KT + D_\beta)\/\beta)$, where $|S_\beta|$ is the number of observations with delay exceeding $\beta$, and $D_\beta$ is the total delay of observations with delay below $\beta$. The bound relaxes to $O(\sqrt{(KT + D)\ln K})$, but we also provide examples where $D_\beta \ll D$ and the oracle bound has a polynomially better dependence on the problem parameters. Joint work with Tobias Thune and Yevgeny Seldin.

###### Learning in structured MDPs with convex cost function: improved regret bounds for inventory management

The stochastic inventory control problem under censored demands  is a fundamental problem in revenue and supply chain management. A simple class of policies called “base-stock policies” is known to be optimal for this problem, and further, convexity of long run average-cost under such policies has been established. In this work, we present a learning algorithm for the stochastic inventory control problem under lost sales penalty and positive lead times, when the demand distribution is a priori unknown. Our main result is a regret bound of O(L\sqrt{T}+D) for the algorithm, where T is the time horizon, L is the fixed and known lead time, and D is an unknown parameter of the demand distribution described roughly as the number of time steps needed to generate enough demand for depleting one unit of inventory. Our results significantly improve the existing regret bounds for this problem. Notably, even though the state space of the underlying Markov Decision Process (MDP) in this problem is continuous and L-dimensional, our regret bounds depend linearly on L. Our techniques utilize convexity of the long run average cost  and a newly derived bound on `bias’ of base-stock policies, to establish an almost blackbox  connection between the problem of learning and optimization in such MDPs and stochastic convex bandit optimization. The techniques presented here may be of independent interest for other settings that involve large structured MDPs but with convex cost functions.

###### SIC-MMAB: Synchronisation Involves Communication in Multiplayer Multi-Armed Bandits

Motivated by cognitive radio networks, we consider the stochastic multiplayer multi-armed bandit problem, where several players pull arms simultaneously and collisions occur if one of them is pulled by several players at the same stage. We present a decentralized algorithm that achieves the same performance as a centralized one, contradicting the existing lower bounds for that problem. This is possible by “hacking” the standard model by constructing a communication protocol between players that deliberately enforces collisions, allowing them to share their information at a negligible cost. This motivates the introduction of a more appropriate dynamic setting without sensing, where similar communication protocols are no longer possible. However, we show that the logarithmic growth of the regret is still achievable for this model with a new algorithm.

###### Stochastic Bandits with Changing Reward Distributions

This talk considers a variant of the stochastic multi-armed bandit problem where the reward distributions may change over time. If there is a fixed number of abrupt changes L and L is known to the learner, various algorithms with regret guarantees have been proposed in the literature. The focus of the talk (based on joint work with Peter Auer and Pratik Gajane) is on the case when L is unknown to the learner.

###### Matching Regret Lower Bounds in Structured Stochastic Bandits

UCB is king in stochastic bandits; it is computationally efficient, performs very well in practise and matches the regret rate lower bound. Can we achieve these same features under structural assumptions on the arm constellation (Lipschitzness, unimodality, sparsity, …)?We present a new strategy for doing so. Our starting point is the generic Graves & Lai instance-dependent regret lower bound, which takes the form of a semi-infinite linear program. We first recast it as a two-player zero-sum game. We then discuss iterative methods for equilibrium computation based on full-information online learning. We then obtain efficient strategies for the structured bandit problem from the resulting iterates. Moreover, we prove finite-time, asymptotically optimal regret guarantees for any structured bandit problem by leveraging the regret guarantees for the employed online learners.