# On the performance of online concurrent reinforcement learners

Multiagent Reinforcement Learning (MARL) is significantly complicated relative to single agent Reinforcement Learning (RL) by the fact that multiple learners render each other's environments non-stationary. While RL focuses on learning a fixed target function like most of machine learning, MARL deals with learning a moving target function. In contrast to classical Reinforcement Learning, MARL deals with an extra level of uncertainty in the form of the behaviors of the other learners in the domain. Existing learning methods provide guarantees about the performance of the learners only in the limit since a learner approaches its desired behavior asymptotically. There is little insight into how well or how poorly an on-line learner can perform while it is learning. This is the core problem studied in this dissertation, resulting in the following contributions. First, the dissertation analyzes some existing MARL algorithms and their online performances when pitted against each other in adversarial situations. This analysis yields a novel characteristic of many MARL algorithms that we call reactivity [1], that explains the observed behaviors of the algorithms. Optimizing this characteristic could produce safe learners, i.e., those that can guarantee good payoffs in competitive situations but the optimization involves tradeoff with the agent's noise sensitivity Second, it sets up a novel mix of goals for a new MARL algorithm that will achieve some basic learning objectives without knowing the type of the other agents, such as (1) learn the best response behavior when other agents in the domain exhibit (eventually) stationary behavior and (2) jointly converge on a mutual equilibrium behavior in case the other agents are using the same learning algorithm, and will also ensure that in case the other agents are neither of the above types, it will achieve some minimum average payoff that is 'good' in some sense Third, the dissertation extends the class of no regret algorithms to yield a class of algorithms that are shown [3] to achieve (1) close to best response payoffs against (eventually) stationary opponents, (2) close to the best possible asymptotic payoffs against converging opponents, and (3) close to at least the minimax payoffs against any other opponents, in polynomial time with high likelihood Fourth, the dissertation explores the cost of learning when the opponents are also adaptive Lastly, the dissertation validates all its novel techniques and algorithms empirically comparing them with existing techniques in simulations. (Abstract shortened by UMI.)