Structure in Reinforcement Learning
Structure in Reinforcement Learning
Disciplines
Computer Sciences (50%); Mathematics (50%)
Keywords
-
Reinforcement Learning,
Regret,
Markov decision processes,
Computational Learning Theory
Markov decision processes (MDPs) are a generic tool for modeling stochastic environments and have found various applications since their introduction in the 1950s by Richard Bellman. In the 1980s Artificial Intelligence research discovered MDPs as models for learning optimal behavior in environments with "delayed feedback". While various algorithms for reinforcement learning in unknown MDPs have been developed, these methods have been denied a breakthrough in spite of some success stories like the backgammon algorithm of Gerald Tesauro. The major practical problem that prevents implementation of reinforcement learning algorithms for many potential applications is that typical algorithms are not efficient in environments with large state spaces. While many real world problems could in principle be handled by representing them as MDPs, such representations usually have a large state space or a large action space (and often both). Thus typical reinforcement learning algorithms are too costly, as their complexity and regret (the lost total reward with respect to an optimal strategy) grows linearly or even polynomially with the number of states and actions. The reason for this is that unlike humans, who can exploit symmetries and similarities in a learning problem, most reinforcement learning algorithms are not able to make use of the environment`s structure. The main focus of the proposed project lies on the investigation of similarity structures for MDPs, and the development of algorithms which are able to exploit such structures. The availability of such tools which are able to deal with structured environments will make reinforcement learning much more interesting for problem domains which are currently handled by heuristics, task specific expert knowledge, or not at all. Thus, applications would neither be restricted to toy problems nor to typical reinforcement learning domains like game playing. Instead, more general control problems in various areas such as robotics and logistics would become accessible to reinforcement learning methods. The proposed project will concentrate on the following two topics: First, similarity structures for state aggregation in MDPs shall be examined, and in a further step exploited by adaptive online aggregation algorithms. Second, these aggregation techniques shall be applied to MDPs with continuous state space, a setting which is of particular importance for applications. In design and analysis of algorithms, application of suitable upper confidence bounds will play a key role. The project shall be conducted within the SequeL team of INRIA Lille, an interdisciplinary center for reinforcement learning. However, collaboration will not be confined to the SequeL group, as INRIA hosts other groups on neighboring fields such as optimization, statistics, and control theory, which may contribute to the success of the project.
- Inria Lille - Nord Europe - 100%
Research Output
- 42 Citations
- 2 Publications
-
2012
Title Regret Bounds for Restless Markov Bandits DOI 10.1007/978-3-642-34106-9_19 Type Book Chapter Author Ortner R Publisher Springer Nature Pages 214-228 -
2012
Title Adaptive aggregation for reinforcement learning in average reward Markov decision processes DOI 10.1007/s10479-012-1064-y Type Journal Article Author Ortner R Journal Annals of Operations Research Pages 321-336