Structured and Continuous Reinforcement Learning
Structured and Continuous Reinforcement Learning
Disciplines
Computer Sciences (50%); Mathematics (50%)
Keywords
-
Reinforcement Learning,
Regret Analysis,
Computational Learning Theory
In reinforcement learning, an agent tries to learn optimal behavior in an unknown environment by evaluating feedback usually some quantifiable and comparable reward to his actions. As Theo learner`s actions may pay off not immediately, he must be able to learn also from delayed feedback, for example by accepting short-term discouraging feedback to achieve a long-term goal giving large positive feedback. Thus, in typical reinforcement learning applications like robotics, control, or game playing, Theo learner will get rewarding feedback only when a given task is finished after a series of coordinated actions which individually give no or even misleading feedback. While various reinforcement learning algorithms have been developed, these methods have been denied a major breakthrough in practice. One of Theo major problems with application of reinforcement learning algorithms to real world problems is that typical algorithms are not efficient in large domains. Thus, while many potential applications could be handled by reinforcement learning algorithms in principle, from Theo practical point of view they are too costly, as their complexity and regret (Theo total lost reward with respect to an optimal strategy) grow linearly or even polynomially with Theo size of Theo underlying domain. One of Theo reasons for this is that unlike humans reinforcement learning algorithms usually are not able to exploit similarities and structures in Theo domain of a problem. In a precursor project, together with scientists from Theo SequeL team at Inria Lille, an interdisciplinary center for reinforcement learning, we were able to define very general similarity structures for reinforcement learning problems in finite domains and to achieve improved theoretical regret bounds when Theo underlying similarity structure is known. Theo developed techniques and algorithms also led to Theo first theoretical regret bounds for reinforcement learning in continuous domains. Theo proposed project wants to take Theo research on continuous reinforcement learning a setting which is of particular importance for applications a step further, not only by improving over Theo known bounds, but also by Theo development of efficient algorithms. Moreover, we also want to investigate in more general settings where Theo learner does not have direct access to Theo domain information, but only to a set of possible models. Also for this setting, Theo precursor project has produced first theoretical results, assuming finite domains and that Theo set of possible models contains Theo correct model. In Theo proposed project, we aim at generalizing this to infinite domains and loosening Theo assumption on Theo model set, which shall not necessarily contain Theo correct model, but only a good approximation of it.
In reinforcement learning a learner wants to learn optimal behavior in an unknown environment. For example, the goal of the learner could be to reach a certain location or state, or to solve a complex task. The learning process itself is governed only by feedback of the environment. That is, the learner can observe the reaction of the environment to his actions and e.g. obtains a reward for solving a given task. Since the solution of a task may require the execution of a longer sequence of coordinated actions, the learner must be able to learn also from delayed feedback, for example by accepting short-term discouraging feedback to achieve a long-term goal giving high reward. Problem settings of this kind are in principle solvable by existing reinforcement learning algorithms, which can even be shown theoretically to be able to solve any task, provided that the task has certain properties (like that it is possible to recover from mistakes). However, at the same time these algorithms are hardly applicable to real world problems. This is mainly due to the fact that the representation of even the simplest problems gives rise to huge state spaces, so that algorithms cannot solve these problems in reasonable time. In the project at hand we managed to develop reinforcement learning algorithms for problems with continuous state space, which are of particular importance in the context of applications but for which there have been only few theoretical results available so far. It could be shown that in environments that behave nicely the new algorithm can provably learn faster than known algorithms. Another question that was dealt with in the project was whether a learning algorithm can learn to use simpler representations in the learning process. More precisely, the learner is given a set of possible representations, some of which are suitable, while others can even be misleading. In this setting it could be shown that a learning algorithm developed in the project can successfully learn even if there is no completely correct representation at its disposal. Instead, it is sufficient that there is at least one representation that is a good approximation of the environment. It is particularly interesting that for successful learning it is not necessary to identify this representation, which can be more difficult and sometimes is even impossible.
- Montanuniversität Leoben - 100%
Research Output
- 31 Citations
- 9 Publications
-
2015
Title Improved Regret Bounds for Undiscounted Continuous Reinforcement Learning. Type Journal Article Author Lakshmanan K Journal JMLR Workshop and Conference Proceedings Volume 37: Proceedings of The 32nd International Conference on Machine Learning, ICML 2015. -
2014
Title Regret bounds for restless Markov bandits DOI 10.1016/j.tcs.2014.09.026 Type Journal Article Author Ortner R Journal Theoretical Computer Science Pages 62-76 Link Publication -
2016
Title Improved Learning Complexity in Combinatorial Pure Exploration Bandits. Type Journal Article Author Bartlett P Et Al Journal JMLR Workshop and Conference Proceedings Volume 51: Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, AISTATS 2016. -
2016
Title Pareto Front Identification from Stochastic Bandit Feedback. Type Journal Article Author Auer P Journal JMLR Workshop and Conference Proceedings Volume 51: Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, AISTATS 2016. -
2016
Title An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits. Type Journal Article Author Auer P Journal JMLR Workshop and Conference Proceedings: Proceedings of the 29th Conference on Learning Theory, COLT 2016 -
2014
Title Selecting Near-Optimal Approximate State Representations in Reinforcement Learning DOI 10.1007/978-3-319-11662-4_11 Type Book Chapter Author Ortner R Publisher Springer Nature Pages 140-154 -
2014
Title Selecting Near-Optimal Approximate State Representations in Reinforcement Learning DOI 10.48550/arxiv.1405.2652 Type Preprint Author Ortner R -
2016
Title Optimal Behavior is Easier to Learn than the Truth DOI 10.1007/s11023-016-9389-y Type Journal Article Author Ortner R Journal Minds and Machines Pages 243-252 Link Publication -
2016
Title An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits DOI 10.48550/arxiv.1605.08722 Type Preprint Author Auer P