Projectdetail

Grant DOI 10.55776/P26219
Funding program Principal Investigator Projects
Status ended
Start April 10, 2014
End May 9, 2016
Funding amount € 130,536
Project website
E-mail

Disciplines

Computer Sciences (50%); Mathematics (50%)

Keywords

Reinforcement Learning, Regret Analysis, Computational Learning Theory

Abstract

Final report

In reinforcement learning, an agent tries to learn optimal behavior in an unknown environment by evaluating feedback usually some quantifiable and comparable reward to his actions. As Theo learner`s actions may pay off not immediately, he must be able to learn also from delayed feedback, for example by accepting short-term discouraging feedback to achieve a long-term goal giving large positive feedback. Thus, in typical reinforcement learning applications like robotics, control, or game playing, Theo learner will get rewarding feedback only when a given task is finished after a series of coordinated actions which individually give no or even misleading feedback. While various reinforcement learning algorithms have been developed, these methods have been denied a major breakthrough in practice. One of Theo major problems with application of reinforcement learning algorithms to real world problems is that typical algorithms are not efficient in large domains. Thus, while many potential applications could be handled by reinforcement learning algorithms in principle, from Theo practical point of view they are too costly, as their complexity and regret (Theo total lost reward with respect to an optimal strategy) grow linearly or even polynomially with Theo size of Theo underlying domain. One of Theo reasons for this is that unlike humans reinforcement learning algorithms usually are not able to exploit similarities and structures in Theo domain of a problem. In a precursor project, together with scientists from Theo SequeL team at Inria Lille, an interdisciplinary center for reinforcement learning, we were able to define very general similarity structures for reinforcement learning problems in finite domains and to achieve improved theoretical regret bounds when Theo underlying similarity structure is known. Theo developed techniques and algorithms also led to Theo first theoretical regret bounds for reinforcement learning in continuous domains. Theo proposed project wants to take Theo research on continuous reinforcement learning a setting which is of particular importance for applications a step further, not only by improving over Theo known bounds, but also by Theo development of efficient algorithms. Moreover, we also want to investigate in more general settings where Theo learner does not have direct access to Theo domain information, but only to a set of possible models. Also for this setting, Theo precursor project has produced first theoretical results, assuming finite domains and that Theo set of possible models contains Theo correct model. In Theo proposed project, we aim at generalizing this to infinite domains and loosening Theo assumption on Theo model set, which shall not necessarily contain Theo correct model, but only a good approximation of it.

In reinforcement learning a learner wants to learn optimal behavior in an unknown environment. For example, the goal of the learner could be to reach a certain location or state, or to solve a complex task. The learning process itself is governed only by feedback of the environment. That is, the learner can observe the reaction of the environment to his actions and e.g. obtains a reward for solving a given task. Since the solution of a task may require the execution of a longer sequence of coordinated actions, the learner must be able to learn also from delayed feedback, for example by accepting short-term discouraging feedback to achieve a long-term goal giving high reward. Problem settings of this kind are in principle solvable by existing reinforcement learning algorithms, which can even be shown theoretically to be able to solve any task, provided that the task has certain properties (like that it is possible to recover from mistakes). However, at the same time these algorithms are hardly applicable to real world problems. This is mainly due to the fact that the representation of even the simplest problems gives rise to huge state spaces, so that algorithms cannot solve these problems in reasonable time. In the project at hand we managed to develop reinforcement learning algorithms for problems with continuous state space, which are of particular importance in the context of applications but for which there have been only few theoretical results available so far. It could be shown that in environments that behave nicely the new algorithm can provably learn faster than known algorithms. Another question that was dealt with in the project was whether a learning algorithm can learn to use simpler representations in the learning process. More precisely, the learner is given a set of possible representations, some of which are suitable, while others can even be misleading. In this setting it could be shown that a learning algorithm developed in the project can successfully learn even if there is no completely correct representation at its disposal. Instead, it is sufficient that there is at least one representation that is a good approximation of the environment. It is particularly interesting that for successful learning it is not necessary to identify this representation, which can be more difficult and sometimes is even impossible.

Research institution(s)

Montanuniversität Leoben - 100%

International project participants

Remi Munos, Inria Lille - Nord Europe - France
Jan Peters, Technische Universität Darmstadt - Germany

Research Output

31 Citations
9 Publications

Publications

Title	Improved Regret Bounds for Undiscounted Continuous Reinforcement Learning.
Type	Journal Article
Author	Lakshmanan K
Journal	JMLR Workshop and Conference Proceedings Volume 37: Proceedings of The 32nd International Conference on Machine Learning, ICML 2015.

Title	Regret bounds for restless Markov bandits
DOI	10.1016/j.tcs.2014.09.026
Type	Journal Article
Author	Ortner R
Journal	Theoretical Computer Science
Pages	62-76
Link	Publication

Title	Improved Learning Complexity in Combinatorial Pure Exploration Bandits.
Type	Journal Article
Author	Bartlett P Et Al
Journal	JMLR Workshop and Conference Proceedings Volume 51: Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, AISTATS 2016.

Title	Pareto Front Identification from Stochastic Bandit Feedback.
Type	Journal Article
Author	Auer P
Journal	JMLR Workshop and Conference Proceedings Volume 51: Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, AISTATS 2016.

Title	An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits.
Type	Journal Article
Author	Auer P
Journal	JMLR Workshop and Conference Proceedings: Proceedings of the 29th Conference on Learning Theory, COLT 2016

Title	Selecting Near-Optimal Approximate State Representations in Reinforcement Learning
DOI	10.1007/978-3-319-11662-4_11
Type	Book Chapter
Author	Ortner R
Publisher	Springer Nature
Pages	140-154

Title	Selecting Near-Optimal Approximate State Representations in Reinforcement Learning
DOI	10.48550/arxiv.1405.2652
Type	Preprint
Author	Ortner R

Title	Optimal Behavior is Easier to Learn than the Truth
DOI	10.1007/s11023-016-9389-y
Type	Journal Article
Author	Ortner R
Journal	Minds and Machines
Pages	243-252
Link	Publication

Title	An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits
DOI	10.48550/arxiv.1605.08722
Type	Preprint
Author	Auer P

Go to overview page Discover

Go to overview page Funding

Go to overview page About Us

Go to overview page News

Structured and Continuous Reinforcement Learning

Structured and Continuous Reinforcement Learning

Disciplines

Keywords

Research Output

Contact

General information

Go to overview page Discover

Go to overview page Funding

Go to overview page About Us

Go to overview page News

SOCIAL MEDIA

SCILOG

Structured and Continuous Reinforcement Learning

Structured and Continuous Reinforcement Learning

Disciplines

Keywords

Research Output