Master's Dissertations in Reinforcement Learning

Since the foundation of our research group, a number of MSc students (Master of Science) have completed their dissertation projects in our group. These dissertations have covered a range of current topics in reinforcement learning, and some were published in top-tier conferences. This blog post lists MSc dissertations which were completed in our group.

Context: In the School of Informatics, the MSc degree is typically a 1-year programme with a final dissertation project completed over a period of around 3 months.


Data Collection for Policy Evaluation in Reinforcement Learning
Author: Rujie Zhong
In reinforcement learning (RL), policy evaluation is a task to estimate the expected returns when deploying a particular policy (called the evaluation policy) to the real environment. One common strategy to conduct policy evaluation is On-policy Sampling (OS), which collects data by sampling actions from the evaluation policy, and makes estimations by averaging the collected returns. However, due to sampling error, the distribution of OS data collection could be arbitrarily different from that under the evaluation policy, making the value estimation suffer from a large variance. In this work, we propose two data collection strategies, Robust On-policy Sampling and Robust On-policy Acting, both of which can consider the historical collected data and reduce sampling error in future data collection. Experiments in different RL domains show that limited to the same amount, this low-sampling-error data can generally enable a more accurate policy value estimation.

Reinforcement Learning with Non-Conventional Value Function Approximation
Author: Antreas Tsiakkas
The use of a function approximation model to represent the value function has been a major contributor to the success of Reinforcement Learning methods. Despite its importance, the choice of model defaults to either a linear- or neural network-based approach mainly due to their good empirical performance, vast amount of research, and available resources. Still, these conventional approaches are not without limitations and alternative approaches could offer advantages under certain criteria. Yet, these remain under-utilised due to a lack of understanding of the problems or requirements where their use would be beneficial. This dissertation provides empirical results on the performance, reliability, sample efficiency, training time, and interpretability of non-conventional value function approximation methods, which are evaluated under a consistent evaluation framework and compared to the conventional approaches. This allows the identification of the relative strengths and weaknesses of each model that is highly dependent on the environment and task at hand. Results suggest that both the linear and neural network models suffer from sample inefficiencies, whilst alternative models –such as support vector regression– are significantly more sample efficient. Further, the neural network model is widely considered a black-box model whilst alternative models –such as decision trees– offer interpretable architectures. Both limitations have become increasingly important in the adaptation of Reinforcement Learning models in real-world applications where they are required to be efficient, transparent and accountable. Hence, this work can inform future research and promote the use of non-conventional approaches as a viable alternative to the conventional approaches.

Multi-Agent Deep Reinforcement Learning: Revisiting MADDPG
Author: Joshua Wilkins
Multi-agent reinforcement learning (MARL) is a rapidly expanding field at the forefront of current research into artificial intelligence. We examine MADDPG, one of the first MARL algorithms to use deep reinforcement learning, on discrete action environments to determine whether its application of a Gumble-Softmax impacts its performance in terms of average and maximum returns. Our findings suggest that while Gumbel-Softmax negatively impacts performance, the deterministic policy that necessitates its use performs far better with the multi-agent actor-critic method than the stochastic alternative we propose without Gumbel-Softmax.

Mitigating Exploitation Caused by Incentivization in Multi-Agent Reinforcement Learning
Author: Paul Chelarescu
This thesis focuses on the issue of exploitation enabled by reinforcement learning agents being able to incentivize each other via reward sending mechanisms. Motivated by creating cooperation in the face of sequential social dilemmas, incentivization is a method of allowing agents to directly shape the learning process of other agents by influencing their rewards to incentivize cooperative actions. This method is versatile, but it also leads to scenarios in which agents can be exploited via their collaboration when their environmental returns together with their received incentives end up being lower than if they had rejected collaboration and independently accumulated only environmental rewards. In my work, I have defined this behavior as exploitation via a metricwith how a set of decision makers can act to maximize their cumulative rewards in the presence of each other. I have expanded the action space to include a ”reject reward” action that allows an agent to control whether it receives incentives from others in order to prevent exploitation. My results indicate that cooperation without exploitation is a delicate scenario to achieve, and that rejecting rewards from others usually leads to failure of cooperation.

Reinforcement Learning with Function Approximation in Continuing Tasks: Discounted Return or Average Reward?
Author: Panagiotis Kyriakou
Reinforcement learning is a machine learning sub-field, involving an agent performing sequential decision making and learning through trial and error inside a predefined environment. An important design decision for a reinforcement learning algorithm is the return formulation, which formulates the future expected returns that the agent receives after following any action in a specific environment state. In continuing tasks with value function approximation (VFA), average rewards and discounted returns can be used as the return formulation but it is unclear how the two formulations compare empirically. This dissertation aims at empirically comparing the two return formulations. We experiment with three continuing tasks of varying complexity, three learning algorithms and four different VFA methods. We conduct three experiments investigating the average performance over multiple hyperparameters, the performance with near-optimal hyperparameters and the hyperparameter sensitivity of each return formulation. Our results show that there is an apparent performance advantage in favour of the average rewards formulation because it is less sensitive to hyperparameters. Once hyperparameters are optimized, the two formulations seem to perform similarly.


Deep Reinforcement Learning with Relational Forward Models: Can Simpler Methods Perform Better?
Author: Niklas Höpner
Learning in multi-agent systems is difficult due to the multi-agent credit assignment problem. Agents must obtain an understanding how their reward depends on ther other agents actions. One approach to encourage coordination between a learning agent and teammate agents is augmenting deep reinforcement learning algorithms with an agent model. Relational forward models (RFM) (Tacchetti et al., 2019), a class of recurrent graph neural networks, try to leverage the relational inductive bias inherent in the set of agents and environment objects to provide a powerful agent model. Augmenting an actor-critic algorithm with action predictions from a RFM has been shown to improve the sample efficiency of the actor-critic algorithm. However, it is an open question whether agents equipped with a RFM also achieve a fixed level of reward in shorter computation time compared to agents without an embedded RFM. Since utilizing a RFM introduces a computational overhead this must not be the case. To investigate this question, we follow the experimental methodology of Tacchetti et al. (2019) but additionally look at the computation time of the learning algorithm with and without an agent model. Further, we look into whether replacing the RFM with a computationally more efficient conditional action frequency (CAF) model, provides an option for a better trade-off between sample efficiency and computational complexity. We replicate the improvement in sample efficiency caused by incorporating an RFM into the learning algorithm on one of the proposed environments from Tacchetti et al. (2019). Further we show that in terms of computation time an agent without an agent model achieves the same reward up to ten hours faster. This suggests that in practice only reporting the sample efficiency gives an incomplete picture of the properties of any newly proposed algorithm. It should therefore become standard to report both, results on the sample efficiency of an algorithm as well as results on the computation time needed to achieve a fixed reward.


Curiosity in Multi-Agent Reinforcement Learning
Author: Lukas Schäfer
Multi-agent reinforcement learning has seen considerable achievements on a variety of tasks. However, suboptimal conditions involving sparse feedback and partial observability, as frequently encountered in applications, remain a significant challenge. In this thesis, we apply curiosity as exploration bonuses to such multi-agent systems and analyse their impact on a variety of cooperative and competitive tasks. In addition, we consider modified scenarios involving sparse rewards and partial observability to evaluate the influence of curiosity on these challenges. We apply the independent Q-learning and state-of-the-art multi-agent deep deterministic policy gradient methods to these tasks with and without intrinsic rewards. Curiosity is defined using pseudo-counts of observations or relying on models to predict environment dynamics. Our evaluation illustrates that intrinsic rewards can cause considerable instability in training without benefiting exploration. This outcome can be observed on the original tasks and against our expectation under partial observability, where curiosity is unable to alleviate the introduced instability. However, curiosity leads to significantly improved stability and converged performance when applied to policy-gradient reinforcement learning with sparse rewards. While the sparsity causes training of such methods to be highly unstable, additional intrinsic rewards assist training and agents show intended behaviour on most tasks. This work contributes to understanding the impact of intrinsic rewards in challenging multi-agent reinforcement learning environments and will serve as a foundation for further research to expand on.

Reinforcement Learning with Function Approximation in Continuing Tasks: Discounted Return or Average Reward?
Author: Lucas Descause
Discounted returns, a common metric used in Reinforcement Learning to estimate the value of a state or action, has been claimed to be ill-defined for continuing tasks using value function approximation. Instead, average rewards have been proposed as an alternative. We conduct an empirical comparison of discounted returns and average rewards on three different continuing tasks using various value function approximation methods. Results show a clear performance advantage for average rewards. Through further experiments we show that this advantage may not be caused by partial observability, as hypothesized by the claims against discounted returns. Instead, we show that this advantage may be due to average reward values requiring a simpler function to estimate than discounted returns.

Variational Autoencoders for Opponent Modelling in Multi-Agent Systems
Author: Georgios Papoudakis
Multi-agent systems exhibit complex behaviors that comes from the interactions of multiple agents in a shared environment. Therefore, modelling the behavior of other agents is a key requirement in order to better understand the local perspective of each agent in the system. In this thesis, taking advantage of recent advances in unsupervised learning, we propose an approach to model opponents using variational autoencoders. Additionally, most existing methods in the literature make the assumption that models of opponents require access to opponents’ observations and actions both during training and execution. To this end, we propose an inference method that tries to identify the underling opponent model, using only local information, such as the observation, action and reward sequence of our agent. We provide a thorough experimental evaluation of the proposed methods, both for reinforcement learning as well as unsupervised learning. The experiments indicate that our opponent modelling method achieves equal of better performance in reinforcement learning task than another recent modelling method and other baselines.

Exploration in Multi-Agent Deep Reinforcement Learning
Author: Filippos Christianos
Much emphasis in reinforcement learning research is placed on exploration, ensuring that optimal policies are found. In multi-agent deep reinforcement learning, efficiency in exploration is even more important since the state-action spaces grow beyond our computational capabilities. In this thesis, we motivate and experiment on coordinating exploration between agents, to improve efficiency. We propose three novel methods of increasing complexity, which coordinate agents that explore an environment. We demonstrate through experiments that coordinated exploration outperforms non-coordinated variants.


Communication and Cooperation in Decentralized Multi-Agent Reinforcement Learning
Author: Nico Ring
Many real-world problems can only be solved when multiple agents work together. Each agent often has only a partial view of the whole environment. Therefore, communication and cooperation are important abilities for agents in these environments. In this thesis, we test the ability of an existing deep multi-agent actor-critic algorithm to cope with partially observable scenarios. In addition, we propose to adapt this algorithm to use recurrent neural networks, which enables using information from past steps to improve action-value predictions. We evaluate the algorithms in cooperative environments that require cooperation and communication. Furthermore, we simulate varying degrees of partial observability of actions and observations from other agents. Our experiments show that our proposed recurrent adaptation achieves equally good or signi cantly better success rates compared to the original method when only actions of other agents are partially observed. However, for other settings our adaptation also can perform worse than the original method, especially when the observations of other agents are only partially observed.