# Master's Dissertations in Reinforcement Learning

Since the foundation of our research group, a number of MSc students (Master of Science) have completed their dissertation projects in our group. These dissertations have covered a range of current topics in reinforcement learning, and some were published in top-tier conferences. This blog post lists MSc dissertations which were completed in our group.

*Context:* In the School of Informatics, the MSc degree is typically a 1-year programme with a final dissertation project completed over a period of around 3 months.

## 2021

**Data Collection for Policy Evaluation in Reinforcement Learning**

Author: Rujie Zhong

In reinforcement learning (RL), policy evaluation is a task to estimate the expected
returns when deploying a particular policy (called the evaluation policy) to the real
environment. One common strategy to conduct policy evaluation is On-policy Sampling
(OS), which collects data by sampling actions from the evaluation policy, and
makes estimations by averaging the collected returns. However, due to sampling error,
the distribution of OS data collection could be arbitrarily different from that under the
evaluation policy, making the value estimation suffer from a large variance.
In this work, we propose two data collection strategies, Robust On-policy Sampling
and Robust On-policy Acting, both of which can consider the historical collected
data and reduce sampling error in future data collection. Experiments in different RL
domains show that limited to the same amount, this low-sampling-error data can generally
enable a more accurate policy value estimation.

**Reinforcement Learning with Non-Conventional Value Function Approximation**

Author: Antreas Tsiakkas

The use of a function approximation model to represent the value function has
been a major contributor to the success of Reinforcement Learning methods. Despite
its importance, the choice of model defaults to either a linear- or neural network-based
approach mainly due to their good empirical performance, vast amount of research,
and available resources. Still, these conventional approaches are not without limitations
and alternative approaches could offer advantages under certain criteria. Yet,
these remain under-utilised due to a lack of understanding of the problems or requirements
where their use would be beneficial. This dissertation provides empirical results
on the performance, reliability, sample efficiency, training time, and interpretability of
non-conventional value function approximation methods, which are evaluated under a
consistent evaluation framework and compared to the conventional approaches. This
allows the identification of the relative strengths and weaknesses of each model that is
highly dependent on the environment and task at hand. Results suggest that both the
linear and neural network models suffer from sample inefficiencies, whilst alternative
models –such as support vector regression– are significantly more sample efficient.
Further, the neural network model is widely considered a black-box model whilst alternative
models –such as decision trees– offer interpretable architectures. Both limitations
have become increasingly important in the adaptation of Reinforcement Learning
models in real-world applications where they are required to be efficient, transparent
and accountable. Hence, this work can inform future research and promote the use of
non-conventional approaches as a viable alternative to the conventional approaches.

**Multi-Agent Deep Reinforcement Learning: Revisiting MADDPG**

Author: Joshua Wilkins

Multi-agent reinforcement learning (MARL) is a rapidly expanding field at the forefront
of current research into artificial intelligence. We examine MADDPG, one of
the first MARL algorithms to use deep reinforcement learning, on discrete action environments
to determine whether its application of a Gumble-Softmax impacts its performance
in terms of average and maximum returns. Our findings suggest that while
Gumbel-Softmax negatively impacts performance, the deterministic policy that necessitates
its use performs far better with the multi-agent actor-critic method than the
stochastic alternative we propose without Gumbel-Softmax.

**Mitigating Exploitation Caused by Incentivization in Multi-Agent Reinforcement Learning**

Author: Paul Chelarescu

This thesis focuses on the issue of exploitation enabled by reinforcement learning
agents being able to incentivize each other via reward sending mechanisms. Motivated
by creating cooperation in the face of sequential social dilemmas, incentivization is a
method of allowing agents to directly shape the learning process of other agents by
influencing their rewards to incentivize cooperative actions. This method is versatile,
but it also leads to scenarios in which agents can be exploited via their collaboration
when their environmental returns together with their received incentives end up being
lower than if they had rejected collaboration and independently accumulated only environmental
rewards. In my work, I have defined this behavior as exploitation via a
metricwith how a set of decision makers can act to maximize their cumulative rewards
in the presence of each other. I have expanded the action space to include a ”reject reward”
action that allows an agent to control whether it receives incentives from others
in order to prevent exploitation. My results indicate that cooperation without exploitation
is a delicate scenario to achieve, and that rejecting rewards from others usually
leads to failure of cooperation.

**Reinforcement Learning with Function Approximation in Continuing Tasks: Discounted Return or Average Reward?**

Author: Panagiotis Kyriakou

Reinforcement learning is a machine learning sub-field, involving an agent performing
sequential decision making and learning through trial and error inside a predefined
environment. An important design decision for a reinforcement learning algorithm is
the return formulation, which formulates the future expected returns that the agent receives
after following any action in a specific environment state. In continuing tasks
with value function approximation (VFA), average rewards and discounted returns can
be used as the return formulation but it is unclear how the two formulations compare
empirically. This dissertation aims at empirically comparing the two return formulations.
We experiment with three continuing tasks of varying complexity, three learning
algorithms and four different VFA methods. We conduct three experiments investigating
the average performance over multiple hyperparameters, the performance with
near-optimal hyperparameters and the hyperparameter sensitivity of each return formulation.
Our results show that there is an apparent performance advantage in favour of
the average rewards formulation because it is less sensitive to hyperparameters. Once
hyperparameters are optimized, the two formulations seem to perform similarly.

## 2020

**Deep Reinforcement Learning with Relational Forward Models: Can Simpler Methods Perform Better?**

Author: Niklas Höpner

Learning in multi-agent systems is difficult due to the multi-agent credit assignment
problem. Agents must obtain an understanding how their reward depends on ther other
agents actions. One approach to encourage coordination between a learning agent and
teammate agents is augmenting deep reinforcement learning algorithms with an agent
model. Relational forward models (RFM) (Tacchetti et al., 2019), a class of recurrent
graph neural networks, try to leverage the relational inductive bias inherent in the set
of agents and environment objects to provide a powerful agent model. Augmenting an
actor-critic algorithm with action predictions from a RFM has been shown to improve
the sample efficiency of the actor-critic algorithm. However, it is an open question
whether agents equipped with a RFM also achieve a fixed level of reward in shorter
computation time compared to agents without an embedded RFM. Since utilizing a
RFM introduces a computational overhead this must not be the case. To investigate
this question, we follow the experimental methodology of Tacchetti et al. (2019) but
additionally look at the computation time of the learning algorithm with and without
an agent model. Further, we look into whether replacing the RFM with a computationally
more efficient conditional action frequency (CAF) model, provides an option for
a better trade-off between sample efficiency and computational complexity. We replicate
the improvement in sample efficiency caused by incorporating an RFM into the
learning algorithm on one of the proposed environments from Tacchetti et al. (2019).
Further we show that in terms of computation time an agent without an agent model
achieves the same reward up to ten hours faster. This suggests that in practice only
reporting the sample efficiency gives an incomplete picture of the properties of any
newly proposed algorithm. It should therefore become standard to report both, results
on the sample efficiency of an algorithm as well as results on the computation time
needed to achieve a fixed reward.

## 2019

**Curiosity in Multi-Agent Reinforcement Learning**

Author: Lukas Schäfer

Multi-agent reinforcement learning has seen considerable achievements on a variety
of tasks. However, suboptimal conditions involving sparse feedback and partial observability,
as frequently encountered in applications, remain a significant challenge.
In this thesis, we apply curiosity as exploration bonuses to such multi-agent systems
and analyse their impact on a variety of cooperative and competitive tasks. In addition,
we consider modified scenarios involving sparse rewards and partial observability to
evaluate the influence of curiosity on these challenges.
We apply the independent Q-learning and state-of-the-art multi-agent deep deterministic
policy gradient methods to these tasks with and without intrinsic rewards. Curiosity
is defined using pseudo-counts of observations or relying on models to predict
environment dynamics.
Our evaluation illustrates that intrinsic rewards can cause considerable instability
in training without benefiting exploration. This outcome can be observed on the original
tasks and against our expectation under partial observability, where curiosity is
unable to alleviate the introduced instability. However, curiosity leads to significantly
improved stability and converged performance when applied to policy-gradient reinforcement
learning with sparse rewards. While the sparsity causes training of such
methods to be highly unstable, additional intrinsic rewards assist training and agents
show intended behaviour on most tasks.
This work contributes to understanding the impact of intrinsic rewards in challenging
multi-agent reinforcement learning environments and will serve as a foundation for
further research to expand on.

**Reinforcement Learning with Function Approximation in Continuing Tasks: Discounted Return or Average Reward?**

Author: Lucas Descause

Discounted returns, a common metric used in Reinforcement Learning to estimate the
value of a state or action, has been claimed to be ill-defined for continuing tasks using
value function approximation. Instead, average rewards have been proposed
as an alternative. We conduct an empirical comparison of discounted returns
and average rewards on three different continuing tasks using various value function
approximation methods. Results show a clear performance advantage for average rewards.
Through further experiments we show that this advantage may not be caused by
partial observability, as hypothesized by the claims against discounted returns.
Instead, we show that this advantage may be due to average reward values requiring a
simpler function to estimate than discounted returns.

**Variational Autoencoders for Opponent Modelling in Multi-Agent Systems**

Author: Georgios Papoudakis

Multi-agent systems exhibit complex behaviors that comes from the interactions of
multiple agents in a shared environment. Therefore, modelling the behavior of other
agents is a key requirement in order to better understand the local perspective of each
agent in the system. In this thesis, taking advantage of recent advances in unsupervised
learning, we propose an approach to model opponents using variational autoencoders.
Additionally, most existing methods in the literature make the assumption that models
of opponents require access to opponents’ observations and actions both during training
and execution. To this end, we propose an inference method that tries to identify the
underling opponent model, using only local information, such as the observation, action
and reward sequence of our agent. We provide a thorough experimental evaluation of
the proposed methods, both for reinforcement learning as well as unsupervised learning.
The experiments indicate that our opponent modelling method achieves equal of better
performance in reinforcement learning task than another recent modelling method and
other baselines.

**Exploration in Multi-Agent Deep Reinforcement Learning**

Author: Filippos Christianos

Much emphasis in reinforcement learning research is placed on exploration, ensuring that optimal policies are found. In multi-agent deep reinforcement learning, efficiency in exploration is even more important since the state-action spaces grow beyond our computational capabilities. In this thesis, we motivate and experiment on coordinating exploration between agents, to improve efficiency. We propose three novel methods of increasing complexity, which coordinate agents that explore an environment. We demonstrate through experiments that coordinated exploration outperforms non-coordinated variants.

## 2018

**Communication and Cooperation in Decentralized Multi-Agent Reinforcement Learning**

Author: Nico Ring

Many real-world problems can only be solved when multiple agents work together.
Each agent often has only a partial view of the whole environment. Therefore,
communication and cooperation are important abilities for agents in these environments.
In this thesis, we test the ability of an existing deep multi-agent
actor-critic algorithm to cope with partially observable scenarios. In addition, we
propose to adapt this algorithm to use recurrent neural networks, which enables
using information from past steps to improve action-value predictions. We evaluate
the algorithms in cooperative environments that require cooperation and
communication. Furthermore, we simulate varying degrees of partial observability
of actions and observations from other agents. Our experiments show that
our proposed recurrent adaptation achieves equally good or signi cantly better
success rates compared to the original method when only actions of other agents
are partially observed. However, for other settings our adaptation also can perform
worse than the original method, especially when the observations of other
agents are only partially observed.