Unsupervised Environment Design for Zero-Shot Transfer in Reinforcement Learning

Author: Samuel Garcin

Date: 2024-04-28

Follow @UoE_Agents on Twitter

Autonomous agents trained using deep reinforcement learning (RL) tend to overfit to the environment instances, or levels, encountered during training. This causes RL agents to perform significantly worse when transfering zero-shot (i.e. without additional training) to levels unseen during training. To quantifies overfitting in RL agents we may measure the generalisation gap, measured as the gap between expected returns over training levels and over held-out test levels, both sampled from the same distribution. Even for simple domains, such as gridworlds [1] or Atari-like games [2], it takes upwards of 10,000 training levels to make this gap disappear.

A level is instantiated within a simulator using the corresponding combination of simulator inputs, or level parameters. In a practical setting these parameters are often too costly to collect or manually specify in large numbers. This makes RL tricky to scale to real-world applications, even with access to a powerful simulator and large amounts of computing power.

In this blog post, we will discuss how Unsupervised Environment Design (UED) could help circumvent level parameter scarcity and enable RL agents to transfer to unseen levels more effectively. We will also argue that a reframing of UED may be necessary to unlock its full potential for zero-shot transfer. We will cover the following

What is UED?

UED [3] automates the selection of level parameters by reformulating RL as a game played between the agent and a “level generator”.

The principal difference between UED and robust adversarial RL [4] is that the agent and generator play a minimax regret game (instead of a maximin return game). In any given level, the regret is the difference between the maximum achievable return in that level and the agent’s current expected return.

In a minimax regret game, the agent follows the traditional return-maximisation objective (i.e. it minimises regret) and the generator consistently plays the level with the highest regret. At Nash equilibrium the agent follows the minimax regret policy, the policy achieving a lower bound for the maximum regret over the space of playable levels. In other words, it is the policy with the best possible worst-case performance.

UED for zero-shot transfer

For the remainder of this post, we will cover how UED may be employed to improve zero-shot transfer in RL, starting with a focus on recent developments.

The level replay buffer

Buffer-based strategies perform UED through a form of rejection sampling. Scores are attributed to generated levels and act as a proxy for the regret. For example, a possible scoring function would be the difference between the agent's current return and the maximum return it reached for that level. High scoring levels get added to a level replay buffer, which can be described as a library of training levels for the agent. During training, high scoring levels have a greater probability of being sampled from the buffer than levels with lower scores.

Early UED algorithms such as PAIRED [3] parametrise the generator as another RL agent which generates a new level each episode. In contrast, buffer-based strategies maintain an adaptive training distribution over all previously generated levels, enabling them to repeateadly sample levels with high learning potential. While PAIRED is usually outclassed by a well tuned implementation of Domain Randomisation (DR) [5-6], which samples parameters within pre-determined ranges, modern buffer based approaches such as Robust PLR [7] or ACCEL [8] have demonstrated excellent zero-shot transfer on a variety of benchmarks. A direct reproduction of results from [7] is included below, in which buffer-based UED strategies (green, pink and blue) are shown to be competitive with or outperform DR when transfering zero-shot to reproductions of Formula-1 tracks in the CarRacing benchmark [9].

Addressing distributional shift

In most practical settings, the set of levels we wish the agent to generalise to doesn’t map to the entire level parameter space. For example, consider a fictional scenario in which we wish to train an agent to find its way around underground caves, which we model here as 2D gridworlds. Each square below corresponds to a different training level instantiated by the simulator. The agent (blue triangle) has a limited field of view and must find the exit (in bright green). It cannot pass through lava (orange) or walls (grey) tiles but may go through empty (black) and vegation (green) tiles.

To successfully generalise an agent should identify common semantics (i.e. shared rules or underlying meaning) between all levels. In our domain, vegetation tiles are more likely to be encountered near the exit, whereas orange lava tiles tend to appear deep within the cave. To successfully transfer to unseen levels the agent needs to leverage these semantics and direct its search for the exit towards vegetation tiles, and away from lava tiles.

However the full set of instantiable gridworlds consist of levels constructed from any possible tile arrangement, and UED methods may generate levels with different semantics. UED trained agents may therefore learn behaviours that are not only suboptimal but could even be completely incompatible with the desired behaviour. For example, the generator could decide to systematically place lava tiles near the exit because these levels happen to be more challenging for the agent, reversing the original semantics.

Eliminating distributional shift over a known subset of parameters

SAMPLR [10] considers a setting in which we have knowledge of which “semantically relevant” simulator parameters change the desired policy, and we know the ground truth distribution of these parameters. Trajectory rollouts are collected from levels sampled via UED, and a second simulator is used to resample individual transitions and maintain consistency with the ground truth distribution.

The authors propose a motivating example sharing some similarity to the one explored above, in which an agent must find its way to a treasure room before being presented with a choice of either eating an apple or a banana. Which fruit is correct depends on the level, but under the ground truth distribution picking the banana maximises returns on average, making it the optimal choice. However an UED strategy may induce a training distribution where this is not the case. In the figure below, directly reproduced from the paper, the agent fails to learn when directly sampling levels from the ground truth (in black), because it is difficult to achieve any reward without inducing a curriculum with UED. In constrast the UED strategy (in green) does well during training but transfers poorly to levels sampled from the ground truth distribution. SAMPLR (in red) outperforms both other methods at test time by correcting for UED-induced distributional shift.

While SAMPLR ensures zero distributional shift over a subset of environment parameters, it makes several strong assumptions:

  1. The simulator parameters may be separated or mapped to semantically relevant and irrelevant sets.
  2. The ground truth distribution of the semantically relevant variables is known.
  3. The simulator can be reset to any state (instead of being resettable to any level).

While the last assumption is reasonable, the first two are rarely practical. In our example we cannot formulate a relationship between the vegetation and the exit tiles without first computing the shortest path of any tile to the exit tile, breaking assumption 1. Even if we can do so, we may not know the ground truth distribution of vegetation tiles with respect to the distance to the exit, breaking assumption 2.

Addressing distributional shift when the desired distribution is not known

Our recent work proposes a framework designed to work in settings where the above assumptions do not hold. Data Regularised Environment Design (DRED) [11] only requires access to a limited starting set of in-distribution level parameters. The level generator is replaced by a generative model trained to approximate the starting set’s underlying distribution.

New levels (below, on the right) are obtained by interpolating between pairs of dataset levels (on the left) within the generative model's latent space. These levels retain the semantics of the dataset levels.

When evaluated on held-out test levels, DRED outperforms agents restricted to the starting set, such as agents trained with PLR or uniform sampling. In contrast, UED methods such as Robust PLR or ACCEL diverge from the target distribution early on, and end up performing poorly on both the starting set and the test levels, even when the level buffer is initialised to the starting set.

In DRED we can control the relative proportion of dataset and generated levels being sampled with the mixing coefficient \( \eta \). We find that generated levels (even when sampled under an approximation of the true distribution) induce distributional shift and are particularly harmful in the early stages of training, and limit their occurrence early on. DRED therefore minimises distributional shift instead of eliminating it, and we find that the additional data diversity obtained by gradually increasing \( \eta \) over the course of training may even be beneficial. In fact, DRED transfers two to three times better than the next best baseline to more difficult “edge-cases”: levels with similar semantics to the starting set but that have sligtly different underlying distributions.

UED has received a lot of attention since its introduction three years ago. Nevertheless, it is still a new research area, even by machine learning standards. As such, scaling UED to complex domains and studying its theoretical framing raise important open questions in the context of zero-shot transfer, which our research group is investigating. We will cover two of these open questions over the next two sections.

UED and the curse of dimensionality

The effectiveness of UED methods has so far only been demonstrated in domains with small level parameter spaces. In fact, fully unsupervised level parameter generators would not scale to high dimensional parameter spaces used in many practical applications (for example a robotic simulator may admit the space of all YAML files as its level parameter space, which would be impossible to navigate efficiently without any prior knowledge).

Fortunately, we can reasonably assume that the manifold hypothesis holds, i.e. the level parameters we care about exist within a small but highly structured set within the much larger level space. In this case, scaling to arbitrary large parameter spaces should be achievable by a data-driven framework such as DRED. Indeed, DRED’s learned generative model should improve in its approximation of the true parameter distribution as the amount, quality and diversity of the training data grows.

Challenging UED's minimax regret formulation

We’ve discussed above how UED fails to account for the fact that some levels may lead the agent to learn undesirable behaviours. However, there is another important contentious point with UED, which may necessitate employing a different theoretical framework when applying UED to zero-shot transfer than when applying it to continual, open-ended learning.

The minimax regret policy does not transfer to unseen levels

We point out that the Nash equilibium of the minimax regret formulation becomes meaningless when the agent stops training and has to transfer to unseen levels zero-shot. We will demonstate that, when training on a finite set of training levels (which is always the case in finite compute time), it is guaranteed that the policy that is maximally overfit to the training set is a minimax regret policy. On the other hand, the policy maximising expected returns over the test set is in general not a minimax regret policy. This may seem counterintuitive, so let’s consider two minimal examples to demonstrate this notion.

In the above example, the agent (yellow) must navigate to the goal (green) in as few steps as possible. It cannot pass through walls (grey) and only observes tiles directly adjacent to itself (highlighted yellow). The agent trains over levels (a-c) but should transfer zero-shot to level (d) if it learns the policy of following blue tiles to the goal location. In contrast, learning an ensemble of level-specific policies will not, in general, result in successful transfer. In fact, level-specific policies make the agent act according to spurious causal relationships, such as

I must be in level (a) because out of (a,b,c), only (a) has this initial observation -> When in (a) I can follow black tiles to reach the goal faster -> The optimal policy in this level is to follow black tiles.

The (a)-specific policy won't transfer to (d) because this causal relationship does not apply to (d), even if (a) and (d) share the same initial observation.

By definition, a level-specific optimal policy has zero regret in that level, making the ensemble of level-specific policies always a minimax regret policy. Applying level-specific policies necessitates level identification, which requires the agent to have seen the level at least once during training. It naturally follows that an agent may only learn level-specific policies on training levels. On the other hand, a level-agnostic policy that successfully transfers to unseen test levels isn't necessarily optimal in any given training level. In our example, the policy following blue tiles has non-zero regret in level (a), and is not a minimax regret policy.

This issue isn’t restricted to partially observable settings. For example, let's consider the following scenario in which Mario must get over the pipe without touching the piranha plant

Mario may identify features unique to each training level to infer their identities and learn a level-specific policy. In this scenario both the ensemble of depicted level-specific policies and a policy directly avoiding piranha plants are minimax regret policies over the training set, but only the second policy will transfer.

Re-interpreting UED in the context of mutual information

There is therefore a trade-off when using a level buffer: on the one-hand, the significantly improved sample efficiency and learning stability of buffer-based UED methods are in part possible because agents revisit levels with high learning potential many times during training. On the other hand, these repeated visits enable agents to identify individual training levels and learn level-specific policies.

Prior work [7] claim that value loss prioritisation strategies biases the agent towards learning minimax regret policies and explain their effectiveness for zero-shot transfer. We find this explanation unconvincing considering the arguments made above. In addition, the claim relies on establishing a relationship between the positive value loss and the regret that is not formally true. In [11], we propose a different explanation for the effectiveness of value loss proritization. We establish a connection between the value loss and the mutual information between the agent’s learned representation and the set of training levels. This lets us relate value loss prioritisation to the minimisation of an existing upper bound on the generalisation gap [12].

This has important implications for UED, as our work re-casts buffer-based UED as a combination of data regularisation techniques, which implicitly challenges the minimax regret formalism. Mutual information generalisation bounds have been extensively studied in supervised and self-supervised learning [13], and a re-interpretation of UED under this lens could help it reach its full potential in improving transfer for RL.

UED codebases

Many new UED benchmarks and codebases have been made available to researchers and practitioners in the last year. We will conclude this post with a non-exhaustive list of repositories to help you get started with your next UED project.

UED algorithms implementations across multiple benchmarks:

Additional benchmarks and environments directly relevant to UED:


  1. Zhang, C., Vinyals, O., Munos, R., & Bengio, S. (2018). A study on overfitting in deep reinforcement learning. arXiv preprint arXiv:1804.06893.
  2. Cobbe, K., Hesse, C., Hilton, J. & Schulman, J.. (2020). Leveraging Procedural Generation to Benchmark Reinforcement Learning. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:2048-2056.
  3. Dennis, M., Jaques, N., Vinitsky, E., Bayen, A., Russell, S., Critch, A., & Levine, S. (2020). Emergent complexity and zero-shot transfer via unsupervised environment design. Advances in neural information processing systems, 33, 13049-13061.
  4. Pinto, L., Davidson, J., Sukthankar, R., & Gupta, A. (2017). Robust adversarial reinforcement learning. In International Conference on Machine Learning (pp. 2817-2826). PMLR.
  5. Jakobi, N. (1997). Evolutionary Robotics and the Radical Envelope-of-Noise Hypothesis. Adaptive Behavior, 6, 325 - 368.
  6. Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., & Abbeel, P. (2017). Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 23-30). IEEE.
  7. Jiang, M., Dennis, M., Parker-Holder, J., Foerster, J., Grefenstette, E., & Rocktäschel, T. (2021). Replay-guided adversarial environment design. Advances in Neural Information Processing Systems, 34, 1884-1897.
  8. Parker-Holder, J., Jiang, M., Dennis, M., Samvelyan, M., Foerster, J., Grefenstette, E., & Rocktäschel, T. (2022, June). Evolving curricula with regret-based environment design. In International Conference on Machine Learning (pp. 17473-17498). PMLR.
  9. G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym. CoRR, abs/1606.01540, 2016.
  10. Jiang, M., Dennis, M., Parker-Holder, J., Lupu, A., Küttler, H., Grefenstette, E., ... & Foerster, J. (2022). Grounding aleatoric uncertainty for unsupervised environment design. Advances in Neural Information Processing Systems, 35, 32868-32881.
  11. Garcin, S., Doran, J., Guo, S., Lucas, C. G., & Albrecht, S. V. (2024). DRED: Zero-Shot Transfer in Reinforcement Learning via Data-Regularised Environment Design. In International Conference on Machine Learning (ICML).
  12. Bertran, M., Martinez, N., Phielipp, M., & Sapiro, G. (2020). Instance-based generalization in reinforcement learning. Advances in Neural Information Processing Systems, 33, 11333-11344.
  13. Xu, A., & Raginsky, M. (2017). Information-theoretic analysis of generalization capability of learning algorithms. Advances in neural information processing systems, 30.