Multi-Agent Learning Environments

Date: 2021-03-19

This blog post provides an overview of a range of multi-agent reinforcement learning (MARL) environments with their main properties and learning challenges. We list the environments and properties in the below table, with quick links to their respective sections in this blog post. At the end of this post, we also mention some general frameworks which support a variety of environments and game modes.

Env	Type	Observations	Actions	Observability
LBF	Mixed	Discrete	Discrete	Both
PressurePlate	Collaborative	Discrete	Discrete	Partial
RWARE	Collaborative	Discrete	Discrete	Partial
MPE	Mixed	Continuous	Both	Both
SMAC	Cooperative	Continuous	Discrete	Partial
MALMO	Mixed	Continuous (Pixels)	Discrete	Partial
Pommerman	Competitive	Continuous (Pixels)	Discrete	Both
DM Lab	Mixed	Continuous (Pixels)	Discrete	Partial
DM Lab2D	Mixed	Discrete	Discrete	Partial
Derk's Gym	Competitive	Discrete	Mixed	Partial
Flatland	Collaborative	Continuous	Discrete	Both
Hanabi	Cooperative	Discrete	Discrete	Partial
Neural MMO	Competitive	Discrete	Multi-Discrete	Partial

We use the term "task" to refer to a specific configuration of an environment (e.g. setting a specific world size, number of agents, etc), e.g. as we did in our SEAC [5] and MARL benchmark [16] papers. We say a task is "cooperative" if all agents receive the same reward at each timestep. The task is "competitive" if there is some form of competition between agents, i.e. one agent's gain is at the loss of another agent. We loosely call a task "collaborative" if the agents' ultimate goals are aligned and agents cooperate, but their received rewards are not identical. Based on these task/type definitions, we say an environment is cooperative, competitive, or collaborative if the environment only supports tasks which are in one of these respective type categories. We call an environment "mixed" if it supports more than one type of task.

For observations, we distinguish between discrete feature vectors, continuous feature vectors, and Continuous (Pixels) for image observations. For actions, we distinguish between discrete actions, multi-discrete actions where agents choose multiple (separate) discrete actions at each timestep, and continuous actions. The action space is "Both" if the environment supports discrete and continuous actions.

Level-Based Foraging

Type	Observations	Actions	Code	Papers
Mixed	Discrete	Discrete	Environment	[5, 16]

General Description

The Level-Based Foraging environment consists of mixed cooperative-competitive tasks focusing on the coordination of involved agents. The task for each agent is to navigate the grid-world map and collect items. Each agent and item is assigned a level and items are randomly scattered in the environment. In order to collect items, agents have to choose a certain action next to the item. However, such collection is only successful if the sum of involved agents’ levels is equal or greater than the item level. Agents receive reward equal to the level of collected items. Below, you can see visualisations of a collection of possible tasks. By default, every agent can observe the whole map, including the positions and levels of all the entities and can choose to act by moving in one of four directions or attempt to load an item. In the partially observable version, denoted with ‘sight=2’, agents can only observe entities in a 5 × 5 grid surrounding them. Rewards are fairly sparse depending on the task, as agents might have to cooperate (in picking up the same food at the same timestep) to receive any rewards.

For more details, see our blog post here.

(a) Illustration of LBF-8x8-2p-3f

(b) Illustration of LBF-8x8-2p-2f-coop

(d) Illustration of LBF-10x10-2p-8f

Example Tasks

LBF-8x8-2p-3f: An $8 \times 8$ grid-world with two agents and three items placed in random locations. Item levels are random and might require agents to cooperate, depending on the level.

LBF-8x8-2p-2f-coop: An $8 \times 8$ grid-world with two agents and two items. This is a cooperative version and agents will always need too collect an item simultaneously (cooperate).

LBF-8x8-3p-1f-coop: An $8 \times 8$ grid-world with three agents and one item. This is a cooperative version and all three agents will need to collect the item simultaneously.

LBF-10x10-2p-8f: A $10 \times 10$ grid-world with two agents and ten items. The time-limit (25 timesteps) is often not enough for all items to be collected. Therefore, the agents need to spread out and collect as many items as possible in the short amount of time.

LBF-8x8-2p-3f, sight=2: Similar to the first variation, but partially observable. The agent’s vision is limited to a $5 \times 5$ box centred around the agent.

PressurePlate

Visualisation of PressurePlate linear task with 4 agents

Type	Observations	Actions	Code	Papers
Collaborative	Discrete	Discrete	Environment	/

General Description

PressurePlate is a multi-agent environment, based on the Level-Based Foraging environment, that requires agents to cooperate during the traversal of a gridworld. The grid is partitioned into a series of connected rooms with each room containing a plate and a closed doorway. At the beginning of an episode, each agent is assigned a plate that only they can activate by moving to its location and staying on its location. Activating the pressure plate will open the doorway to the next room. Therefore, agents must move along the sequence of rooms and within each room the agent assigned to its pressure plate is required to stay behind, activing the pressure plate, to allow the group of agents to proceed into the next room. The task is considered solved when the goal (depicted with a treasure chest) is reached.

In this environment, agents observe a grid centered on their location with the size of the observed grid being parameterised. The observed 2D grid has several layers indicating locations of agents, walls, doors, plates and the goal location in the form of binary 2D arrays. Agents receive these 2D grids as a flattened vector together with their x- and y-coordinates. The action space is identical to Level-Based Foraging with actions for each cardinal direction and a no-op (do nothing) action. Rewards in PressurePlate tasks are dense indicating the distance between an agent's location and their assigned pressure plate. Agents need to cooperate but receive individual rewards, making PressurePlate tasks collaborative. Currently, three PressurePlate tasks with four to six agents are supported with rooms being structured in a linear sequence. However, an interface is provided to define custom task layouts.

For more details, see the documentation in the Github repository.

Multi-Robot Warehouse

Type	Observations	Actions	Code	Papers
Collaborative	Discrete	Discrete	Environment	[5, 16]

(a) Illustration of RWARE tiny size, two agents

(b) Illustration of RWARE small size, two agents

General Description

The multi-robot warehouse environment simulates a warehouse with robots moving and delivering requested goods. In real-world applications [23], robots pick-up shelves and deliver them to a workstation. Humans assess the content of a shelf, and then robots can return them to empty shelf locations. In this simulation of the environment, agents control robots and the action space for each agent is

A = {Turn Left, Turn Right, Forward, Load/ Unload Shelf}

Agents can move beneath shelves when they do not carry anything, but when carrying a shelf, agents must use the corridors in between (see visualisation above). The observation of an agent consists of a $3 \times 3$ square centred on the agent. It contains information about the surrounding agents (location/rotation) and shelves. At each time a fixed number of shelves $R$ is requested. When a requested shelf is brought to a goal location, another currently not requested shelf is uniformly sampled and added to the current requests. Agents are rewarded for successfully delivering a requested shelf to a goal location, with a reward of 1. A major challenge in this environments is for agents to deliver requested shelves but also afterwards finding an empty shelf location to return the previously delivered shelf. Agents need to put down their previously delivered shelf to be able to pick up a new shelf. This leads to a very sparse reward signal. Since this is a collaborative task, we use the sum of undiscounted returns of all agents as a performance metric. The multi-robot warehouse task is parameterised by:

The size of the warehouse which is preset to either tiny $10 \times 11$, small $10 \times 20$, medium $16 \times 20$, or large $16 \times 29$.
The number of agents $N$.
The number of requested shelves $R$. By default $R = N$, but easy and hard variations of the environment use $R = 2N$ and $R = N/2$, respectively.

For more details, see our blog post here.

Multi-Agent Particle Environment

Type	Observations	Actions	Code	Papers
Mixed	Continuous	Continuous/ Discrete	Environment	[15, 12, 7, 20]

General Description

This environment contains a diverse set of 2D tasks involving cooperation and competition between agents. In all tasks, particles (representing agents) interact with landmarks and other agents to achieve various goals. Observations consist of high-level feature vectors containing relative distances to other agents and landmarks as well sometimes additional information such as communication or velocity. The action space among all tasks and agents is discrete and usually includes five possible actions corresponding to no movement, move right, move left, move up or move down with additional communication actions in some tasks. However, there are also options to use continuous action spaces (however all publications I am aware of use discrete action spaces). Reward signals in these tasks are dense and tasks range from fully-cooperative to comeptitive and team-based scenarios. Most tasks are defined by Lowe et al. [12] with additional tasks being introduced by Iqbal and Sha [7] (code available here) and partially observable variations defined as part of my MSc thesis [20] (code available here). It is comparably simple to modify existing tasks or even create entirely new tasks if needed.

(a) Illustration of Speaker-Listener

(b) Illustration of Spread

(d) Illustration of Predator-Prey

(e) Illustration of Multi Speaker-Listener

(f) Illustration of Collect Treasure

Common Tasks [12, 7]

MPE Speaker-Listener [12]: In this fully cooperative task, one static speaker agent has to communicate a goal landmark to a listening agent capable of moving. There are a total of three landmarks in the environment and both agents are rewarded with the negative Euclidean distance of the listener agent towards the goal landmark. The speaker agent only observes the colour of the goal landmark. Meanwhile, the listener agent receives its velocity, relative position to each landmark and the communication of the speaker agent as its observation. The speaker agent choses between three possible discrete communication actions while the listener agent follows the typical five discrete movement agents of MPE tasks.

MPE Spread [12]: In this fully cooperative task, three agents are trained to move to three landmarks while avoiding collisions with each other. All agents receive their velocity, position, relative position to all other agents and landmarks. The action space of each agent contains five discrete movement actions. Agents are rewarded with the sum of negative minimum distances from each landmark to any agent and an additional term is added to punish collisions among agents.

MPE Adversary [12]: In this competitive task, two cooperating agents compete with a third adversary agent. There are two landmarks out of which one is randomly selected to be the goal landmark. Cooperative agents receive their relative position to the goal as well as relative position to all other agents and landmarks as observations. However, the adversary agent observes all relative positions without receiving information about the goal landmark. All agents have five discrete movement actions. Agents are rewarded with the negative minimum distance to the goal while the cooperative agents are additionally rewarded for the distance of the adversary agent to the goal landmark. Therefore, the cooperative agents have to move to both landmarks to avoid the adversary from identifying which landmark is the goal and reaching it as well.

MPE Predator-Prey [12]: In this competitive task, three cooperating predators hunt a forth agent controlling a faster prey. Two obstacles are placed in the environment as obstacles. All agents receive their own velocity and position as well as relative positions to all other landmarks and agents as observations. Predator agents also observe the velocity of the prey. All agents choose among five movement actions. The agent controlling the prey is punished for any collisions with predators as well as for leaving the observable environment area (to prevent it from simply running away but learning to evade). Predator agents are collectively rewarded for collisions with the prey.

MPE Multi Speaker-Listener [7]: This collaborative task was introduced by [7] (where it is also referred to as ”Rover-Tower”) and includes eight agents. Four agents represent rovers whereas the remaining four agents represent towers. In each episode, rover and tower agents are randomly paired with each other and a goal destination is set for each rover. Rover agents can move in the environments, but don’t observe their surrounding and tower agents observe all rover agents’ location as well as their destinations. Tower agents can send one of five discrete communication messages to their paired rover at each timestep to guide their paired rover to its destination. Rover agents choose two continuous action values representing their acceleration in both axes of movement. Each pair of rover and tower agent are negatively rewarded by the distance of the rover to its goal.

MPE Treasure Collection [7]: This collaborative task was introduced by [7] and includes six agents representing treasure hunters while two other agents represent treasure banks. Hunting agents collect randomly spawning treasures which are colour-coded. Depending on the colour of a treasure, it has to be delivered to the corresponding treasure bank. All agents observe relative position and velocities of all other agents as well as the relative position and colour of treasures. Hunting agents additionally receive their own position and velocity as observations. All agents have continuous action space choosing their acceleration in both axes to move. Agents are rewarded for the correct deposit and collection of treasures. However, the task is not fully cooperative as each agent also receives further reward signals. Each hunting agent is additionally punished for collision with other hunter agents and receives reward equal to the negative distance to the closest relevant treasure bank or treasure depending whether the agent already holds a treasure or not. Treasure banks are further punished with respect to the negative distance to the closest hunting agent carrying a treasure of corresponding colour and the negative average distance to any hunter agent.

StarCraft Multi-Agent Challenge

Type	Observations	Actions	Code	Papers
Fully-cooperative	Continuous	Discrete	Environment and codebase	[19]

General Description

The StarCraft Multi-Agent Challenge is a set of fully cooperative, partially observable multi-agent tasks. This environment implements a variety of micromanagement tasks based on the popular real-time strategy game StarCraft II and makes use of the StarCraft II Learning Environment (SC2LE) [22]. Each task is a specific combat scenario in which a team of agents, each agent controlling an individual unit, battles against a army controlled by the centralised built-in game AI of the game of StarCraft. These tasks require agents to learn precise sequences of actions to enable skills like kiting as well as coordinate their actions to focus their attention on specific opposing units. Many tasks are symmetric in their structure, i.e. both armies are constructed by the same units. Below, you can find visualisations of each considered task in this environment. Rewards are dense and task difficulty has a large variety spanning from (comparably) simple to very difficult tasks. All tasks naturally contain partial observability through a visibility radius of agents.

(a) Illustration of SMAC 3m

(b) Illustration of SMAC 8m

(d) Illustration of SMAC 2s3z

(e) Illustration of SMAC 1c3s5z

Example Tasks

SMAC 3m: In this scenario, each team is constructed by three space marines. These ranged units have to be controlled to focus fire on a single opponent unit at a time and attack collectively to win this battle.

SMAC 8m: In this scenario, each team controls eight space marines. While the general strategy is identical to the 3m scenario, coordination becomes more challenging due to the increased number of agents and marines controlled by the agents.

SMAC 2s3z: In this scenario, each team controls two stalkers and three zealots. While stalkers are ranged units, zealots are melee units, i.e. they are required to move closely to enemy units to attack. Therefore, controlled units still have to learn to focus their fire on single opponent units at a time. Additionally, stalkers are required to learn kiting to consistently move back in between attacks to keep a distance between themselves and enemy zealots to minimise received damage while maintaining high damage output.

SMAC 3s5z: This scenario requires the same strategy as the 2s3z task. Both teams control three stalker and five zealot units. Due to the increased number of agents, the task becomes slightly more challenging.

SMAC 1c3s5z: In this scenario, both teams control one colossus in addition to three stalkers and five zealots. A colossus is a durable unit with ranged, spread attacks. Its attacks can hit multiple enemy units at once. Therefore, the controlled team now as to coordinate to avoid many units to be hit by the enemy colossus at ones while enabling the own colossus to hit multiple enemies all together.

MALMO

Type	Observations	Actions	Code	Papers
Mixed	Continuous (Pixels)	Discrete	Environment	[9]

Visualisation of MALMO task

General Description

The MALMO platform [9] is an environment based on the game Minecraft. Its 3D world contains a very diverse set of tasks and environments. Agents interact with other agents, entities and the environment in many ways. Tasks can contain partial observability and can be created with a provided configurator and are by default partially observable as agents perceive the environment as pixels from their perspective. For more information on this environment, see the official webpage, the documentation, the official blog and the public Tutorial or have a look at the following slides. For instructions on how to install MALMO (for Ubuntu 20.04) as well as a brief script to test a MALMO multi-agent task, see later scripts at the bottom of this post. Further tasks can be found from the The Multi-Agent Reinforcement Learning in MalmÖ (MARLÖ) Competition [17] as part of a NeurIPS 2018 workshop. Code for this challenge is available in the MARLO github repository with further documentation available. Another challenge in the MALMO environment with more tasks is the The Malmo Collaborative AI Challenge with its code and tasks available here. However, I am not sure about the compatibility and versions required to run each of these environments

Recently, a novel repository has been created with a simplified launchscript, setup process and example IPython notebooks. I recommend to have a look to make yourself familiar with the MALMO environment.

Downsides of MALMO

The variety exhibited in the many tasks of this environment I believe make it very appealing for RL and MARL research together with the ability to (comparably) easily define new tasks in XML format (see documentation and the tutorial above for more details). However, the environment suffers from technical issues and compatibility difficulties across the various tasks contained in the challenges above. Also, for each agent, a separate Minecraft instance has to be launched to connect to over a (by default local) network. I found connectivity of agents to environments to crash from time to time, often requiring multiple attempts to start any runs. Latter should be simplified with the new launch scripts provided in the new repository.

Pommerman

Type	Observations	Actions	Code	Papers	Others
Competitive	Discrete	Discrete	Environment	[18]	Website

Visualisation of Pommerman

General Description

The Pommerman environment [18] is based on the game Bomberman. It contains competitive $11 \times 11$ gridworld tasks and team-based competition. Agents can interact with each other and the environment by destroying walls in the map as well as attacking opponent agents. The observations include the board state as $11 \times 11 = 121$ onehot-encodings representing the state of each location in the gridworld. Additionally, each agent receives information about its location, ammo, teammates, enemies and further information. All this makes the observation space fairly large making learning without convolutional processing (similar to image inputs) difficult. Agents choose one of six discrete actions at each timestep: stop, move up, move left, move down, move right, lay bomb, message. Observation and action spaces remain identical throughout tasks and partial observability can be turned on or off. A framework for communication among allies is implemented. This environment serves as an interesting environment for competitive MARL, but its tasks are largely identical in experience. While maps are randomised, the tasks are the same in objective and structure. For more information on the task, I can highly recommend to have a look at the project's website.

Deepmind Lab

Name	Type	Observations	Actions	Code	Papers	Others
Deepmind Lab	Mixed	Continuous (Pixels)	Discrete	Environment	[3]	Blog
Deepmind Lab2D	Mixed	Discrete	Discrete	Environment	[4]	/

Deepmind Lab environments

(a) Observation space

(b) Action space

General Description - Deepmind Lab

DeepMind Lab [3] is a 3D learning environment based on Quake III Arena with a large, diverse set of tasks. Examples for tasks include the set DMLab30 [6] (Blog post here) and PsychLab [11] (Blog post here) which can be found under game scripts/levels/demos together with multiple smaller problems. However, there is currently no support for multi-agent play (see Github issue) despite publications using multiple agents in e.g. Capture-The-Flag [8]. Also, the setup turned out to be more cumbersome than expected. Setup code can be found at the bottom of the post.

General Description - Deepmind Lab2D

Fairly recently, Deepmind also released the Deepmind Lab2D [4] platform for two-dimensional grid-world environments. This contains a generator for (also multi-agent) grid-world tasks with various already defined and further tasks have been added since [13]. See bottom of the post for setup scripts.

Derk's Gym

Type	Observations	Actions	Code	Papers	Others
Competitive	Discrete	Mixed	Environment documentation	/	Website

General Description

Derk's gym is a MOBA-style multi-agent competitive team-based game. "Two teams battle each other, while trying to defend their own “statue”. Each team is composed of three units, and each unit gets a random loadout. The goal is to try to attack the opponents statue and units, while defending your own. With the default reward, you get one point for killing an enemy creature, and four points for killing an enemy statue." Agents observe discrete observation keys (listed here) for all agents and choose out of 5 different action-types with discrete or continuous action values (see details here). One of this environment's major selling point is its ability to run very fast on GPUs.

Further information on getting started with an overview and "starter kit" can be found on this AICrowd's challenge page.

One downside of the derk's gym environment is its licensing model. Licenses for personal use only are free, but academic licenses are available at a cost of 5$/mo (or 50$/mo with source code access) and commercial licenses come at higher prices.

Flatland

Type	Observations	Actions	Code	Papers
Collaborative	Mixed	Discrete	Environment	[14]

Visualisation of Flatland task

General Description

This multi-agent environment is based on a real-world problem of coordinating a railway traffic infrastructure of Swiss Federal Railways (SBB). The Flatland environment aims to simulate the vehicle rescheduling problem by providing a grid world environment and allowing for diverse solution approaches. Agents are representing trains in the railway system. There are three schemes for observation: global, local and tree. In these, agents observe either (1) global information as a 3D state array of various channels (similar to image inputs), (2) only local information in a similarly structured 3D array or (3) a graph-based encoding of the railway system and its current state (for more details see respective documentation). Agents can choose one out of 5 discrete actions: do nothing, move left, move forward, move right, stop moving (more details here). Agents receive two reward signals: a global reward (shared across all agents) and a local agent-specific reward.

I strongly recommend to check out the environment's documentation at its webpage which is excellent. More information on multi-agent learning can be found here. There have been two AICrowd challenges in this environment: Flatland Challenge and Flatland NeurIPS 2020 Competition. Both of these webpages also provide further overview of the environment and provide further resources to get started. A new competition is also taking place at NeurIPS 2021 through AICrowd.

Hanabi

Type	Observations	Actions	Code	Papers
Fully-cooperative	Discrete	Discrete	Environment	[2]

From [2]: Example of a four player Hanabi game from the point of view of player 0. Player 1 acts after player 0 and so on.

General Description

The Hanabi challenge [2] is based on the card game Hanabi. This fully-cooperative game for two to five players is based on the concept of partial observability and cooperation under limited information. Players have to coordinate their played cards, but they are only able to observe the cards of other players. Their own cards are hidden to themselves and communication is a limited resource in the game. The main challenge of this environment is its significant partial observability, focusing on agent coordination under limited information. Another challenge in applying multi-agent learning in this environment is its turn-based structure. In Hanabi, players take turns and do not act simultaneously as in other environments. In each turn, they can select one of three discrete actions: giving a hint, playing a card from their hand, or discarding a card.

Neural MMO

Type	Observations	Actions	Code	Papers	Others
Competitive	Discrete	Multi-Discrete	Environment	[21]	Blog

From [21]: ”Neural MMO is a massively multiagent environment for AI research. Agents compete for resources through foraging and combat. Observation and action representation in local game state enable efficient training and inference. A 3D Unity client provides high quality visualizations for interpreting learned behaviors. The environment, client, training code, and policies are fully open source, officially documented, and actively supported through a live community Discord server.”

General Description

Neural MMO [21] is based on the gaming genre of MMORPGs (massively multiplayer online role-playing games). Its large 3D environment contains diverse resources and agents progress through a comparably complex progression system. Agents compete with each other in this environment and agents are restricted to partial observability, observing a square crop of tiles centered on their current position (including terrain types) and health, food, water, etc. of occupying agents. Interaction with other agents is given through attacks and agents can interact with the environment through its given resources (like water and food). The main downside of the environment is its large scale (expensive to run), complicated infrastructure and setup as well as monotonic objective despite its very significant diversity in environments. Agents choose one movement and one attack action at each timestep.

Multi-Agent Frameworks/Libraries

In addition to the individual multi-agent environments listed above, there are some very useful software frameworks/libraries which support a variety of multi-agent environments and game modes.

OpenSpiel

OpenSpiel is an open-source framework for (multi-agent) reinforcement learning and supports a multitude of game types. "OpenSpiel supports n-player (single- and multi- agent) zero-sum, cooperative and general-sum, one-shot and sequential, strictly turn-taking and simultaneous-move, perfect and imperfect information games, as well as traditional multiagent environments such as (partially- and fully- observable) grid worlds and social dilemmas." It has support for Python and C++ integration. However, due to the diverse supported game types, OpenSpiel does not follow the otherwise standard OpenAI gym-style interface. For more information on OpenSpiel, check out the following resources:

For more information and documentation, see their Github (github.com/deepmind/open_spiel) and the corresponding paper [10] for details including setup instructions, introduction to the code, evaluation tools and more.

PettingZoo

PettingZoo is a Python library for conducting research in multi-agent reinforcement learning. It contains multiple MARL problems, follows a multi-agent OpenAI’s Gym interface and includes the following multiple environments:

Atari: Multi-player Atari 2600 games (both cooperative and competitive)
Butterfly: Cooperative graphical games requiring a high degree of coordination
Classic: Classical games including card games, board games, etc.
MAgent: Configurable environments with massive numbers of particle agents, originally from here
MPE: A set of simple nongraphical communication tasks, originally from here
SISL: 3 cooperative environments, originally from here

Website with documentation: pettingzoo.ml

Github link: github.com/PettingZoo-Team/PettingZoo

Megastep

Megastep is an abstract framework to create multi-agent environment which can be fully simulated on GPUs for fast simulation speeds. It already comes with some pre-defined environments and information can be found on the website with detailed documentation: andyljones.com/megastep

Scripts

For the following scripts to setup and test environments, I use a system running Ubuntu 20.04.1 LTS on a laptop with an intel i7-10750H CPU and a GTX 1650 Ti GPU. To organise dependencies, I use Anaconda.

MALMO

Deepmind Lab

Deepmind Lab2D

References

Stefano V Albrecht and Subramanian Ramamoorthy. A game-theoretic model and best-response learning method for ad hoc coordination in multiagent systems. In Proceedings of the 2013 International Conference on Autonomous Agents and Multi-Agent Systems, 2013.
Nolan Bard, Jakob N Foerster, Sarath Chandar, Neil Burch, H Francis Song, Emilio Parisotto, Vincent Dumoulin, Edward Hughes, Iain Dunning, Shibl Mourad, Hugo Larochelle, and L G Feb. The Hanabi Challenge : A New Frontier for AI Research. Artificial Intelligence, 2020.
Charles Beattie, Joel Z. Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich Küttler, Andrew Lefrancq, Simon Green, Vı́ctor Valdés, Amir Sadik, Julian Schrittwieser, Keith Anderson, Sarah York, Max Cant, Adam Cain, Adrian Bolton, Stephen Gaffney, Helen King, Demis Hassabis, Shane Legg, and Stig Petersen. DeepMind Lab. ArXiv preprint arXiv:1612.03801, 2016
Charles Beattie, Thomas Köppe, Edgar A Duéñez-Guzmán, and Joel Z Leibo. Deepmind Lab2d. ArXiv preprint arXiv:2011.07027, 2020.
Filippos Christianos, Lukas Schäfer, and Stefano Albrecht. Shared Experience Actor-Critic for Multi-Agent Reinforcement Learning. Advances in Neural Information Processing Systems, 2020.
Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In Proceedings of the International Conference on Machine Learning, 2018.
Shariq Iqbal and Fei Sha. Actor-attention-critic for multi-agent reinforcement learning. In International Conference on Machine Learning, 2019.
Max Jaderberg, Wojciech M. Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castaneda, Charles Beattie, Neil C. Rabinowitz, Ari S. Morcos, Avraham Ruderman, Nicolas Sonnerat, Tim Green, Louise Deason, Joel Z. Leibo, David Silver, Demis Hassabis, Koray Kavukcuoglu, and Thore Graepel. Human-level performance in first-person multiplayer games with population-based deep reinforcement learning. ArXiv preprint arXiv:1807.01281, 2018
Matthew Johnson, Katja Hofmann, Tim Hutton, and David Bignell. The malmo platform for artificial intelligence experimentation. In Proceedings of the International Joint Conferences on Artificial Intelligence Organization, 2016.
Marc Lanctot, Edward Lockhart, Jean-Baptiste Lespiau, Vinicius Zambaldi, Satyaki Upadhyay, Julien Pérolat, Sriram Srinivasan et al. OpenSpiel: A framework for reinforcement learning in games. ArXiv preprint arXiv:1908.09453, 2019.
Joel Z Leibo, Cyprien de Masson d’Autume, Daniel Zoran, David Amos, Charles Beattie, Keith Anderson, Antonio Garcı́a Castañeda, Manuel Sanchez, Simon Green, Audrunas Gruslys, et al. Psychlab: a psychology laboratory for deep reinforcement learning agents. ArXiv preprint arXiv:1801.08116, 2018.
Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in Neural Information Processing Systems, 2017.
Kevin R. McKee, Joel Z. Leibo, Charlie Beattie, and Richard Everett. Quantifying environment and population diversity in multi-agent reinforcement learning. ArXiv preprint arXiv:2102.08370, 2021.
Sharada Mohanty, Erik Nygren, Florian Laurent, Manuel Schneider, Christian Scheller, Nilabha Bhattacharya, Jeremy Watson et al. Flatland-RL: Multi-Agent Reinforcement Learning on Trains. ArXiv preprint arXiv:2012.05893, 2020.
Igor Mordatch and Pieter Abbeel. Emergence of grounded compositional language in multi-agent populations. ArXiv preprint arXiv:1703.04908, 2017.
Georgios Papoudakis, Filippos Christianos, Lukas Schäfer, and Stefano V Albrecht. Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative Tasks. Advances in Neural Information Processing Systems Track on Datasets and Benchmarks, 2021.
Diego Perez-Liebana, Katja Hofmann, Sharada Prasanna Mohanty, Noburu Kuno, Andre Kramer, Sam Devlin, Raluca D Gaina, and Daniel Ionita. The multi-agent reinforcement learning in malmö (marlö) competition. ArXiv preprint arXiv:1901.08129, 2019.
Cinjon Resnick, Wes Eldridge, David Ha, Denny Britz, Jakob Foerster, Julian Togelius, Kyunghyun Cho, and Joan Bruna. PommerMan: A multi-agent playground. ArXiv preprint arXiv:1809.07124, 2018.
Mikayel Samvelyan, Tabish Rashid, Christian Schroeder de Witt, Gregory Farquhar, Nantas Nardelli, Tim GJ Rudner, Chia-Man Hung, Philip HS Torr, Jakob Foerster, and Shimon Whiteson. The starcraft multi-agent challenge. In Proceedings of the 18th International Conference on Autonomous Agents and Multi-Agent Systems, 2019.
Lukas Schäfer. Curiosity in multi-agent reinforcement learning. Master’s thesis, University of Edinburgh, 2019.
Joseph Suarez, Yilun Du, Igor Mordatch, and Phillip Isola. Neural MMO v1.3: A Massively Multiagent Game Environment for Training and Evaluating Neural Networks. ArXiv preprint arXiv:2001.12004, 2020.
Oriol Vinyals, Timo Ewalds, Sergey Bartunov, Petko Georgiev, Alexander Sasha Vezhnevets, Michelle Yeo, Alireza Makhzani et al. "StarCraft II: A New Challenge for Reinforcement Learning." ArXiv preprint arXiv:1708.04782, 2017.
Peter R. Wurman, Raffaello D’Andrea, and Mick Mountz. Coordinating Hundreds of Cooperative, Autonomous Vehicles in Warehouses. In AI Magazine, 2008.