# Common Inaccuracies in Multi-Agent RL Research

Multi-agent reinforcement learning (MARL), one of the core focus areas of our research group, has received a lot of attention in recent years. With more people drawn to MARL research, I have noticed some common inaccuracies starting to appear in papers, even in highly-cited ones. This blog post looks at some common inaccuracies in MARL research, pertaining to model, problem statement, and complexity; and provides recommendations to address each.

__Disclaimer:__ My intention here is not to point out specific papers, nor to lessen the value of their contributions. Our own group has been guilty of some of these charges. My goal is to increase awareness of these inaccuracies, which I hope will help to reduce their propagation.

As the field grows, it's good to nip bad habits in the bud. Stefano @UoE_Agents points out some ways people in multi-agent reinforcement learning can be clearer and more accurate when citing papers and models. PSA my advisor taught me: Do NOT cite papers you've haven't read. :-) https://t.co/2Slo0e6juV

— Michael Littman (@mlittmancs) May 4, 2021

## Markov games and Littman (1994)

Many recent MARL papers define the learning problem based on the Markov game, and cite Littman (1994) for Markov games [1]. While the Littman paper is an important early work in MARL, it is not the first to study Markov games. Indeed, Littman cites Van Der Wal (1981) [2], and a quick search with Google Scholar shows that the term "Markov game" appeared in papers as early as the 1960's. As far as I can see, these early papers all link back to stochastic games, which are the same model as Markov games and were first introduced by Shapley in 1953 [3].

Another inaccuracy I sometimes see is citing Littman (1994) for partially-observable Markov games, while Littman in fact assumed full observability of states and actions.

While it is regrettable that two different names for the same model are now widely used, I believe researchers in MARL should be aware of the connection to avoid fragmentation of communities (e.g. MARL and game theory; latter mostly uses the term "stochastic game"). Some may prefer the name "Markov game" because it highlights the Markov assumption and looks similar to "Markov decision process" which is the standard model in single-agent RL. However, strictly-speaking, "stochastic game" is the original name of the model.

**Recommendation:** If you prefer "Markov game", instead of writing something like "We use Markov games [1]" write "We use Markov games (e.g. [1])" to indicate that your citation of [1] is not meant as the origin of the model. You may also want to add "(also known as stochastic game [3])", so that new people in MARL who read your paper can see the connection. For partial observability, consider using partially-observable stochastic games (POSG) (e.g. [4]) or other POMDP-style multi-agent models (e.g. Dec-POMDP); at least avoid writing "We use partially-observable Markov games [1]" and instead write e.g. "We use Markov games (e.g. [1]) with partial observability, defined as follows..."

## Agents maximise their own returns in a vacuum

MARL papers sometimes lack a proper problem definition. This is a more subtle issue, but it is an issue nonetheless. After defining the interaction model, which typically is a Markov game (or stochastic game), the learning problem is often stated as agents seeking to maximise their own expected returns (e.g. discounted sum of own rewards) \[ R_i = \sum_{t=0}^{\infty} \gamma^t r_{i,t} \] where \(r_{i,t}\) is agent \(i\)'s reward at time \(t\). However, written this way, this is an ill-posed problem statement because it does not tell us what a solution is nor when one has been found. By contrast, the learning problem in the single-agent Markov decision process (MDP) is clearly defined: find a policy that maximises the expected return in each state. When given a policy, we can in theory decide whether the policy satisfies the learning objective of the MDP. This is not so in a Markov game, because the returns of one agent's policy may (and usually do) depend on the policies of the other agents. Without knowledge of these other policies, we cannot decide whether a given policy for one agent achieves its learning goal. Thus, the expectation over \( R_i \) needs to include the policies of the other agents, but this detail is often missing or left implicit in MARL papers.

**Recommendation:** In line with a large body of work in game theory, a cleaner way to define the learning objective in a game is by defining the objective over all agents' policies. For example, the learning objective of the \(N\) agents may be to find policies \( \pi = (\pi_1,...,\pi_N) \) such that
\[ \forall_i : \pi_i \in \arg\max_{\pi'_i} \mathbb{E}[R_i | \pi'_i,\pi_{-i}] \]
where \( \pi_{-i} = \pi \setminus \{\pi_i\} \). This solution type is effectively used in most MARL papers where \(\pi\) is a Nash equilibrium.

## Exponential joint actions in number of agents

A common statement I often see in MARL papers is that the number of joint actions between agents increases exponentially with the number of agents. This may indeed be the case if each agent comes with its own "physical" representation and associated action variables (e.g. a robot in a 2D spatial environment) such that adding more agents adds more action variables. However, it is worth noting that multi-agent systems (MAS) are a more general approach of modelling decision problems. One important use of MAS modelling (following the "divide-and-conquer" principle often used in computer science) is to divide an intractable decision problem into smaller, hopefully more tractable, ones by reducing the action space for each agent. In this case, the number of agents in the system may not affect the number of joint actions. Here is an example:

Let's say we want to control a power plant which has 1000 control variables, and each variable can take one of \(k\) possible values. So, an action is a vector of length 1000, and there are \(k^{1000}\) possible actions (value assignments). A fully-centralised control problem of this size is intractable with current single-agent RL methods. To make this problem more tractable, we could take a decentralised approach: factor the action vector into \(N\) smaller vectors and have one agent control each of these smaller action vectors. Each of the \(N\) agents now deals with a smaller action space (e.g. \((k^\frac{1000}{N})\) if the factored action vectors have same length). However, the total number of joint actions between agents, \(k^{1000}\), is independent of the number of agents.

The added difficulty now is that agents need to find ways to coordinate their decisions, which is one of the exciting aspects of MARL research.

**Recommendation:** Researchers should be careful when postulating an exponential growth relation between joint actions and agents, by making such claims specific to the types of MARL applications considered. Not all MARL applications (such as example above) exhibit this kind of growth.

## Acknowledgements

I thank Marc Lanctot, Lukas Schäfer, Filippos Christianos, Georgios Papoudakis, and Arrasy Rahman for helpful comments.

## Further reading

## References

- Littman, Michael L. (1994). "Markov games as a framework for multi-agent reinforcement learning". International Conference on Machine Learning.
- Van Der Wal, J. (1981). "Stochastic dynamic programming: successive approximations and nearly optimal strategies for Markov decision processes and Markov games". Mathematical Centre Tracts 139.
- Shapley, Lloyd S. (1953). "Stochastic games". PNAS, 39 (10): 1095–1100.
- Hansen, Eric A., Daniel S. Bernstein, Shlomo Zilberstein (2004). "Dynamic programming for partially observable stochastic games". AAAI Conference on Artificial Intelligence.