# Revisiting the Gumbel-Softmax in MADDPG

Author: Callum Tilbury

Date: 2022-11-01

"Scottish video-game character wearing a kilt, moving in a Mario-style square discretised grid-world, running to get a coin" (DALL·E 2)

When using MADDPG [1] in environments with discrete actions, such as grid worlds, discrete gradient estimation needs to be performed. Originally, this estimation was done with the Gumbel-Softmax [2,3], but many other options exist. We show here that an alternative estimator choice can yield both higher returns and better sample efficiency for a range of tasks, with minimal development overhead. For more details on these ideas, see my master's dissertation on the topic.

## Background

An important type of problem in Reinforcement Learning is where not only a single agent acts, but multiple agents. These agents may act together, either adversarially, co-operatively, or some combination thereof. Broadly, this paradigm is termed Multi-Agent Reinforcement Learning (MARL). Algorithms developed for single-agent contexts can be applied for multiple agents, where each agent simply acts independently, e.g. Independent Q-learning [4]. Though suitable for some tasks, this approach struggles to achieve good returns in certain complex environments, such as those with partial observability [5]. Moreover, single-agent methods commonly assume environmental stationarity, which can cause learning problems in the multi-agent context [6]. As an alternative, researchers have developed MARL-specific algorithms, either by extending extant single-agent approaches to multi-agent scenarios, or by developing new algorithms altogether.

One of the earliest approaches proposed for deep MARL (that is, MARL with the integration of deep learning) was the MADDPG algorithm, by Lowe et al. [1]. In this work, the authors extended the single-agent DDPG [7] method, which is itself an extension of the DPG [8] method, to multi-agent scenarios. Being aware of this heritage is important because the theory underpinning DPG works only with continuous-action problems. In such cases, each action comes from an uncountable, continuous domain (e.g. the torque applied to a pendulum); the alternative is a discrete-action space, which has a countable set of possibilities (e.g. applying a force to move left or right). A canonical example for each case is shown below, from OpenAI's Gym library.

Pendulum: continuous action-space

Cart Pole: discrete action-space

DPG only works in continuous domains because the gradient of the state-action value function, taken with respect to the action, must exist—which is not the case for a discrete-action context. MADDPG, as a descendant of DPG, inherits the same requirement.

## The Gumbel-Softmax

It seems that the authors of MADDPG desired a "universal" algorithm: one which could be applied to all sorts of action-spaces, while still building on the foundations of DPG. For continuous-domain problems, the underlying DPG algorithm works as is. But for discrete action-spaces, the authors applied a mathematical trick: the Gumbel-Softmax reparameterisation, introduced independently by Jang et al. [2] and Maddison et al. [3]. Essentially, this trick "relaxes" the discrete, non-differentiable action-space into a somewhat-equivalent, continuous space—thus allowing an approximation of the gradient to exist. The degree of relaxation is termed the temperature.

Graphical depiction of how a discrete distribution is relaxed with the Gumbel-Softmax, with increasing temperature. Notice how, as the temperature increases, the distribution approaches a uniform distribution and loses all parametric information. Diagram adapted from Jang et al. [2].

Specifically, MADDPG employs the Straight-Through Gumbel-Softmax (STGS) on the action distribution, where only the backward (i.e. gradient) computation is relaxed, and the forward computation remains unmodified. Relaxing the discrete space, however, introduces statistical bias—in this case, into the computed gradient values. The higher the temperature, the higher the corresponding bias.

Recently, a benchmarking paper by Papoudakis et al. [5] found that MADDPG achieved decent performance in certain MARL environments, but performed markedly worse in grid-world situations, where the action space is discrete. The authors suggested that this degradation of performance may be due to the aforementioned bias introduced by the STGS.

Fortunately, the STGS is just one tool in a diverse toolbox, part of a field broadly termed discrete gradient estimation. This field is relevant in a host of contexts outside of MARL (e.g. discrete variational auto-encoders [9,10]). Accordingly, the biased relaxation of the STGS is problematic in many domains, which has encouraged a significant research effort to improve the method. The result? A wide variety of alternatives! Naturally, this made us wonder:

Can we benefit from using these alternative techniques in place of the Gumbel-Softmax, when learning in grid-world tasks with MADDPG?

## Methodology

### Candidate Alternatives

To explore this question, we surveyed the gradient-estimation literature, and identified a wide range of alternatives to try. Thereafter, four techniques were selected, primarily for their simplicity. Two were easy changes to the STGS, and two were recent contributions from the community:

• Straight-Through Gumbel-Softmax (STGS-T):
The original estimator used in MADDPG was the STGS, with a temperature of 1 (denote this baseline estimator as STGS-1). As a simple alternative, we let the temperature be a tunable hyperparameter, termed the STGS-T.
• Temperature-Annealed Gumbel-Softmax (TAGS):
We extended the above idea by introducing an annealing scheme to the temperature: letting it decay exponentially over the course of training.
• Gumbel-Rao Monte Carlo (GRMC$$K$$):
Introduced by Paulus et al. [11], this method performs "Rao-Blackwellisation" of the original STGS estimator, using $$K$$ samples.
• Gapped Straight-Through (GST):
Introduced by Fan et al. [12], this method uses "deterministic perturbations" instead of Gumbel noise in the estimator.

### Environments

Next, we chose a set of grid-world environments in which we could test these new estimators. We used a sensible subset of the choices made by Papoudakis et al. [5] in their benchmarking paper: eleven tasks across three distinct environments. The environments are depicted and linked below.

### Evaluation Metrics

Finally, we defined evaluation metrics. Specifically, we focused on two metrics for performance:

• Firstly, the returns (i.e. cumulative rewards) achieved when implemented with MADDPG—both the maximum returns (indicating raw performance), and the average returns (indicating sample efficiency).
• Secondly, the time required for computation in a "toy" gradient estimation problem, with various input dimensionalities. Here, we considered three classes of estimator: STGS (which accounts for STGS-1, STGS-T, and TAGS); GRMC (with three different $$K$$ values); and GST.

## Results

For a thorough treatment of the results achieved, we encourage readers to see the full dissertation report. Here, we highlight several interesting observations.

### Returns

Out of the eleven tasks explored, there were a handful of simpler ones—tasks in which MADDPG performed comparatively well in the benchmarking paper by Papoudakis et al [5]. Perhaps unsurprisingly, using the alternative gradient estimators made no significant changes in these tasks. We believe that the bias introduced by the STGS was not particularly problematic for these simple situations; hence, changing the estimator had little effect.

Things became far more interesting in the harder tasks—ones in which MADDPG originally struggled. In four of these situations, we observed statistically significant results. The training curves for these tasks are presented below:

LBF: 15x15-3p-5f

LBF: 15x15-4p-3f

LBF: 15x15-4p-5f

RWARE: tiny-4ag

Firstly, we notice and acknowledge that TAGS was a bad estimator choice, with it underperforming in each of these tasks. As for the other estimators, though, consider the significant improvements over the baseline! In particular, we highlight the superior results of the GST estimator. In each of the four graphs, we see that this estimator yields higher maximum returns, and faster convergence (i.e. better sample efficiency).

### Compute Time

After observing superior performance from the alternative estimators, the question remained: how do the computational requirements compare?

As a high-level summary:

• The STGS estimators scaled well with dimensionality: the computational overhead of increasing the dimensionality was not significant.
• The GRMC estimator was at least $$3$$ times slower than the baseline STGS approach. Moreover, using a larger $$K$$ value increased the computational burden of the relaxation—not substantially for low-dimensional inputs, but markedly for higher-dimensional problems.
• The computational burden of GST sat somewhat in-between the other two approaches. Importantly, though, this method also scaled well with dimensionality, staying at just over $$2.5$$ times slower than the baseline, irrespective of the input size.

### Remarks

To answer our original question: can we benefit from using alternative estimators, instead of the Gumbel-Softmax? Evidently, yes! We believe the results observed support this conclusion, and hopefully the graphs presented are sufficiently convincing. Notice the benefit of such an outcome: we can take the extant MADDPG algorithm, replace only the gradient estimation technique (e.g. swap out the STGS for the GST, and leave everything else the same) and the resulting performance may likely improve. Though our algorithm becomes slightly more expensive computationally, we witness faster convergence and higher returns, with minimal development overhead. This is an easy win, we'd say!

Based on our findings, we can draft some general guidelines for choosing an alternative estimator: if minimising the computational burden is paramount for a given problem, it may be worth using the STGS-T, for it has the same overhead as the STGS-1, yet yields improvements in both the achieved returns and sample efficiency. However, if one can afford a more expensive relaxation procedure, the GST is a good fit. It is somewhat slower, but the benefits are significant.

## But... why?

We now face a pressing question: it is wonderful that we noticed improvements, but why exactly? What in these new estimators is enabling higher returns? The postulation has been that the bias introduced by the STGS was a problem. So, perhaps by lowering the bias, these new estimators yield better results. However, we need to investigate this question further. We need concrete evidence.

Such investigations remain as future research, but to catalyse such efforts, we briefly explored a possible avenue: the variance of the computed gradients across mini-batches in the training. We hypothesise that uninformative gradients (i.e. those due to a poor discrete-gradient estimator) yield a mini-batch with low variance, since there are no elements in particular which "stand out". In contrast, we believe that informative gradients will have higher variance across the mini-batch, for the opposite reason—particular components in the gradient vectors are more important than others, yielding large differences between them.

Reconsidering one of the tasks from before (LBF: 15x15-4p-5f), we retrained with two estimators: the baseline STGS-1, and the best performing alternative, the GST. The variance of the gradients is plotted below:

Gradient variance for each layer in the policy networks, with the two estimators: STGS-1 and GST

Notice the clear trend in these graphs: the variance of the gradients increased more rapidly for the GST algorithm than those for the baseline. Though not definitive—at present, merely a hunch—we believe that such results indicate the presence of more informative gradients in the policy networks, when using GST. Informative gradients, in turn, allow the algorithm to achieve higher returns and converge faster.

## Future Work

Of course, much more work needs to be done. This project opens up an array of new questions, and we list some ideas below:

• Firstly, it would be useful to evaluate the performance of the proposed algorithms on a wider variety of tasks, especially in challenging environments like RWARE.
• Secondly, much more investigation ought to be done into why the alternative techniques are boasting better performance. Though an interesting foray, our analysis into the gradient variance was just a first step. We need to dig deeper!
• Thirdly, we used a "vanilla" MADDPG algorithm in this project, which didn't incorporate other extensions suggested in the literature. Combining the benefits observed here with other strong extensions elsewhere would be an interesting exercise. A wider hyperparameter search could be helpful too.
• Finally, we note that only two alternative methods from the literature were explored here: the GRMC and the GST. Though sufficient for our analysis, many other options exist in the literature. Some of these, though far more complex, boast many attractive properties, and may prove to be even more fruitful.

Ultimately, we hope that the work here will be a doorway for future research. And based on the results observed, it seems to be an inviting one!

## References

1. Ryan Lowe, Wu Yi, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. "Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments." Advances in Neural Information Processing Systems, volume 30, 2017.
2. Eric Jang, Shixiang Gu, and Ben Poole. "Categorical Reparameterization with Gumbel-Softmax". International Conference on Learning Representations, 2017.
3. Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. "The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables." International Conference on Learning Representations, 2017.
4. Ming Tan. "Multi-agent reinforcement learning: Independent vs. cooperative agents." International Conference on Machine Learning, 1993.
5. Georgios Papoudakis, Filippos Christianos, Lukas Schäfer, and Stefano V. Albrecht. "Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative Tasks". Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021.
6. Georgios Papoudakis, Filippos Christianos, Arrasy Rahman, and Stefano V. Albrecht. "Dealing with Non-Stationarity in Multi-Agent Deep Reinforcement Learning". arXiv:1906.04737 [cs, stat], 2019.
7. Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. "Continuous control with deep reinforcement learning." International Conference on Learning Representations, 2016
8. David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. "Deterministic Policy Gradient Algorithms." Proceedings of the 31st International Conference on Machine Learning, pages 387-395, 2014.
9. Jason Tyler Rolfe. "Discrete Variational Autoencoders". International Conference on Learning Representations, 2017.
10. Diederik P. Kingma and Max Welling. "An introduction to variational autoencoders". Foundations and Trends in Machine Learning, 2019.
11. Max B. Paulus, Chris J. Maddison, and Andreas Krause. "Rao-Blackwellizing the Straight-Through Gumbel-Softmax Gradient Estimator." International Conference on Learning Representations, 2021.
12. Ting-Han Fan, Ta-Chung Chi, Alexander I. Rudnicky, and Peter J Ramadge. "Training Discrete Deep Generative Models via Gapped Straight-Through Estimator". Proceedings of the 39th International Conference on Machine Learning, PMLR 162:6059-6073, 2022.