Revisiting the Gumbel-Softmax in MADDPG

Author: Callum Tilbury

Date: 2022-11-01

Follow @UoE_Agents on Twitter

Cartoon depiction of a video-game character wearing a kilt, running in a Mario-style square discretised grid-world.

"Scottish video-game character wearing a kilt, moving in a Mario-style square discretised grid-world, running to get a coin" (DALL·E 2)

When using MADDPG [1] in environments with discrete actions, such as grid worlds, discrete gradient estimation needs to be performed. Originally, this estimation was done with the Gumbel-Softmax [2,3], but many other options exist. We show here that an alternative estimator choice can yield both higher returns and better sample efficiency for a range of tasks, with minimal development overhead. For more details on these ideas, see my master's dissertation on the topic.


An important type of problem in Reinforcement Learning is where not only a single agent acts, but multiple agents. These agents may act together, either adversarially, co-operatively, or some combination thereof. Broadly, this paradigm is termed Multi-Agent Reinforcement Learning (MARL). Algorithms developed for single-agent contexts can be applied for multiple agents, where each agent simply acts independently, e.g. Independent Q-learning [4]. Though suitable for some tasks, this approach struggles to achieve good returns in certain complex environments, such as those with partial observability [5]. Moreover, single-agent methods commonly assume environmental stationarity, which can cause learning problems in the multi-agent context [6]. As an alternative, researchers have developed MARL-specific algorithms, either by extending extant single-agent approaches to multi-agent scenarios, or by developing new algorithms altogether.

One of the earliest approaches proposed for deep MARL (that is, MARL with the integration of deep learning) was the MADDPG algorithm, by Lowe et al. [1]. In this work, the authors extended the single-agent DDPG [7] method, which is itself an extension of the DPG [8] method, to multi-agent scenarios. Being aware of this heritage is important because the theory underpinning DPG works only with continuous-action problems. In such cases, each action comes from an uncountable, continuous domain (e.g. the torque applied to a pendulum); the alternative is a discrete-action space, which has a countable set of possibilities (e.g. applying a force to move left or right). A canonical example for each case is shown below, from OpenAI's Gym library.

GIF showing the OpenAI Gym's pendulum environment

Pendulum: continuous action-space

GIF showing the OpenAI Gym's cart-pole environment

Cart Pole: discrete action-space

DPG only works in continuous domains because the gradient of the state-action value function, taken with respect to the action, must exist—which is not the case for a discrete-action context. MADDPG, as a descendant of DPG, inherits the same requirement.

The Gumbel-Softmax

It seems that the authors of MADDPG desired a "universal" algorithm: one which could be applied to all sorts of action-spaces, while still building on the foundations of DPG. For continuous-domain problems, the underlying DPG algorithm works as is. But for discrete action-spaces, the authors applied a mathematical trick: the Gumbel-Softmax reparameterisation, introduced independently by Jang et al. [2] and Maddison et al. [3]. Essentially, this trick "relaxes" the discrete, non-differentiable action-space into a somewhat-equivalent, continuous space—thus allowing an approximation of the gradient to exist. The degree of relaxation is termed the temperature.

Depiction of how the Gumbel-Softmax relaxes a discrete distribution.

Graphical depiction of how a discrete distribution is relaxed with the Gumbel-Softmax, with increasing temperature. Notice how, as the temperature increases, the distribution approaches a uniform distribution and loses all parametric information. Diagram adapted from Jang et al. [2].

Specifically, MADDPG employs the Straight-Through Gumbel-Softmax (STGS) on the action distribution, where only the backward (i.e. gradient) computation is relaxed, and the forward computation remains unmodified. Relaxing the discrete space, however, introduces statistical bias—in this case, into the computed gradient values. The higher the temperature, the higher the corresponding bias.

Recently, a benchmarking paper by Papoudakis et al. [5] found that MADDPG achieved decent performance in certain MARL environments, but performed markedly worse in grid-world situations, where the action space is discrete. The authors suggested that this degradation of performance may be due to the aforementioned bias introduced by the STGS.

Fortunately, the STGS is just one tool in a diverse toolbox, part of a field broadly termed discrete gradient estimation. This field is relevant in a host of contexts outside of MARL (e.g. discrete variational auto-encoders [9,10]). Accordingly, the biased relaxation of the STGS is problematic in many domains, which has encouraged a significant research effort to improve the method. The result? A wide variety of alternatives! Naturally, this made us wonder:

Can we benefit from using these alternative techniques in place of the Gumbel-Softmax, when learning in grid-world tasks with MADDPG?


Candidate Alternatives

To explore this question, we surveyed the gradient-estimation literature, and identified a wide range of alternatives to try. Thereafter, four techniques were selected, primarily for their simplicity. Two were easy changes to the STGS, and two were recent contributions from the community:


Next, we chose a set of grid-world environments in which we could test these new estimators. We used a sensible subset of the choices made by Papoudakis et al. [5] in their benchmarking paper: eleven tasks across three distinct environments. The environments are depicted and linked below.

Evaluation Metrics

Finally, we defined evaluation metrics. Specifically, we focused on two metrics for performance:


For a thorough treatment of the results achieved, we encourage readers to see the full dissertation report. Here, we highlight several interesting observations.


Out of the eleven tasks explored, there were a handful of simpler ones—tasks in which MADDPG performed comparatively well in the benchmarking paper by Papoudakis et al [5]. Perhaps unsurprisingly, using the alternative gradient estimators made no significant changes in these tasks. We believe that the bias introduced by the STGS was not particularly problematic for these simple situations; hence, changing the estimator had little effect.

Things became far more interesting in the harder tasks—ones in which MADDPG originally struggled. In four of these situations, we observed statistically significant results. The training curves for these tasks are presented below:

Graph showing the returns over the course of training for the LBF-15x15-3p-5f task. The GST estimator is seen to achieve higher returns and faster convergence than the others.

LBF: 15x15-3p-5f

Graph showing the returns over the course of training for the LBF-15x15-4p-3f task. The GST estimator is seen to achieve similar returns but faster convergence compared the others.

LBF: 15x15-4p-3f

Graph showing the returns over the course of training for the LBF-15x15-4p-5f task. The GST estimator is seen to achieve higher returns and faster convergence than the others.

LBF: 15x15-4p-5f

Graph showing the returns over the course of training for the rware-tiny-4ag task. The GST estimator is seen to achieve higher returns and faster convergence than the others.

RWARE: tiny-4ag

Firstly, we notice and acknowledge that TAGS was a bad estimator choice, with it underperforming in each of these tasks. As for the other estimators, though, consider the significant improvements over the baseline! In particular, we highlight the superior results of the GST estimator. In each of the four graphs, we see that this estimator yields higher maximum returns, and faster convergence (i.e. better sample efficiency).

Compute Time

After observing superior performance from the alternative estimators, the question remained: how do the computational requirements compare?

As a high-level summary:


To answer our original question: can we benefit from using alternative estimators, instead of the Gumbel-Softmax? Evidently, yes! We believe the results observed support this conclusion, and hopefully the graphs presented are sufficiently convincing. Notice the benefit of such an outcome: we can take the extant MADDPG algorithm, replace only the gradient estimation technique (e.g. swap out the STGS for the GST, and leave everything else the same) and the resulting performance may likely improve. Though our algorithm becomes slightly more expensive computationally, we witness faster convergence and higher returns, with minimal development overhead. This is an easy win, we'd say!

Based on our findings, we can draft some general guidelines for choosing an alternative estimator: if minimising the computational burden is paramount for a given problem, it may be worth using the STGS-T, for it has the same overhead as the STGS-1, yet yields improvements in both the achieved returns and sample efficiency. However, if one can afford a more expensive relaxation procedure, the GST is a good fit. It is somewhat slower, but the benefits are significant.

But... why?

We now face a pressing question: it is wonderful that we noticed improvements, but why exactly? What in these new estimators is enabling higher returns? The postulation has been that the bias introduced by the STGS was a problem. So, perhaps by lowering the bias, these new estimators yield better results. However, we need to investigate this question further. We need concrete evidence.

Such investigations remain as future research, but to catalyse such efforts, we briefly explored a possible avenue: the variance of the computed gradients across mini-batches in the training. We hypothesise that uninformative gradients (i.e. those due to a poor discrete-gradient estimator) yield a mini-batch with low variance, since there are no elements in particular which "stand out". In contrast, we believe that informative gradients will have higher variance across the mini-batch, for the opposite reason—particular components in the gradient vectors are more important than others, yielding large differences between them.

Reconsidering one of the tasks from before (LBF: 15x15-4p-5f), we retrained with two estimators: the baseline STGS-1, and the best performing alternative, the GST. The variance of the gradients is plotted below:

Graphs showing the variance of the gradients across mini-batches for the two estimators. The metrics for the GST estimator increase more rapidly than those for the baseline.

Gradient variance for each layer in the policy networks, with the two estimators: STGS-1 and GST

Notice the clear trend in these graphs: the variance of the gradients increased more rapidly for the GST algorithm than those for the baseline. Though not definitive—at present, merely a hunch—we believe that such results indicate the presence of more informative gradients in the policy networks, when using GST. Informative gradients, in turn, allow the algorithm to achieve higher returns and converge faster.

Future Work

Of course, much more work needs to be done. This project opens up an array of new questions, and we list some ideas below:

Ultimately, we hope that the work here will be a doorway for future research. And based on the results observed, it seems to be an inviting one!


  1. Ryan Lowe, Wu Yi, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. "Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments." Advances in Neural Information Processing Systems, volume 30, 2017.
  2. Eric Jang, Shixiang Gu, and Ben Poole. "Categorical Reparameterization with Gumbel-Softmax". International Conference on Learning Representations, 2017.
  3. Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. "The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables." International Conference on Learning Representations, 2017.
  4. Ming Tan. "Multi-agent reinforcement learning: Independent vs. cooperative agents." International Conference on Machine Learning, 1993.
  5. Georgios Papoudakis, Filippos Christianos, Lukas Schäfer, and Stefano V. Albrecht. "Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative Tasks". Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021.
  6. Georgios Papoudakis, Filippos Christianos, Arrasy Rahman, and Stefano V. Albrecht. "Dealing with Non-Stationarity in Multi-Agent Deep Reinforcement Learning". arXiv:1906.04737 [cs, stat], 2019.
  7. Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. "Continuous control with deep reinforcement learning." International Conference on Learning Representations, 2016
  8. David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. "Deterministic Policy Gradient Algorithms." Proceedings of the 31st International Conference on Machine Learning, pages 387-395, 2014.
  9. Jason Tyler Rolfe. "Discrete Variational Autoencoders". International Conference on Learning Representations, 2017.
  10. Diederik P. Kingma and Max Welling. "An introduction to variational autoencoders". Foundations and Trends in Machine Learning, 2019.
  11. Max B. Paulus, Chris J. Maddison, and Andreas Krause. "Rao-Blackwellizing the Straight-Through Gumbel-Softmax Gradient Estimator." International Conference on Learning Representations, 2021.
  12. Ting-Han Fan, Ta-Chung Chi, Alexander I. Rudnicky, and Peter J Ramadge. "Training Discrete Deep Generative Models via Gapped Straight-Through Estimator". Proceedings of the 39th International Conference on Machine Learning, PMLR 162:6059-6073, 2022.