# Gumbel Machinery

Recently, Laurent Dinh wrote a great blogpost asking (and answering) whether it was possible to invert the Gumbel-Max trick: given a sample from a discrete random variable, can we sample from the Gumbels that produced it?

We thought it would be valuable to show an alternative approach to Laurent’s question. Taking this tack we can prove the Gumbel-Max trick, answer Laurent’s question, and more, all in a few short lines. All together this results in four central properties that form a sort of Gumbel machinery of intuitions, which were indispensable during the development of our NIPS 2014 paper A* Sampling.

Implemented in Python, the Gumbel-Max trick looks like this:

The trick is to show that K sampled in such a way produces a sample from the discrete distribution proportional to alpha[i]. In mathematical notation,

where $\alpha_i > 0$ and $Z = \sum_i \alpha_i$. We will get to the proof, but for now Laurent’s question was simply, given K, what is the distribution over gumbels?

## The Gumbels

The apparently arbitrary choice of noise in the Gumbel-Max trick is its namesake; If $U \sim \mathrm{uniform}[0,1]$, then

has a Gumbel distribution with location $\log \alpha$. The cumulative distribution function (CDF) of a Gumbel is

The derivative of this is the density of the Gumbel,

Thus, the joint density of gumbels is

#### Product of Gumbel CDFs.

The keystone for any understanding of Gumbels is that multiplying their CDFs accumulates the $\alpha_i$ parameters.

The derivation of this property is just some simple algebra,

## The Joint

Our strategy will be to write down the joint distribution of the gumbels and K in an intuitive form, then to manipulate it to reveal a structure like

which answers Laurent’s question. This section is a bit tedious, but it only needs to be done once before getting at the valuable properties of Gumbels.

Knowing K restricts the possible Gumbel events to ones in which gumbels[K] > gumbels[i] for i != K. So, intuitively the joint of gumbels and K is

where $K$ is one of $\{1, \ldots, n\}$ and $[A]$ is the Iverson bracket notation for the indicator function of set $A$. If you’re still not convinced, then sum over $K$ or integrate over $g_1, \ldots, g_n$. In both cases you will get the correct marginal events.

First, we multiply by a judicious choice of 1,

pull the density of gumbels[K] out of the product,

apply the product of Gumbel CDFs property in reverse,

distribute into the product,

expand the density of gumbels[K] and apply the product of Gumbel CDFs,

finally, multiply by $\exp(\log Z - \log Z)$ to get:

## Four Important Properties of Gumbels

The value of that tedious algebra is what it reveals. We can now simply read off the following properties from our form of the joint density of K and gumbels. Refering back to the code

#### 1. Gumbel-Max Trick.

K is distributed as $\mathbb{P}(K = i) = \alpha_i / Z$, since

#### 2. The max Gumbel integrates over alphas.

gumbels[K] is distributed as a Gumbel with location $\log Z$ where $Z = \sum_i \alpha_i$, since

#### 3. Argmax and max are independent.

K and gumbels[K] are independent,

#### 4. The remaining Gumbels are still independent but truncated.

Given K and gumbels[K] consider

The remaining Gumbels remaining_gumbels[i] are independent Gumbels with location remaining_alpha[i] truncated at gumbels[K], since

## Laurent’s Question and Beyond

Reading off the density, we also get a simple answer to Laurent’s question. To sample from $p(g_1 \ldots, g_n | K)$, sample the top Gumbel $g_K$ with location $\log Z$ and for $i \neq K$ sample truncated Gumbels with locations $\log \alpha_i$. This code samples a truncated Gumbel,

This code samples from the desired posterior,

For reference, here is the rejection version

Note that this code differs slightly from the routine in Laurent’s blogpost, up to a shift of $\log \alpha_i$ for each Gumbel.

The above alternative factorization of the joint and the properties of Gumbels that it implies have been valuable to us, allowing manipulation of Gumbels at a higher level of abstraction. In fact, the factorization is a special case of the Top-Down construction that is at the core of A* Sampling. By applying these intuitions recursively, we can easily derive algorithms for sampling a set of Gumbels in decreasing order, or even sampling a heap of Gumbels from the top down. We hope to cover the more general version in a future blog post.