Adversarial Generator-Encoder Networks [arxiv:1704.02304]

This post is about…

a review of AGE https://arxiv.org/abs/1704.02304

Greetings!

G: Hello! This is Guguchi! On this week’s young jump, Kunoi-ichi-noichi was on hiatus. What a pain. Nice to meet you!

M: Hi Hi! This is Maso. I haven’t bought Young jump this week. Damn. Kunoichi-no-ichi is that cartoon about a dude acting as female ninja right? A total Badass.

G: LOL, we sound like an idiot!

M: Agreed LOL. We are going to begin with AGE, correct? https://arxiv.org/abs/1704.02304

G: Yes, yes, Adversarial Generator-Encoder Networks. By the way, we are still having trouble using this markdown. Like, how do we tab this thing? I think it is hard to read. If there is someone familar with diagloue format on markdown, please help!!!

Real stuffs

M: I think we got to begin with WGAN.

G: Chill, chill, let’s stay on our feet. This paper is a paper that produced autoencoder in adversarial manner. In particular, if g is generator and e is encoder, the goal is to train g and e in adversarial training.

$\max_e \min_g V(g,e) = KL [e(g(Z)) \| e(X) ]$

M: Right, we make $g$ and $e$ butt against each other.

G: Well, this confused me at first, because it made no sense to me why we had to maximize the objective function about encoder.

M: On page2, the author claims “ Below, we show how to design an adversarial game between e and g that ensures the alignment of g(Z) = X in the data space, while only evaluating divergences in the latent space.” So the goal is to finder the auto encoder such that two distributions are most far apart from each other in the latent space.

G: Well, that said, what is going on about the order of max min?

M: I am little confused about this too. As a matter of fact, we have max-min inequality

$\max_{z \in Z} \min_{w \in W} f(z,w) \leq \min_{w \in W} \max_{z \in Z} f(z,w)$

So I totally agree, inverting these relations can be a problem. But algorithmically, it seems like no one cares. People update z and w in turn.

G: Why did they swap this order in the first place? Everyone writes min max, instead of max min. In GAN, we do

$\min_G \max_D E_q[\log D(x)] + E_p[\log (1- D((Z))) ] := V(G,D)$

where we find the most pathological way to measure the ‘distance’ between the generator and the target, and try to make the generator and the target close in terms of that ‘distance’. And that distance is based on D.
Also, our goal is ultimately to make V(G,D) to be zero.

M: Well, I agree with you 100%. I think this min-max has a lot to do with, infinity norm on vector space.

$d_{\infty}(p,q) = \| p - q \|_{\infty} = \sup_{k} \{| p_k - q_k | \} ~~~~ p, q \in V$

The analogous story is like this. If q is a vector and F is a vector subspace, we are interested in finding a projection of q onto F in this sup metric, that is , we want to find p^* that achieves

$\inf_{p \in F} d_{\infty}(p,q)= \inf_{p \in F} \sup_k\{| p_k - q_k | \}$

G: Wait, k is the vector’s index, correct? So p_k is the kth coordinate. We are so looking for a vector p in F such that the distance from q at ‘worst’ coordinate is minimal.

M: Right! And I think we can go a little further. We can write

\begin{align} p_k - q_k = \langle p- q, e(k) \rangle \end{align} where e(k) is one hot vector with nonzero entry at kth coordinate. So sup is taken here about the choice of bunch of one hot vectors. Instead of searching within the set of one hot vectors using

$\sup_k\{|\langle p- q, e(k) \rangle| \}$

we can also do extend our search to the all unit vectors in entire vector space V.

\begin{align} \sup_{v \in V} {| \langle p- q, v \rangle| } = \sup_{v \in V, |v|=1} {| \langle p, v \rangle - \langle q, v \rangle| } \end{align} We know all so well that this sup is achieved with v in direction of p-q, so that this is in fact a measurement in L2 distance. Anyways, that we will be looking for p^* that achieves

$\inf_{p \in F } \sup_{v \in V, |v|=1} \{|\langle p, v \rangle - \langle q, v \rangle| \}$

G: This looks like WGAN?

M: That is what I think. Integral is an inner product

$\langle p, v\rangle = \int v(x) p(x) dx$

So if p is a probability distribution,

$\langle p, v\rangle = E_p(v(X)]$

so that the equation above becomes

$\inf_{p \in F } \sup_{v \in V, |v|=1} \{|E_p[v(X)] - E_q[v(X)]\rangle| \}$

So if V is a function space, and if F is a space of all probability distributions, I think this is exactly WGAN.

G: Do you think we can give similar interpretation to AGE, our main topic? We have gone astray LOL

M: Ah, right, right. Sorry, I could not come up with that.

G: LOL !

M: By the way, I always thought…. instead of comparing ‘Expectation of v with respect to distribution p’ and ‘Expectation of v with respect to distribution q’, why don’t we compare

$\begin{align} v(X) &~~~~ X \sim p \\ v(X) &~~~~ X \sim q \end{align}$

say, with KL divergence?

G: That sounds right, we shall definitely do that!!

M: Wait, so we will be looking at

$\begin{align} &KL[v(\hat X)\|v(X)] \\ &\hat X \sim p \in F \\ &X \sim q \end{align}$

and our F is a set of probabilities that can be generated with Z with some parametric family G, so

$\begin{align} &KL[v(g(Z))\|v(X)] \\ &g \in G \\ &Z \sim p_z \\ &X \sim q \end{align}$

Oh shit… this is AGE.

G: Darn LOL This is what I wanted to do!!!

M: Sigh. People are smart. What a pain.

Written on April 22, 2017