Motivation

Goal: combine the inductive bias of CNNs with the expressivity of transformers
- use a convolution to learn a codebook to encode visual information
- use a tranformer to model the long-range interactions between the tokens
- use a GAN to ensure that the dictionary captures information well

Method

fig2

To use a transformer, an image should be transformed into a sequence
Use a codebook to enable this ¹
Make two models, an encoder $E$ and a decoder $G$, which are CNNs.
$$ \mathbf{z}_q = \mathbf{q}(\hat{\mathbf{z}}) := \left( \arg\min_{z_k \in \mathcal{Z}} \left\| \hat{z}_{ij} - z_k \right\| \right) \in \mathbb{R}^{h \times w \times n_z} $$
Then get the reconstructed image $\hat x \approx x$ by

$$ \hat x = G(z_q) = G(\mathbb {q }(E(x))) $$

The final loss funtion is as follows:

$$ \mathcal{L}_{\text{VQ}}(E, G, \mathcal{Z}) = \\ \|x - \hat{x}\|^2 + \|\text{sg}[E(x)] - z_q\|_2^2 + \|\text{sg}[z_q] - E(x)\|_2^2 $$

For the details of this loss function, please refer VQ-VAE.

Learning a Perceptually Rich Codebook
- rather than using a pixelCNN, use Transformer to model $p(z)$
- replace L2 loss in reconstruction error to a perceptual loss using a discriminator

$$\mathcal{L}_{\text{GAN}}(\{E, G, \mathcal{Z}\}, D) = \left[ \log D(x) + \log(1 - D(\hat{x})) \right]$$

$$ \mathcal{Q}^* = \arg\min_{E, G, \mathcal{Z}} \max_D , \mathbb{E}{x \sim p(x)} \left[ \mathcal{L}{\text{VQ}}(E, G, \mathcal{Z})

$$ \lambda = \frac{\nabla_{G_L} \left[ \mathcal{L}_{\text{rec}} \right]}{\nabla_{G_L} \left[ \mathcal{L}_{\text{GAN}} \right] + \delta} $$

It is a scaling between the reconstruction loss and the GAN loss
- $L_{rec}$ is the perceptual reconstruction loss

$$\mathcal{L}_{\text{transformer}} = \mathbb{E}_{x\sim p(x)}\left[ -\log p(s)\right]$$

at that time, the Vision Transformer is not released yet. the two papers are released almost at the same time. So, this paper has a similar goal to the vision transformer paper. ↩︎