DCGAN

Radford, Alec, Luke Metz, and Soumith Chintala. “Unsupervised representation learning with deep convolutional generative adversarial networks.” arXiv preprint arXiv:1511.06434 (2015).

The following is my reading notes of this paper.

1 Introduction

In this paper, they propose that one way to build image representations is by training GANs¹, and later reusing parts of the generator and discriminator networks as feature extractors for supervised tasks.

GANs have been known to be unstable to train, often resulting in generators that produce nonsensical outputs.

Authors’ contributions:

They propose and evaluate a set of constraints on the architectural topology of Convolutional GANs that make them stable to train in most settings. They name this class of architectures Deep Convolutional GANs (DCGAN).
They use the trained discriminators for image classification tasks, showing competitive performance with other unsupervised algorithms.
They visualize the filters learnt by GANs and empirically show that specific filters have learned to draw specific objects.
They show that the generators have interesting vector arithmetic properties allowing for easy manipulation of many semantic qualities of generated samples.

2.1 Representation Learning from unlabeled data

do clustering on the data (e.g. K-means)²
train auto-encoders
- stacked denoising autoencoders³
- stacked what-where autoencoders⁴
- ladder structures⁵
DBN⁶

2.2 Generating natural images

2.3 Visualizing the internals of CNNs

3 Approach and Model Architecture

$Figure 1: DCGAN generator used for LSUN scene modeling. A 100 dimensional uniform distribution Z is projected to a small spatial extent convolutional representation with many feature maps. A series of four fractionally-strided convolutions (in some recent papers, these are wrongly called deconvolutions) then convert this high level representation into a 64 × 64 pixel image. Notably, no fully connected or pooling layers are used.$

Summary - architecture guidelines for stable DCGANs:

Replace any pooling layers with strided convolutions⁷ (discriminator) and fractional-strided convolutions (generator). This makes the model to learn its own upsampling/downsampling.
Use Batch Normalization⁸ in both the generator and the discriminator. BN stabilizes learning by normalizing the input to each unit to have zero mean and unit variance and help gradient flow in deeper models. To avoid sample oscillation and model instability, do not apply BN to the generator output layer and the discriminator input layer.
Remove fully connected hidden layers for deeper architectures. Global average pooling⁹ increased model stability but hurt convergence speed.
Use ReLU activation in generator for all layers except for the output, which uses Tanh. Using a bounded activation allows the model learn more quickly to saturate and cover the color space of the training distribution.
Use LeakyReLU activation in the discriminator for all layers, in contrast to the maxout activation which is used in the original GAN.

4 Details of adversarial training

Datasets:
- Large-scale Scene Understanding (LSUN)¹⁰
- Imagenet-1k
- a newly assembled Faces dataset
No pre-processing was applied to training images besides scaling to the range of the tanh activation function $[-1, 1]$.
All models were trained with mini-batch SGD with mini-batch size of $128$.
All weights were initialized from Normal distribution $\mathcal{N}(0, 0.02)$.
In the LeakyReLU, the slope of the leak was set to $0.2$ in all models.
Use Adam optimizer¹¹ with tuned hyperparameters.
Learning rate is $0.0002$ ($0.001$ is too high).
Reduce the momentum term $\beta_{1}$ from the suggested value $0.9$ to $0.5$ which helped stabilizing training.

4.1 LSUN

By showing samples from one epoch of training, in addition to samples after convergence, they demonstrate that the model is not producing high quality samples via simply overfitting/memorizing training examples.

4.1.1 Deduplication

4.2 Faces

4.3 Imagenet-1k

5 Empirical Validation of DCGANs capabilities

5.1 Classifying CIFAR-10 using GANs as a feature extractor

To evaluate the quality of the representations learned by DCGANs for supervised tasks, they train the model on Imagenet-1k and test it on CIFAR-10 (flatten and concatenate the features, then train a regularized linear L2-SVM classifier on top of them), achieving a good result.

5.2 Classifying SVHN digits using GANs as a feature extractor

Following preparation as in the CIFAR-10 experiments, they achieve SOTA test error on the SVHN dataset. Additionally, they train a purely supervised CNN with the same architecture on the same data and get a much worse result, which demonstrates that the CNN architecture in DCGAN is not the key contributing factor.

6 Investing and visualizing the internals of the networks

6.1 Walking in the latent space

Figure 4: Top rows: Interpolation between a series of 9 random points in Z show that the space learned has smooth transitions, with every image in the space plausibly looking like a bedroom. In the 6th row, you see a room without a window slowly transforming into a room with a giant window. In the 10th row, you see what appears to be a TV slowly being transformed into a window.

6.2 Visualizing the Discriminator Features

6.3 Manipulating the Generator Representation

6.3.1 Forgetting to draw certain objects

6.3.2 Vector arithmetic on face samples

Figure 8: A ”turn” vector was created from four averaged samples of faces looking left vs looking right. By adding interpolations along this axis to random samples we were able to reliably transform their pose.

7 Conclusion and Future Work

“We propose a more stable set of architectures for training generative adversarial networks and we give evidence that adversarial networks learn good representations of images for supervised learning and generative modeling.”

Remaining forms of model instability - as model are trained longer they sometimes collapse a subset of filters to a single oscillating mode.

Written on November 19, 2019