Sunday 14 July 2019

Generative Adversarial Networks - Part VI

This is the Part 6 of a short series of posts introducing and building generative adversarial networks, known as GANs.

In this post we will develop a system for testing a GAN using controllable synthetic data. Too often GANs are tested against datasets which are very varied and this makes assessing the GAN very difficult.

We'll also do some experiments with some of the many GAN design options to see if they help or hinder. Using controlled and simpler synthetic image data makes this assessment easier.

Output from a conditioned GAN learning four classes of synthetic image.

Previously:
  • Part 1 introduced the idea of adversarial learning and we started to build the machinery of a GAN implementation.
  • Part 2 we extended our code to learn a simple 1-dimensional pattern 1010.
  • Part 3 we developed our code to learn to generate 2-dimensional grey-scale images that look like handwritten digits
  • Part 4 extended our code to learn full colour faces, and also developed convolutional networks to encourage learning localised image features
  • Part 5 developed a conditional GAN that can be trained to output images of a desired class.


Synthetic Training Data

If we want to refine a GAN architecture and choose hyper-parameter options we need a more robust way of testing these design choices are having a positive effect. That means being able to control the input to the GANs so that we can draw conclusions about the outputs.

And that means we need a way of generating image training data sets where the images only contain the shapes, colours, patterns and arrangements that we want. 

This level of control means we could, for example, test if a GAN architecture is as good at learning lines as circles.


OpenCV To Draw Into Arrays

We could write our own code to colour the pixels in a numpy array or pytorch tensor to create shapes, but that is hard work. Libraries exist for drawing shapes into numerical arrays:
  • The pycairo bindings to the moden Cairo 2D graphics system generates high quality images. Sadly the dependencies are a little too complex for this experiment.
  • The computer vision opencv toolkits contains drawing functions which are simple enough to use. 

Let's look at an educational example. First we import the modules we'll need, numpy to provide the arrays, matplotlib to draw themas images, and cv2 which are the opencv bindings for python.


import numpy
import matplotlib.pyplot as plt

import cv2


Let's now create an empty image and draw it:


# start with blank image
img = numpy.zeros((128, 128), numpy.uint8)

# show image
plt.imshow(img, cmap='gray', vmin=0, vmax=255)


The first line of code creates a 128 by 128 numpy array of zeros. The additional numpy.uint8 tells numpy to use an integer type for the numbers, not the normal floating point numbers. Most image formats have pixels with integer values, usually in the range 0-255.

The next line draws the array using a grayscale colour palette. It needs to be told which colour palette to use because the default one uses colours even if we're only using grayscale colours. The vmin and vmax tell matplotlib the full scale of greys, otherwise it will use minimum and maximum of whatever is in the array, which might not be a wide range.

The code couldn't be much simpler! Here is the result.


Easy!

We can change the pixels to be white by adding 255 to the array of zeros:


# start with blank image
img = numpy.zeros((128, 128), numpy.uint8) + 255

# show image
plt.imshow(img, cmap='gray', vmin=0, vmax=255)


And the result is a white square.


Drawing a line is pretty self-explanatory:


# start with blank image
img = numpy.zeros((128, 128), numpy.uint8) + 255

# draw line
cv2.line(img, (30, 30), (90, 90), 0, 2, cv2.LINE_AA)

# show image
plt.imshow(img, cmap='gray', vmin=0, vmax=255)


The cv2.line draws a line from (30, 30) to (90, 90) using a colour value 0 (black) with width 2. The additional option cv2.LINE_AA asks for anti-aliased smoothing to avoid jagged edges. Here's the result.


Let's draw a filled circle, and an unfilled circle outline:


# start with blank image
img = numpy.zeros((128, 128), numpy.uint8) + 255

# draw unfilled circle
cv2.circle(img, (50,50), 20, 0, 2, cv2.LINE_AA)

# draw filled circle
cv2.circle(img, (80, 80), 20, 0, -1, cv2.LINE_AA)

# show image
plt.imshow(img, cmap='gray', vmin=0, vmax=255)


The first cv2.circle draws a circle centred at (50, 50) with radius 20, a line colour of 0 (black) and width 2. The next cv2.circle draws a circle at (80, 80) but this one is filled as indicated by the line width of -1. Here's the result.


Let's now introduce colour. Colours are created by mixing red, green and blue light. These RGB values are between 0 and 255. The following code draws a red line and a blue circle.


# start with blank image
img = numpy.zeros((128, 128, 3), numpy.uint8) + 255

# draw red line
cv2.line(img, (30, 30), (90, 90), (255, 0, 0), 2, cv2.LINE_AA)

# draw filled circle
cv2.circle(img, (80, 80), 20, (0, 0, 255), -1, cv2.LINE_AA)

# show image
plt.imshow(img)


The first numpy array is now created with an addition third dimension of depth 3, one for each of the RGB values. For example, at img[10, 10] there will an element with 3 values. This time the colours are coded as tuples of 3 values, for example (255, 0, 0) is red. Note we don't use a grayscale palette, and vmin/vmax aren't used for colour images.

Here's the result.



Learning Monochrome Circles and Lines

Following the philosophy of starting small and simple, let's see if we can train a GAN to learn from a dataset of synthetic images which only contain circles and lines. For now, we'll also avoid colour.

The synthetic image generator draws one of two kinds of images:
  • happy - in a 64 by 64 square, five unfilled circles of radius 5 are randomly drawn with centres between positions 10 and 54 horizontally and vertically.
  • angry - in a 64 by 64 square, five lines of width 1 are drawn from a random position starting between 10 and 24 horizontally and vertically, and ending 20 along and down.


They're called happy and angry, because later we will try to condition a GAN with such emotion labels.

The following are six examples of happy images. You can see the circles are placed randomly but don't overlap the outer edge. They do sometimes overlap each other.


The following are six example of angry images. They look remarkably like pencil strokes.


Let's see if we can train a simple GAN made of fully-connected (dense) layers to learn to generate these images.

The following shows the results of two separate experiments after three epochs of training. The first experiment only used the happy circles in the training data, and the second only used the angry lines. For simplicity, we're not creating a mixed training data set yet.


We can see the GAN has learned to draw the diagonal lines fairly well, and without mode collapse either. However it hasn't learned to draw the unfilled circles well at all. If we think again about how GANs learn the probability distribution of the training dataset, the learning filled circles is easier than unfilled because the probability distribution is more complex.

A solution to this might be to make our GAN architecture larger, with more layers, and train for much longer.

An alternative is to switch to using convolution layers because these learn localised image features, features which when put together form objects like faces, lines or circles.

Let's try using a convolutional GAN instead. The GAN follows the same architecture that we developed in part 4 of this series but with smaller 64 x 64 images.


This time the results are better but not ideal. If we look closely the convolutional GAN has learned to draw both lines and circles. The circles aren't perfect but they are circular shapes with an unfilled centre. This is pretty impressive if we consider that the circles are constructed from much smaller learned features.

There also appears to be no mode collapse.

However the images appear to be plagued by a checkerboard / striped pattern. Experimenting with adjusting the size of the convolution filters, the strides and padding didn't remove this undesirable pattern.

This led to a reworking of the GAN architecture to use only convolution layers and no fully-connected layer at the start of the generator or at the end of the discriminator. This follows the well-known, and often copied, DCGAN architecture. You can read the seminal paper here (link).

The discriminator is built as follows:


# conv layers
self.model = nn.Sequential(            
        
    nn.Conv2d(1, 256, kernel_size=4, stride=2, padding=1, bias=False),
    nn.LeakyReLU(0.2),
    nn.BatchNorm2d(256),
    
    nn.Conv2d(256, 256, kernel_size=4, stride=2, padding=1, bias=False),
    nn.LeakyReLU(0.2),
    nn.BatchNorm2d(256),
            
    nn.Conv2d(256, 256, kernel_size=4, stride=2, padding=1, bias=False),
    nn.LeakyReLU(0.2),
    nn.BatchNorm2d(256),
            
    nn.Conv2d(256, 256, kernel_size=4, stride=2, padding=1, bias=False),
    nn.LeakyReLU(0.2),
    nn.BatchNorm2d(256),
            
    nn.Conv2d(256, 1, kernel_size=4, stride=2, bias=False),
    nn.LeakyReLU(0.2),
    # no final batch norm
     
    View((1,1)),
    nn.Sigmoid()            
)


That specific combination of convolution kernel sizes, strides and padding will take a 64 x 64 image and squish it down to a 1 x 1 number, which is perfect for the output of the discriminator which is a single number.

As before we can check that we have some confidence in the discriminator by training it to tell apart random noise from real images, which in this case are random mix of happy circle and angry line generated images. The following shows the loss falling during a short 2000 training iterations.


It is interesting to test this trained discriminator separately against happy circle and angry line images.


We can see that the discriminator correctly identifies random noise images from happy and angry images. There doesn't appear to be a significant different between the two kinds of synthetic images.

The generator is built as follows:


# define neural network layers
self.model = nn.Sequential(
    # reshape seed to tensor
    View((1, 100, 1, 1)),
            
     # reshape tensor to 256 filters
     nn.ConvTranspose2d(100, 256, kernel_size=4, stride=1, bias=False),
     nn.LeakyReLU(0.2),
     nn.BatchNorm2d(256),
            
     nn.ConvTranspose2d(256, 256, kernel_size=4, stride=2, padding=1, bias=False),
     nn.LeakyReLU(0.2),
     nn.BatchNorm2d(256),
            
     nn.ConvTranspose2d(256, 256, kernel_size=4, stride=2, padding=1, bias=False),
     nn.LeakyReLU(0.2),
     nn.BatchNorm2d(256),
            
     nn.ConvTranspose2d(256, 256, kernel_size=4, stride=2, padding=1, bias=False),
     nn.LeakyReLU(0.2),
     nn.BatchNorm2d(256),
            
     nn.ConvTranspose2d(256, 1, kernel_size=4, stride=2, padding=1, bias=False),
     nn.LeakyReLU(0.2),
     nn.BatchNorm2d(1),
            
     View((1, 1, 64, 64)),
     nn.Tanh()
)


There is no fully connected layer at the start to map a random seed vector into a starting square image. Instead, the 100-length vector is considered a 1 x 1 image with 100 channels of depth.

The following shows the output of an untrained generator, initialised with weights from a normal distribution, as is often recommended for many kinds of neural network, not just GANs.


It is encouraging that even before training there is no evident string striping or checkerboard patterning.

Let's see the results from this fully convolutional architecture. The following shows the images generated after 10000, 20000 and 30000 training iterations. Again, there are separate experiments for the happy circle and angry line datasets.


This time the images are much clearer, with no checkerboard or striping pattern. There is also no mode collapse. The images do seem to be getting cleaner progressively, but 30000 training iterations is fairly small so for further experiments we'll extend the training.

The code for this experiment is online:



Learning Circles And Lines Together

Let's see what happens if we train our GAN on a synthetic data set which has both instances of the happy circles and the angry lines.

The following shows six samples of what the generator creates after training epochs 1 to 16, each of 10000 training iterations.


Initially the outputs are messing, but soon become very clear. For example, after epochs 14, 15 and 16 the generator draws very nice clear circles. However, we can also see that the training is unstable, because the results at epoch 9 and 13 produce very messy results.

Another observation is that we are suffering from mode collapse. At each stage of learning, the generator is only producing one of the two kinds of images - happy circles or angry lines.

The literature suggests that, if possible, we should group together examples of each class when training so the training sees consecutive samples of the same kind.

The following shows what the generator creates after epochs 1 to 16, but this time the training schedule groups together samples of 10 different images of each kind, happy circle or angry lines. Each epoch still has 10000 iterations.


This time the learning is cleaner. There are no periods of large instability, and the shapes become coherent very quickly. There appears to be a transitionary phase at epochs 10 and 11. At epoch 10 we have our only example of diversity with both lines and circles being generates. The rest of the example are effectively mode collapsed


This time we see some incoherence in the early stages of training, but at later epochs we do see some non-mode-collapse, epochs 13, 15 and 16 for example.
The improvements are not clear cut, but there is evidence of some improvement from this moderately scientific test.

Batch Normalisation Before or After Activation

Another choice is between having the batch normalisation before or after the activation function.

The following shows the class grouping of 10 experiment but with batch normalisation taking place before the activation function, LeakyRelu.


The difference isn't drastic. The coherence of the shapes is worse, with examples of unusual shapes being rendered through the training epochs, but this isn't very severe.

This isn't a very scientific test, but is enough for us to stick with the logic of normalisation being applied before a signal is fed into the next layer of a network.

In this case, normalisation before a LeakyRelu means half the signals are mostly lost as they are below zero.

Introducing Colour

Let's now introduce colour to the data sets and extend our code to handle the additional dimensions.
The following shows examples of the happy circles, now coloured a sky blue, or RGB (95, 114, 231).


The following shows the angry lines, now coloured a crimson red, or RGB (169, 20, 54).


Let's see the if the addition of a colour dimension changes how the GAN learns.

The change to the discriminator is trivial, the previous incoming channel depth 1 is changed 3 for the first convolution layer, and that's it.


    nn.Conv2d(3, 256, kernel_size=4, stride=2, padding=1, bias=False),


The change to the generator is similarly trivial. The final deconvolution reduces the channels from 256 to 3, not 1 as before.


nn.ConvTranspose2d(256, 3, kernel_size=4, stride=2, padding=1, bias=False),
nn.LeakyReLU(0.2),
nn.BatchNorm2d(3),
            
View((1, 3, 64, 64)),
nn.Tanh()



Saving Images During Training

We don't have to manually run the training for 16 epochs and generate a sample of six images from the generator after each epoch, saving them to an image file. We can automate this in the main training loop.

The following code saves the images using the current time as a fairly unique filename. It does require a from google.colab import files at the top of our notebook.


# plot 6-image samples, save to file
f, axarr = plt.subplots(2,3, figsize=(16,8))
for i in range(2):
  for j in range(3):
    img = G.forward(generate_random(100)).permute(0,2,3,1).view(64,64,3).detach().cpu().numpy()
    img = (img + 1.0)/2.0
    axarr[i,j].imshow(img, interpolation='none', cmap='gray', vmin=0, vmax=1)
    pass
  pass
  
t = datetime.datetime.fromtimestamp(time.time())
fname = "img_%d_%s.png" % (epoch+1, t.strftime('%H-%M-%S'))
plt.savefig(fname)
files.download(fname)
plt.close(f)


The automatic file downloads are very convenient, but very occasionally the code fails with an error.


Learning Colour

Since grouping the classes into batches of 100 seemed to work well, we'll do the same for this colour experiment.


Compared to the monochrome experiment, this one seems to suffer more strongly from mode collapse. Not only are almost all the images angry red lines, within an epoch of training there is no variation of the pattern itself.

What is it about the addition of colour that has done this? If we swap the colours so the circles are red and the lines are blue, that might reveal an insight. Here are the results.


This time there is more switching between the circles and lines, although at any particular epoch there is mode collapse.

One possible explanation is that the red lines are denser than the circles making it easier to learn, and that the red colour was darker also making it easier to learn compared to that sky blue which was less in contrast with the background.

So a useful experiment is to pick red as RGB (255, 255, 0) and blue as (0, 0, 255). This way both have the same numerical intensity we should get similar results to the monochrome experiments. Let's see.


The results aren't conclusively better or different than the previous experiments,

Let's go back to our monochrome experiments and try other ideas.


More Experiments - ReLU and LeakyRELU

One difference between our implementation and common implementations of DCGAN is that they use a ReLU in the generator and a LeakyRELU in the discriminator. We've used Leaky RELUs for both.

The rationale for a LeakyRELU is to avoid zero gradients so that update information does flow backwards along the discriminator to the generator. Why this logic isn't applied within the generator isn't clear.

Let's see if this recommendation makes a difference.


It looks like the non-leaky ReLU activations in the generator do fail.


More Experiments - Default (Xavier) Initialisation

Our code has so far been initialising the GAN network weights from a gaussian distribution. This is a very common recommendation and has some logic to it.

However, an alternative recommendation from a Stanford lecture on training neural networks suggests that a Xavier initialisation is better for deeper networks.


The following link should take you to the relevant section starting at 37 minutes.

Pytorch initialises weights using recent recommendations similar to Xavier (link). Let's see the results of not initialising the weights ourselves.


Compared to the equivalent monochrome experiment, the results seem cleaner, that is, there is less instability visible. The mode collapse isn't as severe in that at each epoch there is variety in spatial arrangement if not of the patterns themselves.

Let's see the effect when we include colour.


Compared to the experiment with weights initialised to a gaussian distribution, these results are marginally cleaner. Strictly speaking, we should be cautious about making a general claim on the basis of a small number of experiments. What we can say is that the difference isn't huge, the learning didn't destabilise, nor did it solve the mode collapse issue.

The code for this experiment is online:


In real world images, colours are rarely pure. Regions of red, for example, consist of different but similar reds. We can simulate this by adding a small random variation to the RGB values for both red and green:


r2 = r + random.randint(-50,50)
g2 = g + random.randint(-50,50)
b2 = b + random.randint(-50,50)


The following two experiments compare how well our GAN learns pure colours against the more realistic randomly varied colour.

Pure colour:


Varied colour:


We can see there is no significant different. The GAN learns both pure and varied colours.

Varying colours in this way doesn't help solve the mode-collapse issue.


Conditional Learning More Classes

Let's extend our GAN to be a conditional GAN, that is learn to associate particular classes of image to a label. We covered this in part 5 of this series.

Let's extend the synthetic image code to create not just happy blue circles and angry red lines, but also sad green waves and joyful yellow stars.

As before, let's also add some mild randomness to the colours so that all the blues, red, green and yellows have a small random variation added to the RGB values.

The following shows examples of the synthetic images for each class.


You can see the colours of each shape now varies around a base colour. You can see the code which draws the waves and stars here: python notebook. Opencv doesn't provide bezier or similar curves so the waves are made of points along a sine wave. The stars are simply lines out from a centre.

Previously we simply concatenated the label tensor to the noise seed for the generator, and the image tensor for the discriminator. For our fully convolutional GAN, we can't concatenate this label tensor so easily. Some work has been done to explore other methods of combining the label tensor with GAN inputs, here for example.

Here I have added the image tensor to the label tensor, repeated tiled to match the size of the image tensor.

The following shows how this is done for the generator, in its forward() function.


def forward(self, noise_tensor, condition_tensor):
    # add condition tensor to noise tensor
    conditioned_noise_tensor = noise_tensor + condition_tensor.repeat(1,25)
    return self.model(conditioned_noise_tensor)


The noise tensor is of size (1, 100) and the condition tensor is of size (1, 4). This is why the condition tensor is repeated 25 times along the second dimension to create a tensor that matched the noise tensor size of (1,100).

The following shows how this is similarly done for the discriminator.


def forward(self, image_tensor, target_tensor):
    # add condition tensor to image tensor
    conditioned_image_tensor = image_tensor + target_tensor.repeat(1, 3, 64, 16)
    return self.model(conditioned_image_tensor)


The following shows the results of the GAN learning this synthetic dataset, now with four classes.


Success! The conditional GAN has successfully learned to generate each of the four classes on demand. If there is a capacity limit, we haven't hit it yet.

As an additional experiment, let's try adding gaussian noise to the entire synthetic image after the shapes have been applied:


# add random noise
img = img + numpy.random.normal(0, 1, (64, 64,3))*10
img = numpy.clip(img, 0.0, 255.0)


This causes a mild random texture to be applied to the whole image:


The idea is to see if this additional random noise makes GAN learning easier by smoothing the probability distributions of the source images, and by helping gradient flow, an effect that has been observed in others work.

Here are the GAN's outputs after being trained on the images with added noise:


Although this isn't a statistically robust experiment, this single example does show an improvement in the training. There is no instability as in the previous experiment at epoch 5, and the shapes seem to be learned much quicker with a smoother GAN output too.

This effect may be beneficial to sparse images like these, and less beneficial to already busy images in diverse datasets where there is a lot of variation in the probability distributions being learned by the GAN.

The mode collapse issue is still not improved.

So far we have seen some GAN design choices which improve the stability of learning and quality of outputs:
  • mild gaussian noise added to sparse images seems to improve training speed and stability
  • layer and batch normalisation between network layers.
  • collating together training data into groups of the same class
  • Xavier or similar size sensitive weight initialisation, not naive gaussian initialisation
  • LeakyRELU not basic ReLU

The code for combining all the ideas up to this point is online:




Mode Collapse Challenge

We have almost all the elements working. We have both dense and fully convolutional GANs working well. We have experimented with architectural choices and found those that can help, such as adding noise to the source datasets. We have an initial synthetic image data tool which allows us to more easily test our GANs. We have in sufficient cases, successful GAN training.

The only issue that seems to persist is mode collapse. This will be the focus for the next phase of work.


References