Sunday 14 July 2019

Generative Adversarial Networks - Part VI

This is the Part 6 of a short series of posts introducing and building generative adversarial networks, known as GANs.

In this post we will develop a system for testing a GAN using controllable synthetic data. Too often GANs are tested against datasets which are very varied and this makes assessing the GAN very difficult.

We'll also do some experiments with some of the many GAN design options to see if they help or hinder. Using controlled and simpler synthetic image data makes this assessment easier.

Output from a conditioned GAN learning four classes of synthetic image.

Previously:
  • Part 1 introduced the idea of adversarial learning and we started to build the machinery of a GAN implementation.
  • Part 2 we extended our code to learn a simple 1-dimensional pattern 1010.
  • Part 3 we developed our code to learn to generate 2-dimensional grey-scale images that look like handwritten digits
  • Part 4 extended our code to learn full colour faces, and also developed convolutional networks to encourage learning localised image features
  • Part 5 developed a conditional GAN that can be trained to output images of a desired class.


Synthetic Training Data

If we want to refine a GAN architecture and choose hyper-parameter options we need a more robust way of testing these design choices are having a positive effect. That means being able to control the input to the GANs so that we can draw conclusions about the outputs.

And that means we need a way of generating image training data sets where the images only contain the shapes, colours, patterns and arrangements that we want. 

This level of control means we could, for example, test if a GAN architecture is as good at learning lines as circles.


OpenCV To Draw Into Arrays

We could write our own code to colour the pixels in a numpy array or pytorch tensor to create shapes, but that is hard work. Libraries exist for drawing shapes into numerical arrays:
  • The pycairo bindings to the moden Cairo 2D graphics system generates high quality images. Sadly the dependencies are a little too complex for this experiment.
  • The computer vision opencv toolkits contains drawing functions which are simple enough to use. 

Let's look at an educational example. First we import the modules we'll need, numpy to provide the arrays, matplotlib to draw themas images, and cv2 which are the opencv bindings for python.


import numpy
import matplotlib.pyplot as plt

import cv2


Let's now create an empty image and draw it:


# start with blank image
img = numpy.zeros((128, 128), numpy.uint8)

# show image
plt.imshow(img, cmap='gray', vmin=0, vmax=255)


The first line of code creates a 128 by 128 numpy array of zeros. The additional numpy.uint8 tells numpy to use an integer type for the numbers, not the normal floating point numbers. Most image formats have pixels with integer values, usually in the range 0-255.

The next line draws the array using a grayscale colour palette. It needs to be told which colour palette to use because the default one uses colours even if we're only using grayscale colours. The vmin and vmax tell matplotlib the full scale of greys, otherwise it will use minimum and maximum of whatever is in the array, which might not be a wide range.

The code couldn't be much simpler! Here is the result.


Easy!

We can change the pixels to be white by adding 255 to the array of zeros:


# start with blank image
img = numpy.zeros((128, 128), numpy.uint8) + 255

# show image
plt.imshow(img, cmap='gray', vmin=0, vmax=255)


And the result is a white square.


Drawing a line is pretty self-explanatory:


# start with blank image
img = numpy.zeros((128, 128), numpy.uint8) + 255

# draw line
cv2.line(img, (30, 30), (90, 90), 0, 2, cv2.LINE_AA)

# show image
plt.imshow(img, cmap='gray', vmin=0, vmax=255)


The cv2.line draws a line from (30, 30) to (90, 90) using a colour value 0 (black) with width 2. The additional option cv2.LINE_AA asks for anti-aliased smoothing to avoid jagged edges. Here's the result.


Let's draw a filled circle, and an unfilled circle outline:


# start with blank image
img = numpy.zeros((128, 128), numpy.uint8) + 255

# draw unfilled circle
cv2.circle(img, (50,50), 20, 0, 2, cv2.LINE_AA)

# draw filled circle
cv2.circle(img, (80, 80), 20, 0, -1, cv2.LINE_AA)

# show image
plt.imshow(img, cmap='gray', vmin=0, vmax=255)


The first cv2.circle draws a circle centred at (50, 50) with radius 20, a line colour of 0 (black) and width 2. The next cv2.circle draws a circle at (80, 80) but this one is filled as indicated by the line width of -1. Here's the result.


Let's now introduce colour. Colours are created by mixing red, green and blue light. These RGB values are between 0 and 255. The following code draws a red line and a blue circle.


# start with blank image
img = numpy.zeros((128, 128, 3), numpy.uint8) + 255

# draw red line
cv2.line(img, (30, 30), (90, 90), (255, 0, 0), 2, cv2.LINE_AA)

# draw filled circle
cv2.circle(img, (80, 80), 20, (0, 0, 255), -1, cv2.LINE_AA)

# show image
plt.imshow(img)


The first numpy array is now created with an addition third dimension of depth 3, one for each of the RGB values. For example, at img[10, 10] there will an element with 3 values. This time the colours are coded as tuples of 3 values, for example (255, 0, 0) is red. Note we don't use a grayscale palette, and vmin/vmax aren't used for colour images.

Here's the result.



Learning Monochrome Circles and Lines

Following the philosophy of starting small and simple, let's see if we can train a GAN to learn from a dataset of synthetic images which only contain circles and lines. For now, we'll also avoid colour.

The synthetic image generator draws one of two kinds of images:
  • happy - in a 64 by 64 square, five unfilled circles of radius 5 are randomly drawn with centres between positions 10 and 54 horizontally and vertically.
  • angry - in a 64 by 64 square, five lines of width 1 are drawn from a random position starting between 10 and 24 horizontally and vertically, and ending 20 along and down.


They're called happy and angry, because later we will try to condition a GAN with such emotion labels.

The following are six examples of happy images. You can see the circles are placed randomly but don't overlap the outer edge. They do sometimes overlap each other.


The following are six example of angry images. They look remarkably like pencil strokes.


Let's see if we can train a simple GAN made of fully-connected (dense) layers to learn to generate these images.

The following shows the results of two separate experiments after three epochs of training. The first experiment only used the happy circles in the training data, and the second only used the angry lines. For simplicity, we're not creating a mixed training data set yet.


We can see the GAN has learned to draw the diagonal lines fairly well, and without mode collapse either. However it hasn't learned to draw the unfilled circles well at all. If we think again about how GANs learn the probability distribution of the training dataset, the learning filled circles is easier than unfilled because the probability distribution is more complex.

A solution to this might be to make our GAN architecture larger, with more layers, and train for much longer.

An alternative is to switch to using convolution layers because these learn localised image features, features which when put together form objects like faces, lines or circles.

Let's try using a convolutional GAN instead. The GAN follows the same architecture that we developed in part 4 of this series but with smaller 64 x 64 images.


This time the results are better but not ideal. If we look closely the convolutional GAN has learned to draw both lines and circles. The circles aren't perfect but they are circular shapes with an unfilled centre. This is pretty impressive if we consider that the circles are constructed from much smaller learned features.

There also appears to be no mode collapse.

However the images appear to be plagued by a checkerboard / striped pattern. Experimenting with adjusting the size of the convolution filters, the strides and padding didn't remove this undesirable pattern.

This led to a reworking of the GAN architecture to use only convolution layers and no fully-connected layer at the start of the generator or at the end of the discriminator. This follows the well-known, and often copied, DCGAN architecture. You can read the seminal paper here (link).

The discriminator is built as follows:


# conv layers
self.model = nn.Sequential(            
        
    nn.Conv2d(1, 256, kernel_size=4, stride=2, padding=1, bias=False),
    nn.LeakyReLU(0.2),
    nn.BatchNorm2d(256),
    
    nn.Conv2d(256, 256, kernel_size=4, stride=2, padding=1, bias=False),
    nn.LeakyReLU(0.2),
    nn.BatchNorm2d(256),
            
    nn.Conv2d(256, 256, kernel_size=4, stride=2, padding=1, bias=False),
    nn.LeakyReLU(0.2),
    nn.BatchNorm2d(256),
            
    nn.Conv2d(256, 256, kernel_size=4, stride=2, padding=1, bias=False),
    nn.LeakyReLU(0.2),
    nn.BatchNorm2d(256),
            
    nn.Conv2d(256, 1, kernel_size=4, stride=2, bias=False),
    nn.LeakyReLU(0.2),
    # no final batch norm
     
    View((1,1)),
    nn.Sigmoid()            
)


That specific combination of convolution kernel sizes, strides and padding will take a 64 x 64 image and squish it down to a 1 x 1 number, which is perfect for the output of the discriminator which is a single number.

As before we can check that we have some confidence in the discriminator by training it to tell apart random noise from real images, which in this case are random mix of happy circle and angry line generated images. The following shows the loss falling during a short 2000 training iterations.


It is interesting to test this trained discriminator separately against happy circle and angry line images.


We can see that the discriminator correctly identifies random noise images from happy and angry images. There doesn't appear to be a significant different between the two kinds of synthetic images.

The generator is built as follows:


# define neural network layers
self.model = nn.Sequential(
    # reshape seed to tensor
    View((1, 100, 1, 1)),
            
     # reshape tensor to 256 filters
     nn.ConvTranspose2d(100, 256, kernel_size=4, stride=1, bias=False),
     nn.LeakyReLU(0.2),
     nn.BatchNorm2d(256),
            
     nn.ConvTranspose2d(256, 256, kernel_size=4, stride=2, padding=1, bias=False),
     nn.LeakyReLU(0.2),
     nn.BatchNorm2d(256),
            
     nn.ConvTranspose2d(256, 256, kernel_size=4, stride=2, padding=1, bias=False),
     nn.LeakyReLU(0.2),
     nn.BatchNorm2d(256),
            
     nn.ConvTranspose2d(256, 256, kernel_size=4, stride=2, padding=1, bias=False),
     nn.LeakyReLU(0.2),
     nn.BatchNorm2d(256),
            
     nn.ConvTranspose2d(256, 1, kernel_size=4, stride=2, padding=1, bias=False),
     nn.LeakyReLU(0.2),
     nn.BatchNorm2d(1),
            
     View((1, 1, 64, 64)),
     nn.Tanh()
)


There is no fully connected layer at the start to map a random seed vector into a starting square image. Instead, the 100-length vector is considered a 1 x 1 image with 100 channels of depth.

The following shows the output of an untrained generator, initialised with weights from a normal distribution, as is often recommended for many kinds of neural network, not just GANs.


It is encouraging that even before training there is no evident string striping or checkerboard patterning.

Let's see the results from this fully convolutional architecture. The following shows the images generated after 10000, 20000 and 30000 training iterations. Again, there are separate experiments for the happy circle and angry line datasets.


This time the images are much clearer, with no checkerboard or striping pattern. There is also no mode collapse. The images do seem to be getting cleaner progressively, but 30000 training iterations is fairly small so for further experiments we'll extend the training.

The code for this experiment is online:



Learning Circles And Lines Together

Let's see what happens if we train our GAN on a synthetic data set which has both instances of the happy circles and the angry lines.

The following shows six samples of what the generator creates after training epochs 1 to 16, each of 10000 training iterations.


Initially the outputs are messing, but soon become very clear. For example, after epochs 14, 15 and 16 the generator draws very nice clear circles. However, we can also see that the training is unstable, because the results at epoch 9 and 13 produce very messy results.

Another observation is that we are suffering from mode collapse. At each stage of learning, the generator is only producing one of the two kinds of images - happy circles or angry lines.

The literature suggests that, if possible, we should group together examples of each class when training so the training sees consecutive samples of the same kind.

The following shows what the generator creates after epochs 1 to 16, but this time the training schedule groups together samples of 10 different images of each kind, happy circle or angry lines. Each epoch still has 10000 iterations.


This time the learning is cleaner. There are no periods of large instability, and the shapes become coherent very quickly. There appears to be a transitionary phase at epochs 10 and 11. At epoch 10 we have our only example of diversity with both lines and circles being generates. The rest of the example are effectively mode collapsed


This time we see some incoherence in the early stages of training, but at later epochs we do see some non-mode-collapse, epochs 13, 15 and 16 for example.
The improvements are not clear cut, but there is evidence of some improvement from this moderately scientific test.

Batch Normalisation Before or After Activation

Another choice is between having the batch normalisation before or after the activation function.

The following shows the class grouping of 10 experiment but with batch normalisation taking place before the activation function, LeakyRelu.


The difference isn't drastic. The coherence of the shapes is worse, with examples of unusual shapes being rendered through the training epochs, but this isn't very severe.

This isn't a very scientific test, but is enough for us to stick with the logic of normalisation being applied before a signal is fed into the next layer of a network.

In this case, normalisation before a LeakyRelu means half the signals are mostly lost as they are below zero.

Introducing Colour

Let's now introduce colour to the data sets and extend our code to handle the additional dimensions.
The following shows examples of the happy circles, now coloured a sky blue, or RGB (95, 114, 231).


The following shows the angry lines, now coloured a crimson red, or RGB (169, 20, 54).


Let's see the if the addition of a colour dimension changes how the GAN learns.

The change to the discriminator is trivial, the previous incoming channel depth 1 is changed 3 for the first convolution layer, and that's it.


    nn.Conv2d(3, 256, kernel_size=4, stride=2, padding=1, bias=False),


The change to the generator is similarly trivial. The final deconvolution reduces the channels from 256 to 3, not 1 as before.


nn.ConvTranspose2d(256, 3, kernel_size=4, stride=2, padding=1, bias=False),
nn.LeakyReLU(0.2),
nn.BatchNorm2d(3),
            
View((1, 3, 64, 64)),
nn.Tanh()



Saving Images During Training

We don't have to manually run the training for 16 epochs and generate a sample of six images from the generator after each epoch, saving them to an image file. We can automate this in the main training loop.

The following code saves the images using the current time as a fairly unique filename. It does require a from google.colab import files at the top of our notebook.


# plot 6-image samples, save to file
f, axarr = plt.subplots(2,3, figsize=(16,8))
for i in range(2):
  for j in range(3):
    img = G.forward(generate_random(100)).permute(0,2,3,1).view(64,64,3).detach().cpu().numpy()
    img = (img + 1.0)/2.0
    axarr[i,j].imshow(img, interpolation='none', cmap='gray', vmin=0, vmax=1)
    pass
  pass
  
t = datetime.datetime.fromtimestamp(time.time())
fname = "img_%d_%s.png" % (epoch+1, t.strftime('%H-%M-%S'))
plt.savefig(fname)
files.download(fname)
plt.close(f)


The automatic file downloads are very convenient, but very occasionally the code fails with an error.


Learning Colour

Since grouping the classes into batches of 100 seemed to work well, we'll do the same for this colour experiment.


Compared to the monochrome experiment, this one seems to suffer more strongly from mode collapse. Not only are almost all the images angry red lines, within an epoch of training there is no variation of the pattern itself.

What is it about the addition of colour that has done this? If we swap the colours so the circles are red and the lines are blue, that might reveal an insight. Here are the results.


This time there is more switching between the circles and lines, although at any particular epoch there is mode collapse.

One possible explanation is that the red lines are denser than the circles making it easier to learn, and that the red colour was darker also making it easier to learn compared to that sky blue which was less in contrast with the background.

So a useful experiment is to pick red as RGB (255, 255, 0) and blue as (0, 0, 255). This way both have the same numerical intensity we should get similar results to the monochrome experiments. Let's see.


The results aren't conclusively better or different than the previous experiments,

Let's go back to our monochrome experiments and try other ideas.


More Experiments - ReLU and LeakyRELU

One difference between our implementation and common implementations of DCGAN is that they use a ReLU in the generator and a LeakyRELU in the discriminator. We've used Leaky RELUs for both.

The rationale for a LeakyRELU is to avoid zero gradients so that update information does flow backwards along the discriminator to the generator. Why this logic isn't applied within the generator isn't clear.

Let's see if this recommendation makes a difference.


It looks like the non-leaky ReLU activations in the generator do fail.


More Experiments - Default (Xavier) Initialisation

Our code has so far been initialising the GAN network weights from a gaussian distribution. This is a very common recommendation and has some logic to it.

However, an alternative recommendation from a Stanford lecture on training neural networks suggests that a Xavier initialisation is better for deeper networks.


The following link should take you to the relevant section starting at 37 minutes.

Pytorch initialises weights using recent recommendations similar to Xavier (link). Let's see the results of not initialising the weights ourselves.


Compared to the equivalent monochrome experiment, the results seem cleaner, that is, there is less instability visible. The mode collapse isn't as severe in that at each epoch there is variety in spatial arrangement if not of the patterns themselves.

Let's see the effect when we include colour.


Compared to the experiment with weights initialised to a gaussian distribution, these results are marginally cleaner. Strictly speaking, we should be cautious about making a general claim on the basis of a small number of experiments. What we can say is that the difference isn't huge, the learning didn't destabilise, nor did it solve the mode collapse issue.

The code for this experiment is online:


In real world images, colours are rarely pure. Regions of red, for example, consist of different but similar reds. We can simulate this by adding a small random variation to the RGB values for both red and green:


r2 = r + random.randint(-50,50)
g2 = g + random.randint(-50,50)
b2 = b + random.randint(-50,50)


The following two experiments compare how well our GAN learns pure colours against the more realistic randomly varied colour.

Pure colour:


Varied colour:


We can see there is no significant different. The GAN learns both pure and varied colours.

Varying colours in this way doesn't help solve the mode-collapse issue.


Conditional Learning More Classes

Let's extend our GAN to be a conditional GAN, that is learn to associate particular classes of image to a label. We covered this in part 5 of this series.

Let's extend the synthetic image code to create not just happy blue circles and angry red lines, but also sad green waves and joyful yellow stars.

As before, let's also add some mild randomness to the colours so that all the blues, red, green and yellows have a small random variation added to the RGB values.

The following shows examples of the synthetic images for each class.


You can see the colours of each shape now varies around a base colour. You can see the code which draws the waves and stars here: python notebook. Opencv doesn't provide bezier or similar curves so the waves are made of points along a sine wave. The stars are simply lines out from a centre.

Previously we simply concatenated the label tensor to the noise seed for the generator, and the image tensor for the discriminator. For our fully convolutional GAN, we can't concatenate this label tensor so easily. Some work has been done to explore other methods of combining the label tensor with GAN inputs, here for example.

Here I have added the image tensor to the label tensor, repeated tiled to match the size of the image tensor.

The following shows how this is done for the generator, in its forward() function.


def forward(self, noise_tensor, condition_tensor):
    # add condition tensor to noise tensor
    conditioned_noise_tensor = noise_tensor + condition_tensor.repeat(1,25)
    return self.model(conditioned_noise_tensor)


The noise tensor is of size (1, 100) and the condition tensor is of size (1, 4). This is why the condition tensor is repeated 25 times along the second dimension to create a tensor that matched the noise tensor size of (1,100).

The following shows how this is similarly done for the discriminator.


def forward(self, image_tensor, target_tensor):
    # add condition tensor to image tensor
    conditioned_image_tensor = image_tensor + target_tensor.repeat(1, 3, 64, 16)
    return self.model(conditioned_image_tensor)


The following shows the results of the GAN learning this synthetic dataset, now with four classes.


Success! The conditional GAN has successfully learned to generate each of the four classes on demand. If there is a capacity limit, we haven't hit it yet.

As an additional experiment, let's try adding gaussian noise to the entire synthetic image after the shapes have been applied:


# add random noise
img = img + numpy.random.normal(0, 1, (64, 64,3))*10
img = numpy.clip(img, 0.0, 255.0)


This causes a mild random texture to be applied to the whole image:


The idea is to see if this additional random noise makes GAN learning easier by smoothing the probability distributions of the source images, and by helping gradient flow, an effect that has been observed in others work.

Here are the GAN's outputs after being trained on the images with added noise:


Although this isn't a statistically robust experiment, this single example does show an improvement in the training. There is no instability as in the previous experiment at epoch 5, and the shapes seem to be learned much quicker with a smoother GAN output too.

This effect may be beneficial to sparse images like these, and less beneficial to already busy images in diverse datasets where there is a lot of variation in the probability distributions being learned by the GAN.

The mode collapse issue is still not improved.

So far we have seen some GAN design choices which improve the stability of learning and quality of outputs:
  • mild gaussian noise added to sparse images seems to improve training speed and stability
  • layer and batch normalisation between network layers.
  • collating together training data into groups of the same class
  • Xavier or similar size sensitive weight initialisation, not naive gaussian initialisation
  • LeakyRELU not basic ReLU

The code for combining all the ideas up to this point is online:




Mode Collapse Challenge

We have almost all the elements working. We have both dense and fully convolutional GANs working well. We have experimented with architectural choices and found those that can help, such as adding noise to the source datasets. We have an initial synthetic image data tool which allows us to more easily test our GANs. We have in sufficient cases, successful GAN training.

The only issue that seems to persist is mode collapse. This will be the focus for the next phase of work.


References


Tuesday 18 June 2019

Generative Adversarial Networks - Part V

This is the Part 5 of a short series of posts introducing and building generative adversarial networks, known as GANs.

In this post we'll learn about a different architecture called a conditional GAN which enables us to direct the GAN to produce images of a class that we want, rather than images of a random class.


Previously:
  • Part 1 introduced the idea of adversarial learning and we started to build the machinery of a GAN implementation.
  • Part 2 we extended our code to learn a simple 1-dimensional pattern 1010.
  • Part 3 we developed our code to learn to generate 2-dimensional grey-scale images that look like handwritten digits
  • Part 4 extended our code to learn full colour faces, and also developed convolutional networks to encourage learning localised image features


Controlling What A Gan Creates

In parts 3 and 4 of this series, we trained a GAN on data that contained unique and diverse images. Each handwritten digit in the MNIST dataset is different, and each face in the CelebA data set is unique.

When we use the trained generator to create an image, we have no control over what kind of image it creates. All we know is that the image will be plausible enough to get past the discriminator.


We can't ask the generator from part 3 to create a 7 or a 9 for us. All we can do is feed the generator a random vector of numbers as input and see what image pops out.

If we experiment with that random input, in an attempt to control what comes out of the generator, we find that it doesn't sufficient, if any, control over the output.

Is is possible to train the generator so that we can influence the output?

The answer is yes, and that is what a conditional GAN architecture aims to do.


Conditional GAN Architecture

The following picture shows the conditional GAN architecture.


You can see that both the generator and discriminator are provided with additional information about the image. For us, this additional information can be a label, such as the digit an MNIST image represents.

It is not immediately clear how this helps. Let's break it down:
  • The discriminator can use the label to improve how it identifies whether an image is real or fake. How it does this is up to itself, additional information can only help. Without the label, the discriminator has a set amount of information on which to make the decision. With the label, it has additional information. 
  • The generator can learn to associate the label with the image it generates. It doesn't have to - it could choose to ignore the additional information. But the generator learns by getting feedback from the discriminator, which has learned to associate the label with an image, so the generator is encouraged to make this association too by generating images that match the image-label pair the discriminator sees from the training set.


Training A Conditional GAN

The training loop is unchanged from a vanilla GAN. The only difference is the additional information appended to the inputs to the generator and the discriminator:

  • The discriminator is shown a real image from the training dataset, as well as that image's label. It is trained to output a 1 for real.
  • The discriminator is shown a fake image from the generator together with its label, and is trained to output a 0 for fake.
  • The generator is trained to cause the discriminator to output a 1 for real.

The labels associated with the real images are part of the training data.

The labels associated with the generator are randomly chosen one-hot vectors of the same length as the labels in the training dataset. We just need to make sure that this randomly chosen label remains the same when fed into the generator as part of the seed, and when associated with the generated image for the discriminator to test. We can't have a different label for these two parts of the training.

When feeding the generator, the one-hot label vectors are combined with the random seed by concatenating the tensors like this:


def forward(self, noise_tensor, label_tensor):
    # combine image and label
    inputs = torch.cat((noise_tensor, label_tensor),1)
    
    # simply run model
    return self.model(inputs)


Similarly, when feeding the discriminator the one-hot label vectors are combined with the image data like this:


def forward(self, image_tensor, label_tensor):
        # combine image and label
        inputs = torch.cat((image_tensor.view(1, 784), label_tensor),1)
        
        # simply run model
        return self.model(inputs)


The following shows code for the training loop:


# train Discriminator and Generator

epochs = 12

for i in range(epochs):
    print('training epoch', i+1, "of", epochs)
    
    for label, image_data_tensor, target_tensor in mnist_dataset:
      
        # train discriminator on real data
        D.train(image_data_tensor.view(1, 1, 28, 28), target_tensor, torch.cuda.FloatTensor([1.0]).view(1,1))
        
        # random 1-hot label for generator
        random_target_tensor = generate_random_target(10)

        # train discriminator on false
        # use detach() so only D is updated, not G
        # label softening doesn't apply to 0 labels
        D.train(G.forward(generate_random(100).view(1,100), random_target_tensor).detach(), random_target_tensor, torch.cuda.FloatTensor([0.0]).view(1,1))
        
        # random 1-hot label for generator
        random_target_tensor = generate_random_target(10)
        
        # train generator
        G.train(D, generate_random(100).view(1,100), random_target_tensor, torch.cuda.FloatTensor([1.0]).view(1,1))
        
        pass
    
    pass


The full code is online:



Results

The results of training should be a generator that can create images of a desired class by providing it with the label as well as the normal random seed. So feeding the generator a label of 1 should result in images that look like a hand-drawn 1.

The following shows the results of 12 epochs of training.


The zeros at the top left are produced by feeding the trained generator a random seed augmented with a one-hot vector corresponding to the label 0, which would be 1000000000.

We can see that for each input label, the generator does indeed produce images of that label.

The following shows the results for 24 epochs of training:


The quality of the digits has improved.

As an experiment to see how important the labels are to training, the following set of results are from the same code but with the one-hot vector to the discriminator set to 0000000000.


We can see two things:

  • the generator no longer creates images of the desired class
  • the image quality overall is lower than without the label

This shows that it is important for the discriminator to learn the association between an image and its class, for it to then feed the generator useful gradients to learn from. The lower quality is likely a result of the fact that we have, in effect, an enlarged image to learn which means longer training time or perhaps a more efficient neural network design, referred to as the hyper-parameters.


Experimenting With Input Labels

Let's see what happens when we use input labels to the trained generator that are not 1-hot but have several elements activated?

We can use the plot_images() generator method to activate more than one location by supplying a tuple of labels. The following sets the input vector to be [0, 0, 0, 0, 0, 0, 1, 0, 0, 1].


G.plot_images((6,9))


The resulting images are shapes which are intermediate between 6 and 9.


This is interesting as it shows that we can manipulate the input vector in ways that have a visual meaning.

The following shows the results for G.plot_images((3, 5)).


That also broadly works. Let's try a more challenging combination, G.plot_images((1, 7)).


The results are understandable as it is hard to find a shape that is both 1 and 7 in nature.


Conclusion

If we think about these results, they're quite impressive.

We've managed to not only to train a GAN to generate plausible images, where the generator has not directly seen the training data, we've also managed to control the class of image being generated by associating the learned representation with a label we provide.

Previously the learned representation was entangled, and it was difficult to induce the generator to produce an image of the class we wanted just by manipulating the random seed.

We also saw how we can manipulate the input vector to create images which have shapes representing combinations of more than one class. 


More Reading



Sunday 2 June 2019

Generative Adversarial Networks - Part IV

This is the Part 4 of a short series of posts introducing and building generative adversarial networks, known as GANs.


Previously:
  • Part 1 introduced the idea of adversarial learning and we started to build the machinery of a GAN implementation.
  • Part 2 we extended our code to learn a simple 1-dimensional pattern 1010.
  • Part 3 we developed our code to learn to generate 2-dimensional grey-scale images that look like handwritten digits


In this post we'll extend our code again to lean to generate full-colour images, learning from a dataset of celebrity face photos. The ideas should be the same, and the code shouldn't need much new added to it.


Celebrity Faces

A popular dataset for human faces is the celebA dataset which contains 202,599 photos, annotated with some features.

A revised version was developed, called the aligned celebA dataset, where the location of the eyes is consistent across the dataset and the orientation of the heads is vertical so the mouth is below the eyes were possible. The following shows 6 samples from the dataset.


From a code perspective, it is cumbersome to use a folder of over 200,000 images. It is much easier to work with a data structure which contains the images as numerical arrays.

The terms of use prevent me sharing this repackaged dataset, but the following snippet of code will convert the provided zip into a hdf5 file.


HDF5 is a format designed to store large amounts of data for efficient access and processing in a portable manner. Python's pickle approach is not as scalable, and has additional security concerns.

The following code illustrates how to use the python h5py library to extract images from this hdf5 file:


import h5py

with h5py.File('my_data/My Drive/Colab Notebooks/gan/celeba_dataset/celeba_aligned_small.h5py', 'r') as file_handle:
  dataset = file_handle['img_align_celeba']
  image = numpy.array(dataset['000007.jpg'])
  plt.imshow(image, interpolation='none')
)


You can see that a hdf5 file is opened just like a normal file. As the hdf5 format is hierarchical, we first select which dataset we're interested in, here img_align_celeba. That gives is a dictionary-like structure where the keys are the image file names. Here we pick 000007.jpg and convert the returned data into a numpy array before plotting it.

The image data is of the form (height, width, 3) where the 3 is required for the red, green and blue colour values.


Simple Discriminator and Generator

Following our philosophy of starting small and simple, we'll see how well a very simple discriminator and generator made of a single hidden layer of fully connected nodes works.

The following is a simple discriminator model consisting of an input layer of size 3*218*178 = 116412 nodes, a hidden layer of 100 nodes, and a final output layer of 1 node which is sufficient for a 1 (true) and 0 (false) output.


# define neural network layers
# input shape is (1, 3, height, width)
self.model = nn.Sequential(
            
    View((1,3*218*178)),
            
    nn.Linear(3*218*178, 100),
    nn.LeakyReLU(),
        
    nn.LayerNorm(100),
            
    nn.Linear(100, 1),
    nn.Sigmoid()
)
        
# create error function
self.error_function = torch.nn.BCELoss()

# create optimiser, using simple stochastic gradient descent
self.optimiser = torch.optim.Adam(self.parameters(), lr=0.0001)


The incoming data is reshaped using View() to be 1-dimensional so it can be considered a single input layer of nodes. We're using the leaky relu and layer normalisation for the middle layer as we previously found that to be beneficial for GAN training.

As before, let's first check this network has the capacity to learn to discriminate between real data and random noise. If it can't even do that then it is intuitive that it can't tell the difference between images from the training set and images from the generator.

The core code for training the discriminator is as follows:


for image_data_tensor in celeba_dataset:
        
    # train discriminator on real data
    D.train(image_data_tensor.permute(2,0,1).contiguous().view(1, 3, 218, 178), torch.cuda.FloatTensor([1.0]).view(1,1))
    
    # train discriminator on false (random) data
    D.train(generate_random(3*218*178).view((1, 3, 218*178)), torch.cuda.FloatTensor([0.0]).view(1,1))
    
    pass


The code looks a little complicated so let's take it step by step. The data from the celeba dataset is of the form (height, width, 3). This needs to be reshaped to (1, 3, height, width), the 4-dimensional tensor expected by pytorch. The first 1 is a batch size. The permute() function re-arranges the axes and contiguous() is needed to repack the tensor as permute can cause the memory layout to become non-contiguous.

Similarly, the generate_random() creates a 1-dimensional array of random numbers, which we need to reshape to (1, 3, height, width).

As we're developing code, I've only taken 19,999 images from the 202,599 for the hdf5 file. The following shows the loss as the discriminator is trained once on this data.


We can see the loss falls to zero as training proceeds. It is interesting that a large number of the losses seem to be concentrated on a tight path.

Manually testing the trained discriminator shows it has been trained successfully.


We can now proceed to defining the generator. Again, let's keep its architecture as simple as possible.


# define neural network layers
# input shape is 1-dimensional array
self.model = nn.Sequential(

    nn.Linear(100, 3*10*10),
    nn.LeakyReLU(),

    nn.LayerNorm(3*10*10),

    nn.Linear(3*10*10, 3*218*178),
    nn.Sigmoid(),

    View((1, 3, 218, 178))
)

# create error function
self.error_function = torch.nn.BCELoss()

# create optimiser, using simple stochastic gradient descent
self.optimiser = torch.optim.Adam(self.parameters(), lr=0.0001)


As before, the generator takes a random input, in this case a 1-dimensional array of size 100. We use a middle layer of size 3*10*10 = 300 and again use the leaky rely and layer normalisation. The final layer grows to 3*218*178 which is the number needed for an image of size 218 by 178 and 3 red, green and blue channels.

The code to train the generator follows the same pattern as before - we train the discriminator to label real images as 1, images from the generator as 0, and we train the generator to get the discriminator to label its images as 1.


for image_data_tensor in celeba_dataset:
      
    # train discriminator on real data
    D.train(image_data_tensor.permute(2,0,1).contiguous().view(1, 3, 218, 178), torch.cuda.FloatTensor([1.0]).view(1,1))
    
    # train discriminator on false
    # use detach() so only D is updated, not G
    # label softening doesn't apply to 0 labels
    D.train(G.forward(generate_random(100)).detach(), torch.cuda.FloatTensor([0.0]).view(1,1))
    
    # train generator
    G.train(D, generate_random(100), torch.cuda.FloatTensor([1.0]).view(1,1))
    
    pass


Here are the results from one epoch through the dataset of 19,999 images.


Yippee!

Our GAN did actually learn to create faces. That's pretty amazing - especially when we remember the generator has not directly seen any of the celebrity photos.

Another nice thing is that the faces are different - so we've avoided mode collapse. The variety of images is really good - we can see both male and female faces, as well as different styles of hair a shapes for eyes and heads.

We can see the losses from the discriminator and the generator stabilise which suggests an equilibrium has been reached, which is good. Equilibrium in GAN training is sometimes hard to achieve, and a lack of it can lead to a collapse in useful training of the generator.

For a first attempt, these results are great.

The images were a bit grainy so let's see what a second training round does.


The images have improved significantly. The equilibrium between the discriminator and generator is holding.

Let's continue to more training epochs. The following shows the results for 4 epochs.


The results are improving, albeit we still have some of the grain. It is good to see the diversity maintained with different hair colour and even one oblique pose. This means we've avoided mode collapse.

Continuing to 6 epochs gives slightly improved images again.


The lower middle image is starting to show over-saturation which could be a sign we're reaching the limit of this training session.

At 8 training epochs we're starting to see both an improvement in some of the generated faces but also over-saturated blotching, and also some mode collapse. Two pairs of the images in the apparently random set above are very similar.


The loss charts are starting to show instability, which corresponds with this worsening of the generator.

The code for this simple GAN is online, and includes instructions to take advantage of a cuda gpu through google's hosted colab notebook service:



The animation at the top of this blog is the output of the generator as the input array is varied in a controlled by, moving a set of consecutive 1's along the 100 length array in steps of 5, with the resulting images smoothly transitioned for effect.


Experiments - Change Size of Random Input

What we've developed is an intentionally simple architecture to get us started. We can do many experiments varying different elements of the networks design and architecture to see if they result in an improvement or not.

A simple experiment is to change the size of the random number array that feeds the generator. So far we've been using a size of 100. That number is effectively the first layer of the generator network. If it is too large, it makes training the generator harder. If it is too small, it may limit the variety of images the generator can create.

Here are the images resulting from 4 epochs of training, with the input array size varying as 5, 10, 100 and 200.


Overall the quality of the images doesn't change much. For very small input sizes, the image quality is poor, but actually surprising good. Having an input of size 1 into the generator still creates diverse images that do look like faces, even if the quality is poor and there is mode collapse.

The quality seems to improve slowly as that input size grows. At size 400, the images are diverse and contain different features but are starting to look like they need more training. This is expected, because larger networks take longer to train.

The only slight surprise here is that an input of 1 random number into the generator still results in faces being generated.


Convolutions for Selecting Features

A very common improvement in GANs, and indeed neural networks more generally, is the use of convolution layers. Instead of connecting every input node to every node in the next layer, we can limit the connections to a smaller area of the input. This means the next layer picks out local features of the input.

The following diagram (src) shows how a convolution kernel K picks out diagonal features from an image I. The feature map S has a high value of 1 in the bottom right because the original image has a diagonal pattern in the bottom right. Similarly, the top right has no diagonal pattern and that's why S has a 0 there. Partial patterns in the image, such as the top left, result in partial values in S.


The following notebook implements a simple classifier neural network for the handwritten digits dataset from part 3, and shows that learning localised image features results in the accuracy jumping from 90% to 98%:




The following animation (src) shows how a convolution reduces the size of the image to a smaller feature map:


You can see how this locally limited passing of information from the input (blue) to the next layer (green) can allow image features to be learned. For this reason convolution layers are popular for classifying neural networks - and in our work, the discriminator.

For the generator, we can use convolutions again but they need to work in the opposite direction. Instead of shrinking the input, we need the generator to expand the input noise array towards the size of the output image. These are called transposed convolutions, or sometimes deconvolutions.

It is worth noting that generators are often designed to be similar to their discriminator but reversed in direction. There is no real reason to do this, other than as a loose heuristic approach to balancing the generator and discriminator.

The following animation (src) shows this working. The 3x3 input (blue) is expanded to the 5x5 output.


Initial experiments failed to produce results. Here's an early experiment.


The striping or moire-like pattern problem is common when trying to build images from inverse convolutions. This is because they can overlap if not spaced apart just right. This is achieved my making sure the stride or step size divides the size of the kernel.

Here is another example which failed to generate celebrity faces, but did succeed in creating monsters from some horror film!


After much experimentation I did find a working, and still simple, solution.

The following shows how the generator and the discriminator are broadly balanced. The images themselves have been cropped to be a square 128x128 because the inverse convolutions are much easier to design to have an output that is 128x128 and avoid having a linear layer at the end of the generator.


The discriminator has three convolution layer with kernels of size 8, which move in steps of 2. After the convolutions have reduced the input to a 3x10x310 feature map, a linear layer reduces these 300 values down to 1.

self.model = nn.Sequential(
    
    nn.Conv2d(3, 256, kernel_size=8, stride=2, bias=False),
    nn.BatchNorm2d(256),
    nn.LeakyReLU(0.2),
    nn.Dropout2d(0.3),
    
    nn.Conv2d(256, 256, kernel_size=8, stride=2, bias=False),
    nn.BatchNorm2d(256),
    nn.LeakyReLU(0.2),
    nn.Dropout2d(0.3),
        
    nn.Conv2d(256, 3, kernel_size=8, stride=2, bias=False),
    nn.LeakyReLU(0.2),
    nn.Dropout2d(0.3),
 
    View((1,3*10*10)),
        
    nn.Linear(3*10*10, 1),
    nn.Sigmoid()
)


We are also expanding the 3 channels to 256 for the convolutions, which is simply having 256 different potential feature selectors at each layer. This gives the network a larger capacity to learn more potentially useful features.

This discriminator also has dropout. This means that some of the network signals are zeroed during training, which helps avoid overfitting by preventing nodes from co-adapting.

Testing the discriminator by training it to separate real images from random noise shows that it does learn very well. The loss plot shows an interesting residual loss but the bulk of loss values fall towards zero.


Manually testing the discriminator shows very confident scores. This again confirms the general belief that convolution neural networks are better at image classification because they learn meaningful features.

The following is code for the generator. It is a bit smaller than the discriminator because there are only two convolution layers. Researchers are finding that in a balanced architecture, generators can be a bit smaller than the discriminators.


self.model = nn.Sequential(
            
    # input is a 1d array
    nn.Linear(100, 3*28*28),
    nn.LeakyReLU(0.2),
    
    # reshape to 2d
    View((1, 3, 28, 28)),
     
    nn.ConvTranspose2d(3, 256, kernel_size=8, stride=2, bias=False),
    nn.BatchNorm2d(256),
    nn.LeakyReLU(0.2),
            
    nn.ConvTranspose2d(256, 3, kernel_size=8, stride=2, padding=1, bias=False),
    nn.BatchNorm2d(3),
    nn.LeakyReLU(0.2),
               
    View((1,3,128,128)),
    nn.Tanh()
)


Here are the results from 1 epoch of training:


These aren't photo-realistic portraits but the results are impressive because we've succeeded, even with very simple code, in constructing images from localised features, as learned by the convolution layers. These features are, in effect, the eyes, nose, hair, lips, and so on. Looking closely we can see the patching together of these features.

Also impressive is he diversity of the images, each generated by a random input to the generator. We've avoided mode collapse.

Here are the results from 2 epochs:


Here are the results from 4 epochs:


As training progresses, the composition of faces improves. Each of the faces does have two eyes, a mouth and a nose, with hair in roughly the right place. We do see mismatched eyes and hair.

Here are the results from 6 epochs:


Here are the results from 8 epochs:


Again, the quality of the faces is improving, with the patching becoming smoother.

Here are the results from 10 epochs:


Here are the results from 12 epochs:


Each of these training epochs took about half an hour using colab, and it would be interesting to continue the training for more epochs, and also use the full dataset of 200,000 images and not the smaller 20,000 used for these experiments.

There is a loss of contrast as the training progresses which I haven't yet understood. The geometric features are the primary means of getting past the discriminator. The relative strengths of colours, the contrast, might not be learned by the feature based discriminator.

The following shows a selection of generated images, animated using a smooth transition.


The final code is online:




Tips and Heuristics

The theory that underpins GANs is still being developed and for this reason the design and training is too often unsuccessful or not efficient.

Researchers have mostly coalesced around a few heuristics and tips, mostly derived from experience and empirical results:



  • Gaussian weight initialisation can help, just as it seems to help some traditional classification networks.
  • For GANs, convolution kernel sizes should be larger than the 3x3 or 4x4 often found in textbooks. Our own example uses 8x8 kernels.
  • Using a stride that exactly divides the size of the deconvolution kernel can avoid striping or moire-like patterns.
  • Avoid normalising the last generator and first discriminator layers. The explanation offered is that this ensures the network can learn the actual mean and variance of images. 
  • Square images are much easier to work with when trying to get inverse convolutions to produce images that are the size we want.
  • Avoid overconfidence in the discriminator by training it to target 0.9 instead of 1.0 which can lead to gradient saturation, and so slow or no learning. This is called soft-labelling. That 0.9 can be varied by small random amounts.
  • Occasionally flipping the true/false training target values helps the networks get kicked out of local-minima or periods of low-gradients.
  • Normalising the data to the range -1 to +1 helps it match the activation functions where they have the best gradients for learning. 



Published Results

The following is an example of the current state of the art (src):



What does it take to produce these images?

The linked paper shows the hardware and training times:


Each GPU costs about £8,000 and 8 of them were used, with training times extending into days if not weeks!


Talk at Algorithmic Art June 2019

I gave a talk on this journey at the June 2019 meeting of the Algorithmic Art group.


A video recording is here: https://skillsmatter.com/skillscasts/13999-algorithmic-art-june-meetup

Slides for the talk are here: https://tinyurl.com/y3n55acf


Conclusion

The adversarial approach for training competing learning models is a markedly different idea to the large bulk of machine learning.

The theoretical underpinning is still being developed, and so for now not all architectures work well - with notable problems like mode collapse.

Even so, the results possible from very simple, as well as the very expensive state of the art, are impressive.

The future of generative adversarial machine learning is looks very promising!


More Reading