Sunday 2 June 2019

Generative Adversarial Networks - Part IV

This is the Part 4 of a short series of posts introducing and building generative adversarial networks, known as GANs.


Previously:
  • Part 1 introduced the idea of adversarial learning and we started to build the machinery of a GAN implementation.
  • Part 2 we extended our code to learn a simple 1-dimensional pattern 1010.
  • Part 3 we developed our code to learn to generate 2-dimensional grey-scale images that look like handwritten digits


In this post we'll extend our code again to lean to generate full-colour images, learning from a dataset of celebrity face photos. The ideas should be the same, and the code shouldn't need much new added to it.


Celebrity Faces

A popular dataset for human faces is the celebA dataset which contains 202,599 photos, annotated with some features.

A revised version was developed, called the aligned celebA dataset, where the location of the eyes is consistent across the dataset and the orientation of the heads is vertical so the mouth is below the eyes were possible. The following shows 6 samples from the dataset.


From a code perspective, it is cumbersome to use a folder of over 200,000 images. It is much easier to work with a data structure which contains the images as numerical arrays.

The terms of use prevent me sharing this repackaged dataset, but the following snippet of code will convert the provided zip into a hdf5 file.


HDF5 is a format designed to store large amounts of data for efficient access and processing in a portable manner. Python's pickle approach is not as scalable, and has additional security concerns.

The following code illustrates how to use the python h5py library to extract images from this hdf5 file:


import h5py

with h5py.File('my_data/My Drive/Colab Notebooks/gan/celeba_dataset/celeba_aligned_small.h5py', 'r') as file_handle:
  dataset = file_handle['img_align_celeba']
  image = numpy.array(dataset['000007.jpg'])
  plt.imshow(image, interpolation='none')
)


You can see that a hdf5 file is opened just like a normal file. As the hdf5 format is hierarchical, we first select which dataset we're interested in, here img_align_celeba. That gives is a dictionary-like structure where the keys are the image file names. Here we pick 000007.jpg and convert the returned data into a numpy array before plotting it.

The image data is of the form (height, width, 3) where the 3 is required for the red, green and blue colour values.


Simple Discriminator and Generator

Following our philosophy of starting small and simple, we'll see how well a very simple discriminator and generator made of a single hidden layer of fully connected nodes works.

The following is a simple discriminator model consisting of an input layer of size 3*218*178 = 116412 nodes, a hidden layer of 100 nodes, and a final output layer of 1 node which is sufficient for a 1 (true) and 0 (false) output.


# define neural network layers
# input shape is (1, 3, height, width)
self.model = nn.Sequential(
            
    View((1,3*218*178)),
            
    nn.Linear(3*218*178, 100),
    nn.LeakyReLU(),
        
    nn.LayerNorm(100),
            
    nn.Linear(100, 1),
    nn.Sigmoid()
)
        
# create error function
self.error_function = torch.nn.BCELoss()

# create optimiser, using simple stochastic gradient descent
self.optimiser = torch.optim.Adam(self.parameters(), lr=0.0001)


The incoming data is reshaped using View() to be 1-dimensional so it can be considered a single input layer of nodes. We're using the leaky relu and layer normalisation for the middle layer as we previously found that to be beneficial for GAN training.

As before, let's first check this network has the capacity to learn to discriminate between real data and random noise. If it can't even do that then it is intuitive that it can't tell the difference between images from the training set and images from the generator.

The core code for training the discriminator is as follows:


for image_data_tensor in celeba_dataset:
        
    # train discriminator on real data
    D.train(image_data_tensor.permute(2,0,1).contiguous().view(1, 3, 218, 178), torch.cuda.FloatTensor([1.0]).view(1,1))
    
    # train discriminator on false (random) data
    D.train(generate_random(3*218*178).view((1, 3, 218*178)), torch.cuda.FloatTensor([0.0]).view(1,1))
    
    pass


The code looks a little complicated so let's take it step by step. The data from the celeba dataset is of the form (height, width, 3). This needs to be reshaped to (1, 3, height, width), the 4-dimensional tensor expected by pytorch. The first 1 is a batch size. The permute() function re-arranges the axes and contiguous() is needed to repack the tensor as permute can cause the memory layout to become non-contiguous.

Similarly, the generate_random() creates a 1-dimensional array of random numbers, which we need to reshape to (1, 3, height, width).

As we're developing code, I've only taken 19,999 images from the 202,599 for the hdf5 file. The following shows the loss as the discriminator is trained once on this data.


We can see the loss falls to zero as training proceeds. It is interesting that a large number of the losses seem to be concentrated on a tight path.

Manually testing the trained discriminator shows it has been trained successfully.


We can now proceed to defining the generator. Again, let's keep its architecture as simple as possible.


# define neural network layers
# input shape is 1-dimensional array
self.model = nn.Sequential(

    nn.Linear(100, 3*10*10),
    nn.LeakyReLU(),

    nn.LayerNorm(3*10*10),

    nn.Linear(3*10*10, 3*218*178),
    nn.Sigmoid(),

    View((1, 3, 218, 178))
)

# create error function
self.error_function = torch.nn.BCELoss()

# create optimiser, using simple stochastic gradient descent
self.optimiser = torch.optim.Adam(self.parameters(), lr=0.0001)


As before, the generator takes a random input, in this case a 1-dimensional array of size 100. We use a middle layer of size 3*10*10 = 300 and again use the leaky rely and layer normalisation. The final layer grows to 3*218*178 which is the number needed for an image of size 218 by 178 and 3 red, green and blue channels.

The code to train the generator follows the same pattern as before - we train the discriminator to label real images as 1, images from the generator as 0, and we train the generator to get the discriminator to label its images as 1.


for image_data_tensor in celeba_dataset:
      
    # train discriminator on real data
    D.train(image_data_tensor.permute(2,0,1).contiguous().view(1, 3, 218, 178), torch.cuda.FloatTensor([1.0]).view(1,1))
    
    # train discriminator on false
    # use detach() so only D is updated, not G
    # label softening doesn't apply to 0 labels
    D.train(G.forward(generate_random(100)).detach(), torch.cuda.FloatTensor([0.0]).view(1,1))
    
    # train generator
    G.train(D, generate_random(100), torch.cuda.FloatTensor([1.0]).view(1,1))
    
    pass


Here are the results from one epoch through the dataset of 19,999 images.


Yippee!

Our GAN did actually learn to create faces. That's pretty amazing - especially when we remember the generator has not directly seen any of the celebrity photos.

Another nice thing is that the faces are different - so we've avoided mode collapse. The variety of images is really good - we can see both male and female faces, as well as different styles of hair a shapes for eyes and heads.

We can see the losses from the discriminator and the generator stabilise which suggests an equilibrium has been reached, which is good. Equilibrium in GAN training is sometimes hard to achieve, and a lack of it can lead to a collapse in useful training of the generator.

For a first attempt, these results are great.

The images were a bit grainy so let's see what a second training round does.


The images have improved significantly. The equilibrium between the discriminator and generator is holding.

Let's continue to more training epochs. The following shows the results for 4 epochs.


The results are improving, albeit we still have some of the grain. It is good to see the diversity maintained with different hair colour and even one oblique pose. This means we've avoided mode collapse.

Continuing to 6 epochs gives slightly improved images again.


The lower middle image is starting to show over-saturation which could be a sign we're reaching the limit of this training session.

At 8 training epochs we're starting to see both an improvement in some of the generated faces but also over-saturated blotching, and also some mode collapse. Two pairs of the images in the apparently random set above are very similar.


The loss charts are starting to show instability, which corresponds with this worsening of the generator.

The code for this simple GAN is online, and includes instructions to take advantage of a cuda gpu through google's hosted colab notebook service:



The animation at the top of this blog is the output of the generator as the input array is varied in a controlled by, moving a set of consecutive 1's along the 100 length array in steps of 5, with the resulting images smoothly transitioned for effect.


Experiments - Change Size of Random Input

What we've developed is an intentionally simple architecture to get us started. We can do many experiments varying different elements of the networks design and architecture to see if they result in an improvement or not.

A simple experiment is to change the size of the random number array that feeds the generator. So far we've been using a size of 100. That number is effectively the first layer of the generator network. If it is too large, it makes training the generator harder. If it is too small, it may limit the variety of images the generator can create.

Here are the images resulting from 4 epochs of training, with the input array size varying as 5, 10, 100 and 200.


Overall the quality of the images doesn't change much. For very small input sizes, the image quality is poor, but actually surprising good. Having an input of size 1 into the generator still creates diverse images that do look like faces, even if the quality is poor and there is mode collapse.

The quality seems to improve slowly as that input size grows. At size 400, the images are diverse and contain different features but are starting to look like they need more training. This is expected, because larger networks take longer to train.

The only slight surprise here is that an input of 1 random number into the generator still results in faces being generated.


Convolutions for Selecting Features

A very common improvement in GANs, and indeed neural networks more generally, is the use of convolution layers. Instead of connecting every input node to every node in the next layer, we can limit the connections to a smaller area of the input. This means the next layer picks out local features of the input.

The following diagram (src) shows how a convolution kernel K picks out diagonal features from an image I. The feature map S has a high value of 1 in the bottom right because the original image has a diagonal pattern in the bottom right. Similarly, the top right has no diagonal pattern and that's why S has a 0 there. Partial patterns in the image, such as the top left, result in partial values in S.


The following notebook implements a simple classifier neural network for the handwritten digits dataset from part 3, and shows that learning localised image features results in the accuracy jumping from 90% to 98%:




The following animation (src) shows how a convolution reduces the size of the image to a smaller feature map:


You can see how this locally limited passing of information from the input (blue) to the next layer (green) can allow image features to be learned. For this reason convolution layers are popular for classifying neural networks - and in our work, the discriminator.

For the generator, we can use convolutions again but they need to work in the opposite direction. Instead of shrinking the input, we need the generator to expand the input noise array towards the size of the output image. These are called transposed convolutions, or sometimes deconvolutions.

It is worth noting that generators are often designed to be similar to their discriminator but reversed in direction. There is no real reason to do this, other than as a loose heuristic approach to balancing the generator and discriminator.

The following animation (src) shows this working. The 3x3 input (blue) is expanded to the 5x5 output.


Initial experiments failed to produce results. Here's an early experiment.


The striping or moire-like pattern problem is common when trying to build images from inverse convolutions. This is because they can overlap if not spaced apart just right. This is achieved my making sure the stride or step size divides the size of the kernel.

Here is another example which failed to generate celebrity faces, but did succeed in creating monsters from some horror film!


After much experimentation I did find a working, and still simple, solution.

The following shows how the generator and the discriminator are broadly balanced. The images themselves have been cropped to be a square 128x128 because the inverse convolutions are much easier to design to have an output that is 128x128 and avoid having a linear layer at the end of the generator.


The discriminator has three convolution layer with kernels of size 8, which move in steps of 2. After the convolutions have reduced the input to a 3x10x310 feature map, a linear layer reduces these 300 values down to 1.

self.model = nn.Sequential(
    
    nn.Conv2d(3, 256, kernel_size=8, stride=2, bias=False),
    nn.BatchNorm2d(256),
    nn.LeakyReLU(0.2),
    nn.Dropout2d(0.3),
    
    nn.Conv2d(256, 256, kernel_size=8, stride=2, bias=False),
    nn.BatchNorm2d(256),
    nn.LeakyReLU(0.2),
    nn.Dropout2d(0.3),
        
    nn.Conv2d(256, 3, kernel_size=8, stride=2, bias=False),
    nn.LeakyReLU(0.2),
    nn.Dropout2d(0.3),
 
    View((1,3*10*10)),
        
    nn.Linear(3*10*10, 1),
    nn.Sigmoid()
)


We are also expanding the 3 channels to 256 for the convolutions, which is simply having 256 different potential feature selectors at each layer. This gives the network a larger capacity to learn more potentially useful features.

This discriminator also has dropout. This means that some of the network signals are zeroed during training, which helps avoid overfitting by preventing nodes from co-adapting.

Testing the discriminator by training it to separate real images from random noise shows that it does learn very well. The loss plot shows an interesting residual loss but the bulk of loss values fall towards zero.


Manually testing the discriminator shows very confident scores. This again confirms the general belief that convolution neural networks are better at image classification because they learn meaningful features.

The following is code for the generator. It is a bit smaller than the discriminator because there are only two convolution layers. Researchers are finding that in a balanced architecture, generators can be a bit smaller than the discriminators.


self.model = nn.Sequential(
            
    # input is a 1d array
    nn.Linear(100, 3*28*28),
    nn.LeakyReLU(0.2),
    
    # reshape to 2d
    View((1, 3, 28, 28)),
     
    nn.ConvTranspose2d(3, 256, kernel_size=8, stride=2, bias=False),
    nn.BatchNorm2d(256),
    nn.LeakyReLU(0.2),
            
    nn.ConvTranspose2d(256, 3, kernel_size=8, stride=2, padding=1, bias=False),
    nn.BatchNorm2d(3),
    nn.LeakyReLU(0.2),
               
    View((1,3,128,128)),
    nn.Tanh()
)


Here are the results from 1 epoch of training:


These aren't photo-realistic portraits but the results are impressive because we've succeeded, even with very simple code, in constructing images from localised features, as learned by the convolution layers. These features are, in effect, the eyes, nose, hair, lips, and so on. Looking closely we can see the patching together of these features.

Also impressive is he diversity of the images, each generated by a random input to the generator. We've avoided mode collapse.

Here are the results from 2 epochs:


Here are the results from 4 epochs:


As training progresses, the composition of faces improves. Each of the faces does have two eyes, a mouth and a nose, with hair in roughly the right place. We do see mismatched eyes and hair.

Here are the results from 6 epochs:


Here are the results from 8 epochs:


Again, the quality of the faces is improving, with the patching becoming smoother.

Here are the results from 10 epochs:


Here are the results from 12 epochs:


Each of these training epochs took about half an hour using colab, and it would be interesting to continue the training for more epochs, and also use the full dataset of 200,000 images and not the smaller 20,000 used for these experiments.

There is a loss of contrast as the training progresses which I haven't yet understood. The geometric features are the primary means of getting past the discriminator. The relative strengths of colours, the contrast, might not be learned by the feature based discriminator.

The following shows a selection of generated images, animated using a smooth transition.


The final code is online:




Tips and Heuristics

The theory that underpins GANs is still being developed and for this reason the design and training is too often unsuccessful or not efficient.

Researchers have mostly coalesced around a few heuristics and tips, mostly derived from experience and empirical results:



  • Gaussian weight initialisation can help, just as it seems to help some traditional classification networks.
  • For GANs, convolution kernel sizes should be larger than the 3x3 or 4x4 often found in textbooks. Our own example uses 8x8 kernels.
  • Using a stride that exactly divides the size of the deconvolution kernel can avoid striping or moire-like patterns.
  • Avoid normalising the last generator and first discriminator layers. The explanation offered is that this ensures the network can learn the actual mean and variance of images. 
  • Square images are much easier to work with when trying to get inverse convolutions to produce images that are the size we want.
  • Avoid overconfidence in the discriminator by training it to target 0.9 instead of 1.0 which can lead to gradient saturation, and so slow or no learning. This is called soft-labelling. That 0.9 can be varied by small random amounts.
  • Occasionally flipping the true/false training target values helps the networks get kicked out of local-minima or periods of low-gradients.
  • Normalising the data to the range -1 to +1 helps it match the activation functions where they have the best gradients for learning. 



Published Results

The following is an example of the current state of the art (src):



What does it take to produce these images?

The linked paper shows the hardware and training times:


Each GPU costs about £8,000 and 8 of them were used, with training times extending into days if not weeks!


Talk at Algorithmic Art June 2019

I gave a talk on this journey at the June 2019 meeting of the Algorithmic Art group.


A video recording is here: https://skillsmatter.com/skillscasts/13999-algorithmic-art-june-meetup

Slides for the talk are here: https://tinyurl.com/y3n55acf


Conclusion

The adversarial approach for training competing learning models is a markedly different idea to the large bulk of machine learning.

The theoretical underpinning is still being developed, and so for now not all architectures work well - with notable problems like mode collapse.

Even so, the results possible from very simple, as well as the very expensive state of the art, are impressive.

The future of generative adversarial machine learning is looks very promising!


More Reading

No comments:

Post a Comment