Autoencoders, Tied Weights and Dropout

Training deep neural networks (i.e., networks with several hidden layers) is challenging, because normal training easily gets stuck in undesired local optima. This prevents the lower layers from learning useful features. This problem can be partially circumvented by pre-training the layers in an unsupervised fashion and thereby initialising them in a region of the error function which is easier to train (or fine-tune) using steepest descent techniques.

One of these unsupervised learning techniques are autoencoders. An autoencoder is a feed forward neural network with one hidden layer which is trained to map its input to itself via the representation formed by the hidden units. The optimisation problem for input data \(\vec{x}_1,\dots,\vec{x}_N\) is stated as:

\[\min_{\theta} \frac 1 N \sum_{i=1}^N (\vec x_i - f_{\theta}(\vec x_i)^2 \enspace .\]

Of course, without any constraints this is a simple task as the model will just try to learn the identity. It becomes a bit more challenging when we restrict the size of the intermediate representation (i.e., the number of hidden units). An image with several hundred input points can not be squeezed in a representation of a few hidden neurons. Thus, it is assumed that this intermediate representation learns something meaningful about the problem. Of course, using this simple technique only works if the number of hidden neurons is smaller than the number of dimensions of the image. We need more advanced regularisation techniques to work with overcomplete representations (i.e., if the size of the intermediate representation is larger than the input dimension). But especially for images it is obvious that a good intermediate representation must be somehow more complex: the number of objects that can be seen on an image is larger than the number of its pixels.

Shark offers a wide range of possible training algorithms for autoencoders. This is the basic tutorial which will show you:

In tutorials building up on this one, we will show how to use these building blocks in conjunction with training :doxy:`FFNet` in an unsupervised fashion.

As a dataset for this tutorial, we use a subset of the MNIST dataset which needs to be unzipped first. It can be found in examples/Supervised/data/mnist_subset.zip.

The following includes are needed for this tutorial:

#include <shark/Data/Pgm.h> //for exporting the learned filters
#include <shark/Data/SparseData.h>//for reading in the images as sparseData/Libsvm format
#include <shark/Models/Autoencoder.h>//normal autoencoder model
#include <shark/Models/TiedAutoencoder.h>//autoencoder with tied weights
#include <shark/ObjectiveFunctions/ErrorFunction.h>
#include <shark/Algorithms/GradientDescent/Rprop.h>// the RProp optimization algorithm
#include <shark/ObjectiveFunctions/Loss/SquaredLoss.h> // squared loss used for regression
#include <shark/ObjectiveFunctions/Regularizer.h> //L2 regulariziation

Training autoencoders

Training an autoencoder is straight forward in shark. We just create an instance of the :doxy:`Autoencoder` class, initialize its structure and perform a simple regression task where we set the labels to be the same as the input.

We will start by creating a simple function which creates and trains a given type of autoencoder model. This will enable us during the evaluation to take a look at different structures and model types:

template<class AutoencoderModel>
AutoencoderModel trainAutoencoderModel(
        UnlabeledData<RealVector> const& data,//the data to train with
        std::size_t numHidden,//number of features in the autoencoder
        std::size_t iterations, //number of iterations to optimize
        double regularisation//strength of the regularisation
){

The parameters are the dataset we use as the inputs for the autoencoder, the number of hidden units (i.e. the size of the intermediate representation), the number of iterations to train and the strength of the two-norm regularisation we want to use. The template parameter is the type of autoencoder we want to use, this will be for example Autoencoder<LogisticNeuron,LogisticNeuron>. The two template parameters define the type of the activation function used in the hidden and output layer, respectively.

Next we create the model:

//create the model
std::size_t inputs = dataDimension(data);
AutoencoderModel model;
model.setStructure(inputs, numHidden);
initRandomUniform(model,-0.1*std::sqrt(1.0/inputs),0.1*std::sqrt(1.0/inputs));

All autoencoders need only two parameters for setStructure: the number of input dimensions and the number of hidden units. The number of outputs is implicitly given by the input dimensionality.

Next, we set up the objective function. This should by now be looking quite familiar. We set up an :doxy:`ErrorFunction` with the model and the squared loss. We create the :doxy:`LabeledData` object from the input data by setting the labels to be the same as the inputs. Finally we add two-norm regularisation by creating an instance of the :doxy:`TwoNormRegularizer` class:

//create the objective function
LabeledData<RealVector,RealVector> trainSet(data,data);//labels identical to inputs
SquaredLoss<RealVector> loss;
ErrorFunction error(trainSet, &model, &loss);
TwoNormRegularizer regularizer(error.numberOfVariables());
error.setRegularizer(regularisation,&regularizer);

Lastly, we optimize the objective using :doxy:`IRpropPlusFull`, a variant of the Rprop algorithm which uses full weight backtracking which is more stable on the complicated error functions formed by neural networks:

IRpropPlusFull optimizer;
optimizer.init(error);
std::cout<<"Optimizing model: "+model.name()<<std::endl;
for(std::size_t i = 0; i != iterations; ++i){
        optimizer.step(error);
        std::cout<<i<<" "<<optimizer.solution().value<<std::endl;
}

Experimenting with different architectures

We want to use the code above to train different architectures. We first start with the standard autoencoder. It has the formula:

\[f(x) = \sigma_2(\vec b_2+W_2\sigma_1(\vec b_1+W_1\vec x))\]

This is the normal equation for a feed forward neural network with a single hidden layer. The input and output activation functions \(\sigma_1\) and \(\sigma_2\) can be chosen among the same types as used for :doxy:`FFNet`. The problem of this architecture is that, since the weight matrices \(W_1\) and \(W_2\) are independent, the autoencoder can easily learn the identity given a big enough hidden layer. A way around this is to use a :doxy:`TiedAutoencoder` which has the formula:

\[f(x) = \sigma_2(\vec b_2+W^T\sigma_1(\vec b_1+W\vec x))\]

Here we set \(W_2=W_1^T\) eliminating a lot of degrees of freedom.

Additionally we can use a relatively new technique called dropout [SrivastavaEtAl2014]. This works completely on the level of the activation functions by setting the neuron randomly to 0 with a probability of 0.5. Dropout makes acts only on the hidden units. It makes it harder for the single representations to specialise, and instead redundant features need to be learned.

We will now use the 4 combinations of using tied weights and dropout and compare the features that were generated on the MNIST dataset (we omit loading and preprocessing of the dataset for brevity):

typedef Autoencoder<LogisticNeuron, LogisticNeuron> Autoencoder1;
typedef TiedAutoencoder<LogisticNeuron, LogisticNeuron> Autoencoder2;
typedef Autoencoder<DropoutNeuron<LogisticNeuron>, LogisticNeuron> Autoencoder3;
typedef TiedAutoencoder<DropoutNeuron<LogisticNeuron>, LogisticNeuron> Autoencoder4;

Autoencoder1 net1 = trainAutoencoderModel<Autoencoder1>(train.inputs(),numHidden,iterations,regularisation);
Autoencoder2 net2 = trainAutoencoderModel<Autoencoder2>(train.inputs(),numHidden,iterations,regularisation);
Autoencoder3 net3 = trainAutoencoderModel<Autoencoder3>(train.inputs(),numHidden,iterations,regularisation);
Autoencoder4 net4 = trainAutoencoderModel<Autoencoder4>(train.inputs(),numHidden,iterations,regularisation);

exportFiltersToPGMGrid("features1",net1.encoderMatrix(),28,28);
exportFiltersToPGMGrid("features2",net2.encoderMatrix(),28,28);
exportFiltersToPGMGrid("features3",net3.encoderMatrix(),28,28);
exportFiltersToPGMGrid("features4",net4.encoderMatrix(),28,28);

Visualizing the autoencoder

After training the different architectures, we printed the feature maps (i.e., the input weights of the hidden neurons ordered according to the pixels they are connected to). Let’s have a look.

Normal autoencoder:

Plot of features learned by the normal autoencoders

Autoencoder with tied weights:

Plot of features learned by the tied autoencoders

Autoencoder with dropout:

Plot of features learned by the normal autoencoders

Autoencoder with dropout and tied weights.

Plot of features learned by the tied autoencoders

Full example program

The full example program is :doxy:`AutoEncoderTutorial.cpp`.

Attention

The settings of the parameters of the program will reproduce the filters. However the program takes some time to run! This might be too long for weaker machines.

References

[SrivastavaEtAl2014]N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research 15: 929-1958, 2014