Generating Captions for Images using Neural Networks

Akash Chauhan
Level Up Coding
Published in
9 min readJul 24, 2020

--

We all know that Neural Networks can do so much in terms of replicating a human brain while performing certain tasks. The applications of Neural Networks in Computer Vision and Natural Language Generation has been truly remarkable.

This article will introduce the readers to one such application of Neural Networks and let the readers understand how we can actually generate Captions (descriptions) for Images using a Hybrid Network of CNNs and RNNs (LSTM). The dataset that we are using for this task is the popular flickr 8k Image Dataset which is the benchmark data for this task and can be accessed by the below link.

Kaggle — https://www.kaggle.com/adityajn105/flickr8k

Note : We will be splitting our dataset as 7k for training and 1k for testing purpose.

We will first discuss what are the different components (layers) and their function in our Hybrid Neural Network. Along with this, we will also look at the practical implementation of developing our Hybrid Neural Network using Tensorflow, Keras, and Python.

Overall Architecture of Neural Network

Let's look at the overall architecture of our Neural Network that we will be using for generating captions.

Neural Network Architecture (Source: Google Images)

In simple terms, the above Neural Network has 3 main components (Sub Networks), each assigned with a specific task namely —The Convolutional Network (for extracting features from Images), The RNN (LSTM) (for text generation) and A Decoder (for combining both networks).

Now let's talk about each component in detail and understand their working.

Image Feature Extractor

For generating features from an Image we will be using a Convolutional Neural Network with just a bit of modification. Let's look at a ConvNet that is used for Image Recognition.

CNN Architecture (Source: Bing Images)

As you can see, our CNN has two sub-networks —

  1. Feature Learning Network — Network responsible for generating feature maps from the Images (Network of multiple Convolution & Pooling Layers).
  2. Classification Network — Fully connected Deep Neural Network responsible for Image classification (Network of multiple Dense Layers and a single Output Layer).

Since we are only interested in extracting features from Images and not their classification, we are only interested in the Feature Learning Part of the CNN and that is how we can extract features from the Images.

Below code can be used to extract features from any set of images-

Designing you own Image Feature Extractor

Yes, anyone can build their own Image Feature Extractor using the above code, but there is catch…

The above model is too simple to extract each and every important detail from our set of images and hence will impact the performance of our overall model. Also, making model too complex (multiple Dense Layers with high number of Neurons) is also challenging due to the non-availability of high performing GPUs and systems.

To solve this problem, we have very popular pre-trained CNN models (VGG-16, ResNet50 etc. developed by scientists of different universities and organizations) that are available in Tensorflow and can be used to extract features from images. Just remember to remove the output layer from the model before using it for Feature Extraction.

The below code will let you understand how to use these pre-trained models in Tensorflow for extracting features from images.

Using ResNet50 Model for Image Feature Extraction

As you can see below, if you execute the above code you will see that our image feature is nothing but a numpy array of shape — (18432,)

image_feature_dictionary[list(image_feature_dictionary. Keys())[0]].shape(18432,)

Next we will develop our LSTM Network (RNN) for generating captions for images.

LSTM for generating Captions

Text generation is one of the most popular applications of LSTM Networks. LSTM cells (the basic building block of LSTM Networks) have the capability to generate output basis on the output of previous layers i.e. it retains the output of previous layers (the memory) and uses that memory to generate (predict) the next output in the sequence.

For our dataset, we have 5 captions for each images i.e. 40k captions in total.

Let's look at our dataset-

Image from Flickr8k dataset

Captions for Image

  1. A child in a pink dress is climbing up a set of stairs in an entry way.
  2. A girl going into a wooden building.
  3. A little girl climbing into a wooden playhouse.
  4. A little girl climbing the stairs to her playhouse.
  5. A little girl in a pink dress going into a wooden cabin.

As seen, all the captions describe the Image quite well. Our task now is to design an RNN that could replicate this task for any similar set of images.

Coming back to the original task, we first have to look at how an LSTM Network generates text. For an LSTM network caption is nothing but a long sequence of individual words (encoded as digits) placed together. Using this information, it tries to predict the next word in the sequence based on the previous words (Memory).

In our case, since the captions can be of varying length, we first need to specify the starting and ending of every caption. Let us see what this actually means-

Preparing Training Dataset

First we will be adding <start> and <end> to each and every caption in our dataset. After that we will tokenize each & every caption in our training dataset before creating our final vocabulary. For training our model, we will be dropping tokes (words) from our vocabulary that has frequency less than or equal to 10. This step is added to improve the generic performance of our model and to prevent it from overfitting the training dataset.

The following code can be used to implement this-

Code for Loading Captions

The above code will generate the below output —

train_image_captions[list(train_image_captions. Keys())[150]]
['<start> A brown dog chases a tattered ball around the yard . <end>',
'<start> A brown dog is chasing a tattered soccer ball across a low cut field . <end>',
'<start> Large brown dog playing with a white soccer ball in the grass . <end>',
'<start> Tan dog chasing a ball . <end>',
'<start> The tan dog is chasing a ball . <end>']

Once we have loaded in the captions, we will first tokenize everything using spacy and Tokenizer (from tensorflow.preprocessing.text class).

Tokenization is nothing but breaking down a sentence into different words while removing special characters, lowercasing everything. Result is we have a corpus of meaningful words (tokens) in the sentence that we can further encode before using it as an input to our model.

Code for tokenizing captions

The above code will generate a dictionary with every token encoded as an integer and vice versa. The sample output looks like below —

tokenizer.word_index {'a': 1,
'end': 2,
'start': 3,
'in': 4,
'the': 5,
'on': 6,
'is': 7,
'and': 8,
'dog': 9,
'with': 10,
'man': 11,
'of': 12,
'two': 13,
'black': 14,
'white': 15,
'boy': 16,
'woman': 17,
'girl': 18,
'wearing': 19,
'are': 20,
'brown': 21.....}

After this, we need to find the length of our vocabulary and length of longest caption. Let's see the significance of both of these measures in creating our model.

  1. Vocabulary Length → Length of Vocabulary is basically the count of unique words in our corpus. Also, the neurons in our output layers would be equal to Vocabulary Length + 1 (+ 1 is for extra whitespace due to padding sequence) since at every iteration we need our model to generate a new word from our corpus of words.
  2. Maximum Caption Length → Since in our dataset, we have captions of varying length even for the same image. Let's try to understand this in a bit detail-
Captions Sequence Length

As you can see every caption is of different length and hence we cannot use these as an input to our LSTM model. For fixing this problem, we fill pad every caption to the length of maximum caption.

Padded Sequence

Notice every sequence has a set of extra 0s to increase its length to the maximum sequence.

Finding Vocabulary and Max Caption Length

Next we need to create training dataset for our model specifying the input and output. For our problem we have 2 inputs and 1 output. Let's see this in a bit detail for understanding-

Training Dataset

For every image we have-

  1. Input Image Feature (X1) → Numpy array of shape (18432, ) extracted using ResNet50 Model
  2. Input Sequence (X2) → This requires a bit more explanation. Every caption is nothing but a list of sequence and our model is trying to predict to next best element in the sequence. Hence for every caption, we will first start of with the first element in the sequence and the corresponding output to that element would be the very next element. In the next iteration, the output of previous iteration would become the new input along with the input of previous iteration (the memory) and this goes on until we have reached the end of sequence.
  3. Output (y) → The next word in the sequence.

The below code can be used to implement the above logic for creating training dataset-

Code for creating training dataset

Combining the two Sub Networks

Now that we have developed our two sub networks (Image Feature Extractor & LSTM for generating captions), lets combine these two to create our final model.

Final Model Architecture

For any new image (must be similar to ones used during training) our model will generate the caption based on the knowledge it has acquired while training over similar set of images and captions.

Note: For good results, the testing images should be similar to the ones used during training.

The following code creates our final model

Before compiling the model, we need to add weights to our embedding layer. This is done by creating word embeddings (representation of token in a high dimensional vector space) for every token present in our corpus (vocabulary). There are some very popular word embedding models that can be used for this purpose (GloVe, Gensim Word Embedding Models etc.).

We will be using Spacy’s inbuilt “en_core_web_lg” model for creating vector representation of our token (i.e. every token would be represented as a (300,) numpy array).

The following code can be used to create word embeddings and add it to our model embedding layer.

Create word embeddings and adding to Embedding Layer

Now that we have created everything, we just have to compile and train our model.

Note : Training time of this network would be very high (with high number of epochs) due to complex nature of our task

Training Model

For generating new captions, we first need to convert an image into a numpy array of same dimension as images of training dataset (18432,) and using <start> as the input to model .

During sequence generation, we would terminate the process as soon as we have encountered a <end> in our output.

Method for generating captions

Now lets check the output of our model

As you can see our model generates good enough captions for some images but for some the captions are not explanatory.

This can be improved by increasing the epochs, training data, adding layers to our final model but all this requires high end machines (GPUs) for processing.

So this is how we can generate captions for images using our own deep learning model.

--

--