Hands-On Practice with Time Distributed Layers using Tensorflow

Published in

Level Up Coding

8 min readAug 16, 2021

In this article, I will guide you to solve a problem that involves a sequence of images as input with Tensorflow that I have faced in the final round of Data Unchained-an Machine Learning competition by Elysium IIIT Delhi.

Problem Statement

You are the tech guy in the Los Angeles Police Department. The recent surge in illegal racing is causing a lot of trouble to the citizens. So now the police have been on alert. Yesterday they brought in a person for the same reason. We got access to his dashboard cam but we can’t determine if he is over speeding or not because the hard drive was corrupted. Still, you managed to get some pieces of the video clip, but from that, his speed is not conclusive by just seeing.
Being a highly motivated person, you decide to fetch the clips from another car’s dash-cam with velocity noted down. Now your task is to train a model which determines the speed of the car using a series of dash-cam images (i.e. 8 frames). The camcorder recodes the clip at 20 fps.

Kaggle Link for the Competition

An example of training data:

Output: Car’s Velocity in X and Y direction: [-1.2287469463, -0.0101401592]
Velocity values are scaled.

Understanding the Problem Statement

The problem here is to train a model which can determine the speed of the car using just a series of images taken from dash camera. We have to input 8 images to the model and it outputs X and Y components of velocity of the car. In simple terms, we have to predict the speed of the car with 8 images taken from its dashboard camera.

Understanding Time Distributed Layers

Normally, in Computer Vision we will have a problem building a model that can classify images (for ex. dog or cat ). You will input a single image to the model and it gives an output of 0 or 1 (0 being Cat and 1 being Dog).

But in our case, we have 8 images per sample.

One way to approach this problem is to merge all the images into a single image and input the merged image to the model but here we will lose key information which can help the model to predict. In our problem statement, Small differences in the location of objects in the images can help the model determine the speed of the car but by merging the image we will be losing this information.

So we have to find a way to not lose information and at the same time pass 8 images to the model and get an output. Here comes our Savior the Time Distributed Layer from Tensorflow.

This Specialized Layer applies the same layer to several inputs and get output for each input such that we can combine them and pass it to another layer to make predictions.

In this way, we use only one layer that performs its operations on 8 separate images and gives output.

Key Note: **All the Convolution Layers share the same weights. So They are technically the same layer. For easy understanding, you can think of them as Clones of the layer.**

The Time Distributed Layer applies the same instance of the layer to each of the 8 images. So we don’t have 8 different sets of weights for this layer. The same set of weights are applied to all images.

So By using this layer, we will not be increasing the complexity of the model(number of parameters) yet give the model the ability to learn from 8 different images separately with just one layer.

Intuitively, These images are taken at different timestamps that doesn’t have any major difference so one layer is enough to key features. It’s like taking photos in burst mode on Mobile. So the same layer can be applied to 8 images and yet identity key features from each of them.

To Learn More about Time Distributed Layers: How to work with Time Distributed data in a neural network

I thank the writer of the article Patrice Ferlet for this article. it is Gold and has helped me a lot in solving this problem.

Knowing the basics required to solve this problem, let us dive deep into the code

Structure of the Data provided:

Importing necessary libraries:

import cv2
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import keras
import json
import tensorflow as tf 
from keras.layers import Input
from keras import Sequential
from keras.layers import Dense, LSTM,Flatten, TimeDistributed, Conv2D, Dropout
from keras.applications.inception_v3 import InceptionV3
from keras.applications.vgg16 import VGG16

Mounting Google drive and Extracting the Data:

from google.colab import drive
drive.mount('/content/gdrive/')import zipfilefinal_showdown_zip = '/content/gdrive/My Drive/final-showdown.zip'zip_ref = zipfile.ZipFile(final_showdown_zip, 'r')
zip_ref.extractall('/content/gdrive/My Drive/final_showdown')
zip_ref.close()

The dimensions of the images in our problem are (720, 1280, 3). We have to reduce the dimensions of the image for faster processing and training of our data. To preserve the ratio at the same time while reducing dimension we divide the dimension by 4. (720/4,1280/4) =(180,320,3)

Function to Resize the image from (720,1280,3) to (180,320,3) for faster computation still preserving the Image Aspect ratio:

def get_img(img_path, printer=True):
  original_img = cv2.imread(img_path, cv2.IMREAD_COLOR)
  if printer: print ("original dim:",original_img.shape)resized_img = cv2.resize(original_img, (320,180), interpolation=cv2.INTER_CUBIC)
  if printer: print ("resized dim:", resized_img.shape)return resized_img

Testing our Function:

img_path = "/content/gdrive/My Drive/final_showdown/Train/Train/422/imgs/001.jpg"
resized_img = get_img(img_path)
plt.imshow(resized_img)

Next step is To Concatenate 8 Images in a sample for our Time Distributed Layer.

Input shape of our Time Distributed Layer = (No. of Images per sample,height of the image, width of the image, No. of channels)

prefix = "/content/gdrive/My Drive/final_showdown/Train/Train/400/imgs/00"
 ## Testing the Concatenation methodX_sample = []
for idx in range (1, 9):
  img_path = prefix + str(idx) + ".jpg"
  resized_img = get_img(img_path, printer=False)
  X_sample.append(resized_img)print (np.array(X_sample).shape)

To Convert the Training data into Required format to feed the data into a model.

main_prefix="/content/gdrive/My Drive/final_showdown/Train/Train/"
X_train_check=[]
for i in range(1,457):
  path=main_prefix+str(i)+"/imgs/00"
  X_sample = []
  for idx in range(1, 9):
    img_path = path + str(idx) + ".jpg"
    resized_img = get_img(img_path, printer=False)
    X_sample.append(resized_img)
  X_train_check.append(np.array(X_sample))
np.array(X_train_check).shape

Our Numpy array X_train_check has the above shape which corresponds with our idea of (456 Training Examples, 8 images in each example, the dimensions of images)

Extracting Velocity in X and Y Direction from the annotation file and make a NumPy array with 2 columns (Velocity in X and Y).

main_prefix="/content/gdrive/My Drive/final_showdown/Train/Train/"
y_train_check=[]
for i in range (1,457):
  path=main_prefix+str(i)+"/annotation.json"
  f = open(path,)  
  data = json.load(f) 
  f.close()
  y_train_check.append(np.array((data[0]["velocity"])))
np.array(y_train_check).shape

print(np.array(y_train_check)[:5])

Model Structure

We will be using Time Distributed Convolutional layers followed by LSTM to capture the sequential data then proceeding with Dense layers to get our final output. As we have a limited amount of data for training, we use the pre-trained model InceptionResNetV2 as a Time-Distributed layer to capture details from the image.

LSTMs are a special kind of RNN, capable of learning long-term dependencies. We use LSTM to capture the chronological information attained from the Time Distributed layers above and get meaningful information out of it.

Visual Explanation of the Model. Image Source

What are Pretrained Models?

A pre-trained model has been previously trained on a dataset and contains the weights and biases that represent the features of whichever dataset it was trained on. Learned features are often transferable to different data.

Inception-ResNet-v2 is a convolutional neural network that is trained on more than a million images from the ImageNet database. The network is 164 layers deep and can classify images into 1000 object categories, such as a keyboard, mouse, pencil, and many animals. As a result, the network has learned rich feature representations for a wide range of images.

Using the weights of this pre-trained model on top our LSTM will increase our accuracy. We use Inception-ResNetv2 model but we only train the last 4 layers of it. Training the huge model fully again will take a lot of computation power and defies the use of pretraining.

#Best Model to this problem was InceptionResNet V2inceptionresnet=tf.keras.applications.InceptionResNetV2(                                       
    include_top=False,
    weights="imagenet",
    input_tensor=None,
    input_shape=(180,320,3)
)
  
for layer in inceptionresnet.layers[:-4]:                                                       
    layer.trainable = False#We train only the last 4 layers this Model while Freezing the other #Layers.

Building the Model:

model = Sequential()#add Inception model for 8 input images (keeping the right shape)model.add(TimeDistributed(inceptionresnet, input_shape=(8, 180, 320, 3)))     
                     
#Using TimeDistributed Layer to Feed the Image Sequence# now, flatten  each output to send 8 outputs with one dimension to #LSTMmodel.add( TimeDistributed( Flatten() ))##Added LSTM to Capture the Sequencial Informationmodel.add(LSTM (256, activation='relu', return_sequences=False))# Dense Layermodel.add(Dense (64, activation='relu'))#Final Dense Layermodel.add(Dropout(.5))model.add(Dense(2))  
                     
# Final Layer is of 2 Neurons [Velocity in X Direction and Y Direction]model.compile(optimizer='adam', loss='mean_squared_error', metrics=tensorflow.keras.metrics.RootMeanSquaredError())model.summary()

Summary of the model describes the layers used and the no. of parameters in it.

Similar Model Structure:

#Fitting the model to our training data r=model.fit(np.array(X_train_check),np.array(y_train_check),validation_split=0.2,batch_size=38,epochs=10)

Training the model for 10 epochs was found to be efficient to avoid the overfitting problem.

Loss reduces with each epoch. Training Error and Validation Losses converge down significantly per each epoch signifying that the model has performed well.

Predicting with this model on the test data gave me a Root mean squared error of 1.76 and helped me securing Second Position in the Competition.

GitHub Link to access the notebook.