Run a self-driving car using JavaScript and TensorFlow.js

Alex Bakoushin
Level Up Coding
Published in
15 min readMay 30, 2020

--

In this tutorial, we will learn how to train a deep learning model to autonomously drive a car in a virtual driving simulator using Node.js, TensorFlow.js, and Udacity Self-Driving Car Simulator. We would see how easy it is to create and train a deep learning model entirely in JavaScript.

You can find the full code for this project in the GitHub repository.

Simulator

The car will run in a simulated environment provided by the open-source Udacity Self-Driving Car Simulator. You can download pre-built versions for Linux, Mac, and Windows from their GitHub repo.

We will use one of the builds under the “Term 1” header. These provide two virtual tracks to drive trough—and we will use the first one.

Plan of action

In order to make a car to drive autonomously using deep learning, we have to follow these 5 steps:

  1. Collect data to train the model.
  2. Define a deep learning model.
  3. Load and prepare data for training.
  4. Train the deep learning model.
  5. Use the model to drive a car.

1. Collecting the data

Basically, all we have to do is to literally show the car how we normally do driving. This is called behavioral cloning. Basically, we want our car to mimic the actions of a human driver in a similar environment.

The single skill we will teach our car is steering. We want it to steer to the left on the left turns, steer to the right on the right turns, and do not steer too much when the road goes straight. This is an intentionally oversimplified example but it is aimed to catch the basics of self-driving cars.

Actually, NVIDIA researchers were solving the very same steering problem in 2016. Read their great article and watch the video.

Technically we want to collect a bunch of images from the camera looking through the car’s windshield with corresponding steering values. Then our deep learning model will look at these images and it will learn how to steer accordingly.

Images from car camera associated with different steering values
Steering wheel icon by zidney from the Noun Project

With Udacity Simulator we can collect that data with ease. Just drive a virtual car like in a video game, and record the journey. The simulator will automatically extract and save all necessary images and steering values for us.

The final dataset would be saved in a folder with the following structure: a IMG directory with a bunch of images, and driving_log.csv file with steering and other telemetry data mapped to the images.

data
+- driving_log.csv
+- IMG
+- center_2020_05_24_20_51_38_536.jpg
+- left_2020_05_24_20_51_38_536.jpg
+- right_2020_05_24_20_51_38_536.jpg
...

Instructions

  1. Run the simulator in the Training Mode.
  2. Click on the Recording button in the upper right corner of the screen.
  3. Select a folder to store the data.
  4. Click on the Recording button again to start recording.
  5. Drive the car showing exemplary steering skills. Run at least one full lap, better two or three. Remember: this is what the car would learn from!
  6. Click on the Recording button again to stop recording. Wait until it finishes. Meanwhile, enjoy the replay of your epic journey.

2. Defining a deep learning model

The deep learning model is basically a series of operations, which takes an input—in our case, an image, passes it through a set of filters, and outputs some results. Our results would be a steering value, which the model considers appropriate for a given image.

If you are familiar with what JavaScript Array.prototype.map() does, our model will do basically the same: map a given image to a suitable steering value. But instead of writing the mapping algorithm ourselves, we let the model figure it out on its own. This is all training is about.

There are several types of network architectures for deep learning. We will use the one called a CNN which stands for Convolutional Neural Network. This type of neural network is extremely powerful in analyzing images. Typically CNN does extract a bunch of features out of an image, such as lines, shapes, colors. Then it uses that information to produce the results.

We will use a very basic yet powerful CNN architecture: one cropping layer, followed by two convolutional layers, each followed by a max-pooling layer, all of that followed by three dense layers. For all layers, except the very last one, we will use the ReLU activation function. The last layer will have only one value, which is our steering value.

If that, for now, sounds nonsense for you, fear not: we will discuss each element of this architecture to get some basic understanding of what is going on.

Here is how we can define a model in code using TensorFlow.js:

const model = tf.sequential({
layers: [
// Cropping layer
tf.layers.cropping2D({
cropping: [[75, 25], [0, 0]],
inputShape: [160, 320, 3]
}),
// 1st convolutional layer
tf.layers.conv2d({
filters: 16,
kernelSize: [3, 3],
strides: [2, 2],
activation: 'relu'
}),

// 1st max-pooling layer
tf.layers.maxPool2d({
poolSize: [2, 2]
}),
// 2nd convolutional layer
tf.layers.conv2d({
filters: 32,
kernelSize: [3, 3],
strides: [2, 2],
activation: 'relu'
}),
// 2nd max-pool layer
tf.layers.maxPool2d({
poolSize: [2, 2]
}),
// Dense layers with dropout layer
tf.layers.flatten(),
tf.layers.dense({ units: 1024, activation: 'relu' }),
tf.layers.dropout({ rate: 0.25 }),
tf.layers.dense({ units: 128, activation: 'relu' }),
tf.layers.dense({ units: 1, activation: 'linear' })
]
});
model.compile({
optimizer: 'adam',
loss: 'meanSquaredError'
});

First, we create our model using tf.equential(). This means that our model will sequentially process the input towards the output, step by step. Then we add a series of layers to process out data in sequence. Let’s discuss them one by one.

Cropping layer

Images include many extra details, such as landscape or a car hood. These details give no hint of where to steer at a given moment of time. In order to strengthen signal over noise, we will crop the top and the bottom of the image.

// Cropping layer
tf.layers.cropping2D({
cropping: [[75, 25], [0, 0]],
inputShape: [160, 320, 3]
})

We will crop 75px from the top of the image (the landscape) and 25px from the bottom of the image (the car hood) using tf.layers.cropping2D.

We have to specify the dimensions of the initial image ininputShape parameter. This is obligatory for the very first layer of the model. Here160 is the height of the image, 320 is its width, and 3 means that it has three color channels—red, green, and blue (also known as RGB).

tf.layers.cropping2D({ cropping: [[75, 25], [0, 0]], inputShape: [160, 320, 3] })

Convolutional layers

We create two consecutive convolutional layers using tf.layers.conv2d. The intent of these is to extract from a given image the prominent features, such as lines and shapes.

Convolutional layers do it by applying kernel filters to the data of the previous layer.

Learn more:
Image Kernels by Victor Powell
Convolutions video from Udacity Intro to TensorFlow free course

What exact filter to use, the model decides itself during training. We just define, how many different filters to use, size of the filter kernel, and stride — the size of the step when applying the filter.

The result is a set of grayscale images with the important features, highlighted by the filters. We will not see any of these images, but the model will use them to make further calculations.

tf.layers.conv2d({ filters: 16, kernelSize: [3, 3], strides: [2, 2], activation: ‘relu’ })
// 1st convolutional layer
tf.layers.conv2d({
filters: 16,
kernelSize: [3, 3],
strides: [2, 2],
activation: 'relu'
})

We will consecutively use 16 and 32 filters with a 3x3 kernel with a stride of 2. Feel free to experiment with other sizes.

The activation parameter we will discuss in a moment.

Max-pooling layers

Each convolutional layer is followed by a max-pooling layer. Think of it as a compressed version of the result of the convolutional layer. Max-pooling simply reduces the amount of data, while keeping the most important information. Our goal here is the same as with cropping — strengthen the signal over noise. Adding a max-pooling layer right after the convolutional one is a common practice in computer vision.

See also: Max Pooling video from Udacity Intro to TensorFlow free course

tf.layers.maxPool2d({ poolSize: [2, 2] })
// 1st max-pooling layer
tf.layers.maxPool2d({
poolSize: [2, 2]
})

Here again, we just specify the size of the pooling filter. We use the size of 2x2 which yields a double reduction of data size. Feel free to experiment with our own sizes.

Dense layers

Finally, we add three dense() layers. These are responsible for the actual inference of the steering value using deep learning magic. Think of them as another set of filters needed to extract the information we need. Here again, we just set the size of the filters, and the model will figure out the exact values for those filters during training.

Note, that before first dense layer we insert a flatten() one. Its job is to convert data from multi-dimensional representation into a single flat array. It is needed because each dense layer is a single flat array.

tf.layers.dense({ units: 1024, activation: ‘relu’ }),
tf.layers.dense({ units: 128, activation: ‘relu’ }),
tf.layers.dense({ units: 1, activation: ‘linear’ })
// Dense layers with dropout layer
tf.layers.flatten(),
tf.layers.dense({ units: 1024, activation: 'relu' }),
tf.layers.dropout({ rate: 0.25 }),
tf.layers.dense({ units: 128, activation: 'relu' }),
tf.layers.dense({ units: 1, activation: 'linear' })

We will add two consecutive dense layers of size 1024 and 128. Feel free to experiment with other sizes and a number of dense layers.

The last dense layer has just one value — the steering value which we are looking for. This is the result of applying all the filters in the model to the initial image.

Dropout layer

The tf.layers.dropout() layer is used only during training. Its job is to turn off a specified amount of connections between two consecutive layers. This technique is considered efficient for training.

// Dashed connections are dropped out
tf.layers.dense({ units: 1024, activation: ‘relu’ }),
tf.layers.dropout({ rate: 0.25 }),
tf.layers.dense({ units: 128, activation: ‘relu’ })

We will add a 25% dropout between the first two dense layers. Feel free to experiment with your own values, or get rid of this layer at all.

Activation function

For all layers of our network, except for the last one, we will use ReLU activation function. The activation function is like another filter at the end of the layer. What ReLU does — it drops out the values less than zero by applying max(0, value). This is considered efficient for training and is widely used.

For the last layer, we don’t want to drop out negative values, because our steering value should be between -1 and 1. So we use a linear activation function. This simply means “do nothing with the value, just pass it over”.

Loss and optimizer functions

Lastly, we have to specify which loss and optimizer function our model will use. These are the tools the model will use during training in order to set appropriate values for all its filters.

Loss function: used by the model to measure how well the filters work. We will use the one named mean squared error. This is a standard go-to when we want to get a single numerical value as an output. It measures how far the value returned by the model is from the real value. The less is the loss—the better is the model. The goal of training is to minimize the loss value.

Optimizer function: used by the model to sophistically update all the filters during training. Adam is a standard go-to here.

3. Loading and preparing data for training

In order to feed our model with data, we will create a generator function, which will infinitely loop through our driving_log.csv, and yield a Buffer with an image data along with a steering value.

We are using a generator function because it is expected by the TensorFlow fitDataset() method—we will see it later. Basically TensorFlow wants us to provide a function which will permit it to iterate over each item in a dataset.

Learn more on generators:
The definitive guide to the JavaScript generators by Gajus Kuizinas
Generators in Dr. Axel Rauschmayer book, Exploring ES6

Note, that each line of driving_log.csv actually contains 3 images: one form the center camera of a car, and two from the side cameras.

/data/IMG/center_2020_05_24_15_52_29_872.jpg,/data/IMG/left_2020_05_24_15_52_29_872.jpg,/data/IMG/right_2020_05_24_15_52_29_872.jpg,-0.35,0,0,25.13947

We will use all three images. For the images from the side cameras, we will apply an offset of 0.333 for steering value, since they are shifted a little bit from the center.

Where this offset comes from? This is the result of empirical experiments. People reported they got good results with values in a range between 0.2 and 0.4. We will pick the middle ground, but feel free to experiment on your own.

Out data file driving_log.csv contains another telemetry: throttle, brake, and speed. We will ignore those for now.

Here is how we can implement a generator function:

In this function, we create a csvStream using csv-parser, which will read our driving_log.csv and loop trough it line by line.

Note that we have three yield statements, since each actual line of driving_log.csv file results in 3 individual pairs of images along with steering values.

When the for await loop reaches the end of driving_log.csv, the file would be opened again from the beginning. This will enable us to loop over the data as much as we need.

We could check that our generator works properly using the next() method:

const data = dataGenerator();
data.next() // { value: [<Buffer>, <Number>], done: false }
data.next() // { value: [<Buffer>, <Number>], done: false }
...

Converting the data for TensorFlow

Lastly, we need to convert our data to the representation which TensorFlow can understand:

  1. Convert each pixel value in the image from a range between 0 and 255 to a range between 0 and 1, because deep learning models work better with values between 0 and 1.
  2. Wrap our numbers into Tensors, because it’s the type of data TensorFlow works with.

Here is how we can do it:

const batchSize = 64;const dataset = tf.data
// Use our generator function
.generator(dataGenerator)
// Convert each datapoint to TensorFlow-specific representation
.map(([imageBuffer, steering]) => {
const xs = tf.node.decodeJpeg(imageBuffer).div(255);
const ys = tf.tensor1d([steering]);
return { xs, ys };
})
// Randomly shuffle data within batches of specific size
.shuffle(batchSize)
// Return datapoints in batches of specific size
.batch(batchSize);

We create TensorFlow Dataset out of our generator function, then instruct TensorFlow to process each item in the dataset using a map function.

In themap function we convert JPEG image into a Tensor, then divide each value in that tensor by 255. As a result, we have a representation of the image good for deep learning—a Tensor with values between 0 and 1.

Adjusting any value into a range between 0 and 1 is widely used in deep learning.

The steering value is already between 0 and 1, so we just wrap it into a Tensor.

Note, that we are returning each item as an object with two keys: xs and ys. This is how we specify for our model where are the input value (x), and the expected output value (y) as well.

Lastly, we instruct the dataset to provide data in shuffled batches of 64 items. This means that our model while learning will look at 64 images at once. Batches usually make training more efficient. Common sizes are 32, 64, 128 etc. Feel free to experiment with your own sizes.

4. Training the deep learning model

After the neural network model is defined and the dataset is ready, we want to finally train our network.

The training code is very simple. TensorFlow behind the scenes does all the heavy lifting for us:

const linesCount = require('file-lines-count');// Initialize our model
const model = initModel();
// Calulate total number of samples in our dataset
const totalSamples = (await linesCount(pathToCSV)) * 3;
// Train the model
await model.fitDataset(dataset, {
epochs,
batchesPerEpoch: Math.floor(totalSamples / batchSize)
});
// Save the trained model
await model.save(`file://${modelDir}`);

First, we init our model using initModel() function we created before. Then we call model.fitDataset() method. This method will do the actual training. Finally, we save the trained model to disk in order to use it further for driving a car.

We provide model.fitDataset() with the following parameters:

dataset: the TensorFlow Dataset we created earlier;

epochs: the number of epochs. An epoch is one iteration of looping through the entire dataset during training. By specifying the number of epochs we define how many times we want to loop through the dataset. Usually, we want to loop several times, because each epoch yields a better adjustment for all filters in the model;

batchesPerEpoch: the number of batches of data in a single epoch. Since we are providing our dataset as a generator, the model doesn’t know how many items there are in total. We have to provide it ourselves by specifying how many batches our dataset contains. We calculate this value as a total number of samples divided by the batch size. Since each line in our CSV file yields 3 samples, we count the total number of samples as the number of lines in the CSV file multiplied by 3. To quickly count lines in the file in one line of code we are using a tiny file-lines-count library.

Command-line arguments

We may want to experiment with training using different datasets and for a different number of epochs. It would be convenient to be able to specify the dataset directory, number of epochs, and the final model directory as command-line arguments. We could use minimist for that:

const parseArgs = require('minimist');const { 
data: dataDir = 'data',
model: modelDir = 'model',
epochs = 10
} = parseArgs(process.argv.slice(2));

Here we read command-line arguments data, model, epochs and assign them to the constants dataDir, modelDir, and epochs respectively.
The training code could be invoked like that:

node train.js --data mydatadir --epochs 5 --model mymodeldir

We also provide the default values for each parameter, so they could be omitted.

Training

Finally, we could start training by running our script. For instance, if our recording is stored in the directory track1 within the current user’s Documents directory and we want to train for 3 epochs, here how we can specify it:

node train.js --data ~/Documents/track1 --epochs 3

Here is an example of the output:

Epoch 1 / 3
eta=0.0 ==========================================================>
103621ms 761918us/step - loss=0.0667
Epoch 2 / 3
eta=0.0 ==================================================================>
127557ms 937918us/step - loss=0.0568
Epoch 3 / 3
eta=0.0 ==================================================================>
116276ms 854970us/step - loss=0.0588
Model saved to: model

Note that the loss increased a bit on the third epoch. Usually, this is a signal to stop training, because the model is not improving anymore.

On MacBookAir with 1,6 GHz Dual-Core Intel Core i5 processor, each epoch takes roughly 2 minutes.

5. Use the model to drive a car

Now we want to send steering commands to the car in the simulator.

The communication with the simulator is done by WebSocket protocol wrapped into a socket.io library.

Udacity Self-Driving Car Simulator feedback cycle

In the Autonomous mode, Udacity Simulator connects to a socket.io server on port 4567. Once connected, it will send a message with telemetry data to that port. The server could respond to that message with steer instructions, which in turn yields a telemetry message, and so on.

Here is the driving script:

First, we wait until our model is loaded, then we start listening on port 4567 for socket.io messages.

On the telemetry message, we parse the image data (image is base64 encoded) and feed that data into our model.

Want to know what else telemetry the simulator is sending?
Look inside its code on GitHub.

In order to pass the received image to our model, we convert it to a Tensor and normalize all pixel values to the range between 0 and 1. We also have to add another dimension to our tensor—the batch size. It’s what 1 stands for in this statement: reshape([1, 160, 320, 3]).

Next, we get the steering value using model.predict(). The tail of this statement: squeeze().arraySync() is needed to convert steering value from the Tensor back to a number.

Finally, we send the steering value into the socket by emitting steer message. Note, that we are converting all numerical values to strings because it is what simulator expects.

We are also sending a throttle value, which we calculate to not exceed the maximum speed.

The driving script has two command-line arguments for convenience:
model: directory with the model to use;
speed: maximum speed during driving. Note that the car in the simulator is already limited to around 30 mph.

Now, we are all set to drive the autonomous car!

Assuming the model is in the default model directory of our project, let’s start the simulator in autonomous mode and run our driving script:

node drive.js

And it is driving totally autonomously. How rewarding it is to watch!

Next steps

Here are a few next steps on how this project could be improved further.

Training for the second track

We trained our model only on the data collected from the first track of the simulator. In addition, we could train another model explicitly for the second track or train it using data from both tracks to make it universally autonomous.

Validation

In sake of brevity, we didn’t split our dataset into the training and validation parts. That kind of splitting is a standard practice in machine learning to control the quality of the model. The validation dataset is used only to measure the loss of the model, and all validation samples are never used in training. Thus we can measure how well our model will work on previously unseen data.

Image augmentation

This technique can enrich our image dataset with different kinds of the same image by shifting, zooming, changing brightness, and so on. The more data our model sees, the stronger it becomes. Although, image processing is an intensive computational task, which could significantly slow down the training speed.

--

--