A Neural Network In Under 4KB

Dan McLeran
Level Up Coding
Published in
6 min readJul 21, 2020

--

Introduction

Years ago, I began writing C++ template libraries with the goal of instantiating and running neural networks and machine learning algorithms within embedded systems. I did not have an FPU (floating point unit), GPU (graphics processing unit), or any vectorized instructions(e.g. SIMD) at my disposal so I needed to ensure I could run these algorithms with only a very simple CPU. Memory to store code and data was also highly constrained in my environment. I didn’t have room to store code and data that I wasn’t going to use.

My inspiration for these libraries was Andrei Alexandrescu’s Modern C++ Design. This book was my first exposure to the power of C++ templates and template metaprogramming. In the book he describes how to design your code using policy classes as template parameters to customize behavior. This fit my needs perfectly since I wanted the code to be scalable from one problem to the next (e.g. resolution, activation function(s), etc.) I also used this technique to make the code within these libraries as small and efficient as possible.

To demonstrate what my neural network library can do, I chose to implement a classic “hello world” kind of program for neural networks (xor.cpp). Here I will demonstrate how to instantiate and train a neural network to predict the outcome of the mathematical XOR function. We will generate and feed known data into the neural network and train it to become an XOR predictor. By inspecting the generated output file during compilation, we can prove to ourselves that the whole neural network and associated code and data occupies a bit under 4KB in the final image, assuming we turn on compiler optimizations.

Q Format

To make the neural network super small and fast, we can’t afford the luxury of floating point. Plus, my hardware doesn’t have a floating point unit, which makes using floating point even more difficult. We need to use fixed-point numbers to solve this problem. Tinymind contains a C++ template library to help us here, Q Format. See this article for a full explanation of how it works. We declare the size and resolution of our Q-Format type like this:

We have now declared a signed fixed-point type which can represent the values -128 to 127.99609375. Our resolution is 2^(-8) or 0.00390625 since we are using 8 bits as our fraction part of our q-format number. You’ll see from this example that we’ll have plenty enough dynamic range and resolution to solve this problem without floating point. The fixed and fractional portion of the value are concatenated into a 16-bit field. As an example. the floating point number 1.5 will be represented at 0x180 in the q-format chosen above. This is because the integer portion (1) is shifted up by the number of fractional bits (8), resulting in 0x100. We the OR in the number which represents half of the dynamic range for 8 bits (128 or 0x80 in hex).

Neural Network

We need to specify our neural network architecture. We do it as follows:

We want a neural network with 2 input neurons, 1 hidden layer with 3 neurons in it, and 1 output neuron. The 2 input neurons will receive the streams of 1s and 0s from the training data. The output neuron outputs the predicted result of the operation. The hidden layer is the transfer function which learns to predict the result, given the input values and the known result.

A graphical view of the neural network is provided here:

We typedef the neural network transfer functions policy as well as the neural network itself:

The TransferFunctionsType specifies the ValueType (our Q-format type), the random number generator for the neural network, the activation policy for Input->Hidden layers, as well as the activation policy for the Hidden->Output layer. The neural network code within tinymind makes no assumptions about its environment. It needs to be provided with template parameters which encapsulate policy.

Lookup Tables

Since I didn’t have floating point, I also did not have the luxury of the built in math functions for tanh, sigmoid, etc. To get around this limitation, tinymind generates and utilizes lookup tables (LUTs) for these activation functions (activationTableGenerator.cpp). This generator generates header files for every supported activation function and Q-Format resolution. Because we don’t want to generate code we’re not going to use, we need to define preprocessor symbols to compile in the LUT(s) we need. If you look at the compile command you’ll see the following:

-DTINYMIND_USE_TANH_8_8=1

This instructs the compiler to build the tanh activation LUT for the Q-Format type Q8.8, which is what we’re using in this example. We need to do this for every LUT we plan to use.

Generating Training Data

We’ll generate the training data programmatically. The function generateXorTrainingValue is called to generate a single training sample. We use the built-in random number generator to generate the 2 inputs. We generate a random number and then AND it with 0x1. If the random number was even, the result will be 0, if the random number was odd, the result will be a 1.

We feed the data into the neural network, predict an output, and then train if we have an error outside of our zero-tolerance.:

We’ll train the neural network on every iteration where the prediction error is not within our zero tolerance. The error of the neural network’s predicted value(s) will virtually never be zero. We need a way to decide when we’re close enough. If we’re close enough to zero error then it’s best to leave the weights alone. The zero-tolerance policy is specified when declaring the transfer functions policy for the neural network. In this example, we’re just using default values for these policies:

Compiling The Example

Here I will show how I compiled the example on my Linux box. I’ve also compiled the code on Windows using Visual Studio.

make

To run the example, just cd to output and run the generated executable file.

Plotting Results

After running the example, you will see that you have a text file output, nn_fixed_xor.txt. I have provided a python script to plot the results of neural network training in the neural network unit test (nn_plot.py). To plot the neural network weights, expected output, predicted output, as well as the calculated error.

You can see the weights training thru time, as well as the error reducing thru time, as the predicted value approximates the expected value. Notice the magnitude of the numbers. Expected values are varying between 0 and 256. This maps to the representation of 0 and 1 for our Q-Format type. The representation of 1 in a signed Q8.8 format (8 bits of fractional resolution, 8 bits of integer representation) is 0x100. This is simple the number 1 left-shifted by the number of bits of fractional resolution. Also notice the weighs changing. It almost looks like floating point but it’s not. We have 2^(-8) or 0.00390625 resolution between values.

To use the plotting script, just pass it the path to your results file from running the example program.

Determining Neural Network Size

To more easily determine how much space the neural network code and data are occupying in this example, I made sure to place all of the neural network code into its own file, xornet.cpp. To determine how much space code and data took up in this program, we generate and parse an output file during compilation. The command line I used to compile xornet.cpp and move the resulting output file is here:

g++ -c xornet.cpp ../../cpp/lookupTables.cpp -O3 -DTINYMIND_USE_TANH_8_8=1 -I../../cpp

We can use the size utility under Linux to parse the output file from compiling xornet.cpp:

size lookupTables.o
text data bss dec hex filename
224 0 0 224 e0 lookupTables.o
size xornet.o
text data bss dec hex filename
3384 8 388 3780 ec4 xornet.o

You can see that the total space occupied by this neural network, Q-format support, and LUTs was just under 4KB. The neural network itself occupies 388 bytes while the associated code takes up 3384 bytes. The LUT occupies 224 bytes for a grand total of 4,004 bytes.

Conclusion

Using tinymind, neural networks can be instantiated using fixed-point numbers so that they take up a very small amount of code and data space. They can be trained to predict the outcome of a basic function in under 2,000 training samples. I hope that this inspires you to clone tinymind and give the unit tests and examples a look. Maybe even generate one of your own.

--

--

Principal Engineer at Solidigm(SSD firmware architecture). Exploring machine learning and neural networks within non-volatile memory controllers.