The Integral Image

Introduction to Computer Vision, Part 3

Published in

Level Up Coding

10 min readFeb 12, 2021

Last time, we discussed Haar-like features, what they are, how they help the Viola-Jones algorithm detect objects, and how they determine if they’ve found a feature. We covered a lot of information and if you’ve never heard of Haar-like features before, I highly recommend reading the previous article, as any summary I give will not be enough to give you the foundation needed to understand our topic today: The Integral Image.

To really underscore the importance of the Integral Image, let’s talk a little bit about…

The Big O

No, not that Big O. This Big O:

The Big O is a method of computing the amount of time and resources used to accomplish a task. As you increase the number of variables/elements/inputs, the longer the process takes and the more resources (processing power, memory, etc.) it uses up. Don’t get bogged down too much in the details just remember this:

As you increase the complexity of a task, the amount of time and resources needed increase exponentially.

Now, how does this relate to Integral Images?

So remember that Haar-like features are a series of frames that “pan” over an array that represents a grayscale image. These frames can be scaled to nearly any size except a 1x1 pixel. This is because each cell (or element) of the frame (and the image array) represents a single pixel, which can only contain the value for a single color. The other reason is because the entire function of these frames is to compare the “white-ness” and “black-ness” of neighboring pixels and, therefore, cannot be scaled down to a single pixel.

While many Haar-like features have been developed since the creation of the Viola-Jones in 2001, we’re focusing on the three that was available to them at the time:

Edge Features: Detect edges of an object. Smallest scale- 1x2 or 2x1.
Line Features: Detect lines and highlights within an object. Smallest scale- 1x3 or 3x1.
Four-Rectangle Features: Detect diagonal edges and lines. Smallest scale- 2x2

A feature is classified by calculating the means of the “lighter” and “darker” sides of the frame and subtracting them from one another. If the difference lands within a certain threshold, then the algorithm recognizes that the pixels within that frame make up a Haar-like feature.

This is where the Big O comes into play. Since these frames are scalable, there are thousands of combinations and places where features fit within even a tiny image. As was mentioned in the previous article, a 24x24 pixel image contains over 180,000 possible features. The frames within that image can be between 1x2 (or 2x1) pixels and 24x24 pixels.

To better illustrate this problem, consider this example:

Let’s say that the amount of time it takes for a computer to calculate the mean of a single feature is 1 millisecond (this would be incredibly slow, but let’s run with it). Since there are over 180,000 features in a 24x24 pixel image, it would take 180,000ms (3 minutes) to calculate all of the features in it. That’s not including processing power and memory used up to complete it.

The amount of time explodes even more when you consider that, these days, you’re most likely going to be using an image with a resolution of 1920x1080 pixels. In order to reduce the amount of time and resources used up, we need to find a way to reduce the complexity of the task. How did Viola and Jones accomplish this to such a degree that their algorithm could detect faces in real-time on a digital camera?

Enter: the Integral Image

What is an Integral Image anyway? Well, it’s an array whose values are the sums of values of an image. What values, though? And how is it arranged in any meaningful way? Let’s just make one. It’ll be easier to understand when you see it.

Let’s say that this array represents a 10x8 pixel image. As we know, the numbers represent how bright a pixel is on that channel. For the sake of simplicity, let’s say that we didn’t scale these values down to be between 0 and 1. So these values are on their original scale of 0 to 255 with 0 being no brightness and 255 being the brightest (meaning the image above is almost pitch-black).

The algorithm takes this original image and creates an array of sums with the exact same dimensions called the integral image.

It then begins to populate this table by calculating the sums of the values in the original image following a specific pattern. Again, it’ll be easier to understand when we see it in action. Let’s say we want the value shown below:

Well, this cell is not a pixel, but the sum of all values in a corresponding area on the original image. To calculate its value, go to the pixel of the original image that corresponds to this one. Then draw a line from the bottom-right corner going straight up and another going to the left. Shade in the values that are above and to the left of these lines and you’ll get the following:

The value for the cell in our integral image is the sum of all the numbers in the shaded area. The sum of these numbers is 51, so let’s go ahead and put that in our Integral Image.

Awesome! Let’s try one more to get the hang of it.

What area of the original image would the cell below represent?

If this was your answer, you’d be right! Now to put it in our Integral Image.

By now, you’ve probably noticed a pattern. A cell in the Integral image corresponds to a specific area in the original image. The value of that cell is the sum of all the values in that corresponding area in the original image. Now that we have a little bit of intuition for how an Integral mage is made and organized, let’s fill out the rest of it.

Now, go back to the original image. We’re going to select a portion of the image that we want to find the sum of and we’ll outline it in red.

This is only 5x3 pixels large, so we could just add up all 15 numbers, it’s not that hard. However, what if we had a frame that was 100x100 pixels? or 1000x1000 pixels? That’s a lot of numbers to add up! This is where the Integral image shines. Because each cell of the Integral image corresponds to an area of the original image, we can find the sum of the area that holds all of the values we have outlined and then subtract the excess values. Let’s try it out!

First, find the cell that corresponds to the bottom-right pixel in the red box.

Now, again, this number is the sum of the values located in a specific area of the original image. What area of the original image does this cell correspond to?

Ok! We’ve already started to isolate the area in the box! Referring back to the Integral image, we know that the sum of this shaded portion is 215. This still includes values outside of the red box, however, so we need to remove some more.

Luckily, these excess values can be broken down into two shapes seen below:

Let’s start by removing the top two rows (shaded in orange in Figure 1.13). What cell in the Integral image would correspond to this area?

80! When you subtract 80 from 215, you get 135. Let’s check our progress on the original image.

So the sum of the values shaded in blue is 135. Now we just need to remove that last part shaded in green in Figure 1.14. There’s a problem, that we need to address first, however!

Take a second to compare Figures 1.13 and 1.14. A keen eye would notice that they overlap! Looking at Figure 1.17 below, you’ll see the overlapping area.

The problem then is that we’ve already subtracted the sum of this area when we removed the top two rows. If we jump ahead and subtract the green shaded area from Figure 1.14, we’ll be subtracting these values twice! No worries, though! This is an easy fix. We simply need to add these values back into our calculations. So go back to the integral image and find the value that corresponds to the pink area.

That’s right! 20! So add this back to our running total and we have 155. More importantly, however, we’ve fixed the problem. Check it out!

The remaining blue shaded area now perfectly lines up with the green area in Figure 1.14! Now we just subtract this area! Let’s find the value on the integral image.

Perfect! Subtract subtracting this will get us our answer. 155- 72 = 83. That’s the sum of all of the values in the red box on the original image from Figure 1.10!

Now, what does this all mean? Quite simply, no matter how many pixels are selected in a frame, you only need four numbers to calculate the sum of those values. Whether the area is 24x24 pixels or 1000x1000 pixels, it doesn’t matter. You only need those four numbers. This process drastically reduces the complexity of the task, which reduces the amount of time and resources needed to compute.

Applying to Haar-like Features

So how does this work with the Viola-Jones algorithm? Let’s take a look at the original image again.

The frame here represents an Edge feature. The algorithm needs to calculate the means of the lighter and darker halves and subtract them to see if the difference falls within the threshold. For the sake of the argument, let’s say that our threshold is 10 (remember that we’re interpreting the values in the array as if they were still on a scale of 0 to 255). So it creates the integral image.

Then it calculates the sum of the lighter half by finding the four relevant numbers.

255 - 215 + 46 - 55 = 31

Taking this sum, the algorithm calculates the mean by dividing it by the number of columns in the frame.

31 / 6 = 5.167

Now, it follows the exact same steps to find the mean of the darker half.

302 - 255 + 55 - 69 = 33
33 / 6 = 5.5

Now, it subtracts the means to see if it falls within the threshold.

5.5 - 5.167 = 0.333

This falls way outside of our threshold and the algorithm determines that there is no Edge feature in this portion of the image and moves on.

Summary

We covered a lot of ground today, but the key takeaway is that the Integral Image plays a big role in reducing the amount of time and processing power required to calculate and evaluate hundreds of thousands of features in real-time. It’s clever shortcuts like this that made this object detection possible on hardware 20 years ago. It’s simply brilliant.