Haar-like Features: Seeing in Black and White

An Introduction to Computer Vision, Part II

Published in

Level Up Coding

8 min readFeb 7, 2021

In the last article, we discussed the foundational knowledge needed to understand computer vision. It’s so important, that I am going to provide a VERY brief recap. If you haven’t, I highly recommend that you read the article, as it has a lot of images to help visualize the process.

How Do Computers “See”?

They don’t! Computers only work with numbers! When an image is stored it goes through the following process:

The image is split into 3 channels: Red, Green, and Blue. These channels correspond to the three colors that compose white light, as well as the three color channels in a single pixel.
These channels are arrays with the same dimensions as the image’s resolution (e.g. 1920x1080 pixel image will form 3 arrays with 1920 columns and 1080 rows).
Each cell of the arrays corresponds to the red, green, and blue channels of single pixel.
The value stored in the cells represent the brightness of a color channel in that pixel. These values are on a scale of 0 (no light) to 255 (the brightest). Thus, if a pixel is perfectly black, its values on the corresponding channels would be {Red: 0, Green: 0, Blue: 0}. Conversely, if the pixel is white, their stored values on the arrays would be {R: 255, G: 255, B: 255}.

In short, the image is stored as instructions for how bright each channel should be in each pixel.

With that out of the way, let’s move on to the nitty-gritty!

The Viola-Jones Algorithm

Q: Can computers see electric sheep?
A: If you give me 10,000 labeled images of electric sheep, sure.

The Viola-Jones algorithm was developed in 2001 by Paul Viola and Michael Jones. It’s best described as an object-detection algorithm that was trained on faces. This means that, even though it was designed for and functions as a face-detection algorithm, its functionality can extend to any object if it were retrained.

Now, the world of Computer Vision has advanced a lot over these last 20 years, with deep learning neural networks and transfer learning with pre-trained neural networks slowly becoming the new standard. Despite this, the Viola-Jones Algorithm is still a powerful tool and used by many. This, along with its creative design, make it perfect for understanding the basics of computer vision.

So, how does it work?

A Look Under the Hood

The Viola-Jones algorithm first starts by changing the image to grayscale. This makes the math easier, especially when you consider the nature of Haar-like features. Haar-like features are scalable, rectangular frames that are used to compare how pixels relate to each other; specifically how dark one is to the other. See examples below.

There are three basic types of Haar-like features: Edge features , Line features, and Four-rectangle features. The white bars represent pixels that contain parts of an image that are closer to the light source, and would therefore be “whiter” on a grayscale image. The black bars are the opposite. These are pixels whose image features are farther away from the light source (like a background), or are obstructed by another object (such as the eyebrows casting slight shadows over the eyes below). These features, similar to before, would appear “blacker” in a grayscale image. This comparison between white and black pixels is the most important reason that we transform the image to grayscale.

Now, more Haar-like features have been developed since the creation of the algorithm, but back in 2001, these three were what Viola and Jones relied on. Let’s discuss these features.

Important note: Remember that Haar-like features are scalable. In the case of Edge features, it could be 1x2, 100x200, or even 400x50 pixels. It doesn’t matter! The only dimension they can’t be is 1x1 pixels. So keep that in the back of your mind as we continue.

Edge features: These frames detect edges (simple enough). When it comes to face-detection, think of the forehead and the eyes/eyebrows. The forehead is an exposed, flat surface of the face. This allows it to reflect more light, so it is often “lighter”. Eyebrows are generally darker. The algorithm would read the lighter shade of the forehead and the transition to the darker eyebrows as an “edge” boundary.

Line features: These detect? You guessed it! Lines. The pattern can go white-black-white, or black-white-black (like an Oreo). Going back to our example of face-detection, think about a nose. The top edge of your nose that stretches from the bridge to the nose tip, while not as flat as the forehead, is still reflective and the closest point on the face to a light source that might be in front of the subject, so it will naturally be brighter and stand out. The area around the nostrils typically bend away from the light making them darker. This pattern would be picked up as a line feature. Another interesting way that Line features are being utilized is in eye-tracking technology. Think about it: a darker iris sandwiched between the white space of your eye on either side of it. Pretty clever!

Four-Rectangle Features: This is good for finding diagonal lines and highlights in an image. This is used best on a micro scale. Depending on the lighting, it can pick out the edges of the jaw, chin, wrinkles, etc.. These typically are features that aren’t as important in general face-detection as there are so many of them, as well as so many variations in every individuals face, that it would lead to an algorithm that was too slow and might only detect the faces of certain people. In other words, too specialized.

Let’s check out the image below to get an understanding of how this all comes together.

As you can see, the algorithm classifies the transition from the forehead to the brow, the eyes to the cheeks, the upper-lip to the mouth, and the jaw to the chin as Edge-features. The nose follows a black-white-black pattern as light reflects off of the top. The highlight here creates a line and, thus, it is classified as a Line-feature.

A couple of notes should be made. First, this is just one example of how the algorithm could classify these parts of the face. Depending on the circumstances (whether or not the subject is wearing sunglasses), lighting (light source coming from a different angle), and scale (a group picture in front of the Eiffel Tower; where facial features are a tiny part of a much bigger image) that the algorithm is working with, it could classify them differently.

It should also be noted that the image above, in addition to being converted to grayscale, has had the contrast turned up. A lot! This is not a transformation that happens in the process. Instead, it would look for features on an image similar to this:

Now, Haar-like features are defined by specific patterns of black and white pixels in a certain area. So you might be wondering: How does the algorithm make these decisions on a grayscale image? That’s a great question! The answer is quite simple: Thresholds. The algorithm is taught that, in a Haar-like feature, if the difference between the means of the light and dark areas is within a certain threshold, treat them as black and white. Let’s look at an example:

So remember that an image is just an array with three channels. The values stored in the arrays are just numbers that represent a pixel’s intensity on that channel. In the image above, we have a frame for an Edge-feature with the dimension of 6x5 pixels. For the sake of “training”, we’ve scaled the values so that, instead of 0 to 255, they’re 0 to 1, where 0 is black and 1 is white.

(Note: I’ve seen explanations where 0 represented white and 1 represented black, but for the sake of consistency, we’re sticking with the interpretation above)

Now, let’s say that we have a threshold of 0.25. We’ll take the mean of the lighter side (0.58667) and subtract the mean of the darker side (0.28). The difference between the two is 0.30667. This larger than our threshold, therefore, the algorithm treats this segment as black and white and classifies this as an Edge feature! Remember, that a difference of 0 indicates that the two sides are roughly the same shade.

Bringing it all together

Look at the image below:

The Viola-Jones algorithm converts an image to grayscale and slides a frame across the image from left to right, slowly working it’s way down. It searches for specific patterns called Haar-like features and doesn’t classify a face until all of the relevant features of a face have been found inside that frame (eyes, nose, mouth, etc.). Years ago, if a feature like an eye was missing, the algorithm would struggle with detecting the face. This is no longer the case, however.

Although Haar-like features are meant to be black and white, the Viola-Jones algorithm is able to interpret areas of an image as black and white if the difference between the means of the darker and lighter areas fall within a specified threshold.

An important topic of discussion is that within an image as small as 24x24 pixels, there are over 180,000 features to calculate. That’s a lot of calculations that will eat up the processing power of your device. So how is it that we’re able to achieve real-time face-detection on digital cameras and smart phones? With inputs as high as 1080p no less! The answer to that lies in a neat little trick called the Integral Image. Check it out!