Pixels, Arrays, and Images
An Introduction to Computer Vision, Part I

Computer vision is a growing field of study with a lot of applications from face detection to analyzing x-ray images and even determining if a train’s car coupling will detach soon. There seems to be an almost limitless number of applications for computer vision. But how does it work? Today, we’re going to learn just that. We’ll focus less on the mathematics and more on the intuition. Let’s get started!
What’s an Image?
In order to understand how computer’s “see” we first need a fundamental understanding of what an image is. Let’s look at the image below:

On the left is the original image. When a computer scans, stores, or retrieves this image, it will first break it down into three separate channels: red , green, and blue. You might remember from science class that these are the three colors that comprise white light. It’s very important for us to draw the connection between these channels and light, because it will help us answer our next question:
How do computers see?
Well, they don’t. Not in a conventional way, at least. Computers only understand data in the form of numbers. These numbers are often stored as scalars, vectors or in arrays and matrices. These numbers can also represent literally ANYTHING! However, they are still numbers.
We just went over how a computer breaks an image down into three channels. After this is done, these channels are then converted into a three-dimensional array. Look at the next example.

Each cell of the array, corresponds to a pixel in the image. This means that the array’s dimensions are equal to the resolution. Therefore, a color image with a resolution of 1920 x 1080 pixels will be broken down to a 3 arrays with the same dimensions. The values stored in each cell of the array represent the intensity of that channel for the corresponding pixel.
Let’s real quick take a step back and try to digest this by doing a quick exercise.
To navigate the above image we’ll use an index that will follow this pattern: [Channel #, Row #, Column #]. So if I ask, “what is the value stored at [1, 0, 2]?”, you would go to the green channel (channel 1), find the cell on top row (row 0) and the 3rd column (column 2), and report that the answer is 0.376.
With that, let’s compare the values of a single pixel. What are the values in the following cells: [0, 0, 1], [1, 0, 1], and [2, 0, 1]?
Answer: 0.482, 0.263, and 0.376, respectively. So what does this mean? Well, remember that these values represent the intensity (or brightness) of the channel at that pixel. So in this pixel, the red channel will be the brightest, followed by the blue channel, with the green channel being the least bright.
You might be asking “Why channels?” and “Why pixel intensity?” Both are good questions that can be answered by looking at a picture of a pixel.

This is an image of 60 pixels. As you can see, each pixel has three channels: red, green, and blue. Just like the three channels of the array above. So the values in the arrays are a recipe to recreate an image! In the exercise we just did, the array is saying, “Set the brightness of the red channel in that pixel to 0.482, the green channel to 0.263, and the blue channel to 0.376. Do this, and that pixel will be the correct color!”
Quick note: the values in this example have been standardized so that the values are on a scale of 0 to 1. Typically, the intensity levels would be on a scale of 0 to 255; 0 being no light whatsoever and 255 being the brightest.
Bringing It All Together
Let’s recap what we just learned:
- A color image is comprised of three channels: red, green, and blue.
- These channels correspond to the those in a single pixel.
- When a computer reads (or writes) an image, it takes the intensity values of each channel in a pixel and stores them in corresponding cells of a 3D-array.
Thus, what a computer “sees” is the array! The task of computer vision, then, is to train an algorithm to recognize patterns in the 3D array and associate that pattern with an object or shape.
In the next article, we’ll discuss how the very powerful algorithm, Viola-Jones, accomplished this back in 2001 using Haar-like Features!
See you then!