Original Source Here
COMPUTER VISION — INTRODUCTION
Computer Vision 101: Introduction
The fundamental concepts of Computer vision
If you’ve ever heard of computer vision but aren’t a professional in the field, you might immediately associate this term with neural networks, particularly CNNs and Vision Transformers. While this is a reasonable association, it is a quite narrow perspective. Computer vision has been an active research area for decades, long before the advent of CNNs, and includes many other techniques and methods beyond just deep learning.
And you might be wondering:
Why should I bother learning outdated concepts that are no longer relevant today, when I can simply train a CNN and call it a day? 🤔
Well, consider that they form the foundation upon which all modern approaches to computer vision are built. A firm grasp of these core principles is essential to mastering more advanced concepts, including deep learning (in my personal opinion).
And yes, neural networks are not magic: their power derives from a solid mathematical foundation.
Power is nothing without control.
Computer Vision 101 is a series where we’ll dig under the surface to deeply understand the underlying principles of modern computer vision approaches, starting from the basics and building up to the advanced techniques.
What is computer vision
Computer vision is the science and related technology devoted to creating machines capable of seeing.
More precisely, computer vision studies how to build machines provided with visual intelligence.
While vision (see) refers to the physical act of seeing with our eyes, visual intelligence (observe) refers to the ability to interpret and comprehend visual information. Recognizing that two people are playing football on opposing teams in an image, for example, requires visual intelligence, whereas just seeing two people and a ball does not.
The goal of modern computer vision is to automate tasks that are traditionally performed by the human visual system, and in some cases, to even overtake human capabilities.
This technology offers a wide range of possible applications, including facial recognition, medical diagnosis improvement, and the development of self-driving cars.
The computer vision pipeline outlines the main steps to follow to go from an input image (or video that is) to some type of human-like understanding.
The main stages of the pipeline are:
- Image acquisition, the process of collecting an image with a camera or other sensor. The quality and characteristics of the acquired data can have a significant impact on the pipeline’s subsequent steps. This step will not be discussed in this series of articles.
- Image processing or pre-processing, transforms the input image into another. The output image will be enhanced in some visual properties, to make the next steps of knowledge extraction easier, more reliable, and more accurate. Image denoising, for example, involves removing or reducing noise in an image and is an essential pre-processing operation.
- Feature extraction or image analysis, is a compression task that helps reduce the dimensionality of the data. An image is in fact made up of so many pixels, and work on all of them would be too computationally demanding. Besides that, a whole image may contain much redundant or irrelevant information. The goal here is to represent the image as an n-dimensional vector, describing some (limited) visual properties of it. This requires the definition of what good features are, and practical methods to extract and represent them.
- Image understanding tries to infer some information from the image by using previously extracted features. This could be classification, object detection, semantic segmentation, instance segmentation, and many others.
But what exactly is an image, and how is it represented?
What is an image
An image is essentially a two-dimensional function f(x, y), that maps spatial coordinates (x, y) to pixel values. A digital image is a discrete representation of the continuous function.
In computer vision, an image is represented as a matrix of elements.
Each element is referred to as a pixel, and it indicates the brightness of that specific point in the image. Pixel values are fixed between 0 (black) and 255 (white). This is done to represent grayscale images particularly.
Color images, on the other hand, are represented by a set of matrices (or tensors): each one represents the intensity of a single color channel. The RGB representation is a widely used form in which three matrices indicate the intensity of the three colors red, green, and blue. Each pixel is so represented by a set of three values.
Histogram is the most basic image representation in computer vision. It represents the distribution of pixel intensities in the image.
Let’s make it clearer. We know that pixel values in grayscale images are often fixed between 0 and 255. In this case, a histogram is essentially a vector with 256 components. Each element is called bin, and represent the number of pixels in the image that holds the corresponding gray-level value. The below image is a simple representation of this notion.
Bin 67 (x-axis) corresponds to the value 30 (y-axis). This means that 30 of the image’s pixels have a grayscale value of 67.
When working with RGB images, we can have three separate histograms, one for each color channel, or a combined one.
The normalized histogram is a particular type of histogram, used to view the histogram as a discrete approximation of a probability density function.
A probability density function is a function whose value at any given sample (each bin), can be interpreted as the likelihood that the value of the variable would be equal to that sample.
This simply means that each bin no longer represents the number of pixels with a particular value (as in the standard histogram), but the probability that taking a random pixel in the image, will have a particular value.
How can we compute it? Trivially, dividing each bin value in the standard histogram, by the total number of pixels in the image. For example, if we have 100 pixels in the image and 10 of them have the gray-level value i:
In plain English: the probability of gray-level value i, is given by the number of pixels (x) such that the value of x is equal to i, over the total number of pixels in the image (n).
In this manner, we obtain a probability vector that sums up to 1.
This is the most common representation in practical applications.
Entropy is a measure of the distribution of gray-level values in an image. It is used to assess the information content of an image, with the assumption that higher value distribution leads to higher information content. An image with a single gray level value will have
entropy = 0 and in fact, provides no information.
Where P(x), is the normalized histogram value at position k.
Histogram is useful in simple image segmentation applications, to separate the background from the foreground objects, or for contrast enhancement.
Aside from its use, it is important to keep in mind that a histogram contains no information about the spatial distribution of pixels: from the same histogram, we can obtain many images that do not represent anything at all. That’s an important limitation, that feature extraction methods try to overcome.
Image processing operators
Image processing operators are mathematical functions that operate on the pixel values of an image to produce a desired effect or transformation. They play an important role in each step of the vision pipeline. Three sorts of operators act on images to produce others:
- Point operators
- Neighborhood operators
- Global operators
In point operators, the value of a pixel in the output image depends only on the value of the pixel in the same position in the input (original) image.
Image 8 depicts the formal equation for point operators. The function h is the transformation applied, processing the point of coordinates (x, y), to generate the output value for the same pair of coordinates.
Linear point operators
When the applied mathematical function h is also a linear function, it can be represented as follows.
Where the parameters are:
- s, the scale factor (or gain, or contrast)
- k, the offset constant (or bias, or brightness)
To be a linear function, h must satisfy the superposition principle.
Luminance variation is an example of a linear point operator.
With s=1.3 and k=0, we can boost the brightness of the image by 30%.
Another famous linear point operator is a linear blend. It is used to combine two or more images into a single composite image and in video processing for producing a linear transition. It is defined as follows.
The parameter alpha ranges from 0 to 1 and controls the trade-off between the two images.
You can discover more about the implementation of these operators in the references below.
Thresholding is a non-linear point operator. It is used mostly for image segmentation, to separate the foreground objects from the background.
Thresholding converts an input image into a binary image (one with only two values, such as black and white) depending on a threshold value. Each pixel is compared to the threshold and one of the two values is assigned to it. We can write it as follows.
The main challenge with thresholding is not the implementation (which, as you can see, is rather simple), but the choice of a good T, for the case at hand. Otsu thresholding is one of various algorithms that can be used to accomplish this.
In neighborhood operators, the value of a pixel in the output image depends on the value of the pixel in the same position in the original image as well as the value of the pixels around it. This is known as a pixel neighborhood, and its size might vary depending on the task.
To maintain the size of the image, padding can also be added beforehand.
It is merely a pixel border put around the image. Pixels can be 0 (zero-padding), a specific constant value (constant-padding), or other more sophisticated strategies such as clamp to edge, mirror, and so on.
In the example above, a 1-pixel padding has been applied around the image to maintain the spatial dimension (both input and output are 4×4 matrices).
The equation in Image 15 illustrates how the function works. The output value for a pixel of coordinates (x, y) is given by a transformation function h, applied on the same pair of coordinates and on a neighborhood of it, represented as N(x,y).
This type of operation is extremely important when you need to take into account spatial information, which is an essential concept of computer vision processing: when attempting to comprehend something about an image, the value of each individual pixel is not all that relevant.
Linear filtering is a type of neighborhood processing. Its definition is really simple.
Given an image F, linear filtering produces another image G, where the output value of each pixel is given by the weighted sum of the original pixel value and the pixel values around it (its neighbors).
The set of weights used to perform this weighted sum is the same for all the neighborhoods and is referred to as the filter’s kernel. Recalling Image 13, the red mask around the pixel is our kernel.
The process of applying this filter all over the image is called convolution or correlation. We shall not discuss the distinctions between convolution and correlation here.
The above graphical representation shows how linear filtering happens. In particular, here, we’re using a stride of 1. The stride is the amount of horizontal and vertical movement that the kernel accomplishes each time.
The formula of the correlation, which is the one graphically shown in the figure above, is the following one.
Where g(i, j) is a single output pixel obtained by the weighted sum of an image region.
In image processing, neighborhood operators and, in particular, linear filtering (but not solely) are widely utilized.
They are particularly useful in image restoration, which is a subfield of image processing concerned with the reduction of noise in images.
They may also be used to calculate the gradient of an image and hence pinpoint its edges.
And, if you’re wondering if this is the fundamental notion around which CNNs are built, the answer is yes! 😃 As I told you, we’re laying the foundations for the modern approaches too. But don’t rush; we’re only at the beginning.
In global operators, the value of a pixel in the output image depends on all the pixels in the original one. Here h works on a function of all the pixels, for each pixel in the output.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot