Original Source Here
Why Neural Networks Can Solve Simple Tasks
By simple, I mean Computer Vision
Recently, I posted on when not to use neural networks. This time, I thought it would be nice to go the other way around and talk about why these fancy networks are so damn effective at vision and language problems. So in a sense, this is a piece on the flexibility aspect of machine learning models and why neural networks are the kings of flexibility.
In the previous post, I presented the following rule of thumb: within AI, we use simple models to solve challenging tasks and complex models to solve simple tasks. In this context, complex means problems we need to reason about, whereas simple refers to problems we intuitively solve. For instance, you need to do some thinking to multiply 48976 with 15948, but you can readily recognize faces in pictures. Computationally, the multiplication is trivial, while recognizing faces is significantly complex.
When it comes to machine learning algorithms, many techniques can solve complex-to-human problems. Still, really just one approach is reliably capable of solving simple-to-human tasks: neural networks. It all boils down to its innate flexibility to combine features into more meaningful representations. To understand this, we need some understanding of how algorithms see.
What Models See
Before talking about computer vision, let’s take a slight detour through coffee tasting. Without being too nerdy about it, there are a few features that matter when it comes to good versus bad coffee. Here is an excellent video on buying/finding good coffee. Here is my humble summary:
- The number of days since the bean was harvested
- The number of days since it was roasted
- The roasting strength used
- How much “body” it has
- How acidic it is
- Does it have fermented flavors?
Since we are data scientists, we can model this as a simple classification problem: we have some integer features (number of days since harvest/roast), some categories (roast strength, body, and acidity), and a Boolean (fermented / not fermented). Then, we can buy some coffee packages, hand them to some experts, and record their feedback on each package. In other words, how much they liked/disliked each sample.
Having our dataset ready, we can try some machine learning models to predict if a given coffee package will be good or bad. Within this setting, I highlight how each model sees the task to solve it.
Disclaimer: While I based the above features on Hoffmann’s video, most examples below are fictitious for the explanation. Feel free to share your coffee preferences in the comments section, though ☕️.
A linear classifier essentially learns a “y = A⋅x + b” function. Here, y denotes the labels, x the features, A the feature weights, and b the feature biases. Note that A and b are vectors holding a single value for each of our six features. The learning process thus learns the appropriate A and b values that yield a positive y when the coffee is good and negative otherwise.
Intuitively, b represents how much we value each feature on its own, regardless of the coffee itself (x). This is why the term is called “a bias.” Meanwhile, A weights how much we value the particular features of the coffee. So, for instance, b represents how much we love acidity in general, while A models how each category of acidity changes our base valuation.
Note that linear models perform no inter-feature reasoning. In other words, all it does is sum up what it believes given each feature. So, for instance, it cannot value how the roasting strength interacts with the acidity of the coffee or how our perception of each element changes, given how old the beans are. In other words, all they see is a bunch of values.
Decision Trees (DTs)
At each node, DTs will look for the most divided feature. For instance, many packages roasted in the last three months were deemed suitable, while most stale ones were considered bad. Then, it will look for the next best feature to split each subset in sequence. For example, for the recently roasted packages, the coffee body led to the most divided opinions. In contrast, for the stale ones, fermented flavors were very informative of the coffee quality.
Again, DTs are incapable of directly mixing features. However, they can exploit some interdependences through their hierarchical nature. In the example, the tree could perceive that fermented flavors matter a lot more for stale coffee than recently roasted ones. Nonetheless, all decisions consider a single feature at a time.
A neural network layer is a linear model inside an activation function σ. Mathematically, we get “y = σ(A⋅x + b).” For a three layers network, we get the following composition: “y = L₃(L₂(L₁(x))),” or, expanding the layer terms, we get the full expression “y = σ(A₃⋅σ(A₂⋅σ(A₁⋅x + b₁) + b₂) + b₃).” Traditionally, L₃ is named the output layer, while L₁ and L₂ are referred to as hidden layers (for being “hidden” between the input and output)
The critical insight is realizing each layer takes as input a set of features and produces another set of features. In other words, it combines features to create a novel representation of the problem. Then, the second layer takes the combined features and outputs a combination of the already mixed features. Lastly, the third and final layer produces the final output from the feature mixture received.
While linear models are blind to feature relationships and decision trees can only indirectly realize them, neural networks constantly mix features to create more elaborate representations. Within the literature, “hidden” layers produce “hidden” features. However, I find it more enlightening to interpret the word “hidden” as neural networks being like detectives who constantly uncover the hidden patterns in the data.
This constant elaboration of features makes neural networks the kings of model flexibility. It is also from where the word deep in deep learning comes from. Instead of operating at “shallow representations,” neural networks create layers of understanding, revealing deeper structures in data.
The Meaning Lies Within
Previously, we considered a simple supervised learning setting. Now, view the following data:
I ask you, almighty human, what is the content of this image?
Trouble finding out, hm?
Yeah, I just treated you like a linear model treats its data: no context, baby.
See, the thing with visual data is that pixels carry little to no value. In other words, a 1920×1080 full HD image is nothing more than a little over two million meaningless features. So the above strip is just a regular image downsampled and flattened out. This has the effect of robbing the image pixels of all their previous spatial relationships.
Here is another example:
In this version, I just randomized pixels. As above, no meaning can be derived from the image. This illustrates how the meaning of images is encoded in the relationship among pixels. Mathematically, this image has all the same pixels as the original, yet, it is utterly unrecognizable.
For your viewing pleasure, here is the original:
And here comes the most crucial paragraph in this whole article:
Neural networks are great at handling complex inputs, such as images, audio, and text, due to their innate ability to combine smaller features into complex abstractions. Over and over, the meaningless pixels are combined into patterns and patterns of patterns.
Conversely, since simpler models are limited in their ability to extract multi-feature patterns, they are unable to operate on highly dimensional problems. Simply put, scenarios with a massive number of meaningless features.
As a meta example, no single word or sentence in this article can capture its whole meaning. Instead, the overall message is encoded within, little by little, in each word, in every sentence. When it comes to audio, a second of music is represented by 44.1k tiny samples. We need to tune this number down to something more manageable, like sequences of notes or phonemes for any meaningful operation.
Back to Coffee
In our coffee example, all features were entirely meaningful: roasting date, acidity, body, etc. These features exist because we humans perceive them in our mouths, directly or indirectly. Simpler models work on such problems because all the work into finding useful features has already been done. Using a neural network here will almost always lead to no benefit.
Now, consider a machine that actually samples coffee, identifying the micro-chemicals that pass through its sensors. The number of features it could produce is immense. This time around, simpler models are no longer fit for the task. Instead, a neural network model could parse through the data jungle and arrive at a limited set of features that are pretty useful for classification.
Most features the neural network will come up with can be pretty similar to things we humans might notice. For instance, it might learn to value the sample’s low/high frequency of acidic molecules. Likewise, the lack of more aromatic components might indicate stale beans or an older roast.
The experiment can backfire as well. In a previous piece, I highlight how flexible models can often learn spurious patterns that locally solve a dataset but do not generalize to external data. For instance, we might have failed to clean the machine properly the day we sampled all old-roast coffees, and, as a consequence, the algorithm learned to detect dust as a sign of lousy coffee.
The Days Before Neural Networks
As a final remark, before neural networks got extremely popular in the 2010s, we had other techniques to solve these complex tasks. For instance, there were quite some exciting approaches to face recognition. In particular, many authors derived several pattern-matching algorithms to find facial landmarks, such as eyes, noses, and mouths.
Authors used many signal processing techniques for audio processing, such as Fast Fourier Transforms (FTT), to convert the high-dimensional signal into frequency bands or even musical notes and phonemes. Likewise, textual data could be converted to more economical representations, such as bags-of-words and, more recently, word-vectors (text embeddings).
Nevertheless, all these techniques share a common theme: someone came up with valuable features to reduce the high-dimensional problem into more valuable representations. This approach was so common it got a name: feature engineering, a feat of human ingenuousness.
What neural networks, especially convolutional networks, brought to the table is a pretty dramatic turn of events: feature learning. Neural networks automate the tedious process of evolving features into more apt representations. Moreover, networks can repeatedly create multi-level representations, which we humans find pretty challenging to do. Finally, they bypass the need for expensive specialists to solve novel tasks.
Back at the beginning of this piece, I said “just one approach is reliably capable of solving simple-to-human tasks: neural networks.” They keyword here is reliably. For instance, SVMs solve some fancy problems, like Kernel PCA and Spectral Clustering, yet, you cannot expect SVMs to reliably work for any input data out-of-the-box. The same goes for XGBoost.
A great neural network reliability example is MIT’ work on detecting humans through walls using wireless signal reflections. The authors simultaneously recorded camera images and WIFI reflections from several actors doing routine activities. Then, they ran a pose estimation framework over the camera images and trained a model to replicate the results using the WIFI data instead of images. The privacy-concerning results are pretty impressive. How did the network accomplish such a feat? No one knows. It learns all by itself. There is no feature engineering, only raw data, and several matrices.
After all, detecting people through walls is a simple task 🙂
This is all for now. Feel free to comment or connect with me if you have any questions about this article or the papers. You can also subscribe to be notified whenever I publish here. By the way, speaking of coffee, you can buy me one if you would like to support my work directly 🙂
If you are new to Medium, I highly recommend subscribing. Medium articles are the perfect pair to StackOverflow for Data and IT professionals and even more for newcomers. Please consider using my affiliate link when signing up.
Thanks for reading 🙂
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot