Original Source Here
Neural Networks and Gradient Descent
Neural Networks use Gradient Descent for the same reasons Logistic Regression uses it — to find a minimum point of error in the data so that you can use it to make predictions. With linear data, this works really well, and the more variables / features you have, the more precise the line you draw can be. This is great when you have linear data, since your model can assume there is a single low error point it needs to find. However, as we saw in the previous post, Neural Networks are often figuring out functions that have many curves. Take this example:
With Neural Networks, you start at a random point and then use Gradient Descent to find the lowest point (the lowest error) you can. Say for example, you start here:
Gradient Descent would flow down the side of the wall you are on to find the lowest **local** minimum, so you’d end up here:
However, the next time the model is built, you might start here:
…and thus end up here:
The first thing to note is that the local minimum you find is almost completely dependent on your starting point. There are roughly four local minimums in this function:
Leaving things up to random chance and finding a really shallow minimum (like the first one from the left) is definitely a risk. Another risk, however, is finding a really deep minimum (like the second one from the left). Why is this a bad thing? After all, isn’t that technically the point with the lowest error?
This is a challenge because when you run test data through your model, moving just slightly in one direction or another has a big impact on the prediction. Move a small percentage left or right, and your predicted value changes dramatically.
The goal, then, is to find minimums at wide points in the overall function, where small movements don’t result in large predicted changes. The third and fourth minimums from the left would both work nicely for this. But if the minimum you’re going to find are dependent completely on where you happen to randomly start, that’s not going to provide satisfactory results a good percentage of the time.
Stochastic Gradient Descent
Throwing the Marble at The Bowl I Made in 2nd Grade Art
Enter regularization via Stochastic Gradient Descent. Stochastic is just a fancy word for random in this context. Let’s say you have a training data set with 2,000 variables. With traditional Gradient Descent, you use all of the variables available to determine the direction of the descent, meaning you’re going to have a really good chance of finding the lowest point closest to you. With Stochastic Gradient Descent, you randomly pick a small subset of those variables — say, just 32 of them — and base your calculation on that tiny fraction of the data. Rather than dropping a marble in a perfectly concave bowl, that’s more like throwing it every time, at an angle, and intentionally making it bounce pretty hard — and maybe not even in the right direction. It’s intentionally and probably egregiously introducing significant error into your accuracy analysis.
What does this buy you? Well, for starters, you’re going to bounce well out of shallow minimum areas pretty quickly, thus avoiding the first minimum in our function. Then, if you do happen to land in or near a deep but narrow minimum, you’re going to pretty quickly bounce your way out of that as well. Stochastic Gradient Descent can only approach a minimum, then, that is both wide enough and deep enough to contain it despite all of the bouncing around. Again, in our example function, either the third or fourth minimum areas would have worked, and while the fourth minimum was a lower absolute minimum than the third, either one would likely result in satisfactory predictions. Run your model a few times and you’ll find both, and then take averages or whatever you want to make the final prediction.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot