Implementing a Neural Net in CUDA From Scratch, Part 6: Training

Original Source Here

Implementing a Neural Net in CUDA From Scratch, Part 6: Training

Photo by Balazs Busznyak on Unsplash


In this article, we will train a neural network, which is a sequential module composed of linear layers and ReLUs, on real data using all we’ve done thus far. You can find the GitHub repository here.

Without further ado, let’s get coding!


The fun part of projects such as this one is to try them out on actual data and measure how well they can do compared to other models. Sadly, after mulling over what real-world dataset we can train on, I came away empty-handed since ultimately, our model is quite rudimentary and inflexible, therefore it cannot be easily used on every (or in fact any) complex dataset: No batch norm means the inability to train deep networks and hence a low accuracy on highly nonlinear problems. Datasets with categorical features must also be crossed out because we don’t have an embedding layer and one-hot encoding sounds like too much work, and there’s also the lack of a strong optimizer like Adam which immensely handicaps the model’s ability to converge quickly.

Therefore, I settled for a synthetic regression dataset with 100,000 rows and 50 features (available in the GitHub repo). I have forgotten the exact method used to generate it, but I do recall it was similar to Scikit-learn’s make_regression function, with the main distinction being that it used a nonlinear model as opposed to Scikit-learn.

Assuming the independent and dependent values are in two CSV files, we first need to read them into arrays. There are already many, many, many excellent libraries on par or even more efficient than Python’s pandas, but it would be a good exercise to implement our own CSV parser. It is surprisingly straightforward at the expense of being slow, but it is all right for our usage:

After loading our data, we would do:


The pieces required for creating and training a neural network are currently at our disposal, but it would be nice to have another layer of abstraction on top to make life easier. Specifically, a function that receives a neural network (i.e. a sequential module) in addition to some data and takes care of training for us so we wouldn’t be forced to write a fit function every time. Again, we’ve encountered the following lines of codes many times before; all we’re doing is wrap them up inside a function for ease of use.

Like always, we start with pure C++. First, an instance of MSE_CPU needs to be created because it’s not given by the user:

Next, we need a copy of the input because every time sequential.update() is performed, the input given to the model is replaced by its gradients, and we need the original input for the subsequent epoch. A placeholder for out in sequential.forward(inp, out) is also necessary:

Then, for n_epochs, we run propagation and update the net:

When training is finished, we can make a final set of predictions and evaluate the model’s loss:

Put together, that’d be:

For CUDA, the only disparities are memory management for cp_inp and MSE_CPU being replaced by MSE_GPU:

Almost there!

The Finish Line

Let’s do this end-to-end! To read the data, we need to initialize two arrays, one for the input variables and the other for the target variables:


And call read_csv:

Identical code for CPU and GPU

Then, we create a neural network with two linear layers and a ReLU, although you can test out other architectures and see how the speed & accuracy are affected:


Lastly, we simply call the training function:


Here are the results, benchmarked by me and pitted against PyTorch (on a GPU): The MSE of our model is 8e-2, whereas PyTorch got 5e-1 (averaged five times). Initially, I assumed our program was flawed but quickly realized that the reason for the significant gap between the two scores is that PyTorch initializes weights & biases differently, and after accounting for that, PyTorch too got an error of roughly 8e-2.

In terms of execution time, however, PyTorch is the undisputed champion, with training taking roughly 0.9 seconds. For comparison, our CUDA program took 6.3 seconds, while its CPU counterpart took a whopping 116 seconds. So, PyTorch is seven times faster than us. Doesn’t that wholly and absolutely defeat the purpose of CUDA as discussed in the first article?!?! Well, there are three main points to bear in mind:

  1. Our code isn’t efficient. Not one bit. The sole goal was always to get a working solution and to get it with the minimum amount of effort in the shortest amount of time. We didn’t care about memory usage, we didn’t care about how neat the code is, and we certainly didn’t care about speed. By simply fine-tuning the block size, I was able to get the training time down to 2.8 seconds, a 55% decrease. Long story short, there are myriad changes we could make to further optimize things and improve performance.
  2. Getting to even 80% of popular CUDA libraries’ performance is nigh impossible, and our case is no different. Had we opted for a package such as cuBLAS to do the heavy work of matrix multiplication (or better yet, cuDNN for the entire network), we would’ve easily exceeded PyTorch’s speed by an appreciable margin while doing half the work. Bottom line? Never write something from scratch if it is available as an easy-to-use, optimized library curated by trained professionals.
  3. Even if one disregards the previous couple of arguments, note that there’s nothing more low-level than the modules we implemented like the linear layer or ReLU, which are the “atoms” of neural networks, so if they’re even a little too slow, the entire network will consequently be greatly slowed down. However, if we were implementing less fundamental and frequent layers (say, dropout on the input), we would match or even surpass PyTorch.

Despite our model’s not-so-stupendous efficiency, it’s still definitely something: We wrote everything from scratch and not just in any language; in C++, which is known for occasionally being… ugly. Moreover, we had to parallelize our modules with CUDA, thus adding another layer of intricacy on top and plaguing our program with a new collection of potential bugs. Last but not least, we had thorough tests in place to notify us of any inconsistencies between our C++ and CUDA modules.

All in all, I would say not bad at all!


In this article, we trained a neural network on a relatively large, 100,000 X 50 synthetic dataset using the modules we’ve created. That required reading CSV files into C++ arrays and performing gradient descent, the latter of which we wrote a convenience function for.

This was the last part of this series, and there’ll be no more articles after this one. I hope you learned at least one thing from my writings and enjoyed reading them as much as I did producing them!

Please, if you have any questions or feedback at all, feel welcome to post them in the comments below and as always, thank you for reading!

Related articles:

Social media:


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: