Implementing a Neural Net in CUDA From Scratch, Part 2: Linear Forward Pass

Original Source Here

Implementing a Neural Net in CUDA From Scratch, Part 2: Linear Forward Pass

Photo by Nicole King on Unsplash


In this article, we will write the linear forward pass, which requires delving into the CUDA programming model and learning about its memory hierarchy.

Without further ado, let’s get coding!

Parent Layer

On a high level, deep learning layers must perform two tasks:

  1. Receive the input and transform it, A.K.A. the forward pass.
  2. Differentiate the output with respect to the input, A.K.A. the backward pass.

Of course, this is deep learning 101, but my point in outlining the obvious is that every layer needs two methods, a forward pass and a backward pass, and some layers optionally could be updated. Therefore, there could be a parent layer with three empty functions, forward, backward, and update, that the other layers could inherit for consistency (by the way, after writing this series, I better appreciate the rationale behind deep learning libraries’ design choices such as keras.layers.Layer or nn.Module):

Let’s get to the real deal!

Linear Layer

Arguably the trickiest section of this series is implementing the linear layer. If you are comfortable with the details behind a matrix multiplication, the following should be relatively simple. Otherwise, I urge you to read these compendious notes to clear up any confusion you may have relating to a linear layer (they’ll be useful for the linear backward pass too.).

Note: Throughout the rest of these articles, we are going to implement every function or class in both pure C++ and CUDA since I) Once you implement something in pure C++, it’d be easier to write and understand its CUDA counterpart than it would’ve been had you started with CUDA right off the bat and II) Testing our CUDA code is crucial, and Python is an inefficient option since we would have to port the random elements of our network from CUDA/C++ to Python. Two CUDA or C++ programs, on the other hand, communicate nicely, thus comparing their results would be easier.

First, we create the header for the CPU linear layer (CPU & C++ and GPU & CUDA will be interchangeable henceforth):

Linear_CPU‘s constructor stores the arguments it’s been passed, computes the layer’s output size, and initializes the parameters:

Randomly initializing the weights is all right for shallow networks, but deeper nets require smarter initialization to avoid the exploding/vanishing gradients problem (please check this out for an explanation). Kaiming initialization is the go-to method for networks with non-linearities like ReLU, with the formula being a zero-centred Gaussian distribution with a standard deviation of 2/n_in (in reality, it is more complicated than that, but this should suffice for our usage):

Before we resume, there is something to be cleared up first. Normally, we would have multidimensional arrays representing our input (of the shape bs*n_in), weights (of the shape n_in*n_out), and so on. C++ makes such arrays easy to work with, but in CUDA, dealing with higher-rank tensors is a pain compared to vectors. Ergo, every array, regardless of its dimensions, shall be represented as a vector throughout the rest of this series.

Indexing may therefore sound knotty, but it is really not: Given a matrix m of shape a*b, element m[i][j] is referring to the jth value in the ith row, where i < a and j < b (zero-based indexing). To put it another way, we skip i lines and j columns (e.g. when i = 2 and j = 1, we skip the first two lines to get to row 2, then we skip the first element of row 2 to get to column 1), or just i*b + j elements, for getting the desired value.

With that out of the way, let’s write the forward pass itself, which is basically nothing but a matrix multiplication: Iterate through the rows i and columns k of the input and weights respectively, calculate their vector multiplication, and store the result in output[i][k]. For the bias, just add the kth bias to kth column of output.

The forward pass merely needs to store the input & output and call linear_forward_cpu:

Then there’s CUDA. The header is the same as the one above except for two new variables, n_block_rows and n_block_cols :

Forward pass with CUDA should be leagues more involved and lengthy, right? Take a look yourself:

The loop that evaluates the vector multiplication is the same as the one for CPU, and ind_out , ind_inp, & ind_weights seem extremely familiar, don’t they? As a matter of fact, if you replace row with i , col with k, and i with j, the CUDA code would appear just about indistinguishable from the C++ one. The only disparities would be the first line of the function and the replacement of the two fors with an if. Understanding them needs understanding:

Threads, Blocks, and Kernels

Remember the analogy where we had 100 kids working together on 200 additions? Let’s call each student a “thread”, an entity that performs an elementary calculation like the sum of two numbers. There are thus 100 threads, with IDs from one to 100, and each one has been designated an expression to solve.

We can also group the threads into “blocks” for better organization. For instance, there could be five blocks, each with 20 threads. This way, the blocks also could have IDs, and the threads’ IDs would be block-wise, not global (i.e. a thread’s ID, between one and 20, would be its ID inside its block). A thread’s global ID would hence be (its block’s ID)*(#threads per block) + (its ID in that block) (zero-based indexing again. Notice how the structure is the same as the formula for indexing flattened matrices we derived above), and we could use the same formula from the previous article to pair the right threads with the right questions. The blocks themselves can be grouped into “grids”, but we need not know much about them for now.

That is exactly what we do in the first row of linear_forward_gpu: blockDim is the number of threads per block, blockIdx is the block’s ID, and threadIdx is the thread’s ID in its block. But what does “the” in “the block’s ID” and “the thread’s ID” mean?

Enter kernels. A kernel is just a function that is executed by many threads in parallel on the GPU, and each thread’s job differs based on its ID and its block’s ID. It’s like we give each student/thread a card that reads, “Your ID is the number of threads per block times your block’s ID plus your block-wise ID. Solve questions #2*your ID and #(2*you ID)-1”. The beauty is that the threads are assigned identical commands, so one function can take care of everything.

In short, CUDA has grids, and grids are composed of collections of blocks. Block, in turn, are constituted of threads, but that’s as small as we get. The threads are all given the same command that is almost always made up of two parts: I) Find your ID and II) Do something based on that ID. Often, that something is essentially the same for every thread, like multiplying two values, but where those two values come from varies from thread to thread. For example, if we’d like to add two vectors, every thread would just be adding two numbers, but one thread may add the third elements of the two vectors, whereas another would be summing the sevenths elements, and so on.

Just one more thing: How would we tell the threads to execute our function? It is simply:

The number of threads per block largely depends on the task at hand and one’s hardware. Most GPUs can have a maximum of 512 threads per block, but some can go to 1024. I’ve found that something between 32 and 256 is the sweet spot for optimal performance in most cases but again, your code greatly influences the ideal value (Stack Overflow thread).

n_blocks_per_grid, however, has to be chosen carefully. In the case of adding two vectors, we would like each index in the two vectors to be assigned exactly one thread. Therefore, we need to ensure the number of threads is not smaller than the number of elements in each vector, but we also don’t want surplus threads, so it is set to floor((#elements + #thread per block – 1)/#threads per block). There’d hence be enough threads for every element and at most, we’d have #threads per block-1 extra threads on our hands (if you’re confused why the formula works, I encourage you to play around with different values for the number of elements and threads per block. It is also easy to understand it algebraically.).

Here’s the thing though: Adding two vectors is a one-dimensional task, but summing two matrices is inherently two-dimensional. Since we’ve flattened out our matrices, one might be tempted to treat the matrices as width*height-dimensional vectors. However, for more complex operations like matrix multiplication, doing so would complicate things and wouldn’t be efficient. Instead, CUDA offers two-dimensions blocks and grids (three-dimensional too, but that’s beyond the scope of this series), basically meaning our grid would be rectangular, and threadIdx, blockIdx, & blockDim would each have two different attributes, .x & .y, to represent the x- and y-axis.

Diagram of CUDA’s memory hierarchy with two dimensions. The cells in the blocks represent threads. Source

The syntax for launching a kernel with two-dimensional grids and blocks is the same as above, but the values of n_blocks_per_grid and n_threads_per_block need to be modified. They both need to be dim3 instances, a type provided by CUDA that allows for up to three-dimensional grids & blocks. Our application only requires two dimensions, the batch size and the number of output features. Ergo, we would have:

And the rest is the same. Putting it all together gives:

Another way to look at the code above is this: When you have dim3 n_blocks(x, y, z) as the number of your thread blocks, where x = (a+block_size — 1) / block_size , y = (b+block_size-1) / block_size , and z = (c+block_size-1) / block_size , doing kernel<<<n_blocks, n_threads>>>(args)is like you have three nested for loops where the first one iterates from zero to a, the second one from zero to b, and the third one from zero to c, with the kernel being what you do inside the loops. To get the value of the iterators, we would do i = blockDim.x*blockIdx.x + threadIdx.x , j = blockDim.y*blockIdx.y + threadIdx.y , & z = blockDim.z*blockIdx.z + threadIdx.z, and we can proceed like we would with regular loops.

You may be a bit overwhelmed with all of this, and that’s completely fine. My advice is to reread this article a few more times, but if there were still parts you were unsure about, that’s OK. We are going to see the implementation of other functions in CUDA with more explanation, so you can revisit the parts you don’t currently understand with a different perspective and fresh knowledge later on.


In this article, we implemented the linear forward pass and learned about threads, blocks, and kernels, the backbones of CUDA programming.

In the next article, we will write ReLU, MSE, and a sequential module to wrap up our network’s forward pass.

Please, if you have any questions or feedback at all, feel welcome to post them in the comments below and as always, thank you for reading!

Related articles:

Social media:


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: