Automatic Differentiation in Machine Learning

https://miro.medium.com/max/1200/0*1Rnl20PdgVFkw0WN

Original Source Here

Manual Differentiation

If you are familiar with calculus, you would be able to calculate the partial derivative for f with respect to x1 and x2 straight away. But for the benefit of those who need a quick refresher, I will discuss some of the more basic calculus formulas or identities.

  • Linear function (ie. a ∈R):
Differentiating a linear function
  • Polynomial function (ie. a, n ∈R):
Differentiating a polynomial function
  • Linearity (where f and g are functions):
Linear operation of differentiation

So given our function f earlier and a point x1=2and x2=3, its derivative with respect to (w.r.t.) both x1 and x2 are as follows:

Partial of f w.r.t. to x1
Partial of f w.r.t. to x2

This method works, albeit with a lot of processes. But as the function f becomes more complex, the processes can become unnecessarily huge and cumbersome. The good news is that it can be automated via symbolic differentiation.

Symbolic Differentiation

In symbolic differentiation, the mathematical expression is parsed and converted into elementary nodes. These nodes correspond to basic functions where differentiation is supposed to be trivial (constants, polynomials, exponential, logarithm, trigonometric, etc).

The derivatives of these elementary nodes are then assembled using the rules for combined functions (linearity, product, inverse and compound rules), to obtain the final form of f’(x).

We can use SymPy library to perform such a procedure by first defining the x1 and x2 symbols.

import sympyx1 = sympy.symbols('x1')
x2 = sympy.symbols('x2')

and then our function f

def f(x1, x2):
return x1**2 + 2*x
print(f'f(x1, x2) = {func}')
>>> f(x1, x2) = x1**2 + 2*x2

Now, we perform the symbolic differentiation by calling the diff()method.

gradient_func = [func.diff(x1), func.diff(x2)]print(f'gradient function = {gradient_func}')
>>> gradient function = [2*x1, 2]

We got the same result as the one when we are using the manual differentiation method.

The final step would be to insert the values of x1 and x2 respectively.

gradient_val = [g.subs({x1: 2, x2: 3}) for g in gradient_func]print(f'gradient_values = {gradient_val}')
>>> gradient_values = [4, 2]

Nicely done!

However, for large functions, the graphs can become extremely big. This could result in a slow performance despite our best efforts of pruning, etc. This is where automatic differentiation comes into the picture!

Automatic Differentiation

Autodiff is typically used by some of the more popular deep learning frameworks such as Tensorflow and Pytorch because of its simplicity and highly efficient way of computing derivatives.

There are two modes in Automatic Differentiation, forward mode, and reverse mode.

  • Forward Mode: The goal of this mode is to create a computation graph, similar to our symbolic differentiation method above. we split the problem into its elementary nodes consisting of arithmetic operations, unary operations and geometric functions. The computation graph of our function is illustrated below:
Our Computation Graph

In this forward pass, we insert the values of x1 and x2 down the steps, as follows. Notice also that we define x3 and x4 as our intermediate steps.

Our forward pass

As expected, our f(2,3) = 10. Now let’s calculate the gradient.

  • Reverse Mode: This is where we calculate the derivative of the function for each of the steps. We use the chain rule to compute derivatives of.

Let’s calculate the partial derivatives of each step with respect to its immediate input:

Partial derivatives for each intermediary step

Now for the hard part: calculate the partial of f w.r.t. to x1 and x2 . As hinted earlier, we are going to use chain rule by traversing the line from the terminal node to the x1node: red line (also to x2 : blue line) as follows:

Chain rule traversing: reverse mode

We got the same results as those in manual and symbolic differentiation!

The chain rule has an intuitive effect: the sensitivity of f with respect to an input x1, is the product of sensitivities of each intermediary step between x1 and x4: sensitivities “propagate” along with the computational graph.

Parting Thoughts

This post introduced you to the basic concept of autodiff with a simple running example. I hope you enjoy learning this concept as much as I do!

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: