Deep Learning: From Fluids to the Schrödinger Equation



Original Source Here

Deep Learning: From Fluids to the Schrödinger Equation

In this post I give a brief and hand-wavy summary of my latest research paper preprint [1] (it can be found here) and attempt to explain plainly the background topics. The primary focus of my research is formulating deep learning as a hydrodynamics system. The reason behind that is to utilise tools from fluid dynamics to help understand the behaviour of some deep neural network. One interesting finding is that a ResNet type deep neural network when trained using optimal control, it is equivalent to the nonlinear Schrödinger equation. This connection is not straight forward and requires several intermediate transformations to attain it. There are three key topics covered in the paper: stochastic optimal control, geometric hydrodynamics and structure-preserving integrators. Here, I will only address the first topic, while the other two will be subject of a future post.

A simple toy example of using structure-preserving integrator to discretise a ResNet ODE and learning to classify data points from two spirals.

Residual Neural Network as an Ordinary Differential Equation

There are many types of neural networks, but the article’s focus is only of residual networks, ResNet henceforth, which are a recurrence relationship

ResNets were invented to alleviate the problem of vanishing gradients by the use of skip connections, or sometimes referred to as shortcut connections. What is meant by skip connections is that the activation of the previous layer is used for successive layers. In the mathematical expression above, we can see that the activation of layer 𝑥^(k-1) appears in 𝑥^(k+1) . This becomes crucial at the early stages of training, when the network’s weights and biases are close to zero, as it guarantees that the layer activation is close to the input. As a result, the training picks up pace at the beginning of the training process.

Now, consider the function 𝑥(t), evaluated at t + α, where is a small positive number, we take its Taylor expansion around t, we get

The value 𝑥^(k+1) can be viewed the value of the function at t + α, where t = kα for an integer k. We substitute the expression above into the left-hand side of the ResNet recurrence relation

Cancelling the like terms, dividing both sides α and taking the limit α → 0, i.e. meaning that as the number of layers goes to infinity, we obtain the following ordinary differential equation

where σ is the activation function. One view that ResNet is Euler discretisation of this ODE. Unfortunately, such numerical method is known to have poor stability properties and can result in yielding the wrong solution.

Remark: using ODE solver to implement neural network is the core idea of Neural Ordinary Differential Equations.

What are ordinary differential equations? In a nutshell, an ordinary differential equation is basically a tool to describe the change over time or other parameter, of many systems in nature. A system dependent on a single parameter t (a real number ℝ) is modelled by a vector 𝑥(t), however, it is difficult to obtain an exact description for 𝑥(t), instead its rate of change is used. Hence, the rate of change of 𝑥(t)with respect to t and the initial condition 𝑥(0) are used to describe the behaviour of the system. When the system depends on more than one parameter, the equations for the rate of change of the system include partial derivatives and they are known as partial differential equations.

One advantage of this is that the ODE formulation gives us the freedom to choose different discretisation schemes to implement a more stable and more accurate approximation of 𝑥(t). Now, the question is, how do we train such networks?

Optimal Control for Deep Learning

For the time-being, we only consider supervised learning and classification problem. Say we have the following dataset (y^(i), c^(i)), with i=0…N, where y^(i) is the input data and c^(i)​ is its corresponding class. First, we define ​the projection map 𝜋(𝑥) as a map that takes the network’s output 𝑥(T)​ and maps it to a the space of classes. This needed to define the loss function. Training of a neural network seeks to find parameters 𝜃​ that would make it identify each input of the dataset 𝑥(0)​ with the correct class c​ and be able to predict the class of any input. This is done by minimising a cost functional

and the minimum is found using gradient descent technique. Instead, it can be treated as an optimal control problem.

In brief, optimal control is a method that helps in designing a system’s trajectories. Here, a trajectory is not given per se, but instead its initial and final positions and the goal is to find a force, or a vector field in which it will make the system move from the initial state to the final one. This post does no justice to a rich and important topic, but the curious reader is invited to check the following books [2] and [3].

For a general nonlinear system

with 𝑥(0) = y​ and ​t ∈[0,T], the aim is to find network parameters ​𝜃(t) that minimise the cost functional

Here, ​L is a regularisation of the network’s parameters, while 𝚽​ measures the deviation from the actual class/category ​c^(i). At its core, it is a variational calculus problem, where the unknown is a function or a vector field rather than a variable. The minimisation has to also satisfy the constraint that relates 𝑥(t)​ with ​𝜃(t), in this case it is the ODE for ResNet, and to ensure that a Lagrange multiplier is added to the cost function. That is

The function 𝜆​ plays the role of the Lagrange multiplier. Rewriting, the cost functional as

where the function

is the Hamiltonian function and it depends on the network’s parameters (control), 𝜃, the state, 𝑥, and the Lagrange multiplier (referred to costate, henceforth) 𝜆.

You might wonder what a Hamiltonian function is and why it appears here. In many cases, it coincides with a system’s total energy, i.e. the sum of potential and kinetic energies. Basically, a Hamiltonian is a scalar function that carries all of the information of regarding a set of coupled first-order differential equations. These equations are equivalent to the Euler-Lagrange equations which appear in the principle of least action. The solution of of the Euler-Lagrange equations is the curve in which a cost functional is at minimum, hence they often appear in minimisation/maximisation problems.

The function H is required to apply Pontryagin’s Maximum Principle to solve the optimal control problem. That said, the Maximum Principle gives only the necessary condition for optimal solution. The Pontryagin’s Maximum Principle is very technical and out scope of this article, however, what one needs to know about it is that it states that the optimal control, 𝜃, that minimises the cost functional S(𝜃) is the solution of the following Hamilton’s equation:

Solving this system of equations analytically is challenging and for this reason a numerical method is needed. In general, optimal control problems can be solved in a verity of ways: One approach is to use variational integrators, and that is by discretising the cost functional and then applying Hamilton’s principle which yields a system on nonlinear equations:

The system of equations can be stated as the following vector equation

and z is an unknown vector and we seek to find a value such​ that it is the root of ​F.

This root finding is a nonlinear programming problem; it can be solved using Newton’s iteration. The second approach is to choose initial conditions for the costate, 𝜆^(i), and integrate the Hamiltonian system. Then the error between the final condition and the solution is used to adjust the initial guess. This process is then repeated until the solution is close to the final condition. I my paper, I opted for a structure preserving scheme to discretise the optimal control problem. What is meant by structure-preserivng integrators is that the numerical solution preserves the properties of the Hamilton’s equations.

Stochastic ResNets

Instead of working with

we consider the case when the neural network has a degree of uncertainties. This requires the dynamical model to change into a stochastic differential equation SDE

where dW is the derivative of Brownian motion and it plays the role of white noise and 𝜈 is the strength of the noise. (Brownian motion). The main difference between an SDE and an ODE is that we have dX instead of d𝑥/dt, because the trajectories of the Brownian motion are not differentiable. Thus, it is not a differential equation, per se, but an equation for small increments. The solution of this SDE is a stochastic process X. What is meant by that X is a random variable that is part of a sequence whose values are all from the same space. An important feature is that X has a density 𝜌(x,t) with the property that

for all t ≥ 0. At the initial state, X(0) the process is known and it is deterministic and usually the probability density function is chosen as a Dirac delta 𝛿(𝑥 — y). Imagine we have an infinite number of particles at the point x at time t = 0 and then release them, the particles would then spread in all directions. If we trace the trajectory the particles we see that it is stochastic, and on a macroscopic scale, their distribution is described by the density 𝜌(𝑥,t).

Probability density functions are not always constant and they evolve in time. However, their dynamic behaviour is not governed by an ODE, instead by a PDE called Kolmogorov forward equation, also known as Fokker-Planck equation, and it is given by

To understand the Fokker-Planck equation, we look at the terms of the partial differential equation separately. The term

or

for higher dimensions, is the advection (convection) term and takes the density and transports it along the vector field 𝜎(𝑥,𝜃). For example, if we set 𝜎(𝑥,𝜃) to be a constant vector field c, then the density would move in a straight line as shown in the following figure

This is what the solution looks like when there is only advection term in the Fokker-Planck equation.

Meanwhile, the term

or

is the diffusion term where the concentration of the density would move in all directions to areas where the concentration is low.

When the Fokker-Planck equation consists only of partial time derivative and diffusion, it reduces to the heat equation, whose solution behaves like this.

For example, if we consider a particle whose motion is described by the stochastic differential equation:

and its Fokker-Planck equation is given by

has a stationary solution, and it is given analytically by the Gaussian function

where ​Z is a normalising factor.

In Deep Waters: Deep Learning as Hydrodynamics Problem

Optimal control for training, described previously, does not work, and we need to use stochastic optimal control instead. One key idea is that we require to minimise with respect to the expectation value of the cost functional. Similar to the deterministic problem, there are a number of ways to solve the stochastic optimal control problem, but sadly, we will not discuss them here. Instead, we focus on a method that uses mean-field games for solving stochastic optimal control problem. It utilises the same Maximum Principle for ODEs, but instead of dynamic of the ResNets described by SDEs, the Fokker-Planck equation is used.

Mean-field games is part optimisation and part game theory, where it aims at finding the optimal control that drives a large number of systems. In other words, the optimal control is determined by how the whole population reacts to change rather than how each individual system reacts. The optimal control problem for training now becomes

As a result, Hamilton’s equations are now a system of partial differential equations given by

where the costate, 𝜆, is a nonlinear backward partial differential equation known as the stochastic Hamilton-Jacobi-Bellman equation.

The next step is to use a change of variable, where the weights of the network, are decomposed to

and when substituting this into the forward Kolmogorov equation, it becomes the continuity equation. In fluids, this equation is the conservation law for mass

and the cost functional becomes

Here, the regularising function, L, is chosen to be the L² norm. The term whose coefficient is 𝜈⁴ is the Fisher information metric and it works and as an indicator for the quality of the measurement. With integration by parts we can obtain from it the so-called quantum potential. Also, an important assumption made here is that divergence of 𝜔 is zero. With the change of variable, the Hamilton’s equation are now:

Notice that the diffusion part has disappeared from both equations, however we are left with a potential force term in the costate equation. The next step is known as the Elimination theorem, and it seeks to eliminate the costate from the system of equations. First, we take a time-independent function, 𝜂​, and take its inner product with the optimality condition and take the time derivative

and using the forward Kolmogorov and costate equations and with integration by parts, we omit the calculation here. The result of the elimination is an equivalent system of partial differential equation

The right hand side term in the first equation with the partial derivatives of the probability density function is known as the quantum force. It usually appears in formulation of quantum mechanics according to de Broglie–Bohm. This partial differential equation appears in fluid mechanics, and it is related to the Madelung equation or quantum Euler’s fluid equation. The solution of this fluid problem gives us the ResNet’s parameters. In other words, the the amount by which the ResNet’s parameters change from a layer to the next is similar to how the a fluid flows when a quantum force is applied to it.

From Water Waves to Wave Packets: Schrödinger Equation and Deep Learning

The Madelung equation was introduced by Erwin Madelung as a hydrodynamics formulation of the Schrödinger equation

and this is a Hamiltonian partial differential equation, with a Hamiltonian

Indeed, we can use the same change of variable to map deep learning optimal control to the nonlinear Schrödinger. The wave function ​ of the nonlinear Schrödinger can be decomposed into

This decomposition is known as the separately transform. Setting f = 0​ and plugging the Madelung transform into the Hamiltonian function for the nonlinear Schrödinger, we get

and it simplifies to

Since 𝝏𝜆/𝝏𝑥 = 𝜔 and ħ = 𝜈⁴​, we recover the Hamiltonian for mean-field optimal control.

Conclusion

So far I only have talked about how one can start from training of ResNet can end up with a fluid dynamics system. However, there are few things I did not discuss or mention such as the properties of the ResNet’s weights derived from the fluid equations. The link between deep learning and fluids is indeed interesting, despite the heavy use of mathematics, it opens many avenues for exploration. The primary aim of this post is not to give a crash course on all of the theory, but instead encourage interest in optimal control for deep learning.

If you have questions regarding anything mentioned here, you can contact me on LinkedIn. Thanks for reading!

References

[1] Ganaba, N. (2021). Deep learning: Hydrodynamics, and Lie-Poisson Hamilton-Jacobi theory. ArXiv:2105.09542 [Math]. http://arxiv.org/abs/2105.09542

[2] Kirk, D. E. (2004). Optimal control theory: an introduction. Courier Corporation.

[3] Jurdjevic, V. (1999). Optimal control, geometry, and mechanics. In Mathematical control theory (pp. 227–267). Springer, New York, NY.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: