Original Source Here



  • The basic building block of Machine learning is to optimize some kind of objective function like the mean squared loss or max log-likelihood. Minimize the loss and in the process optimize the weights and biases, do this iteratively and you will get a well-performing model
  • The optimization of weights can be done in two ways. i) Line search method ii)Trust region method. In the former, we select a step-size length (called the learning rate) to update our weights. In trust-region, we select a trust region around (having a maximum step size or learning rate) and within that region, we find a point of improvement. Within each trusted region, you are choosing a point for improvement, but in line search you have a global point to reach, hence you are taking steps towards it. It is stable to do region-wise improvement.
  • Adaptive learning rate are good alternatives, but they are regulated based on some optimizing function. Like time step, after reaching certain iteration decrease your learning rate. But this comes with its own set of problems of choosing proper time step to reduce learning rate and RL algorithms are sensitive to hyperparameters.
  • Drawing parallels to real-life, imagine having a fixed range flashlight while moving through an open area. So you trust your senses within that region and only take steps where you think you are moving in the right direction. You can have stable updates in your movement.
  • Now coming to the limitations of line search methods. Suppose you are a mountain climber. According to the line search method, you are allowed to move with a fixed step size length. But there are high chances that you may fall off the clip by over-stepping the peak of the mountain. But in the trust region method, you fix a maximum step size and analyze within that region, where to go. This improves stability and also you don’t fiddle around a sensitive hyperparameter.


  • In the previous article about PG methods, we jotted down a few points about the limitations of policy gradient. These methods are data-inefficient as they throw away the previously learnt policy and generate a new set of transitions in each iteration. Off-policy algorithms like DDPG, solve this by having a replay buffer where they store previous transitions. Here we store previously generated transitions and then sample a random mini-batch to calculate the loss function and optimize our network parameters
  • But here we have an unstable update. Initially, our estimates are bad hence generated policy is sub-optimal → As we reuse our old bad policy to generate new transitions, we collect bad samples → which worsens our policy. Bad policy is generated in each iteration if we have a large step size and if we have a small step size, convergence is slow
  • Also, it is empirically proven that the critic network which determines the value function of a state IS MORE SENSITIVE TO NETWORK WEIGHTS COMPARED TO ACTOR POLICY.
Taken from the video and π_thetha is the actor’s policy
  • So to achieve a stable update, we need to update the weights of the actor-network directly, hence update our policy based on previously learnt policy. Using the previously learnt policy to update our current policy, serves as a proxy for replay buffer so it is ON-POLICY ALGORITHM now as we are using our previously learnt policy as a base for our new policy. It is not truly on-policy as in on-policy we need to generate a completely new trajectory and the data is thrown away after one gradient update, but here we keep the previously learnt policy and update the policy.
  • The answer is to reduce variance. Before this, we need to understand “Importance sampling”. It is estimating one distribution by sampling from another distribution
Taken from this video
  • So here we have two distributions p(x) and q(x). We want to calculate the expectation of the function f(x) by following the distribution p(x). So here p(x) is the new policy and q(x) is the old policy. Now we are playing a game in our reinforcement learning environment and sample a trajectory given our current policy and we want to estimate our reward (f(x)). We can do that by sampling from the current policy p(x), but it is noisy as we are learning it, hence we use the old policy q(x) to estimate the total reward. Now if we calculate the variance before and after importance sampling, we get
The variance of a distribution
Taken from the video
  • LHS is before importance sampling and the right one is after importance sampling, we will see that they are nearly similar except for the first term in both side. So p(x)/q(x) >>1, we have a larger variance which means our current policy is much different from the old policy. But if p(x)/q(x) <<1, we have lower variance (inherent regularization). SO IT IS SUITABLE TO UPDATE OUR POLICY BY UTILIZING THE PREVIOUSLY LEARNT POLICY AND WE HAVE CONTROL ON VARIANCE OF OUR POLICY.


  • The gradient of a function is only accurate close to that function, here the gradient of our objective function is only accurate close to the current policy. This is optimization within the constraints of a trusted region. So our new policy must not move too far away from the current policy.
  • This can be seen as doing gradient update with a constraint, we don’t want our updated policy to move far away from the current policy. This gives us an idea about KL (Kullback-Leibler) divergence.
  • KL divergence gives the distance between 2 distributions or how much different are two distributions.
KL divergence is a divergence of p(x) [old policy] from q(x) [new policy]
  • The intuition for the KL divergence score is that when the probability for an event from P is large, but the probability for the same event in Q is small, there is a large divergence. When the probability from P is small and the probability from Q is large, there is also a large divergence, but not as large as the first case. (Note here P and Q are the old and new policy probability distribution respectively).
  • It conveys that if for an action a, the old policy P assigns a higher probability, for the new policy Q try to assign a probability close to the probability from P. But if for the same action a, the probability from P is lower, you can assign a higher probability (without increasing divergence) in Q. So it encourages exploration with constraints for low probability actions in the old policy.
  • So TRPO with KL penalty as a constraint is given by
TRPO natural gradient update.
  • In the gradient, we have the old policy in the denominator which is a constant, hence it won’t affect our gradient. The Advantage function is calculated from the previous iteration of the critic network, if we are not using replay buffer then we can use the transitions collected by the previous policy to estimate the advantage function. The advantage function can be estimated by the empirical estimate of the value function V(S)
TRPO uses a Monte-Carlo estimate approach through regression but constrains it (as for the policy)
  • But most of our deep learning frameworks like Tensorflow and PyTorch cannot calculate gradients under constraint, hence this natural gradient cannot be readily implemented. Hence PPO Algorithm — (I) is suggested which looks like the following
The conjugate gradient for TRPO update, the hard KL penalty objective. It looks a lot like L2 regularization. Hence conjugate gradient has inherent regularization
  • Now the hyperparameter β is adaptive. It is done by setting a target distance d_{targ} and the above equation serves as the loss function written as L_{KPEN}
Adaptive KL penalty
  • If your d_{targ} is low, means you have a heavy constraint on your update of a new policy, so the updates are slow as the new policy won’t differ much from the current policy. To ameliorate slow update, you decrease β, increasing the loss L_{KPEN}. So a loss increase would lead to having higher gradient updates to minimize the loss, solving the issue of slow update. Similarly, if d_{targ} is high.


  • This algorithm represents a simpler objective compared to the above two algorithms. Its loss function is derived from the clipped importance sampling loss
Importance sampling ratio
  • The final loss function is derived by multiplying the above ratio with the Advantage function.
Backpropagate with the policy network for advantage estimation. CLIP loss along with the MSE loss between value function of target critic and critic network is taken as the combined loss function for policy network
  • It is a much simpler implementation but its performance is superior to others. ϵ is a hyperparameter and 0.2 gives the best result. It is clamping our probability ratio between 1- ϵ and 1+ϵ multiplied with the advantage function. So without deviating too much from your old policy, update your current policy inside the trusted region.
  • For Advantage A>0, consider the new policy to deviate only (1+ϵ ) times the old policy. And for advantage A<0, consider the new policy to deviate only (1-ϵ) times the old policy. Also, the probability ratio is clipped at 1 − ϵ or 1 + ϵ depending on whether the advantage is positive or negative.
  • If you don’t have a minimum operator, it may lead to overestimation bias. So it gives a pessimistic bound. Hence, we only ignore the change in probability ratio when it would make the objective improve, and we include it when it makes the objective worse.


  • In machine learning, we continuously learn based on an objective function. Our estimates for the true values get better over iteration, which is called reducing bias. As you play more episodes, your estimates for the value function of states get better, thus reducing bias.
  • But as you continuously learn, your estimates differ a lot which increases the variance of our approximation. So your estimate for the same state changes as you keep on learning. If your variance increases, you may not be able to accurately predict the output of unseen data. So if you encounter a previously unseen state, but nearly similar to the state you estimated earlier, your estimates may be wrong because of high variance (too many options to choose the state function because of high variance).
  • So you need to balance between bias and variance (bias-variance trade-off). Also, if your model overfits your training data, then it would have a low bias (low training error) but high variance (high test error). So if we sample a whole episode to estimate our value function. So your agent will get good at the current trajectory which will decrease bias, but your variance would increase, because it may fail in another trajectory.
  • SO TD-LEARNING HAS LOW VARIANCE AND HIGH BIAS, as we have one-step update only which changes slowly. SIMILARLY, MONTE-CARLO HAS HIGH VARIANCE AND LOW BIAS, as updates the whole episode for a given state. So if you do one-step update, you may not be able to approximate the value function of a state properly, but the state is only affected by next state approximation, hence less noise (low variance). But in Monte-Carlo, a state is updated based on information of whole epsiode, but too much influence on the current state hence high variance.
  • To strike a proper balance, we use Generalized Advantage estimates. These discount (γ) the advantage function (to reduce variance) and the TD-learning step by the GAE factor λ (to reduce bias by minimizing the effect of erroneous update).
Implementation Source


  • PPO Algorithm with a clipped objective performs better than other algorithms. It also has a simple implementation.
  • PPO use cases can be in finance, where you initialize your policy according to an existing trading strategy and then iterate within a trusted region to improve your policy. This would lead to faster convergence in back-testing, as the trading strategy is sufficing but not generalizing. So using the generalizability of neural networks you can generalize your trading strategy.
  • PPO does not need replay buffer storage as it utilizes the already learnt old policy to update the next policy. So the old and new policies are not different. But unlike in Policy gradient, where the old policy is thrown away after each iteration, here we reuse the old policy hence more sample efficient.
  • Trusted region update ensures that your policy always improves by keeping the previously learnt policy as the baseline. This somewhat guarantees improvement. Also, GAE can help you to balance bias and variance. PPO is a primary algorithm at OpenAI. Do check it out.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: