Original Source Here
Safe Reinforcement Learning — Part II
In the previous post, we discussed constrained MDPs, which we used to deal with safe reinforcement learning problems. We then presented the Lagrangian multiplier formalism and demonstrated how such min-max issues could be solved. In this part, we will extend our analysis and present safety-augmented MDPs that arrive with favourable properties, e.g., plug-n-play on top of standard reinforcement learning solvers.
It is worth noting that with the Lagrangian formalism, it is not clear if we can find the optimal policy using the commonly used representation (i.e., policies are state conditioned) since it is not clear what is the equivalent of the Bellman equation in the constrained case. Hence, the standard policy representation can create some limitations. Even intuitively, the actions should somehow depend on the safety constraint, but they do not.
To tackle such limitations and condition action-selection rules on safety features while ensuring Markovian transitions, we will change the state of the MDP to incorporate safety and shape costs. If successful, we are then able to provide at least the following ‘’cool’’ properties:
- Enable Markovian transitions making the Bellman equations easily hold;
- Enable policies which explicitly consider safety representations and features when selecting actions — since our policy will condition on our safety augmented state;
- Enable a plug-n-play approach since our changes are more on the MDP than on the RL algorithm’s side.
What to Augment:
To understand what variable we should augment the state space with, let us analyse the evolution of the safety constraint. Recall that from our previous post, we defined the constraint MDPs optimisation problem as follows:
In the above equation, the part to the left is the standard MDP problem, i.e., find a policy that minimises the task’s cost (negative reward). On the other hand, the component on the right is the budget constraint defined through an additional cost function Z that dictates safety specifications.
We will take a step back and examine the constraint details in the above equation. First, it is worth noting that the above constraint is equivalent to enforcing the infinite number of the following constraints:
This is true because we assumed that the instantaneous cost is nonnegative and the accumulated safety cost cannot decrease. Therefore, if the constraint is violated at some time, it will be violated at all times as we advance. It seems counterintuitive to transform a problem with a single constraint into a problem with an infinite number of constraints. However, as noted in our paper, our goal is to incorporate the constraints into the MDPs state and instantaneous task cost, thus taking into account safety while solving the task. This will be easier to perform while considering the constraint for all times. To do so, we begin by tracking a scaled version of the remaining safety budget at a specific time step that we define as:
Interestingly, our scaled version of the remaining safety budget (the equation defined by \lambda_t above) has a simple and Markovian time-independent update which we write as:
Don’t freak out! There is an easy way to derive the above equation. The only missing component is a recursive formula of the constraint:
Upon substituting the above recursive formula in the definition of the “scaled safety tracker”, we arrive at the time evolution equation we presented above and repeat it here for ease of understanding:
It is interesting to notice that this time-evolving equation is Markovian, making it easy to augment as an additional state variable. Of course, we would need to change the transition dynamics of the MDP to accommodate this new state component. We’ll do it later!
For now, we discuss changes needed to the cost function to transform our problem into a constraint-free form. Just to let you know, since we make the constraint for all time steps, we can reshape the instantaneous task cost to account for the safety constraint and write a constraint-free problem:
In the above equation, we shaped the cost function such that if the budget remains, the focus is on the task’s cost, while if nothing remains, we highly punish the agent. This, in turn, allows us to write a constraint-free problem acting in the augmented state space.
Now, we are ready to introduce a new form of MDPs that we dub safety-augmented Markov Decision processes or Saute MDP. Saute MDPs are derived from constrained MDPs by augmenting the state space with the “scaled safety tracker” and shaping the rewards as detailed above.
As clear from the figure above, a Saute MDP is an MDP with its state space augmented, its transitions modified to accommodate augmentation and its costs shaped according to constraints. Namely:
Notice that all we have done is add the “safety tracker” we derived above as part of the state space and then shape the task costs. The difference in shaping the costs here compared to what we have written above is that we have used n instead of infinity in case no budget remains. The reason for doing so is to have computationally friendly shaped costs — penalise the agent with a high positive number instead of infinity.
The remaining ingredient needed in finalising the definition of Saute MDPs is the augmented transition model. This is easy to define since we have already found a Markovian transition rule for the safety tracker:
How to Implement:
Before we go over a specific implementation, it is worth noting that our derivations above are independent of any deep RL algorithm. They are primarily environmental in the sense of augmenting the state space of the OpenAI gym, for example, and then changing the cost/reward function via shaping. Upon doing so, we can use any deep RL algorithm you’d like!
Our changes allows you to plug in any deep RL algorithm you like to solve the unconstraint cost; thus the plug-n-play nature.
In other words, the main benefit of our approach to safe RL is the ability to extend it to any critic-based RL algorithm. This is because we do not need to change the algorithm (besides some cosmetic changes) but create a wrapper around the environment. The implementation is relatively straightforward, and the only “trick” we had to resort to is normalising the safety state by dividing it with the safety budget.
As you see from the above figure, all we need to do is to write a safety step function and overload the step and rest functions from the OpenAI gym. Pretty simple! The code is also available for you to experiment with. Please make sure to star our library if you find it useful!
In our paper, we experimented with many algorithms, including model-free and model-based ones. In those examples, we showed that Saute RL yields various improvements on Lagrangian methods and also CPO ones. We report some of those in the figure below:
In this blog, I didn’t detail the CVaR case. In our paper, we do so and elaborate on some theoretical guarantees — which I also omitted — of Saute MDPs. Please make sure to consult our ICML for all those details and more.
That’s it! I hope you find this helpful blog allowing you to do safe reinforcement learning. If you find any typos, please let me know, and I will fix them.
Above all, I would like to thank my co-authors on Saute RL, especially Aivar Sootla, who made this work even possible.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot