Hold Your Horses, My Dear Reinforcement Learning Agent

Original Source Here

Hold Your Horses, My Dear Reinforcement Learning Agent


One of the biggest challenges for reinforcement learning experts is that they can’t control the agent’s action behavior. Sometimes making the agent converge and maximizing the rewards is not enough, you also want your agent to have smoothened actions. Let’s look at a few examples:


In trading, the biggest force that’s working against traders is the transaction cost. The more often you change your portfolio more transaction costs incur. My RL trading agent used to go full long position on one day and go short position the other day. This double transaction costs eat away all my returns. Sad right!

Heat Exchange

Imagine you are controlling the temperature of steam in a thermal power plant, or a building. You built an agent that works very well in maintaining a near-constant temperature but the problem is agent jump from +1 ( full heat in) to -1( full heat out) in just 1 second or 1-time step. This erratic jump in heat flow will damage your heat flow control system or this erratic jump might not even be supported by the control systems.

Autonomous cars

Reinforcement learning agents are widely applied in autonomous cars cases. Imagine the RL agent steers the vehicle from ‘full left’ to ‘full right’. You don’t want to find yourselves such an autonomus car.

This problem is a huge issue in reinforcement learning agents’ deployment for real-life business use cases. I hope the solution that’s presented in this blog solves this problem and this brings RL close to business use cases.


Honestly it’s a reserach topic, But there are somevery reasonable shortcuts. We can solve this problem by tweaking the reward function a bit. This blog mainly discusses how to do this simple tweaking. This is the first thing I try when I observed this erratic action jump issue.

Add Transaction to Rewards

The idea is simple and elegant. Add the ‘change in action’ or action delta to your reward funciton. Simple right. Let’s discuss an example and make this concrete.

Heat Exchange Problem

The idea of this blog to explain a method to deal with erratic reward funciton. So environment is no described in detail. The environment is simple:

  • Action: continous action between(-1, 1), +1: heat in max, -1: heat out max
  • Reward: closer to target temperature, higher is the reward

Code for environment Without Reward Tweak:

Observe the rewards function in line 68:
self.reward = np.exp(-(abs(10*self.observation)))[0]
It’s based on the deviation of the agent set temperature from the target temperature.

Train agent using stable baseline

Let’s use stable baseline to train the agent and plot the agent’s action profile.

Agent’s Performance Without Rewards Tweak

The agent can maintain the temperature near the set temperature. So the agent is successful.

Agent Performance

Agent’s Actions Without Rewards Tweak

Look at the actions, the agent is jumping from +1 action to -1 action. This makes the agent unusable at any cost.

Now let’s introduce the reward tweak and relook at the agent’s performance

Add Transaction to Rewards

I subtracted transaction cost (delta: action- last_action) from the reward that was designed. This delta component in the reward function penalizes the agent for taking erratic actions.

Agent’s Performance With Rewards Tweak

The agent is again successful in maintaining the temperature but not as good as earlier. But the action profile is smoother now. The action delta is now majorly contained.

Agent’s actions profile Comparison

As you can clearly observe that the transaction-based reward tweaking added smoothness to the action’s profile.


We learned how we can smoothen the actions of an RL agent. I hope this blog will help bridge the gap between deployment and research within the scope of reinforcement learning.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: