Original Source Here
Why do we need adaptive policies for autonomous driving?
When humans drive they often solve many sub-tasks. Think of merging in when entering the highway, while keeping a certain speed and distance to others while driving on the highway. However, if you look at the return, it usually depends on only one reward function:
We can of course extend the reward function, to include a reward for each task i:
But in that case, a policy will trade-off doing good in or the other task. Now, a better idea would be to switch reward functions between tasks. But that would require an adaptive policy, that first identifies the task, and then solves it. This is what a rollout or episode would look like:
Let’s look at a simple Multi-armed bandit task: A policy has to select a slot machine to maximize its return, but it does not know which machine has the highest expected payoff. We will consider two slot machines, each with a Beta distribution associated with their rewards.
Now, we can train our agent on sampled tasks:
rnn = LSTM(input_size=6, hidden_size=2)
optimizer_rnn = Adam(rnn.parameters(), lr=.001)taus = np.linspace(4, 0.1, n_runs)for e in range(n_runs):
theta = random.choice(p_theta)
m_t = (torch.randn(1,1,2),torch.randn(1,1,2))
a_t = softmax(torch.randn(1,2),1)
R = 0 for t in range(t_max):
arms = get_arms(theta)
o_t = arms.view(1, -1)
(pi_t, m_t) = rnn(torch.cat((a_t,o_t),dim=1), m_t)
pi_t = pi_t.view(1, 1, -1)
a_t = gumbel_softmax(pi_t.view(1, -1), tau=taus[e], dim=1)
r = (a_t * arms).sum() R += r
loss = -R
If we compare the rnn agent, with an agent that is trained for one task only, we get higher returns at training time (left plot). But the crucial difference is at test time (right plot).
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot