Reinforcement Learning — Reward Controversy Issue

Original Source Here

Reinforcement Learning — Reward Controversy Issue

How To Make Autonomous Cars Trustworthy — IEEE Standards Association (07 Apr 2021)

Reinforcement learning has been a source of controversy lately as to whether reward is enough to take appropriate “intelligent” decisions. RL which does not require historical or labelled data is based on reward in its interaction with the environment. The reward can be computed by measuring the quality of the performance. The aim is to get the agent to act in it environment so as to maximize its rewards and consider the long term results of its actions, such as deceleration of speed ahead of a U-turn or obstacles for self-driving car.

The environment in turn is a modelled as a stochastic finite state machine actions sent from the agent, as inputs, and observations and rewards sent to the agent as outputs.

Figure 1: Action taken by the agent (as inputs) exploring its environment is being rewarded (observations as outputs).

Reinforcement learning is being currently applied in a growing number of areas as researchers are figuring out how to get a computer to calculate the value that should be assigned to tasks even for large number of values. Each value assigned is stored in a large table, and the computer updates all these values as it learns. Today such large table can be processed using ML subset deep learning to carry out procession of such large data sets generated for example by the positions on a Go board, or the pixels on a screen during a computer game.

RL has been used in AlphaGo, which a computer developed by a subsidiary of Alphabet called DeepMind. It is now used in improving self-driving cars and helping robot to grasp objects it has never seen before, and in the optimal configuration for the equipment in a data center.

On overall RL approach addresses two types of problems:

– Prediction: How much reward can be expected for every combination of possible future states.

– Control: By exploring all possible combinations of the environment through its interaction of state space to find a combination of actions such as how to steer an autonomous vehicle.

When applied to a self-driving car an internal map will allow the car to place itself in the envisaged space (environment). To determine the best route to navigate through the map a method is used along with a system of obstacle avoidance.

Reward Controversy Issue

A recent paper by the DeepMind team argue that rewards are to take cations that exhibits most if not all attributes of intelligence. The argument put forward is that “rewards are all that is needed for agents in rich environments to develop multi-attribute intelligence of the sort needed to achieve artificial general intelligence”. This reminds me of an article in Wired Magazine that Big Data is enough.

However, by analogy to the evolutionary processes, where both genes and the environment interact, reward is the equivalent of the selection process in evolution and alone it is not sufficient.

RL Basic Concepts

RL is a system in which success is not granted but learned by interacting with its environment through trial and error. The ‘Environment’ is a matrix of all possible alternative values or steps that can be taken, such all possible moves of all pieces in a game of checkers. The Agent begins to randomly explore alternative actions in the Environment and reinforces the Agent to exploit when the moves are successful. The ‘State’ is the current set of moves or values which is modified after each try and seeks to optimize the reward via the feedback loop. The Agent must thus learn from its experiences of the Environment as it explores the full range of possible States.

When compared with supervised or unsupervised learning, RL does not have any data to learn from and it has to build its own data to learn from scratch through trial and error.

Q-Learning in RL

To evaluate the action taken by RL, a Q-learning function, which is an action-value function that determines the value of being in a certain state and takes in turn a certain action at that state. Q stands for the “quality” of an action taken by the agent in a given state. Q-Learning is by far the best known generalized algorithm in RL.

Q-Learning is thus based of a combination of state, denoted by s and an action denoted by a at time t, where rt is the reward “observed” value for the current state st, a is the learning rate, with values between 0 and 1, and n is the discount factor, which controls the importance of subsequent rewards from the current state onwards. refers to the new or current value and to the future value estimate.

Another function works on the top of Q-learning function to let the agent to choose which action to perform. This is known as ‘the agent policy function” and denoted by the notation . This function uses the current environment state to return an action.

Figure 2: Action taken by the agent exploring its environment involving state feedback.

The agent explore the state-space, and the state–action pair policies of are created in episodes in the state-space. The policy function selects the next action for the agent based on the to either explore or exploit the state-space. An exploit policy allows the function to identify the action with the largest Q-value and returns that action. Under explore approach the action is being identified probabilistically as a function of the Q-value, as a probability over the sum of Q-values for the state.

Figure 3: Example of an environment or state-pace being explored

RL systems do not require neural nets but increasingly the most interesting problems like self-driving cars represent such large and complex state spaces that the direct observation policy gradient approach is not practical.

These Q-Learning situations are also frequently defined by their use of images, in particular pixel fields, as unlabeled inputs which are classified using a convolutional neural net (CNN) with some differences from standard image classification.

Reinforcement Learning — Algorithm

Under RL algorithm, the machine is exposed to an environment (env_sa) where it trains itself based on using trial and error.. RL learns by taking actions (a) under continuous changing conditions or states (s). It is trained thus to learn from past experience and tries to capture the best possible knowledge to make accurate decisions.

Example of RL algorithms is Markov Decision Process (MDP). There is a package for applying MDP. Other algorithms and packages are also under development such as the “ReinforcementLearning” package, which is intended to partially close this gap and offers the ability to perform model-free reinforcement learning in a highly customizable framework.


1. Reinforcement Learning and AI by William Vorhies on September 13, 2016

2. Gregory Piatetsky (December 2017) Exclusive: Interview with Rich Sutton, the Father of Reinforcement Learning. KDnuggets.

3. Kevin Murphy (1998) A brief introduction to reinforcement learning, UBC, Canada

4. Will Knight Reinforcement Learning By experimenting… March/April 2017 MIT Technology Review

5. Reinforcement Learning and AI by William Vorhies on September 13, 2016

6. M. Tim Jones (2017) Train a software agent to behave rationally with reinforcement learning. Cognitive computing, IBM.

7. Nicolas Pröllochs & Stefan Feuerriegel (2017).

8. Training AI: Reward is not enough by Herbert Roitblat, The venturebeat, July 10, 2021.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: