The Math of AI Alignment 101: Markov Decision Processes*B-TOEg9X22-7v9r8

Original Source Here

The Math of AI Alignment 101: Markov Decision Processes

A Series on AI Alignment Basics

Photo by Maximalfocus on Unsplash

This is the third article in a series introducing the basic mathematical concepts that go into properly aligning AGI (artificial general intelligence).

Each article is roughly self-contained, and so they can be read independently. But I will not be re-motivating or reintroducing concepts, so they can probably be read more easily in order.

The previous one is here:

Motivating a New Model

In the past few articles, we had a very basic approach to modeling the goals of an AGI.

We used a single utility function over an attribute space. This model said that the AGI would attempt to maximize the function.

As a model, this is a great starting place. It allows us to do multivariable, single function, optimization with constraints.

This is a well-known branch of math. If we can’t solve the problem in this simple model, we have no hope for more complicated models.

That being said, there is a more complicated way to model the situation called a Markov Decision Process.

The advantage of this model is that it is the one actually used in real-world AI trained using Reinforcement Learning.

So the alignment problem isn’t just a toy model. If we solve the alignment problem in this model, then we will have practically solved it for a whole class of machine learning algorithms.

(This is Alignment 101, so I’m fudging some of the fine details here).

Markov Decision Process Example

If I give you the formal definition first, you’re all going to leave confused. So I’m going to start with an example, which might also be confusing.

Hang in there. If you take the time to fully understand this basic example, you will have a very good understanding of every relevant concept.

Let’s break down the above image.

There are two states: S₁ and S₂.

There are two sets of actions: a₁ and a₂.

Starting at the top and going down:

  • From S₁ we could take action 1, a₁. This sends us to state S₂ with probability 1.
  • From S₂ we could take action 1, a₁. This sends us to state S₁ with probability 1.
  • From S₂ we could take action 2, a₂. This sends us to state S₁ with probability 0.5 and stays at S₂ with probability 0.5.
  • From S₁ we could take action 2, a₂. This keeps us at state S₁ with probability 0.5 and sends us to S₂ with probability 0.5.

That’s a lot of words and arrows, but hopefully, this all makes sense now.

In addition to the above information, we also have a reward function for each action and pair of states.

To motivate this, say we’re training some AI and we want it to always change states and never stay in the same state.

To get this into the Markov Decision Process, we would use a reward function that gives a positive reward for switching states, say 1, and zero reward for staying in the same state:

Rₐ(S₁, S₁)=0
Rₐ(S₁, S₂)=1
Rₐ(S₂, S₁)=1
Rₐ(S₂, S₂)=0

Markov Decision Process Definition

A Markov decision process is a 4-tuple (S, A, Pₐ, Rₐ) where:

  • S is a set of states called the state space,
  • A is a set of actions called the action space,
  • Pₐ(s, s’)=Pr(sₜ₊₁ = s’ | sₜ=s, aₜ=a) is the probability that action a in state s at time t will lead to state s’ at time t+1,
  • Rₐ(s, s’) is the reward received after transitioning from state s to state s’, due to action a.

This is just formalizing and extending the above example. We can have any number of states, including infinite. We can have any number of actions, including infinite.

The actions don’t have to be “symmetrical” or “the same” like in the above example. For example, a₁ from S₁ could have five different possible states it could transition to while a₁ from S₂ could only have S₁ as a possibility.

The number of actions available to each state do not have to be the same (we had two from each in the example).

In fact, in most real-world modeling, there will be terminal states, a state in which there is only one action which stays at that state with probability 1.

In our example, the process just runs forever with no termination.

Policy Functions

Okay, so you’re probably wondering what any of this gets us.

The real power of this formalism comes from something called policy functions.

A policy function is a function π: S → A.

The way to think of this is as a “decision” or “rule” to take a certain action each time you’re at a particular state.

For example, π(S₁)=a₂ means take action 2 each time you land on state 1.

In these small, finite examples, we can actually write down every single possible policy. In fact, our above example only has four possible policy functions:

  • π₁(Sᵢ) = a₁
  • π₂(Sᵢ) = a₂
  • π₃(S₁)=a₂, π₃(S₂)=a₁
  • π₄(S₁)=a₁, π₄(S₂)=a₂

In words, the first one chooses action 1 always.
The second policy chooses action 2 always.
The third chooses opposite actions.
The fourth chooses the action with the same subindex.

Reward Optimization

So why is this important?

Well, the way an AI is trained is for it to figure out which policy optimizes its reward.

We’re dealing with probabilities, so this is done by expected value.

What’s the expected value of policy 1 starting at S₁?

Unfortunately, we already discussed that the process will run forever, and so the expected value is infinite (it has an expected reward of 1 at each time step).

So, the real definition of the value of a policy is a little more complicated to define. It’s recursive and requires the choice of a discount factor.

You can look it up if you want. We can use a simplified version that will get us the correct result for the purposes of finding the optimal policy.

For us, we’ll define 𝔼 of a policy to be the sum over each state of the expected values of going one time step.

Let’s work them all out for our example. I’ll redraw it for ease of reference:

Remember, our rewards were 1 for switching states and 0 for staying in the same state:

Rₐ(S₁, S₁)=0
Rₐ(S₁, S₂)=1
Rₐ(S₂, S₁)=1
Rₐ(S₂, S₂)=0

Policy 1: π₁(Sᵢ) = a₁. This says take action 1 always.

Action 1 gives a probability 1 that we switch states. Our reward for switching states is 1.

Therefore, the sum of the expected value for each state is 1+1=2.

Thus, 𝔼(π₁)=2.

Policy 2: π₂(Sᵢ) = a₂. This says take action 2 always.

The expected value of action 2 from state 1 is half the time we get a reward of 1 and half the time we get a reward of 0. Likewise for state 2.


And then similarly, 𝔼(π₃)=(0.5)(0)+(0.5)(1)+1(1)=1.5.

And 𝔼(π₄)=1.5.

Which policy maximizes the reward function?
It’s policy 1.

This should have been our intuitive guess because our reward function rewards moving and doesn’t reward staying in place.

But policy 1 is the only one that guarantees we always move. The rest have some probability of staying in place.

Some Missing Details

If you’re at all confused about any aspect of this, I highly recommend generating your own two state example.

Make the numbers weirder and maybe use different numbers of actions and throw in a third terminal state for fun. Then completely work out the optimal policy for your example (remember that the more actions you have, the more policies you’ll have to consider).

I guarantee you can get a good handle on all of the concepts presented with a small example like that if you go through it yourself (following along can sometimes make it more confusing).

I’ll just remind you that I was a bit vague about the expected value step, and this was on purpose.

One detail is that you might start your process at state 1 every time. If a policy has transitioning to state 2 to be an impossibility (probability 0), then you shouldn’t include the expected reward from state 2 in your calculation.

I also didn’t formalize the concept because the state space could be infinite, which means you can’t just naively sum the expected rewards and get something that converges.

But if you’re at the point where these details matter, then you have progressed beyond this article.

Reward Hacking

What does any of this have to do with AI alignment?

Today’s paper is: Skalse, J., Howe, N. H. R., Krasheninnikov, D., & Krueger, D. (2022). Defining and Characterizing Reward Hacking.

You can find it here:

In the previous article I wrote, we discussed that we will often want to use a proxy utility function for an AI. This is for all sorts of reasons: easier to define metrics, the impossibility of encoding everything you care about, etc.

In the above paper, they define a similar concept of a proxy reward function in the context of a Markov Decision Process.

Then they define a pair of reward functions to be hackable if there is a way to find a policy that improves the expected value of the proxy reward while simultaneously decreasing the expected value of the true reward function.

In other words, a proxy reward function is hackable if it is possible that the AI becomes misaligned by optimizing the proxy reward.

If you recall the previous article, this always happens in the case of a proxy utility function, but maybe that was just a bad way to model the situation.


It turns out the results are just as scary here.

Under generic conditions in the infinite case, any pair of reward functions that are unhackable and non-trivial are equivalent.

In other words, an AI can always hack the proxy reward you give it and become misaligned with respect to the true reward function.

It’s kind of amazing that we get basically the exact same result as the previous paper in such a wildly different way to model alignment.

Under these conditions, misalignment of the AI is always possible no matter what your reward function was and what your proxy was.

Not all hope is lost, though.

The paper goes on to show that unhackable pairs always exist in the finite policy space case and they even give a procedure for determining if an unhackable proxy exists.

It’s a pretty interesting paper, and if you’ve followed this article so far, you should be able to get something from reading through it.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: