TheUnknownBlog

Back

Note:#

You could view previous notes on CS188: Lecture 8 - Markov Decision Processes (MDPs).

Also note that my notes are based on the Spring 2025 version of the course, and my understanding of the material. So they MAY NOT be 100% accurate or complete. Also, THIS IS NOT A SUBSTITUTE FOR THE COURSE MATERIAL. I would only take notes on parts of the lecture that I find interesting or confusing. I will NOT be taking notes on every single detail of the lecture.

Markov Decision Processes (MDPs)#

After the previous lecture, I realized I had some misunderstandings about the Policy Iteration algorithm, especially when compared to Value Iteration. So here, I’ll clarify my understanding of these two core approaches for solving MDPs.

Why use a “fixed policy” in Policy Iteration?#

It can be confusing at first that Policy Iteration evaluates a fixed policy. You might ask: does using a fixed, possibly non-optimal policy ever lead to the optimal one?

The answer is that evaluating a fixed policy is an essential intermediate step towards finding the optimal policy. We might “evaluate” a policy that is not optimal, but we it yields valuable information about the expected future rewards of that policy, so finnaly what we act on is the optimal policy.

In Policy Iteration, we loop between two key phases:

Step 1: Policy Evaluation#

We begin with an initial policy π\pi (random, greedy, whatever). For this π\pi, we compute the exact utility Vπ(s)V^{\pi}(s) for each state ss under the assumption that we always follow π\pi. The Bellman equation for this is:

Vπ(s)=sT(s,π(s),s)[R(s,π(s),s)+γVπ(s)]V^{\pi}(s) = \sum_{s'} T(s, \pi(s), s') [ R(s, \pi(s), s') + \gamma V^{\pi}(s') ]

This evaluates the policy’s long-term value at every state, given that policy.

Step 2: Policy Improvement#

Now that we have VπV^{\pi}, we look at each state ss and ask: “Is there an action aa that would improve my expected future rewards if I took it immediately, then continued with π\pi?”

For each state, we consider:

Qπ(s,a)=sT(s,a,s)[R(s,a,s)+γVπ(s)]Q^{\pi}(s, a) = \sum_{s'} T(s, a, s') [ R(s, a, s') + \gamma V^{\pi}(s') ]

We then build a new policy by setting:

πnew(s)=argmaxaQπ(s,a)\pi_{\text{new}}(s) = \arg\max_a Q^{\pi}(s, a)

That is, for each state, choose the action that looks best based on the values under the old policy. This is the policy improvement step.

Repeat: We now re-evaluate the new policy πnew\pi_{\text{new}}, and the process continues until the policy stops changing. This guarantees convergence to the optimal policy π\pi^* and optimal value function VV^*. Evaluating a fixed policy at each stage is essential for knowing both how good our current strategy is and how to improve it.


What is the difference between Policy Iteration and Value Iteration?#

In short:

  • Value Iteration is always searching for the best action at each step, directly refining the estimate of the optimal value function.
  • Policy Evaluation (as used in Policy Iteration) simply calculates the consequences of following a predefined plan π\pi, without improvement during evaluation itself. Policy improvement occurs as a separate step.

Let’s break down the differences in detail.

Value Iteration Equation:#

Vk+1(s)=maxasT(s,a,s)[R(s,a,s)+γVk(s)]V_{k+1}(s) = \max_{a} \sum_{s'} T(s, a, s') [ R(s, a, s') + \gamma V_{k}(s') ]
  • Goal: Directly compute the optimal value function V(s)V^*(s).
  • How: Each iteration, for each state ss, considers all possible actions aa. For each action, it calculates the expected value (reward + discounted future value), then takes the maximum over all actions.
  • Policy: Implicit. The max\max operation is finding the best action, and the final optimal policy π\pi^* is extracted after VkV_k converges.
  • What it computes: Iteratively refines the best possible long-term value from each state.

Policy Evaluation Equation (for a fixed policy π\pi):#

Vk+1π(s)=sT(s,π(s),s)[R(s,π(s),s)+γVkπ(s)]V^{\pi}_{k+1}(s) = \sum_{s'} T(s, \pi(s), s') [ R(s, \pi(s), s') + \gamma V^{\pi}_k(s') ]
  • Goal: Compute the value function Vπ(s)V^\pi(s) for the given, fixed policy π\pi (which may not be optimal).
  • How: Each iteration, for each state ss, uses only the action prescribed by π\pi: a=π(s)a = \pi(s). Calculates the expected value (reward + discounted future value) following this fixed action. There is no max\max because the action is predetermined by π\pi.
  • Policy: Explicit and fixed throughout evaluation.

Comparison Table#

FeatureValue Iteration (VI)Policy Evaluation (PE for fixed π\pi)
Equation CoremaxaT(s,a,s)[R+γVk(s)]\max_a \sum T(s,a,s')[R + \gamma V_k(s')]T(s,π(s),s)[R+γVkπ(s)]\sum T(s, \pi(s), s')[R + \gamma V^\pi_k(s')]
maxa\max_a Present?YesNo
Action ChoiceConsiders all aa, picks the bestOnly the action π(s)\pi(s) given by policy
Policy RolePolicy is implicit (via max\max)Policy is explicit and fixed
GoalCompute optimal value function VV^*Compute value function VπV^\pi for the given policy π\pi
Used Where?Standalone algorithm to find VV^*Subroutine within Policy Iteration
ConvergenceVkV_k converges to VV^*VkπV^\pi_k converges to VπV^\pi

Does Policy Evaluation converge after more iterations than Value Iteration?#

It’s tempting to think that Policy Evaluation takes more iterations to converge, since it does not optimize at every step, but in practice, Policy Iteration often converges in fewer outer iterations (policy updates) than Value Iteration, though the work per iteration can differ.

The real power of Policy Iteration comes after Policy Evaluation. Once we have VπV^\pi for our current policy, we can often make a large jump to a better policy by improving all states at once:

πnew(s)=argmaxasT(s,a,s)[R(s,a,s)+γVπ(s)]\pi_{\text{new}}(s) = \arg\max_a \sum_{s'} T(s, a, s') [ R(s, a, s') + \gamma V^\pi(s') ]

We only repeat this process until the policy stops changing, which often happens quickly and requires fewer overall iterations than Value Iteration.

CS188 Notes 3 - Markov Decision Processes (MDPs) II
https://start-co.de/blog/cs188-notes-3
Author TheUnknownThing
Published at April 29, 2025
Comment seems to stuck. Try to refresh?✨