CS188 Notes 3 - Markov Decision Processes (MDPs) II • TheUnknownBlog

Note:#

You could view previous notes on CS188: Lecture 8 - Markov Decision Processes (MDPs).

Also note that my notes are based on the Spring 2025 version of the course, and my understanding of the material. So they MAY NOT be 100% accurate or complete. Also, THIS IS NOT A SUBSTITUTE FOR THE COURSE MATERIAL. I would only take notes on parts of the lecture that I find interesting or confusing. I will NOT be taking notes on every single detail of the lecture.

Markov Decision Processes (MDPs)#

After the previous lecture, I realized I had some misunderstandings about the Policy Iteration algorithm, especially when compared to Value Iteration. So here, I’ll clarify my understanding of these two core approaches for solving MDPs.

Why use a “fixed policy” in Policy Iteration?#

It can be confusing at first that Policy Iteration evaluates a fixed policy. You might ask: does using a fixed, possibly non-optimal policy ever lead to the optimal one?

The answer is that evaluating a fixed policy is an essential intermediate step towards finding the optimal policy. We might “evaluate” a policy that is not optimal, but we it yields valuable information about the expected future rewards of that policy, so finnaly what we act on is the optimal policy.

In Policy Iteration, we loop between two key phases:

Step 1: Policy Evaluation#

We begin with an initial policy $\pi$ (random, greedy, whatever). For this $\pi$ , we compute the exact utility $V^{\pi}(s)$ for each state $s$ under the assumption that we always follow $\pi$ . The Bellman equation for this is:

V^{\pi}(s) = \sum_{s'} T(s, \pi(s), s') [ R(s, \pi(s), s') + \gamma V^{\pi}(s') ]

This evaluates the policy’s long-term value at every state, given that policy.

Step 2: Policy Improvement#

Now that we have $V^{\pi}$ , we look at each state $s$ and ask: “Is there an action $a$ that would improve my expected future rewards if I took it immediately, then continued with $\pi$ ?”

For each state, we consider:

Q^{\pi}(s, a) = \sum_{s'} T(s, a, s') [ R(s, a, s') + \gamma V^{\pi}(s') ]

We then build a new policy by setting:

\pi_{\text{new}}(s) = \arg\max_a Q^{\pi}(s, a)

That is, for each state, choose the action that looks best based on the values under the old policy. This is the policy improvement step.

Repeat: We now re-evaluate the new policy $\pi_{\text{new}}$ , and the process continues until the policy stops changing. This guarantees convergence to the optimal policy $\pi^*$ and optimal value function $V^*$ . Evaluating a fixed policy at each stage is essential for knowing both how good our current strategy is and how to improve it.

What is the difference between Policy Iteration and Value Iteration?#

In short:

Value Iteration is always searching for the best action at each step, directly refining the estimate of the optimal value function.
Policy Evaluation (as used in Policy Iteration) simply calculates the consequences of following a predefined plan $\pi$ , without improvement during evaluation itself. Policy improvement occurs as a separate step.

Let’s break down the differences in detail.

Value Iteration Equation:#

V_{k+1}(s) = \max_{a} \sum_{s'} T(s, a, s') [ R(s, a, s') + \gamma V_{k}(s') ]

Goal: Directly compute the optimal value function $V^*(s)$ .
How: Each iteration, for each state $s$ , considers all possible actions $a$ . For each action, it calculates the expected value (reward + discounted future value), then takes the maximum over all actions.
Policy: Implicit. The $\max$ operation is finding the best action, and the final optimal policy $\pi^*$ is extracted after $V_k$ converges.
What it computes: Iteratively refines the best possible long-term value from each state.

Policy Evaluation Equation (for a fixed policy $\pi$ ):#

V^{\pi}_{k+1}(s) = \sum_{s'} T(s, \pi(s), s') [ R(s, \pi(s), s') + \gamma V^{\pi}_k(s') ]

Goal: Compute the value function $V^\pi(s)$ for the given, fixed policy $\pi$ (which may not be optimal).
How: Each iteration, for each state $s$ , uses only the action prescribed by $\pi$ : $a = \pi(s)$ . Calculates the expected value (reward + discounted future value) following this fixed action. There is no $\max$ because the action is predetermined by $\pi$ .
Policy: Explicit and fixed throughout evaluation.

Comparison Table#

Feature	Value Iteration (VI)	Policy Evaluation (PE for fixed $\pi$ )
Equation Core	$\max_a \sum T(s,a,s')[R + \gamma V_k(s')]$	$\sum T(s, \pi(s), s')[R + \gamma V^\pi_k(s')]$
$\max_a$ Present?	Yes	No
Action Choice	Considers all $a$ , picks the best	Only the action $\pi(s)$ given by policy
Policy Role	Policy is implicit (via $\max$ )	Policy is explicit and fixed
Goal	Compute optimal value function $V^*$	Compute value function $V^\pi$ for the given policy $\pi$
Used Where?	Standalone algorithm to find $V^*$	Subroutine within Policy Iteration
Convergence	$V_k$ converges to $V^*$	$V^\pi_k$ converges to $V^\pi$

Does Policy Evaluation converge after more iterations than Value Iteration?#

It’s tempting to think that Policy Evaluation takes more iterations to converge, since it does not optimize at every step, but in practice, Policy Iteration often converges in fewer outer iterations (policy updates) than Value Iteration, though the work per iteration can differ.

The real power of Policy Iteration comes after Policy Evaluation. Once we have $V^\pi$ for our current policy, we can often make a large jump to a better policy by improving all states at once:

\pi_{\text{new}}(s) = \arg\max_a \sum_{s'} T(s, a, s') [ R(s, a, s') + \gamma V^\pi(s') ]

We only repeat this process until the policy stops changing, which often happens quickly and requires fewer overall iterations than Value Iteration.

CS188 Notes 3 - Markov Decision Processes (MDPs) II