CS188 Notes 4 - Reinforcement Learning • TheUnknownBlog

Note:#

You could view previous notes on CS188: Lecture 9 - Markov Decision Processes (MDPs).

Also note that my notes are based on the Spring 2025 version of the course, and my understanding of the material. So they MAY NOT be 100% accurate or complete. Also, THIS IS NOT A SUBSTITUTE FOR THE COURSE MATERIAL. I would only take notes on parts of the lecture that I find interesting or confusing. I will NOT be taking notes on every single detail of the lecture.

Reinforcement Learning#

In this note I will go through the key concepts in the Reinforcement Learning (RL) lecture. I will also try to clarify my understanding of the Q-learning algorithm, which is a key concept in RL.

First let’s categorize the topics. I’ll use the same categories as in the lecture slides also adding some of my own notes.

Passive Learning
- Model-based
- Model-free
Active Learning
Approximate Q-learning
Policy Gradient

ONE SENTENCE SUMMARY: Passive learning involves evaluating a fixed policy (likely human will control it), while active learning seeks to improve the policy through exploration (likely model itself would operate); model-based methods use environment models, model-free methods learn directly from experience, approximate Q-learning generalizes learning to large state spaces, and policy gradient methods optimize policies directly using gradient ascent.

I believe this is a good summary of the key concepts in RL. I will go through each of these categories in detail below. Also, I will use the structure of “HOW? -> WHY? -> PROBLEM” to explain each concept.

Passive RL#

Model-based#

How?

The agent learns a model of the environment (e.g., transition probabilities, rewards) and uses this model to evaluate the policy. This is done by estimating the expected value of each action in each state based on the model.

Then Solve for values as if the learned model were correct. (Trust the model)

Why?

Answering “why” in this section is basically answering “why do we need a model?” The answer is that we do not have a model of the environment, so we need to learn it. This is done by estimating the transition probabilities and rewards based on the observed data. This is a key concept in RL, as it allows the agent to learn from its experiences and improve its policy over time.

Problem

The problem with this approach is that it requires a lot of data to learn the model accurately. If the model is not accurate, the agent may make suboptimal decisions based on the learned model.

Model-free#

How?

In this case there are no models to guide us “what to do”. We need to learn the value function directly from the data.

The simplest thought is to Average together observed sample values. Every time you visit a state, write down what the sum of discounted rewards turned out to be, and average it out. But what’s bad about this is that it do not take account of state connections. For example, there is a graph A -> B -> C (end). How to calculate $V$ for state $A$ and $B$ ? We would evaluate every single starting state separately, for example, when evaluating A, we would NOT take the previous evaluation of B in to account, it only cares the final outcome and to average it. This is not a good idea, because we are wasting a lot of data. We could use the data from state B to help us evaluate state A. So we need to take into account the connections between states.

So an evolution of this is to use the Bellman equation. The idea is to use the value of the next state to help us evaluate the current state. This is done by using the Bellman equation similar to the one we used in the MDP lecture. However, we need modifications for this.

The ORIGINAL Bellman equation is:

V(s) = \sum_{s'} T(s, a, s')[R(s, a, s') + \gamma V(s')]

And its ADAPTED version is:

V(s) = \frac{1}{n}\sum \mathrm{sample}_{s'} \quad \text{where} \ \mathrm{sample}_{s'} = R(s, a, s') + \gamma V(s')

What’s improved from the naive version is that we are utilizing the existing data to evaluate. However, as we notice that there are problems with this: We are waiting until the end of an episode to update values as we are using the average of all samples. We could update values more frequently.

So this is where the Temporal Difference (TD) Learning comes in. The idea is to update the value of the current state based on the value of the next state, without waiting for the end of the episode. Because updates happen after every transition, states and transitions that are experienced more frequently will have a greater influence on the learned values over time.

The specific type of TD learning shown here is for policy evaluation. This means we have a fixed policy π (a fixed way of choosing actions in each state), and we want to figure out the value function Vπ(s) for that policy. We are not trying to find the best policy yet, just evaluating the current one.

In TD, we have samples, and the update rule.

\mathrm{sample} = R(s, \pi(s), s') + γV^\pi(s')

The sample (or TD Target) is: “the reward I just got, plus the discounted value of where I landed (according to my current beliefs)”.

The update rule is:

V^\pi(s) \leftarrow (1 - α)V^\pi(s) + α \ \mathrm{sample}

We calculate the TD Error: sample - Vπ(s). This error represents the difference between our target (sample) and our current estimate (Vπ(s)). We then adjust our current estimate Vπ(s) by moving it a small step (α) in the direction of that error.

It shows that TD learning is essentially maintaining a running average of the TD targets it observes for each state. It gradually “forgets” older, potentially less accurate, information because initial value estimates might be far off. Using a decreasing learning rate α over time can help the value estimates converge more stably.

However, there are still problems. Mentioned in the previous lecture, what really GUIDES the agent is the $Q$ -values. So we need to learn the $Q$ -values instead of the $V$ -values.

Q-Learning (Active RL)#

How?

Q-Learning is a model-free reinforcement learning algorithm used to learn the optimal action-value function (Q-values). Unlike TD learning which focuses on state values, Q-learning focuses on (state, action) pairs.

The Q-learning update rule is:

Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right]

Similar to above, where: $Q(s, a)$ is the current estimate of the Q-value for state $s$ and action $a$ , $\alpha$ is the learning rate, and the term $r + \gamma \max_{a'} Q(s', a') - Q(s, a)$ is the TD error. You might wonder “why do we need to use the max operator here?” The answer is that we are trying to learn the optimal Q-value for each state-action pair. The max operator allows us to select the best action in the next state $s'$ based on the current Q-values.

Why?

Q-learning allows us to select the best action in each state, unlike TD learning which only evaluates a fixed policy. It’s called “off-policy” because it learns the optimal policy regardless of how the agent is currently behaving (exploration). The agent can follow any exploratory policy during training while still learning the greedy optimal policy.

With Q-values, we can derive our policy directly:

\pi(s) = \arg\max_a Q(s, a)

This means choosing the action that maximizes the expected future rewards for each state.

Problem

The main challenge in Q-learning is balancing exploration and exploitation, i.e., balancing the behaviour that “Trying new actions to discover potentially better rewards” and “Using known Q-values to maximize rewards based on past experience”

This is typically addressed using an $\epsilon$ -greedy policy or exploration functions:

The $\epsilon$ -greedy policy works as follows:

With probability $1-\epsilon$ , choose the best action (exploit)
With probability $\epsilon$ , choose a random action (explore)
Gradually decrease $\epsilon$ over time to favor exploitation as learning progresses

And the exploration function works as follows:

Define “exploration bonus” based on the uncertainty of Q-values. Let $n$ be the number of times action $a$ has been taken in state $s$ . The exploration bonus can be defined as $\frac{1}{n(s, a)}$ .
When choosing actions, add the exploration bonus to the Q-value: $Q(s, a) + \frac{1}{n(s, a)}$ .
This encourages the agent to explore less frequently visited actions, balancing exploration and exploitation.
Gradually decrease the exploration bonus over time to favor exploitation as learning progresses

This approach can be more efficient than $\epsilon$ -greedy, as it focuses exploration on less certain actions rather than uniformly random actions. So it IS used in practice.

Experience Replay#

Experience replay is a optimization technique used in reinforcement learning, particularly in deep Q-learning. So I’ll add it as a subtopic here.

How?

Experience replay enhances Q-learning by storing the agent’s experiences (transitions) in a replay buffer. Instead of updating Q-values using only the most recent experience, the agent stores the recent experience to buffer, and randomly samples batches of past experiences from this buffer for training.

Why?

In Q-learning, the agent learns from its experiences sequentially. This can lead to correlations between consecutive experiences, making learning inefficient. By using experience replay, the agent can break these correlations and learn from a more diverse set of experiences. Consecutive experiences are often similar, making learning inefficient. Random sampling creates more independent training examples.

This is especially important in deep reinforcement learning where neural networks are used to approximate Q-values.

With all the problems addressed above, we still could not put the Q-learning algorithm into practice. The problem is that the state space is too large. We cannot store the Q-values for every single state-action pair. So we need to use function approximation to generalize across similar states, and this is where Approximate Q-Learning comes in.

Hold on a second#

But before that, I do believe I need to clarify some points here.

You might think: “Why Q-Learning is discussed in active learning? Q-learning could be used in passive learning, while TD could also be used in active learning, is that correct?”

Yes, you are right. In CS188 (and many RL courses), the algorithms are typically presented in this order:

TD Learning is introduced first as a way to learn value functions for passive learning

Q-Learning is introduced next as a way to extend these ideas to active learning

This pedagogical approach sometimes creates the impression that these algorithms are strictly tied to their respective learning categories, but they’re more flexible than that.

The main difference is that TD learning (as typically presented) learns state values V(s) while Q-learning learns state-action values Q(s,a). Q-values naturally lend themselves to policy improvement (just take argmax), which is why Q-learning is often presented in the active learning context.

Approximate Q-Learning#

How?

In environments with large or continuous state spaces, it’s impractical to maintain a separate Q-value for each state-action pair. Approximate Q-learning uses function approximation to generalize across similar states.

Simple solution is that recall the “feature function” that we discussed in the game tree lecture. We describe a state using a vector of features (properties) $f_1, f_2, \ldots, f_n$ and learn a linear function of these features:

Q(s, a) = w_1 f_1(s, a) + w_2 f_2(s, a) + \ldots + w_n f_n(s, a)

Where $w_1, w_2, \ldots, w_n$ are weights that we learn through experience. This is a linear function approximation. We can also use non-linear function approximators like neural networks, but the basic idea is the same: learn a function that maps states (and actions) to Q-values.

And you might wonder: “How to learn the weights?” The answer is that we can use the same Q-learning update rule, but instead of updating the Q-value directly, we update the weights using some tricks. This tricks is a simple notion of “if something unexpectedly bad happens, blame the features that were on: disprefer all states with that state’s features”.

So the update rule becomes:

w_i \leftarrow w_i + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right] f_i(s, a)

Where $f_i(s, a)$ is the value of the $i$ -th feature for state $s$ and action $a$ . This means we are updating the weights based on the features that were present in the current state-action pair.

The update rule of $Q$ is still the same:

Q(s, a) \leftarrow Q(s, a) + \alpha \ \mathrm{Difference}

Where $\mathrm{Difference} = r + \gamma \max_{a'} Q(s', a') - Q(s, a)$

Why?

Approximate Q-learning allows reinforcement learning to scale to complex environments with huge state spaces (like Atari games, robotics, etc.) where tabular methods would be impossible.

It enables generalization across similar states, so learning in one state can improve performance in similar states, even those the agent hasn’t encountered yet.

Problem

Approximate Q-learning faces these two challenges:

Forgetting - learning in one region of the state space can undo learning in another region
Feature selection - choosing the right representation for states is critical for good generalization

Policy Gradient Methods#

How?

Instead of learning a value function and deriving a policy from it, policy gradient methods directly parameterize the policy itself. That is, the agent’s behavior is described by a function $\pi(a|s; \theta)$ , where $\theta$ are the parameters (often the weights of a neural network). The goal is to adjust $\theta$ so that the expected return (the sum of rewards) is maximized.

The core idea is to use gradient ascent: we estimate how changing the parameters would affect the expected return, and then nudge the parameters in that direction. The update rule looks like this:

\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)

where $J(\theta)$ is the expected return under the current policy.

But how do we compute this gradient? The answer is the policy gradient theorem, which tells us that the gradient of the expected return can be estimated using samples from the environment:

\nabla_\theta J(\theta) \approx \mathbb{E}_{\pi_\theta} \left[ \nabla_\theta \log \pi_\theta(a|s) \cdot G \right]

Here, $G$ is the return (sum of discounted rewards) following the action $a$ in state $s$ . In practice, we run episodes, collect rewards, and use these samples to estimate the gradient.

This approach is called REINFORCE, the simplest policy gradient algorithm. Each time the agent takes an action, it computes the gradient of the log-probability of that action, multiplies it by the return, and uses that as the update direction.

Why?

Policy gradient methods are powerful for several reasons. First, they allow us to optimize the policy directly, which is what we ultimately care about. This is especially useful in environments with continuous or high-dimensional action spaces, where value-based methods struggle. Policy gradients can also learn stochastic policies, which can be optimal in environments with inherent randomness or partial observability.

Another advantage is that policy gradient methods can be combined with function approximation (e.g., neural networks) to handle very large or continuous state spaces. This is the foundation of modern deep reinforcement learning algorithms.