6
$\begingroup$

To derive the policy gradient, we start by writing the equation for the probability of a certain trajectory (e.g. see spinningup tutorial):

$$ \begin{align} P_\theta(\tau) &= P_\theta(s_0, a_0, s_1, a_1, \dots, s_T, a_T) \\ & = p(s_0) \prod_{i=0}^T \pi_\theta(a_i | s_i) p(s_{i+1} | s_i, a_i) \end{align} $$

The expression is based on the chain rule for probability. My understanding is that the application of the chain rule should give up this expression:

$$ p(s_0)\prod_{i=0}^T \pi_\theta(a_i|s_i, a_{i-1}, s_{i-1}, a_{i-2}, \dots, s_0, a_0) p(s_{i+1} | s_i, a_i, s_{i-1}, a_{i-1}, \dots, a_0, s_0) $$

Then the Markov property should be applicable, producing the desired equality. This should only depend on the latest state-action pair.

Here are my questions:

  1. Is this true?

  2. I watched this lecture about policy gradients, and at this time during the lecture, Sergey says that: "at no point did we use the Markov property when we derived the policy gradient", which left me confused. I assumed that the initial step of calculating the trajectory probability was using the Markov property.

$\endgroup$
4
  • 1
    $\begingroup$ in the first equation you show, you definitely are using the Markov Property. When deriving the policy gradient you don’t explicitly use the Markov property, if you refer to the Sutton and Barto derivation this is clear. However, I am still of the opinion that the Markov property is used, as the underlying assumption of the MDP is that the Markov property holds (eg our policy is only conditioned on the current state, not the whole trajectory) $\endgroup$ Commented Dec 21, 2020 at 0:27
  • $\begingroup$ I watched the videos you’re watching and I would argue that the way he derives the policy gradient definitely does use the Markov property — he directly uses the probability of the trajectory that you have in your question and the LHS = RHS only if you assume the Markov property holds, otherwise as you say you would end up with a term like the second equation you have written. $\endgroup$ Commented Dec 21, 2020 at 0:58
  • $\begingroup$ I still have to read the policy gradient chapter in Sutton & Barto. Maybe comparing the two derivation can clear it up. I wonder if conditioning on policy is the answer, so it should be $P(\tau | \pi)$. I just realized thats how its written in the spinning up docs. $\endgroup$ Commented Dec 21, 2020 at 1:19
  • 2
    $\begingroup$ I don’t know what they mean by ‘conditioning on a policy’ — that is analogous to conditioning on a density function. $\endgroup$ Commented Dec 21, 2020 at 1:23

2 Answers 2

0
$\begingroup$

I think the equation doesn't check out, when $i$ runs to $T$, we don't have $s_{T+1}$ to plug in to $s_{i+1}$ in your first equation. Sorry it's not an answer.

$\endgroup$
0
$\begingroup$

Sergey Levine's comment does seem confusing, here's the clarification. The policy gradient derivation itself does not explicitly require the Markov property. Instead, it operates on the probability distribution of trajectories under the parameterized policy $\pi_{\theta}$, which uses the Markov assumption since in most RL with fully observable states any trajectory is generated by a MDP.

$\endgroup$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.