Questions tagged [policy-gradients]
For questions related to reinforcement learning algorithms often referred to as "policy gradients" (or "policy gradient algorithms"), which attempt to directly optimise a parameterised policy (without first attempting to estimate value functions) using gradients of an objective function with respect to the policy's parameters.
206 questions
3
votes
0
answers
75
views
Is it incorrect to drop the gradient operator in the PPO derivation?
The Proximal Policy Optimization Algorithms paper by Schulman et al. says the following.
The most commonly used gradient estimator has the form
$$\widehat{g} = \widehat{\mathbb{E}}_{t}\left[\nabla_{\...
4
votes
1
answer
112
views
Why Are the Standard and Markov Chain Derivations of the Policy Gradient Theorem Equivalent?
While studying the proof of the Policy Gradient Theorem, I have come across two different approaches.
The first seems to be a more standard approach involving "unrolling" across every time ...
1
vote
1
answer
76
views
the scoring function of the policy
I read in the book and when I saw the formula to optimize the $\theta$
$$
\theta \leftarrow \theta + \alpha \nabla_\theta J(\pi_\theta) \\
\nabla_\theta J(\pi_\theta) = E_(\tau ~ \pi_\theta)[\sum_{t=0}...
0
votes
0
answers
42
views
Model the Policy for policy gradient for the 2D cutting stock problem
I need to implement a policy gradient algorithms (actor-critic) for the 2D cutting stock problem with varied size stocks. However I'm new to machine learning so I still have no clue how to design the ...
1
vote
1
answer
114
views
Expected return formula for deterministic policy
I have a question regarding how the expected return of a deterministic policy in written. I have seen that in some cases the use the Q-Function as it is shown in the part Objective function ...
1
vote
1
answer
157
views
Deep RL problem: Loss decreases but agent doesn't learn
I'm implementing a basic Vanilla Policy Gradient algorithm for the CartPole-v1 gymnasium environment, and I don't know what I'm doing wrong. No matter what I try, during the training loop the loss ...
0
votes
3
answers
137
views
Why does TD3/DDPG use − 𝐸 [ 𝑄 ( 𝑠 , 𝜋 ( 𝑠 ) ) ] −E[Q(s,π(s))] as the policy loss without causing Q-values to go to infinity?
I tried to understand why TD3/DDPG use a policy loss of −E[Q(s,π(s))], which should make the policy maximize Q-values. I expected this to push Q-values to infinity over time, as there’s no explicit ...
0
votes
1
answer
118
views
Proof of representation as a trajectory is the same as a representation in terms of state-action pairs
I would like to prove the following equation, which pops up everywhery in RL and I would like to have a clean proof of it. It kind of gives a representation as a trajectory vs. a representaiton in ...
1
vote
1
answer
59
views
Interpretation of changing action probability based on policy gradient expression
Please see slide 78 in https://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture14.pdf
$$\nabla_{\theta} J\left(\theta\right) \approx\sum_{t\ge 0}r(\tau) \nabla_{\theta} \log \pi_{\theta}\left(a_{t}...
2
votes
1
answer
63
views
question on Implementing the Simplest Policy Gradient from spinningup
In Part 3: Intro to Policy Optimization from spinningup documentation, there is a formula to compute the estimate of the policy gradient:
'''
This is an expectation, which means that we can estimate ...
1
vote
0
answers
88
views
Proof for Using Q-Function in Policy Gradient Formula
Currently I am reading the OpenAI spinning up document about policy gradient and actor-critic method. In this webpage,replace Return with Action value, I think they are trying to prove that the ...
4
votes
1
answer
468
views
Can we simply remove the log term for loss in policy gradient methods?
If I understand correctly, the goal of vanilla policy gradients is maximizing $E[r(s_t,a_t);\pi_\theta]$; in deriving the gradient of this function as a clearer function on $\theta$, we get $\sum_{t=0}...
0
votes
1
answer
112
views
I have a few doubts understanding and implementing Proximal Policy Optimisation Algorithm [closed]
What is the difference between a rollout buffer and a replay buffer (as used in DQNs). Why can't they be used interchangeably?
Why is the trajectory sampling parallelized? Is it just for making data ...
1
vote
2
answers
157
views
How are these two terms equivalent in Sutton and Barto's derivation of the REINFORCE algorithm
After reading Sutton and Barto, I was able to understand the derivation of this theorem. The only thing I don't get is the following part from REINFORCE algorithm:
How are these terms equivalent, and ...
3
votes
1
answer
189
views
Why do big policy updates cause performance drop in deep RL?
In the TRPO and PPO papers, it is mentioned that large policy updates often lead to performance drops in policy gradient methods.
By "large policy updates," they mean a significant KL ...