Questions tagged [policy-gradients]

Ask Question

For questions related to reinforcement learning algorithms often referred to as "policy gradients" (or "policy gradient algorithms"), which attempt to directly optimise a parameterised policy (without first attempting to estimate value functions) using gradients of an objective function with respect to the policy's parameters.

206 questions

3 votes

0 answers

75 views

Is it incorrect to drop the gradient operator in the PPO derivation?

The Proximal Policy Optimization Algorithms paper by Schulman et al. says the following. The most commonly used gradient estimator has the form $$\widehat{g} = \widehat{\mathbb{E}}_{t}\left[\nabla_{\...

efthimio

asked Dec 3, 2025 at 19:28

4 votes

1 answer

112 views

Why Are the Standard and Markov Chain Derivations of the Policy Gradient Theorem Equivalent?

While studying the proof of the Policy Gradient Theorem, I have come across two different approaches. The first seems to be a more standard approach involving "unrolling" across every time ...

Jamie Stephenson

asked Apr 2, 2025 at 12:53

1 vote

1 answer

76 views

the scoring function of the policy

I read in the book and when I saw the formula to optimize the $\theta$ $$ \theta \leftarrow \theta + \alpha \nabla_\theta J(\pi_\theta) \\ \nabla_\theta J(\pi_\theta) = E_(\tau ~ \pi_\theta)[\sum_{t=0}...

Vietnamese IPhO Competitant -

asked Feb 18, 2025 at 13:18

0 votes

0 answers

42 views

Model the Policy for policy gradient for the 2D cutting stock problem

I need to implement a policy gradient algorithms (actor-critic) for the 2D cutting stock problem with varied size stocks. However I'm new to machine learning so I still have no clue how to design the ...

Phạm Trần Minh Trí

asked Dec 4, 2024 at 9:05

1 vote

1 answer

114 views

Expected return formula for deterministic policy

I have a question regarding how the expected return of a deterministic policy in written. I have seen that in some cases the use the Q-Function as it is shown in the part Objective function ...

marc_spector

asked Nov 30, 2024 at 10:09

1 vote

1 answer

157 views

Deep RL problem: Loss decreases but agent doesn't learn

I'm implementing a basic Vanilla Policy Gradient algorithm for the CartPole-v1 gymnasium environment, and I don't know what I'm doing wrong. No matter what I try, during the training loop the loss ...

wildBass

asked Nov 7, 2024 at 11:05

0 votes

3 answers

137 views

Why does TD3/DDPG use − 𝐸 [ 𝑄 ( 𝑠 , 𝜋 ( 𝑠 ) ) ] −E[Q(s,π(s))] as the policy loss without causing Q-values to go to infinity?

I tried to understand why TD3/DDPG use a policy loss of −E[Q(s,π(s))], which should make the policy maximize Q-values. I expected this to push Q-values to infinity over time, as there’s no explicit ...

Omar

asked Nov 3, 2024 at 9:38

0 votes

1 answer

118 views

Proof of representation as a trajectory is the same as a representation in terms of state-action pairs

I would like to prove the following equation, which pops up everywhery in RL and I would like to have a clean proof of it. It kind of gives a representation as a trajectory vs. a representaiton in ...

craaaft

asked Sep 11, 2024 at 14:56

1 vote

1 answer

59 views

Interpretation of changing action probability based on policy gradient expression

Please see slide 78 in https://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture14.pdf $$\nabla_{\theta} J\left(\theta\right) \approx\sum_{t\ge 0}r(\tau) \nabla_{\theta} \log \pi_{\theta}\left(a_{t}...

DSPinfinity

1,273

asked Sep 1, 2024 at 1:10

2 votes

1 answer

63 views

question on Implementing the Simplest Policy Gradient from spinningup

In Part 3: Intro to Policy Optimization from spinningup documentation, there is a formula to compute the estimate of the policy gradient: ''' This is an expectation, which means that we can estimate ...

Charles Ju

asked Aug 21, 2024 at 3:50

1 vote

0 answers

88 views

Proof for Using Q-Function in Policy Gradient Formula

Currently I am reading the OpenAI spinning up document about policy gradient and actor-critic method. In this webpage,replace Return with Action value, I think they are trying to prove that the ...

jim1124

asked Aug 14, 2024 at 14:22

4 votes

1 answer

468 views

Can we simply remove the log term for loss in policy gradient methods?

If I understand correctly, the goal of vanilla policy gradients is maximizing $E[r(s_t,a_t);\pi_\theta]$; in deriving the gradient of this function as a clearer function on $\theta$, we get $\sum_{t=0}...

User

asked Jul 28, 2024 at 18:19

0 votes

1 answer

112 views

I have a few doubts understanding and implementing Proximal Policy Optimisation Algorithm [closed]

What is the difference between a rollout buffer and a replay buffer (as used in DQNs). Why can't they be used interchangeably? Why is the trajectory sampling parallelized? Is it just for making data ...

DeadAsDuck

asked Jul 23, 2024 at 16:30

1 vote

2 answers

157 views

How are these two terms equivalent in Sutton and Barto's derivation of the REINFORCE algorithm

After reading Sutton and Barto, I was able to understand the derivation of this theorem. The only thing I don't get is the following part from REINFORCE algorithm: How are these terms equivalent, and ...

DeadAsDuck

asked Jul 19, 2024 at 0:57

3 votes

1 answer

189 views

Why do big policy updates cause performance drop in deep RL?

In the TRPO and PPO papers, it is mentioned that large policy updates often lead to performance drops in policy gradient methods. By "large policy updates," they mean a significant KL ...

Druudik

asked Jun 25, 2024 at 14:09

15 30 50 per page

2 3 4 5

…

14 Next

Stack Exchange Network

Questions tagged [policy-gradients]

Is it incorrect to drop the gradient operator in the PPO derivation?

Why Are the Standard and Markov Chain Derivations of the Policy Gradient Theorem Equivalent?

the scoring function of the policy

Model the Policy for policy gradient for the 2D cutting stock problem

Expected return formula for deterministic policy

Deep RL problem: Loss decreases but agent doesn't learn

Why does TD3/DDPG use − 𝐸 [ 𝑄 ( 𝑠 , 𝜋 ( 𝑠 ) ) ] −E[Q(s,π(s))] as the policy loss without causing Q-values to go to infinity?

Proof of representation as a trajectory is the same as a representation in terms of state-action pairs

Interpretation of changing action probability based on policy gradient expression

question on Implementing the Simplest Policy Gradient from spinningup

Proof for Using Q-Function in Policy Gradient Formula

Can we simply remove the log term for loss in policy gradient methods?

I have a few doubts understanding and implementing Proximal Policy Optimisation Algorithm [closed]

How are these two terms equivalent in Sutton and Barto's derivation of the REINFORCE algorithm

Why do big policy updates cause performance drop in deep RL?

Hot Network Questions