Skip to main content

Questions tagged [policy-gradients]

For questions related to reinforcement learning algorithms often referred to as "policy gradients" (or "policy gradient algorithms"), which attempt to directly optimise a parameterised policy (without first attempting to estimate value functions) using gradients of an objective function with respect to the policy's parameters.

3 votes
0 answers
75 views

The Proximal Policy Optimization Algorithms paper by Schulman et al. says the following. The most commonly used gradient estimator has the form $$\widehat{g} = \widehat{\mathbb{E}}_{t}\left[\nabla_{\...
efthimio's user avatar
  • 131
4 votes
1 answer
112 views

While studying the proof of the Policy Gradient Theorem, I have come across two different approaches. The first seems to be a more standard approach involving "unrolling" across every time ...
Jamie Stephenson's user avatar
1 vote
1 answer
76 views

I read in the book and when I saw the formula to optimize the $\theta$ $$ \theta \leftarrow \theta + \alpha \nabla_\theta J(\pi_\theta) \\ \nabla_\theta J(\pi_\theta) = E_(\tau ~ \pi_\theta)[\sum_{t=0}...
Vietnamese IPhO Competitant - 's user avatar
0 votes
0 answers
42 views

I need to implement a policy gradient algorithms (actor-critic) for the 2D cutting stock problem with varied size stocks. However I'm new to machine learning so I still have no clue how to design the ...
Phạm Trần Minh Trí's user avatar
1 vote
1 answer
114 views

I have a question regarding how the expected return of a deterministic policy in written. I have seen that in some cases the use the Q-Function as it is shown in the part Objective function ...
marc_spector's user avatar
1 vote
1 answer
157 views

I'm implementing a basic Vanilla Policy Gradient algorithm for the CartPole-v1 gymnasium environment, and I don't know what I'm doing wrong. No matter what I try, during the training loop the loss ...
wildBass's user avatar
0 votes
3 answers
137 views

I tried to understand why TD3/DDPG use a policy loss of −E[Q(s,π(s))], which should make the policy maximize Q-values. I expected this to push Q-values to infinity over time, as there’s no explicit ...
Omar's user avatar
  • 19
0 votes
1 answer
118 views

I would like to prove the following equation, which pops up everywhery in RL and I would like to have a clean proof of it. It kind of gives a representation as a trajectory vs. a representaiton in ...
craaaft's user avatar
  • 139
1 vote
1 answer
59 views

Please see slide 78 in https://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture14.pdf $$\nabla_{\theta} J\left(\theta\right) \approx\sum_{t\ge 0}r(\tau) \nabla_{\theta} \log \pi_{\theta}\left(a_{t}...
DSPinfinity's user avatar
  • 1,273
2 votes
1 answer
63 views

In Part 3: Intro to Policy Optimization from spinningup documentation, there is a formula to compute the estimate of the policy gradient: ''' This is an expectation, which means that we can estimate ...
Charles Ju's user avatar
1 vote
0 answers
88 views

Currently I am reading the OpenAI spinning up document about policy gradient and actor-critic method. In this webpage,replace Return with Action value, I think they are trying to prove that the ...
jim1124's user avatar
  • 13
4 votes
1 answer
468 views

If I understand correctly, the goal of vanilla policy gradients is maximizing $E[r(s_t,a_t);\pi_\theta]$; in deriving the gradient of this function as a clearer function on $\theta$, we get $\sum_{t=0}...
User's user avatar
  • 225
0 votes
1 answer
112 views

What is the difference between a rollout buffer and a replay buffer (as used in DQNs). Why can't they be used interchangeably? Why is the trajectory sampling parallelized? Is it just for making data ...
DeadAsDuck's user avatar
1 vote
2 answers
157 views

After reading Sutton and Barto, I was able to understand the derivation of this theorem. The only thing I don't get is the following part from REINFORCE algorithm: How are these terms equivalent, and ...
DeadAsDuck's user avatar
3 votes
1 answer
189 views

In the TRPO and PPO papers, it is mentioned that large policy updates often lead to performance drops in policy gradient methods. By "large policy updates," they mean a significant KL ...
Druudik's user avatar
  • 191

15 30 50 per page
1
2 3 4 5
14