Skip to main content
added 2 characters in body; edited tags
Source Link
nbro
  • 43.4k
  • 14
  • 122
  • 222

In Open AI's actor-critic and in Open AI's REINFORCE, the rewards are being normalized like so

rewards = (rewards - rewards.mean()) / (rewards.std() + eps)

on every episode individually.

This is probably the baseline reduction, but I'm not entirely sure why they divide by the standard deviation of the rewards?

Assuming this is the baseline reduction, why is this done per episode?

What if one episode yields rewards in the (absolute, not normalized) range of [0,1]$[0, 1]$, and the next episode yields rewards in the range of [100,200]$[100, 200]$?

This method seems to ignore the absolute difference between the episodes' rewards.

In Open AI's actor-critic and in Open AI's REINFORCE, the rewards are being normalized like so

rewards = (rewards - rewards.mean()) / (rewards.std() + eps)

on every episode individually.

This is probably the baseline reduction, but I'm not entirely sure why they divide by the standard deviation of the rewards?

Assuming this is the baseline reduction, why is this done per episode?

What if one episode yields rewards in the (absolute, not normalized) range of [0,1], and the next episode yields rewards in the range of [100,200]?

This method seems to ignore the absolute difference between the episodes' rewards.

In Open AI's actor-critic and in Open AI's REINFORCE, the rewards are being normalized like so

rewards = (rewards - rewards.mean()) / (rewards.std() + eps)

on every episode individually.

This is probably the baseline reduction, but I'm not entirely sure why they divide by the standard deviation of the rewards?

Assuming this is the baseline reduction, why is this done per episode?

What if one episode yields rewards in the (absolute, not normalized) range of $[0, 1]$, and the next episode yields rewards in the range of $[100, 200]$?

This method seems to ignore the absolute difference between the episodes' rewards.

In Open AI's actor-critic and in Open AI's REINFORCE, the rewards are being normalized like so

rewards = (rewards - rewards.mean()) / (rewards.std() + eps)

on every episode individually.

This is probably the baseline reduction, but I'm not entirely sure why they divide by the standard deviation of the rewards. Why do we devide by the standard deviation of the rewards?

Assuming this is the baseline reduction, why is this done per episode?

What if one episode yields rewards in the (absolute, not normalized) range of [0,1], and the next episode yields rewards in the range of [100,200]?

This method seems to ignore the absolute difference between the episodes' rewards.

In Open AI's actor-critic and in Open AI's REINFORCE, the rewards are being normalized like so

rewards = (rewards - rewards.mean()) / (rewards.std() + eps)

on every episode individually.

This is probably the baseline reduction, but I'm not entirely sure why they divide by the standard deviation of the rewards. Why do we devide by the standard deviation of the rewards?

Assuming this is the baseline reduction, why is this done per episode?

What if one episode yields rewards in the (absolute, not normalized) range of [0,1], and the next episode yields rewards in the range of [100,200]?

This method seems to ignore the absolute difference between the episodes' rewards.

In Open AI's actor-critic and in Open AI's REINFORCE, the rewards are being normalized like so

rewards = (rewards - rewards.mean()) / (rewards.std() + eps)

on every episode individually.

This is probably the baseline reduction, but I'm not entirely sure why they divide by the standard deviation of the rewards?

Assuming this is the baseline reduction, why is this done per episode?

What if one episode yields rewards in the (absolute, not normalized) range of [0,1], and the next episode yields rewards in the range of [100,200]?

This method seems to ignore the absolute difference between the episodes' rewards.

added 1 character in body
Source Link
Gulzar
  • 789
  • 1
  • 10
  • 27

In Open AI's actor-critic and in Open AI's REINFORCE, the rewards are being normalized like so

rewards = (rewards - rewards.mean()) / (rewards.std() + eps)

on veryevery episode individually.

This is probably the baseline reduction, but I'm not entirely sure why they divide by the standard deviation of the rewards. Why do we devide by the standard deviation of the rewards?

Assuming this is the baseline reduction, why is this done per episode?

What if one episode yields rewards in the (absolute, not normalized) range of [0,1], and the next episode yields rewards in the range of [100,200]?

This method seems to ignore the absolute difference between the episodes' rewards.

In Open AI's actor-critic and in Open AI's REINFORCE, the rewards are being normalized like so

rewards = (rewards - rewards.mean()) / (rewards.std() + eps)

on very episode individually.

This is probably the baseline reduction, but I'm not entirely sure why they divide by the standard deviation of the rewards. Why do we devide by the standard deviation of the rewards?

Assuming this is the baseline reduction, why is this done per episode?

What if one episode yields rewards in the (absolute, not normalized) range of [0,1], and the next episode yields rewards in the range of [100,200]?

This method seems to ignore the absolute difference between the episodes' rewards.

In Open AI's actor-critic and in Open AI's REINFORCE, the rewards are being normalized like so

rewards = (rewards - rewards.mean()) / (rewards.std() + eps)

on every episode individually.

This is probably the baseline reduction, but I'm not entirely sure why they divide by the standard deviation of the rewards. Why do we devide by the standard deviation of the rewards?

Assuming this is the baseline reduction, why is this done per episode?

What if one episode yields rewards in the (absolute, not normalized) range of [0,1], and the next episode yields rewards in the range of [100,200]?

This method seems to ignore the absolute difference between the episodes' rewards.

deleted 505 characters in body
Source Link
nbro
  • 43.4k
  • 14
  • 122
  • 222
Loading
Post Migrated Here from stats.stackexchange.com (revisions)
Source Link
Gulzar
  • 789
  • 1
  • 10
  • 27
Loading