Revisions to Why does is make sense to normalize rewards per episode in reinforcement learning?

added 2 characters in body; edited tags

Source Link

edited Nov 22, 2020 at 0:35

nbro

43.4k
14
122
222

In Open AI's actor-critic and in Open AI's REINFORCE, the rewards are being normalized like so

rewards = (rewards - rewards.mean()) / (rewards.std() + eps)

on every episode individually.

This is probably the baseline reduction, but I'm not entirely sure why they divide by the standard deviation of the rewards?

Assuming this is the baseline reduction, why is this done per episode?

What if one episode yields rewards in the (absolute, not normalized) range of [0,1]$[0, 1]$, and the next episode yields rewards in the range of [100,200]$[100, 200]$?

This method seems to ignore the absolute difference between the episodes' rewards.

In Open AI's actor-critic and in Open AI's REINFORCE, the rewards are being normalized like so

rewards = (rewards - rewards.mean()) / (rewards.std() + eps)

on every episode individually.

This is probably the baseline reduction, but I'm not entirely sure why they divide by the standard deviation of the rewards?

Assuming this is the baseline reduction, why is this done per episode?

What if one episode yields rewards in the (absolute, not normalized) range of [0,1], and the next episode yields rewards in the range of [100,200]?

This method seems to ignore the absolute difference between the episodes' rewards.

In Open AI's actor-critic and in Open AI's REINFORCE, the rewards are being normalized like so

rewards = (rewards - rewards.mean()) / (rewards.std() + eps)

on every episode individually.

This is probably the baseline reduction, but I'm not entirely sure why they divide by the standard deviation of the rewards?

Assuming this is the baseline reduction, why is this done per episode?

What if one episode yields rewards in the (absolute, not normalized) range of $[0, 1]$, and the next episode yields rewards in the range of $[100, 200]$?

This method seems to ignore the absolute difference between the episodes' rewards.

clarity, redundant text

Source Link

edit approved Aug 22, 2019 at 9:42

Gabizon

173
5

In Open AI's actor-critic and in Open AI's REINFORCE, the rewards are being normalized like so

rewards = (rewards - rewards.mean()) / (rewards.std() + eps)

on every episode individually.

This is probably the baseline reduction, but I'm not entirely sure why they divide by the standard deviation of the rewards. Why do we devide by the standard deviation of the rewards?

Assuming this is the baseline reduction, why is this done per episode?

What if one episode yields rewards in the (absolute, not normalized) range of [0,1], and the next episode yields rewards in the range of [100,200]?

This method seems to ignore the absolute difference between the episodes' rewards.

In Open AI's actor-critic and in Open AI's REINFORCE, the rewards are being normalized like so

rewards = (rewards - rewards.mean()) / (rewards.std() + eps)

on every episode individually.

This is probably the baseline reduction, but I'm not entirely sure why they divide by the standard deviation of the rewards. Why do we devide by the standard deviation of the rewards?

Assuming this is the baseline reduction, why is this done per episode?

What if one episode yields rewards in the (absolute, not normalized) range of [0,1], and the next episode yields rewards in the range of [100,200]?

This method seems to ignore the absolute difference between the episodes' rewards.

In Open AI's actor-critic and in Open AI's REINFORCE, the rewards are being normalized like so

rewards = (rewards - rewards.mean()) / (rewards.std() + eps)

on every episode individually.

This is probably the baseline reduction, but I'm not entirely sure why they divide by the standard deviation of the rewards?

Assuming this is the baseline reduction, why is this done per episode?

What if one episode yields rewards in the (absolute, not normalized) range of [0,1], and the next episode yields rewards in the range of [100,200]?

This method seems to ignore the absolute difference between the episodes' rewards.

added 1 character in body

Source Link

edited Jul 7, 2019 at 14:36

Gulzar

789
1
10
27

In Open AI's actor-critic and in Open AI's REINFORCE, the rewards are being normalized like so

rewards = (rewards - rewards.mean()) / (rewards.std() + eps)

on veryevery episode individually.

This is probably the baseline reduction, but I'm not entirely sure why they divide by the standard deviation of the rewards. Why do we devide by the standard deviation of the rewards?

Assuming this is the baseline reduction, why is this done per episode?

What if one episode yields rewards in the (absolute, not normalized) range of [0,1], and the next episode yields rewards in the range of [100,200]?

This method seems to ignore the absolute difference between the episodes' rewards.

In Open AI's actor-critic and in Open AI's REINFORCE, the rewards are being normalized like so

rewards = (rewards - rewards.mean()) / (rewards.std() + eps)

on very episode individually.

This is probably the baseline reduction, but I'm not entirely sure why they divide by the standard deviation of the rewards. Why do we devide by the standard deviation of the rewards?

Assuming this is the baseline reduction, why is this done per episode?

What if one episode yields rewards in the (absolute, not normalized) range of [0,1], and the next episode yields rewards in the range of [100,200]?

This method seems to ignore the absolute difference between the episodes' rewards.

In Open AI's actor-critic and in Open AI's REINFORCE, the rewards are being normalized like so

rewards = (rewards - rewards.mean()) / (rewards.std() + eps)

on every episode individually.

This is probably the baseline reduction, but I'm not entirely sure why they divide by the standard deviation of the rewards. Why do we devide by the standard deviation of the rewards?

Assuming this is the baseline reduction, why is this done per episode?

What if one episode yields rewards in the (absolute, not normalized) range of [0,1], and the next episode yields rewards in the range of [100,200]?

This method seems to ignore the absolute difference between the episodes' rewards.

deleted 505 characters in body

Source Link

edited Jun 5, 2019 at 17:01

nbro

43.4k
14
122
222

Loading

Post Migrated Here from stats.stackexchange.com (revisions)

occurred Jan 25, 2019 at 16:02

Source Link

asked Jan 24, 2019 at 13:56

Gulzar

789
1
10
27

Loading

Stack Exchange Network

Return to Question

Post Timeline