In Open AI's actor-critic and in Open AI's REINFORCE, the rewards are being normalized like so
rewards = (rewards - rewards.mean()) / (rewards.std() + eps)
on every episode individually.
This is probably the baseline reduction, but I'm not entirely sure why they divide by the standard deviation of the rewards?
Assuming this is the baseline reduction, why is this done per episode?
What if one episode yields rewards in the (absolute, not normalized) range of [0,1]$[0, 1]$, and the next episode yields rewards in the range of [100,200]$[100, 200]$?
This method seems to ignore the absolute difference between the episodes' rewards.