-
Notifications
You must be signed in to change notification settings - Fork 6.2k
/
Copy pathcount_based_curiosity.py
137 lines (120 loc) · 5.43 KB
/
count_based_curiosity.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
"""Example of using a count-based curiosity mechanism to learn in sparse-rewards envs.
This example:
- demonstrates how to define your own count-based curiosity ConnectorV2 piece
that computes intrinsic rewards based on simple observation counts and adds these
intrinsic rewards to the "main" (extrinsic) rewards.
- shows how this connector piece overrides the main (extrinsic) rewards in the
episode and thus demonstrates how to do reward shaping in general with RLlib.
- shows how to plug this connector piece into your algorithm's config.
- uses Tune and RLlib to learn the env described above and compares 2
algorithms, one that does use curiosity vs one that does not.
We use a FrozenLake (sparse reward) environment with a map size of 8x8 and a time step
limit of 14 to make it almost impossible for a non-curiosity based policy to learn.
How to run this script
----------------------
`python [script file name].py --enable-new-api-stack`
Use the `--no-curiosity` flag to disable curiosity learning and force your policy
to be trained on the task w/o the use of intrinsic rewards. With this option, the
algorithm should NOT succeed.
For debugging, use the following additional command line options
`--no-tune --num-env-runners=0`
which should allow you to set breakpoints anywhere in the RLlib code and
have the execution stop there for inspection and debugging.
For logging to your WandB account, use:
`--wandb-key=[your WandB API key] --wandb-project=[some project name]
--wandb-run-name=[optional: WandB run name (within the defined project)]`
Results to expect
-----------------
In the console output, you can see that only a PPO policy that uses curiosity can
actually learn.
Policy using count-based curiosity:
+-------------------------------+------------+--------+------------------+
| Trial name | status | iter | total time (s) |
| | | | |
|-------------------------------+------------+--------+------------------+
| PPO_FrozenLake-v1_109de_00000 | TERMINATED | 48 | 44.46 |
+-------------------------------+------------+--------+------------------+
+------------------------+-------------------------+------------------------+
| episode_return_mean | num_episodes_lifetime | num_env_steps_traine |
| | | d_lifetime |
|------------------------+-------------------------+------------------------|
| 0.99 | 12960 | 194000 |
+------------------------+-------------------------+------------------------+
Policy NOT using curiosity:
[DOES NOT LEARN AT ALL]
"""
from ray.rllib.connectors.env_to_module import FlattenObservations
from ray.rllib.core.rl_module.default_model_config import DefaultModelConfig
from ray.rllib.examples.connectors.classes.count_based_curiosity import (
CountBasedCuriosity,
)
from ray.rllib.utils.test_utils import (
add_rllib_example_script_args,
run_rllib_example_script_experiment,
)
from ray.tune.registry import get_trainable_cls
parser = add_rllib_example_script_args(
default_reward=0.99, default_iters=200, default_timesteps=1000000
)
parser.set_defaults(enable_new_api_stack=True)
parser.add_argument(
"--intrinsic-reward-coeff",
type=float,
default=1.0,
help="The weight with which to multiply intrinsic rewards before adding them to "
"the extrinsic ones (default is 1.0).",
)
parser.add_argument(
"--no-curiosity",
action="store_true",
help="Whether to NOT use count-based curiosity.",
)
ENV_OPTIONS = {
"is_slippery": False,
# Use this hard-to-solve 8x8 map with lots of holes (H) to fall into and only very
# few valid paths from the starting state (S) to the goal state (G).
"desc": [
"SFFHFFFH",
"FFFHFFFF",
"FFFHHFFF",
"FFFFFFFH",
"HFFHFFFF",
"HHFHFFHF",
"FFFHFHHF",
"FHFFFFFG",
],
# Limit the number of steps the agent is allowed to make in the env to
# make it almost impossible to learn without (count-based) curiosity.
"max_episode_steps": 14,
}
if __name__ == "__main__":
args = parser.parse_args()
base_config = (
get_trainable_cls(args.algo)
.get_default_config()
.environment(
"FrozenLake-v1",
env_config=ENV_OPTIONS,
)
.env_runners(
num_envs_per_env_runner=5,
# Flatten discrete observations (into one-hot vectors).
env_to_module_connector=lambda env: FlattenObservations(),
)
.training(
# The main code in this example: We add the `CountBasedCuriosity` connector
# piece to our Learner connector pipeline.
# This pipeline is fed with collected episodes (either directly from the
# EnvRunners in on-policy fashion or from a replay buffer) and converts
# these episodes into the final train batch. The added piece computes
# intrinsic rewards based on simple observation counts and add them to
# the "main" (extrinsic) rewards.
learner_connector=(
None if args.no_curiosity else lambda *ags, **kw: CountBasedCuriosity()
),
num_epochs=10,
vf_loss_coeff=0.01,
)
.rl_module(model_config=DefaultModelConfig(vf_share_layers=True))
)
run_rllib_example_script_experiment(base_config, args)