Final Project of Reinforcement Learning
Ant-v4 | HalfCheetah-v4 | |
---|---|---|
Action Space | Box(-1.0, 1.0, (8,), float32) | Box(-1.0, 1.0, (6,0), float32) |
Observation Space | Box(-inf, inf, (27,), float64) | Box(-inf, inf, (17,), float64) |
- WIDER value network than policy network
- depth=1 or 2
- activation function: Tanh (ReLU is WORST)
- Orthogonal Initialization
- Normalization for every layer of value network
- Advantage Normalization
-
Test Video
Ant-v4 HalfCheetah-v4 ant-video-episode-0.mp4
halfcheetah-video-episode-8.mp4
-
Return oscillates more loudly in Ant-v4
→ Bigger Observation Space, Greater Oscillation Width -
Observation nomalization improved performance of Ant-v4, but reduced performence of HalfCheetah-v4
→ Observation normalization can reduce performance when the observation space is small