SAC jax #300

araffin · 2022-10-23T10:51:39Z

Description

Missing: benchmark and doc

Adapted from https://github.com/araffin/sbx
Report (3 seeds on 3 MuJoCo envs): https://wandb.ai/openrlbenchmark/cleanrl/reports/SAC-jax---VmlldzoyODM4MjU0

Types of changes

Bug fix
New feature
New algorithm
Documentation

Checklist:

I've read the CONTRIBUTION guide (required).
I have ensured pre-commit run --all-files passes (required).
I have updated the documentation and previewed the changes via mkdocs serve.
I have updated the tests accordingly (if applicable).

If you are adding new algorithms or your change could result in performance difference, you may need to (re-)run tracked experiments. See #137 as an example PR.

vercel · 2022-10-23T10:51:44Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
cleanrl	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jun 15, 2023 6:05pm

araffin · 2022-10-23T12:15:08Z

@vwxyzjn tests fails because ModuleNotFoundError: No module named 'pygame', not sure why it worked before...

vwxyzjn · 2022-10-24T00:23:53Z

ModuleNotFoundError: No module named 'pygame' looks really weird... so I investigated a bit further into it. Instead of running poetry lock, I ran poetry add tensorflow-probability and poetry update flax and that seems to make things work.

It turns out the culprit is the following changes

-classic_control = ["pygame (==2.1.0)"]
+classic-control = ["pygame (==2.1.0)"]

We install pygame by pip install gym[classic_control] under the hood with poetry, but for some reason the key of the extra was changes 😓

araffin · 2022-10-24T14:03:05Z

@vwxyzjn I think I'm done for the implementation, I added support for constant entropy coeff and for deterministic eval.
I would be happy to receive help for the documentation ;)

vwxyzjn · 2022-11-21T01:39:50Z

Perhaps it's because in #217 I implemented my own normal distribution I am trying to do the same for SAC...

However if I replaced

def actor_loss(params):
            dist = TanhTransformedDistribution(
                tfd.MultivariateNormalDiag(loc=action_mean, scale_diag=jnp.exp(action_logstd)),
            )
            actor_actions = dist.sample(seed=subkey)
            log_prob = dist.log_prob(actor_actions).reshape(-1, 1)

with the log probability taken from https://github.com/openai/baselines/blob/9b68103b737ac46bc201dfb3121cfa5df2127e53/baselines/common/distributions.py#L238-L241

def actor_loss(params):
            action_mean, action_logstd = actor.apply(params, observations[0:1])
            action_std = jnp.exp(action_logstd)
            actor_actions = action_mean + action_std * jax.random.normal(subkey, shape=action_mean.shape)
            log_prob = -0.5 * ((actor_actions - action_mean) / action_std) ** 2 - 0.5 * jnp.log(2.0 * jnp.pi) - action_logstd
            log_prob = log_prob.sum(axis=1, keepdims=True)
            actor_actions = jnp.tanh(actor_actions)

things kind of fall catastrophically... I felt that maybe implementing our own would bring greater transparency but maybe not be necessary...

vwxyzjn · 2022-11-21T02:36:14Z

Aha! I got it, it's supposed to be the following

            action_mean, action_logstd = actor.apply(params, observations)
            action_std = jnp.exp(action_logstd)
            actor_actions = action_mean + action_std * jax.random.normal(subkey, shape=action_mean.shape)
            log_prob = -0.5 * ((actor_actions - action_mean) / action_std) ** 2 - 0.5 * jnp.log(2.0 * jnp.pi) - action_logstd
            actor_actions = jnp.tanh(actor_actions)
            log_prob -= jnp.log((1 - jnp.power(actor_actions, 2)) + 1e-6)
            log_prob = log_prob.sum(axis=1, keepdims=True)

Interestingly, the paper seems to say our implementation should have been the following (with the summation)

log_prob -= jnp.log((1 - jnp.power(actor_actions, 2)) + 1e-6).sum(axis=-1).reshape(-1, 1)

but empirically, it doesn't perform as well... @dosssman any thoughts?

araffin · 2022-11-21T10:03:35Z

Interestingly, the paper seems to say our implementation should have been the following (with the summation)

Not sure to follow the difference...

You can take a look at how we do it in SB3, I think it is what is described:
https://github.com/DLR-RM/stable-baselines3/blob/c4f54fcf047d7bf425fb6b88a3c8ed23fe375f9b/stable_baselines3/common/distributions.py#L222-L226

vwxyzjn · 2022-11-22T21:58:12Z

I tried to implement the probability distribution ourselves 0cf0e9e, but hit a performance regression.

Looking into the issue deeper, I couldn't quite understand how TanhTransformedDistribution works. Could someone take a look at https://gist.github.com/vwxyzjn/331f896b79d3f829fdfa575be666d2d8, which generates

manually sample actions, manually calculate log prob
  action=2.561650514602661, logprob=55.152984619140625
manually sample actions, calculate log prob from TanhTransformedDistribution
  action=2.561650514602661, logprob=nan
sample actions from `TanhTransformedDistribution`, calculate log prob from TanhTransformedDistribution
  action=2.7475833892822266, logprob=66.45195770263672
sample actions from `TanhTransformedDistribution`, manually calculate log prob
  action=2.7475833892822266, logprob=-inf

I am quite puzzled. TanhTransformedDistribution seems like quite a black box to me. Because tensorflow_probability is written in tensorflow, there is no meaningful code trace in the IDE to understand what's happening inside... And tfp's docs seems to have some issues (e.g., the "view source code on Github" button in https://www.tensorflow.org/probability/api_docs/python/tfp/distributions/MultivariateNormalDiag is broken). Maybe we shouldn't use anything from tfp?

araffin · 2022-11-24T10:56:50Z

@vwxyzjn run the code with JAX_ENABLE_X64=True and it will solve your issue ;) (results are still slightly different, but that's probably expected, try with different random seeds)
JIT_DISABLE_JIT=1 already partially solves the issue.

I guess the answer to your question is called numerical precision ;).

EDIT: the code from tf distribution is here: https://github.com/tensorflow/probability/blob/bcdf53024ef9f35d81be063093ccfb3a762dab3f/tensorflow_probability/python/bijectors/tanh.py#L70-L81

  # We implicitly rely on _forward_log_det_jacobian rather than explicitly
  # implement _inverse_log_det_jacobian since directly using
  # `-tf.math.log1p(-tf.square(y))` has lower numerical precision.

  def _forward_log_det_jacobian(self, x):
    #  This formula is mathematically equivalent to
    #  `tf.log1p(-tf.square(tf.tanh(x)))`, however this code is more numerically
    #  stable.
    #  Derivation:
    #    log(1 - tanh(x)^2)
    #    = log(sech(x)^2)
    #    = 2 * log(sech(x))
    #    = 2 * log(2e^-x / (e^-2x + 1))
    #    = 2 * (log(2) - x - log(e^-2x + 1))
    #    = 2 * (log(2) - x - softplus(-2x))
    return 2. * (np.log(2.) - x - tf.math.softplus(-2. * x))

cleanrl/sac_continuous_action_jax.py

araffin · 2022-11-28T15:08:07Z

run the code with JAX_ENABLE_X64=True and it will solve your issue ;) (results are still slightly different, but that's probably expected, try with different random seeds)

@vwxyzjn as a follow up, if you remove the + 1e-6 in your code, you get the same results. Btw, why did you use 1e-6 and not a smaller value?

EDIT: I don't know why precommit fails, it does work locally

Howuhh · 2022-11-28T15:18:10Z

@araffin 1e-6 used on most popular SAC pytorch implementations, I also use it on my research for some reason (and in CORL). I think it's more a matter of reproducibility.

ffelten · 2023-04-24T12:38:23Z

Hi, is there any update/blocking thing on this?

araffin · 2023-06-15T18:19:31Z

@vwxyzjn I would need your help again to update the lockfile, I tried to do it locally and poetry destroyed my conda env...

araffin added 7 commits October 23, 2022 11:19

Remove unused params

08cdd3c

Add SAC Jax version

9c8e642

Upgrade SB3

9c5da79

Revert SB3 upgrade (gym incompatible)

d0bf6a8

Add progress bar

2ace8b5

Fix import

1d7e36d

Remove unused code

799eadb

vercel bot deployed to Preview October 23, 2022 10:52 View deployment

araffin added 3 commits October 23, 2022 13:14

Update requirements

61be9db

Update lock file

8fce3a8

Add test for jax

631d90a

vercel bot deployed to Preview October 23, 2022 11:30 View deployment

Display FPS only when needed

b2481a4

vercel bot deployed to Preview October 23, 2022 15:17 View deployment

update lock files

1048689

vercel bot deployed to Preview October 24, 2022 00:14 View deployment

fix test cases, use the same naming convention

39a20d0

vercel bot deployed to Preview October 24, 2022 00:20 View deployment

Add constant ent coef support and improve types

ecd66c8

vercel bot deployed to Preview October 24, 2022 13:07 View deployment

Add deterministic evaluation

e6958a2

vercel bot deployed to Preview October 24, 2022 13:38 View deployment

Use deterministic eval

101089c

vercel bot deployed to Preview October 24, 2022 13:50 View deployment

Rescale actions

fe2295f

vercel bot deployed to Preview October 24, 2022 14:08 View deployment

update reference

668ea1d

vercel bot deployed to Preview November 21, 2022 00:45 View deployment

remove tensorflow_probability

0cf0e9e

vercel bot deployed to Preview November 21, 2022 03:04 View deployment

add benchmark script

9082f57

vercel bot deployed to Preview November 21, 2022 03:05 View deployment

properly implement log probability

9f2fd81

vercel bot deployed to Preview November 22, 2022 16:33 View deployment

add unit test and fix log prob calc

be38473

vercel bot deployed to Preview November 22, 2022 19:16 View deployment

Howuhh reviewed Nov 24, 2022

View reviewed changes

cleanrl/sac_continuous_action_jax.py Outdated Show resolved Hide resolved

Fix critic loss

15c30c8

vercel bot deployed to Preview November 28, 2022 14:58 View deployment

vwxyzjn mentioned this pull request Dec 4, 2022

nan in MultivariateNormalDiag log prob google-deepmind/distrax#216

Open

Merge branch 'master' into feat/sac-jax

623da0f

vercel bot deployed to Preview June 15, 2023 17:17 View deployment

Revert poetry lock to master

f0ee601

vercel bot deployed to Preview June 15, 2023 17:34 View deployment

Re-add tf proba

bbec22d

vercel bot deployed to Preview June 15, 2023 18:05 View deployment

asmith26 mentioned this pull request Oct 8, 2023

Example using vmap/pmap from Jax? keras-team/keras#18570

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SAC jax #300

SAC jax #300

araffin commented Oct 23, 2022 •

edited

Loading

vercel bot commented Oct 23, 2022 •

edited

Loading

araffin commented Oct 23, 2022

vwxyzjn commented Oct 24, 2022 •

edited

Loading

araffin commented Oct 24, 2022

vwxyzjn commented Nov 21, 2022

vwxyzjn commented Nov 21, 2022

araffin commented Nov 21, 2022

vwxyzjn commented Nov 22, 2022

araffin commented Nov 24, 2022 •

edited

Loading

araffin commented Nov 28, 2022 •

edited

Loading

Howuhh commented Nov 28, 2022 •

edited

Loading

ffelten commented Apr 24, 2023

araffin commented Jun 15, 2023

SAC jax #300

Are you sure you want to change the base?

SAC jax #300

Conversation

araffin commented Oct 23, 2022 • edited Loading

Description

Types of changes

Checklist:

vercel bot commented Oct 23, 2022 • edited Loading

araffin commented Oct 23, 2022

vwxyzjn commented Oct 24, 2022 • edited Loading

araffin commented Oct 24, 2022

vwxyzjn commented Nov 21, 2022

vwxyzjn commented Nov 21, 2022

araffin commented Nov 21, 2022

vwxyzjn commented Nov 22, 2022

araffin commented Nov 24, 2022 • edited Loading

araffin commented Nov 28, 2022 • edited Loading

Howuhh commented Nov 28, 2022 • edited Loading

ffelten commented Apr 24, 2023

araffin commented Jun 15, 2023

araffin commented Oct 23, 2022 •

edited

Loading

vercel bot commented Oct 23, 2022 •

edited

Loading

vwxyzjn commented Oct 24, 2022 •

edited

Loading

araffin commented Nov 24, 2022 •

edited

Loading

araffin commented Nov 28, 2022 •

edited

Loading

Howuhh commented Nov 28, 2022 •

edited

Loading