Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chapter 19, Reinforcement Learning, p. 691 #189

Open
Maryisme opened this issue Jul 31, 2024 · 2 comments
Open

Chapter 19, Reinforcement Learning, p. 691 #189

Maryisme opened this issue Jul 31, 2024 · 2 comments

Comments

@Maryisme
Copy link

Maryisme commented Jul 31, 2024

I would like to get a slightly better understanding regarding the difference between the on-policy and off-policy as well as some clarifications regarding the formulas used to apply them. Namely, what I am also interested in is the difference between "A" and "a" used in these formulas.

@Maryisme
Copy link
Author

Screenshot 2024-07-31 at 16 54 39 Screenshot 2024-07-31 at 16 54 45

@d-kleine
Copy link

d-kleine commented Aug 8, 2024

I am not involved in the book, but I am trying to answer your questions:

  • Methods:

    • On-policy: Learning the value of the policy being carried out by the agent, including any exploration steps; the policy used to select actions is the same policy that is being evaluated and improved. So same policy for action and learning (e.g. in SARSA).
    • Off-policy: Learning the value of an optimal policy independently of the agent's actions. The learning process is separate from the actions taken by the agent during exploration. So different policies for action and learning (e.g. in Q-Learning)
      → you can consider this as different methods of learning a policy
  • Formulas:

    • In SARSA (left formula), the algorithm updates the action-value function based on the actual actions taken, making it an on-policy method (it updates the Q-value based on the action actually taken at the next time step based on the policy, therefore $t-1$).
    • Q-Learning (right formula), where the update is based on the maximum possible action value for the next state, regardless of the action actually taken, making it an off-policy method (it updates the Q-value based on the maximum possible action value at the next state regardless of the current policy's actions, therefore $t$)

→ SARSA uses the actual action taken ($A$), while Q-Learning considers the best possible action ($a$) in the next state. So $A$ denotes a specific action chosen by the policy, whereas ($a$) indicates any possible action, reflecting its off-policy nature, where the update is based on the optimal future action rather than the one actually taken. This is afaik because of different notations, you often see it denoted as $a$ both in SARSA as well as in Q-Learning (actually $a$ means an action and $A$ the set of possible actions in an environment, so might be highly confusing here).

https://tcnguyen.github.io/reinforcement_learning/sarsa_vs_q_learning.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants