Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reinforcement Learning Bots #89

Open
garbear opened this issue Dec 7, 2017 · 2 comments
Open

Reinforcement Learning Bots #89

garbear opened this issue Dec 7, 2017 · 2 comments
Milestone

Comments

@garbear
Copy link
Owner

garbear commented Dec 7, 2017

Introduction

Here, I introduce the Emulation Equation, a game-theoretic equation for emulation.

The equation represents all emulation, allows for powerful features, and all gameplay (human or otherwise) becomes data for training an artificially-intelligent game-playing agent.

For demonstration purposes, I've split the explanation into two theories:

  • The Special Theory of Emulation explains the fundamentals, and presents a simplified emulation equation. Using this, all emulation and many powerful features are possible.

  • The General Theory of Emulation uses the same fundamentals, and presents an extended emulation equation. Using this, all gameplay (human or otherwise) becomes data for training a reinforcement learner.

For the physics nerds, this is analogous to how Einstein introduced relativity:

  • In 1905, the Special Theory of Relativity explained moving bodies without gravity
  • In 1915, the General Theory of Relativity explained moving bodies in the presence of gravity

Background

Reinforcement learning is popular in game-playing AI because the reward signal is often sparse (not evident every frame) and depends on actions taken much earlier in the game.

The Q-learning algorithm is well-suited for teaching reinforcement learners how to play a video game because it does not require a model of the environment, which would be difficult to create for even the most basic computing machine.

Recall the algorithm from Q-learning:

Q-learning

Understanding that is out of the scope of this documentation. Just know that it learns over "emulation equations", which describe a series of frames in the discrete time domain.

Here I present two theories of emulation. The Special Theory of Emulation is the smallest equation needed for emulation. The General Theory of Emulation expands on this equation, allowing it to be used for Q-learning.

Special Theory of Emulation

The Special Theory of Emulation presents the smallest equation (the emulation equation) needed to represent all emulation.

Emulation variables

Game-theoretic emulation uses two variables:

State: S

  • State consists of video, audio, and memory regions (RAM, SRAM, Real-time clock).

Action: A

  • Action is the combined state of all input devices.

Time series

Emulation occurs at discrete time steps, so every time step has its own instance of these variables:

S == S_t
A == A_t

The emulation history is therefore a time series of tuples containing these two variables:

(S0, A0, S1, A1, ...)

Emulation model

Time steps occur by applying a set of functions to the emulation variables:

PlayFrame()

  • The PlayFrame() function takes the previous state, along with the most recent action, and produces a new state

GetInput()

  • The GetInput() function takes the previous action, along with the most recent state, and produces a new action

Emulation equation

The emulation equation is a time series model consisting of the initial conditions, as well as the model used for each time step.

The initial condition of all variables is the empty set:

S_0 = 0
A_0 = 0

The variables then evolve by applying the functions in sequence:

S_t+1 = PlayFrame(S_t, A_t)
A_t+1 = GetInput(S_t+1, A_t)

Summary

The emulation equation describes something fundamental in every emulator; play a frame, get input, repeat.

Interestingly, this fundamental concept was not a priori. It emerged as a model from solving an algorithm.

Surprising facts also appear. A0 is empty; the first frame is played with all buttons unpressed. Upon deeper inspection, this results because Q-learners get no value from an Action without a prior State observation.

Next, we present the General Theory of Emulation, which expands on these fundamentals to include two new concepts needed in the Q-learning algorithm.

The General Theory of Emulation

The general theory of emulation extends the emulation equation so that it can be used for Q-learning.

Note: I also wanted to choose strategies for my Q-learners, such as "walk up" or "reach level 2". I extended Q-learning to depend on a Policy variable in the time series. When the strategy is the identity function (no strategy), this extended learning algorithm decays to Q-learning.

Emulation variables

Reinforcement learning adds two variables:

Reward: R

  • Reward is used to train the function approximator that infers an action from the observed state. The reward can come from sniffing RAM, such as the achievements at http://retroachievements.org, or reading a value from video memory using OCR.

Policy: pi

  • Policy is used by the agent to choose the next move. The goal of reinforcement learning is to choose the policy that maximizes the reward.

Time series

Emulation occurs at discrete time steps, so every time step has its own instance of these variables:

S == S_t
R == R_t
pi == pi_t
A == A_t

The emulation history is therefore a time series of tuples containing these four variables:

(S0, R0, π0, A0, S1, R1, π1, A1, ...)

Emulation model

Reinforcement learning also needs two more functions:

GetReward()

  • The GetReward() function takes the previous reward, along with the most recent values of the other variables, and produces a new reward

Strategize()

  • The Strategize() function takes the previous policy, along with the most recent values of the other variables, and produces a new policy

Emulation equation

The emulation equation is a time series model consisting of the initial conditions, as well as the model used for each time step.

The initial condition of all variables is the empty set:

S_0 = 0
R_0 = 0
pi_0 = 0
A_0 = 0

The variables then evolve by applying the functions in sequence:

S_t+1 = PlayFrame(S_t, R_t, pi_t, A_t)
R_t+1 = GetReward(S_t+1, R_t, pi_t, A_t)
pi_t+1 = Strategize(S_t+1, R_t+1, pi_t, A_t)
A_t+1 = GetInput(S_t+1, R_t+1, pi_t+1, A_t)

@garbear garbear added this to the M 19.0 milestone Dec 7, 2017
@LipkeGu
Copy link

LipkeGu commented Dec 30, 2017

As far i understand, you mean to develop a "virtual" player which is available in each game / rom?

@garbear
Copy link
Owner Author

garbear commented Dec 30, 2017

right. so far the math in this issue just describes the data we need to gather to make this happen. then it can be uploaded to the cloud for training, and depending on the state of embedded tensorflow, inference can be run locally or in the cloud if we get netplay support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants