This contains the core reinforcement learning logic that drives the entire project. It includes scripts for orchestrating the training process, evaluating the environment, and serving models through a socket-based API.
This requires having conda (or some variant of it) installed, so that the conda
command is available.
- Create the environment:
conda env create -p ./env -f environment.yml
. - Activate the environment:
conda activate ./env
.
For CPU-only training, uncomment cpuonly
in the conda environment file before creating the environment. By default,
training uses GPU if available.
This will run an agent on a simulated server to fight against.
- Run the evaluation script with a model:
eval --model-path <model-path-here>
. - Log in to the simulated server and play against the agent!
This serves models in models via a socket-based API for fast predictions.
- Start the API:
serve-api
. - Connect using a client (example: PvpClient).
By default, it only accepts connections on 127.0.0.1
, configurable with --host
.
- Configure the job in ./config - or use an existing config such as
PastSelfPlay
. - Start the job:
train --preset PastSelfPlay --name <name-your-experiment>
. - Stop the job:
train cleanup --name <your-experiment-name>
ortrain cleanup --name all
to terminate all jobs.
Note: Training logs are stored in ./logs
and experiment data, including model versions, are stored in
./experiments
.
- Tensorboard automatically launches with training jobs, or run
train tensorboard
to start it manually. Access it at http://127.0.0.1:6006/. - Tensorboard logs are stored in
./tensorboard
under the experiment name.
- Generalized PvP environment setup.
- Model evaluation support.
- Model serving through a socket-based API.
- Distributed rollout collection.
- Parameterized and masked actions, including autoregressive actions (with normalization).
- TorchScript-compatible models for efficient evaluation.
- Self-play strategies, including prioritized past-self play (based on OpenAI Five paper).
- Adversarial training (based on DeepMind's SC2 paper).
- Reward normalization and observation normalization.
- Novelty rewards.
- Distributed model processing via various
RemoteProcessor
implementations. - Noise generation.
- Flexible parameter annealing through comprehensive scheduling.
- Asynchronous training job management.
- Comprehensive metric recording (Tensoboard).
- Scripted plugins for evaluation and API.
- PPO implementation.
- Async vectorized environment.
- Customizable model architectures.
- Gradient accumulation.
- Detailed configuration via YAML.
- PvP Environment implementation with configurable rewards.
- Full game state visibility for the critic.
- Frame stacking.
- Comprehensive callback system.
- Environment randomization for generalization.
- Elo-based ranking and rating generation for benchmarking.
- Supplementary model for episode outcome prediction.
- Supports Ray for distributed rollouts on a cluster or multiple CPU cores.
- Train with distribution:
train --preset <preset> --distribute <parallel-rollout-count>
. - Omit
<parallel-rollout-count>
to use all available CPU cores.
- Scale up a cluster:
ray up cluster.yml
. - Scale down a cluster:
ray down cluster.yml
. - View the cluster:
ray attach cluster.yml --port-forward=8265
to open dashboard.
- Focuses on 1v1 NH fights.
- MultiDiscrete action space with 11 action heads.
- Extensive observation space.
See the environment contract for details.
- Available in models.
- Trained for PvP Arena/LMS for various builds and gear setups.
- Includes
GeneralizedNh
(self-play) andFineTunedNh
(GeneralizedNh
fine-tuned against human approximations).
- Investigate bootstrapping from human replays for improved human-like behavior.
- Consider blending behavior cloning with self-play.
- Experiment with LSTM or transformer architectures for episode recall and strategy adaptation.
Note: Some experimentation was done with transformers (with frame-stacking), but simple FF networks learned quicker and outperformed the more complex networks.
- Explore rollouts on the live game for enhanced realism and human player adaptation.
These are some resources that helped the most when working on this project.