This repository accompanies the paper "Filtered Corpus Training (FiCT) Shows that Language Models can Generalize from Indirect Evidence"
- add minicons, ipykernel, statsmodels to environment
All models trained in the paper are available at: https://huggingface.co/collections/CLMBR/filtered-corpus-training-6650b10280b530dba75208d3
After cloning this repo, if you want to do any sort of model training, you will need to run the commands below from within the root directory of this repo (if however,you only want to filter corpuses (i.e. use the corpus_filtering
module and its children only), there is no need to run these commands):
git submodule init
git submodule update
This will pull the contents of the lm-training
submodule into the lm-training/
folder, which otherwwise is empty.
You will only need to make sure you have a recent version of Anaconda. All other requirements are listed in environment.yml
/gpu_environment.yml
and installed/managed by conda.
On Patas, we have already created two environments for use with this project. One is for use with GPU nodes, and the other with CPU nodes (including the head node that you would normally ssh into).
Instructions on setting up Conda on Patas can be found here. n.b.: you will have to go to the Anaconda website and find the link to the most recent version, as the link in this PDF is out of date.
After installing conda as above, you may wish to test small changes while working on your own account on the head node. To do so, you will want to first activate the CPU environment like so:
conda activate /projects/assigned/lm-inductive/envs/corpus-filtering
As always, please abide by general Patas etiquete and avoid running jobs on the head node that require non-trivial amounts of CPU or memory usage.
There are two ways to tell Condor to use the environment when running a job. The first works for CPU or GPU nodes, while the second works only for CPU nodes.
- In your Condor submit file, add a line saying
getenv = False
(or edit ifgetenv
is already there) - Add these two lines near/at the top of the shell script (executable) that you are submitting to Condor, adjusting the first line if your condor installation is elsewhere:
For CPU nodes:
source ~/anaconda3/etc/profile.d/conda.sh
conda activate /projects/assigned/lm-inductive/envs/corpus-filtering
For GPU nodes:
source ~/anaconda3/etc/profile.d/conda.sh
conda activate /projects/assigned/lm-inductive/envs/gpu-corpus-filtering
Note that you will also have to edit your Condor submit file to request GPU nodes; for instructions regarding how to do that, see the document linked to near the top of this README that also contain the instructions for installing conda on Patas.
n.b.: This only works for CPU nodes.
- While logged into your Patas account on the Patas node, run
conda activate /projects/assigned/lm-inductive/envs/corpus-filtering
(unless you are already working within this environment) - Add
getenv = True
to your Condor submit file - Call
condor_submit
with the submit file as per usual.
- Create a fresh conda environment using
environment.yml
. If you haven't done so for this project previously:By default this will create a conda env whose name is indicated on the first line of theconda env create -f environment.yml
environment.yml
file (presently,corpus-filtering
). You can change this by adding the-n
flag followed by the desired name of your environment. - After the environment is created, whenever you want to work on this project, first activate the environment:
conda activate corpus-filtering
- When you are done, you can exit the environment with
conda deactivate
. - If you pull code from the repo and the
environment.yml
file has changed, update your environment by running the following (after activating the environment):conda env update -f environment.yml --prune
For any non-trivial changes, please work on your own branch rather than on main
and submit a PR when you are ready to merge your changes.
If you need any new packages, install them with conda install PACKAGE_NAME
. Then, before committing, run:
conda env export --from-history | grep -vE "^(name|prefix):" > environment.yml
(Replace environment.yml
with gpu_environment.yml
as appropriate.)
This makes sure the name:
and prefix:
lines automatically created by Conda's export
command are not included, since these values can vary by platform/machine.
Then make sure the updated (gpu_)environment.yml
file is included with your commit. Note: if you did not install the command with conda install
, the above command will not work properly, due to the --from-history
flag. However, using this flag is necessary to ensure the requirements.yml
file is platform-agnostic. Therefore, please only install packages via conda install
(or by manually adding requirements to the YAML files).
Optional, but recommended: before running conda install
for new packages, run
conda config --set channel_priority strict
Example run command (execute from root corpus-filtering/
directory):
sbatch --job-name=fict-full-lstm-1 scripts/run_with_conda.slurm gpu-lm-training python lm-training/src/train_lm.py --config-path $PWD"/config" --config-name train '+arch=lstm' '+corpus=full' 'seed=1'
Update arch
, corpus
, and seed
as desired.
Important: When you first use a dataset with HuggingFace, by default, it will cache that dataset, along with any transformations applied to it (e.g. tokenization). This process is not parallelism-safe. Therefore, if training multiple models (e.g. across an HPC cluster like Hyak), you will first want to training just one model, and only start training new models when the first model has finished the dataset-processing portion (typically, and in our case, this means tokenization) of the training script.
(TODO)
[WIP]