Adversarial training is designed to make models robust to adversarially-selected inputs.
- Adversarial examples - A famous result showing that image classifiers are vulnerable to adversarial attacks, in which an image can be imperceptibly perturbed to cause the classifier to select an incorrect target class with high confidence. Training on adversarial examples makes more models robust to them, and is an example of adversarial training.
- Adversarial Examples Are Not Bugs, They Are Features - A proposed explanation for adversarial examples, that they are the result of so-called "non-robust" features. Further discussion of these results can be found here.
- Transfer of Adversarial Robustness Between Perturbation Types - There are different ways of bounding adversarial perturbations so that they remain imperceptible, and this paper studies how well adversarial training transfers between them.
- Red Teaming Language Models with Language Models - In the context of language models, an analogous method can be used to make models robust to adversarial inputs designed to cause the model to output harmful or unwanted content.
- Adversarial Training for High-Stakes Reliability - A study of adversarial training for language models using a sequence of increasingly powerful adversaries. Further discussion motivating this work can be found here. One possible extension that could help with inner alignment is relaxed adversarial training, in which the adversary's task is "relaxed" to allow "pseudo-inputs".
Red team a GPT-2 chatbot to find inputs where it generates offensive language, reproducing the experimental setup in the red teaming paper. We recommend using all models via HuggingFace Transformers in an environment with GPUs available (Google Colab provides GPUs for free).
- Choose a language model (LM) for red teaming. We recommend GPT-2 (or larger) as the LM.
- Choose a chatbot-like model to red team. We recommend using the prompt from the last page of the red teaming paper to prompt GPT-2 (or larger) to generate chatbot-like text.
- Use an offensive or toxic language detection model of your choice. We recommend Unitary’s BERT-based model or a similar toxicity classifier available on HuggingFace.
- Use the zero-shot approach described in the red teaming paper (section 3.1) to generate inputs that elicit offensive language from the chatbot language model. Look for patterns in the failed test cases, to better understand what kinds of inputs the chatbot fails on.
- (Optional) Use the few-shot, supervised learning, and reinforcement learning approaches (in that order) to generate even harder test cases for the chatbot. How do the test cases generated by each method differ from each other?
- (Optional) Reproduce various analyses in the red teaming paper, e.g., clustering the test cases, to help find common patterns in the chatbot failures.
Credit to Ethan Perez for this suggested exercise.