Jeremy Bernstein* · Chris Mingard* · Kevin Huang · Navid Azizan · Yisong Yue
Install PyTorch and a GPU, and run:
python main.py
Command line arguments are:
--arch # options: fcn, vgg, resnet18, resnet50
--dataset # options: cifar10, cifar100, mnist, imagenet
--train_bs # training batch size
--test_bs # testing batch size
--epochs # number of training epochs
--depth # number of layers for fcn
--width # hidden layer width for fcn
--distribute # train over multiple gpus (for imagenet)
--gain # experimental acceleration of training
No training hyperparameters are neccessary. Optionally, you can try --gain 10.0
which we have found can accelerate training. Chris is maintaining a separate repository with some more experimental features.
.
├── latex/ # source code for the paper
├── supercloud/ # mit supercloud run files
├── util/
│ ├── util/data.py # datasets and preprocessing
│ ├── util/models.py # architecture definitions
├── agd.py # automatic gradient descent
├── main.py # entrypoint to training
For the
- initial weights are drawn from the uniform measure over orthogonal matrices, and then scaled by
$\sqrt{d_k / d_{k-1}}$ . - weights are updated according to:
-
$G \gets \frac{1}{L} \sum_{k\in{1...L}} \sqrt{\tfrac{d_k}{d_{k-1}}}\cdot \Vert\nabla_{W_k} \mathcal{L}\Vert_F$ ; -
$\eta \gets \log\Big( \tfrac{1+\sqrt{1+4G}}{2}\Big)$ .
This procedure is slightly modified for convolutional layers.
If you find AGD helpful and you'd like to cite the paper, we'd appreciate it:
@article{agd-2023,
author = {Jeremy Bernstein and Chris Mingard and Kevin Huang and Navid Azizan and Yisong Yue},
title = {{A}utomatic {G}radient {D}escent: {D}eep {L}earning without {H}yperparameters},
journal = {arXiv:2304.05187},
year = 2023
}
Our paper titled Automatic Gradient Descent: Deep Learning without Hyperparameters
is available at this link. The derivation of AGD is a refined version of the majorise-minimise analysis given in my PhD thesis Optimisation & Generalisation in Networks of Neurons
, and was worked out in close collaboration with Chris and Kevin. In turn, this develops the perturbation analysis from our earlier paper On the Distance between two Neural Networks and the Stability of Learning
with a couple insights from Greg Yang and Edward Hu's Feature Learning in Infinite-Width Neural Networks
thrown in for good measure.
Some architecture definitions were adapted from kuangliu/pytorch-cifar.
We are making AGD available under the CC BY-NC-SA 4.0 license.