Satyaog/feature/covalent #217

satyaog · 2024-05-22T19:38:46Z

`milabench cloud --setup`

It creates a system config file and takes a target cloud platform with --run-on.

This starts a local covalent server which is used to manage python code that will be executed on the remote. For now this is only somewhat useful since milabench is mostly using ssh commands anyway and it would take a bit of time to refactor the pipeline I think to instead use the covalent interface to run code. I think this could be an interesting approach but it's a nice to have for now.

So milabench cloud --setup setup the remote and install basic stuff on it like the correct python version (necessary to ensure good serialization/deserialization of python objects between the local and remote machine), pip and venv. venv is used to separate the covalent env and milabench env which have incompatible package requirements versions (sqlalchemy caused problems). On this is done , the covalent server becomes useless

Then system config file should be used in the install, prepare and run commands. In those commands it creates a new standalone config for the tests that will be executed and copies it to the remote before the rest of the pipeline is executed.

At the end of the run command the results are copied to the local machine to allow the generation of a report

At the very end, milabench cloud --teardown should be used to release the cloud resources. The --all argument will release all resources of a target cloud platform specified with --run-on.

Check docs/usage.rst for more info

`milabench` with slurm

The milabench cloud --setup works as well with a slurm system configuration but does not support the --all argument with milabench cloud --teardown.

Check docs/usage.rst for more info

`milabench report --push`

Push the results to a reports branch which as well stores the status svg and summary

Example of reports : #210

covalent is not compatible with milabench as it requires sqlalchemy<2.0.0 Update .github/workflows/cloud-ci.yml Apply suggestions from code review Update .github/workflows/cloud-ci.yml Add azure covalent cloud infra Add multi-node on cloud * VM on the cloud might not have enough space on all partitions. Add a workaround which should cover most cases * Use branch and commit name to versionize reports directories * Fix parsing error when temperature is not available in nvidia-smi outputs * export MILABENCH_* env vars to remote Add docs Fix cloud instance name conflict This would prevent the CI or multiple contributors to run tests with the same config Fix github push in CI * Copy ssh key to allow connections from master to workers * Use local ip for manager's ip such that workers can find it and connect to it

satyaog · 2024-09-20T15:51:48Z

Added the slurm covalent plugin to help debug the cloud setups

satyaog · 2024-09-24T13:43:10Z

Tested slurm with:

=================
Benchmark results
=================

System
------
cpu:      AMD EPYC 7543 32-Core Processor
n_cpu:    64
product:  NVIDIA A100-SXM4-80GB
n_gpu:    1
memory:   81920.0

Breakdown
---------
bench                    | fail |   n | ngpu |           perf |   sem% |   std% | peak_memory |          score | weight
diffusion-single         |    0 |   1 |    1 |          28.13 |   0.1% |   0.9% |       53815 |          28.13 |   1.00
dimenet                  |    0 |   1 |    1 |         482.46 |   1.8% |   5.4% |         nan |         482.46 |   1.00
dinov2-giant-single      |    0 |   1 |    1 |          54.12 |   0.6% |   2.1% |       69569 |          54.12 |   1.00
dqn                      |    0 |   1 |    1 | 22934535905.03 |   3.3% |  91.1% |         nan | 22934535905.03 |   1.00
bf16                     |    0 |   1 |    1 |         296.65 |   0.0% |   0.2% |        1609 |         296.65 |   0.00
fp16                     |    0 |   1 |    1 |         295.35 |   0.0% |   0.3% |        1609 |         295.35 |   0.00
fp32                     |    0 |   1 |    1 |          19.17 |   0.0% |   0.0% |        1987 |          19.17 |   0.00
tf32                     |    0 |   1 |    1 |         148.64 |   0.0% |   0.1% |        1987 |         148.64 |   0.00
bert-fp16                |    0 |   1 |    1 |         275.25 |   0.0% |   0.2% |         nan |         275.25 |   0.00
bert-fp32                |    0 |   1 |    1 |          45.64 |   0.0% |   0.1% |       20991 |          45.64 |   0.00
bert-tf32                |    0 |   1 |    1 |         147.32 |   0.1% |   0.4% |         nan |         147.32 |   0.00
bert-tf32-fp16           |    0 |   1 |    1 |         274.37 |   0.2% |   1.3% |         nan |         274.37 |   3.00
reformer                 |    0 |   1 |    1 |          62.86 |   0.1% |   0.4% |         nan |          62.86 |   1.00
t5                       |    0 |   1 |    1 |          52.16 |   0.3% |   0.8% |         nan |          52.16 |   2.00
whisper                  |    0 |   1 |    1 |         520.24 |   1.0% |   3.0% |         nan |         520.24 |   1.00
lightning                |    0 |   1 |    1 |         712.70 |   0.5% |   5.0% |       27183 |         712.70 |   1.00
llava-single             |    0 |   1 |    1 |           2.39 |   0.2% |   1.6% |       72377 |           2.39 |   1.00
llama                    |    0 |   1 |    1 |         466.14 |  11.5% |  72.0% |       27641 |         466.14 |   1.00
llm-lora-single          |    0 |   1 |    1 |        3517.85 |   0.1% |   0.7% |       32995 |        3517.85 |   1.00
pna                      |    0 |   1 |    1 |        5079.10 |   1.9% |   5.6% |       39543 |        5079.10 |   1.00
ppo                      |    0 |   1 |    1 |    32372024.27 |   1.5% |  57.6% |       62159 |    32372024.27 |   1.00
recursiongfn             |    0 |   1 |    1 |        9035.14 |   3.5% |  10.5% |        6935 |        9035.14 |   1.00
rlhf-single              |    0 |   1 |    1 |        2573.66 |   0.3% |   2.8% |       19181 |        2573.66 |   1.00
focalnet                 |    0 |   1 |    1 |         389.95 |   0.7% |   2.3% |        3847 |         389.95 |   2.00
torchatari               |    0 |   1 |    1 |        3592.50 |   1.4% |   5.0% |        3655 |        3592.50 |   1.00
convnext_large-fp16      |    0 |   1 |    1 |         354.76 |   0.5% |   2.6% |         nan |         354.76 |   0.00
convnext_large-fp32      |    0 |   1 |    1 |          60.63 |   0.1% |   0.3% |       55771 |          60.63 |   0.00
convnext_large-tf32      |    0 |   1 |    1 |         160.49 |   0.0% |   0.1% |       49471 |         160.49 |   0.00
convnext_large-tf32-fp16 |    0 |   1 |    1 |         357.23 |   0.2% |   1.2% |         nan |         357.23 |   3.00
regnet_y_128gf           |    0 |   1 |    1 |         123.15 |   0.3% |   0.9% |         nan |         123.15 |   2.00
resnet50                 |    0 |   1 |    1 |        1199.53 |   2.4% |   7.3% |         nan |        1199.53 |   1.00
resnet50-noio            |    0 |   1 |    1 |        1177.09 |   0.0% |   0.2% |       27301 |        1177.09 |   0.00
vjepa-single             |    0 |   1 |    1 |          22.22 |   1.8% |  14.0% |       56005 |          22.22 |   1.00

Scores
------
Failure rate:       0.00% (PASS)
Score:             821.42

=================
Benchmark results
=================

System
------
cpu:      AMD EPYC 7543 32-Core Processor
n_cpu:    64
product:  NVIDIA A100-SXM4-80GB
n_gpu:    4
memory:   81920.0

Breakdown
---------
bench              | fail |   n | ngpu |       perf |   sem% |   std% | peak_memory |      score | weight
brax               |    0 |   1 |    4 |  636209.06 |   0.3% |   0.8% |        2609 |  636209.06 |   1.00
diffusion-gpus     |    0 |   1 |    4 |     109.52 |   0.1% |   0.5% |       58283 |     109.52 |   1.00
dinov2-giant-gpus  |    0 |   1 |    4 |     229.23 |   0.3% |   0.9% |       70961 |     229.23 |   1.00
lightning-gpus     |    0 |   1 |    4 |    2898.55 |   0.3% |   2.6% |       28055 |    2898.55 |   1.00
llm-lora-ddp-gpus  |    0 |   1 |    4 |   10472.82 |   0.6% |   3.1% |       36227 |   10472.82 |   1.00
rlhf-gpus          |    0 |   1 |    4 |    7560.51 |   0.3% |   2.4% |       21489 |    7560.51 |   1.00
resnet152-ddp-gpus |    0 |   1 |    4 |    2438.15 |   0.0% |   0.4% |       27849 |    2438.15 |   0.00
vjepa-gpus         |    0 |   1 |    4 |      78.81 |   3.6% |  28.9% |       63831 |      78.81 |   1.00

Scores
------
Failure rate:       0.00% (PASS)
Score:            2246.64

=================
Benchmark results
=================

System
------
cpu:      AMD EPYC 7543 32-Core Processor
n_cpu:    64
product:  NVIDIA A100-SXM4-80GB
n_gpu:    2
memory:   81920.0

Breakdown
---------
bench              | fail |   n | ngpu |       perf |   sem% |   std% | peak_memory |      score | weight
diffusion-nodes    |    0 |   2 |    4 |      23.50 |   0.5% |   3.7% |       57299 |      23.50 |   1.00
llm-lora-ddp-nodes |    0 |   2 |    4 |    1043.47 |   0.6% |   3.4% |       35199 |    1043.47 |   1.00

Scores
------
Failure rate:       0.00% (PASS)
Score:             156.58

Large llm models (llama3 70B) have been excluded as I don't have the resources to test yet

It should work as well on azure which I'll test next week

satyaog had a problem deploying to cloud-ci May 22, 2024 19:38 — with GitHub Actions Failure

satyaog had a problem deploying to cloud-ci May 22, 2024 19:40 — with GitHub Actions Failure

satyaog mentioned this pull request May 22, 2024

Feature/covalent #196

Closed

satyaog had a problem deploying to test-cloud-ci May 23, 2024 13:48 — with GitHub Actions Failure

satyaog had a problem deploying to test-cloud-ci May 23, 2024 13:51 — with GitHub Actions Failure

satyaog force-pushed the satyaog/feature/covalent branch from 29f573e to 3bfe690 Compare May 23, 2024 13:53

satyaog had a problem deploying to cloud-ci May 23, 2024 13:53 — with GitHub Actions Failure

satyaog force-pushed the satyaog/feature/covalent branch from 3bfe690 to 65ca09a Compare May 23, 2024 14:02

satyaog had a problem deploying to cloud-ci May 23, 2024 14:02 — with GitHub Actions Error

satyaog force-pushed the satyaog/feature/covalent branch from 65ca09a to 978e16d Compare May 23, 2024 14:28

satyaog had a problem deploying to cloud-ci May 23, 2024 14:29 — with GitHub Actions Failure

satyaog force-pushed the satyaog/feature/covalent branch 3 times, most recently from f9a8c6e to 89898ee Compare May 23, 2024 15:11

satyaog had a problem deploying to cloud-ci May 23, 2024 15:11 — with GitHub Actions Failure

satyaog force-pushed the satyaog/feature/covalent branch from 89898ee to 14ffdf1 Compare May 24, 2024 13:03

satyaog had a problem deploying to cloud-ci May 24, 2024 13:04 — with GitHub Actions Failure

satyaog force-pushed the satyaog/feature/covalent branch from 14ffdf1 to 7d15073 Compare May 24, 2024 15:40

satyaog had a problem deploying to cloud-ci May 24, 2024 15:40 — with GitHub Actions Failure

satyaog force-pushed the satyaog/feature/covalent branch from 7d15073 to 052d2b9 Compare May 27, 2024 13:49

satyaog had a problem deploying to cloud-ci May 27, 2024 13:49 — with GitHub Actions Failure

satyaog force-pushed the satyaog/feature/covalent branch from 052d2b9 to 11dd515 Compare May 27, 2024 18:06

satyaog had a problem deploying to cloud-ci May 27, 2024 18:06 — with GitHub Actions Failure

satyaog force-pushed the satyaog/feature/covalent branch from 11dd515 to 3683cb7 Compare May 27, 2024 23:50

satyaog temporarily deployed to cloud-ci May 27, 2024 23:50 — with GitHub Actions Inactive

satyaog force-pushed the satyaog/feature/covalent branch from 3683cb7 to fa32dde Compare August 8, 2024 14:24

satyaog had a problem deploying to cloud-ci August 8, 2024 14:24 — with GitHub Actions Error

satyaog force-pushed the satyaog/feature/covalent branch from fa32dde to 172c90f Compare August 8, 2024 14:26

satyaog had a problem deploying to cloud-ci August 8, 2024 14:26 — with GitHub Actions Failure

satyaog force-pushed the satyaog/feature/covalent branch from 172c90f to 2f5f981 Compare August 8, 2024 14:53

satyaog force-pushed the satyaog/feature/covalent branch from e9a129b to 558a31d Compare August 21, 2024 06:23

satyaog temporarily deployed to cloud-ci August 21, 2024 06:23 — with GitHub Actions Inactive

satyaog had a problem deploying to cloud-ci August 21, 2024 07:12 — with GitHub Actions Failure

satyaog requested a deployment to cloud-ci August 21, 2024 07:34 — with GitHub Actions Abandoned

satyaog temporarily deployed to cloud-ci August 21, 2024 11:09 — with GitHub Actions Inactive

satyaog had a problem deploying to cloud-ci August 21, 2024 11:34 — with GitHub Actions Failure

satyaog force-pushed the satyaog/feature/covalent branch from 558a31d to 9e394be Compare August 22, 2024 04:13

satyaog temporarily deployed to cloud-ci August 22, 2024 04:13 — with GitHub Actions Inactive

satyaog temporarily deployed to cloud-ci August 22, 2024 05:01 — with GitHub Actions Inactive

satyaog temporarily deployed to cloud-ci August 22, 2024 05:28 — with GitHub Actions Inactive

satyaog force-pushed the satyaog/feature/covalent branch from 9e394be to fdd5270 Compare September 6, 2024 03:24

satyaog had a problem deploying to cloud-ci September 6, 2024 03:24 — with GitHub Actions Failure

satyaog changed the base branch from master to staging September 6, 2024 03:28

satyaog requested a deployment to cloud-ci September 6, 2024 04:48 — with GitHub Actions Abandoned

satyaog requested a deployment to cloud-ci September 6, 2024 08:08 — with GitHub Actions Abandoned

Delaunay force-pushed the staging branch from e5505ee to 13b24e3 Compare September 6, 2024 13:59

satyaog force-pushed the satyaog/feature/covalent branch from fdd5270 to b03a424 Compare September 11, 2024 05:24

satyaog added 2 commits September 20, 2024 11:37

Fix llama3 generation

3e45407

satyaog force-pushed the satyaog/feature/covalent branch from b03a424 to b591c23 Compare September 20, 2024 15:48

satyaog force-pushed the satyaog/feature/covalent branch from b591c23 to 3b207f8 Compare September 23, 2024 21:42

Base automatically changed from staging to master October 2, 2024 17:00

Add slurm system setup

f75e3a5

satyaog force-pushed the satyaog/feature/covalent branch from 3b207f8 to f75e3a5 Compare October 3, 2024 19:11

satyaog had a problem deploying to cloud-ci October 3, 2024 19:11 — with GitHub Actions Failure

satyaog requested a deployment to cloud-ci October 3, 2024 20:47 — with GitHub Actions Abandoned

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Satyaog/feature/covalent #217

Satyaog/feature/covalent #217

satyaog commented May 22, 2024 •

edited

Loading

satyaog commented Sep 20, 2024

satyaog commented Sep 24, 2024

Satyaog/feature/covalent #217

Are you sure you want to change the base?

Satyaog/feature/covalent #217

Conversation

satyaog commented May 22, 2024 • edited Loading

milabench cloud --setup

milabench with slurm

milabench report --push

satyaog commented Sep 20, 2024

satyaog commented Sep 24, 2024

satyaog commented May 22, 2024 •

edited

Loading

`milabench cloud --setup`

`milabench` with slurm

`milabench report --push`