Open-sourced PipeDLRM #122

YanzhaoWu · 2020-08-14T18:29:13Z

The open-sourced version of PipeDLRM, consisting of 5 functioning components, profiler, optimizer, runtime implementation, modeling and visualizer. PipeDLRM is built on top of DLRM with some components from PipeDream (https://github.com/msr-fiddle/pipedream).

facebook-github-bot · 2020-08-14T21:51:53Z

Hi @YanzhaoWu!

Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but we do not have a signature on file.

In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

ConnollyLeon · 2020-09-09T00:55:33Z

Hi Yanzhao,

I am very curious about your work. Could you please show some more instructions on how to run it in your github? It would help me and the others a lot.

Thanks!

facebook-github-bot · 2020-09-09T01:29:08Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

facebook-github-bot · 2020-09-09T03:24:45Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

facebook-github-bot · 2020-09-09T05:25:24Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

dmudiger · 2020-09-09T06:24:18Z

Can we remove the empty "init.py " files

dmudiger · 2020-09-09T06:35:06Z

Hi Yanzhao,

I am very curious about your work. Could you please show some more instructions on how to run it in your github? It would help me and the others a lot.

Thanks!

Thank you for your interest in this work, we are currently actively reviewing this PR to merge it in. In the meanwhile please feel free to try it out, you can find the detail instructions here - https://github.com/facebookresearch/dlrm/pull/122/files#diff-22b1984e9055744bcb6b52260dfdfb71

dmudiger · 2020-09-11T15:37:33Z

bring the discussion from the email thread back here, perhaps we can look at including some of the Pipedream components here linked as a submodule rather than copy them over ?

YanzhaoWu · 2020-09-13T06:59:31Z

Hi Yanzhao,

I am very curious about your work. Could you please show some more instructions on how to run it in your github? It would help me and the others a lot.

Thanks!

Thank you very much for your interest in our project. Besides, you may also check the script (https://github.com/facebookresearch/dlrm/pull/122/files#diff-bc0c739ba93024f3443445a48fd0319b) for running PipeDRLM on the Kaggle DAC dataset with a 3-stage pipeline. Hope it will be helpful.

YanzhaoWu · 2020-09-13T07:05:40Z

Can we remove the empty "init.py " files

Sure. Currently, the empty init.py files are used to treat the directories containing this file as Python packages, which will be used in PipeDLRM. We may remove them as we reorganize the codebase.

TimJZ · 2020-09-19T23:06:29Z

Hi Yanzhao,
I'm having some trouble running the code and I'm wondering if you could provide some help. I'm mainly confused with the meaning of several variables in shell script.

I currently have one node and 4 GPUs, I'm wondering what are the num_input_rank, nrank and ngpus I should set up correspondingly.

From my understanding, the nranks represents the number of GPUs on one machine, therefore I set it to 4. I've tried several numbers for num_input_rank and so far all of them gave me errors such as:

File "../communication.py", line 42, in init
dist.init_process_group(backend, rank=rank, world_size=world_size)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 434, in init_process_group
timeout=timeout)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 505, in _new_process_group_helper
timeout=timeout)
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:898] Connect timeout [172.17.0.4]:26516

Could you please give me some recommendations on how to set these number correctly? Thank you so much!

YanzhaoWu · 2020-10-01T18:05:08Z

Hi Yanzhao,
I'm having some trouble running the code and I'm wondering if you could provide some help. I'm mainly confused with the meaning of several variables in shell script.

I currently have one node and 4 GPUs, I'm wondering what are the num_input_rank, nrank and ngpus I should set up correspondingly.

From my understanding, the nranks represents the number of GPUs on one machine, therefore I set it to 4. I've tried several numbers for num_input_rank and so far all of them gave me errors such as:

File "../communication.py", line 42, in init
dist.init_process_group(backend, rank=rank, world_size=world_size)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 434, in init_process_group
timeout=timeout)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 505, in _new_process_group_helper
timeout=timeout)
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:898] Connect timeout [172.17.0.4]:26516

Could you please give me some recommendations on how to set these number correctly? Thank you so much!

Thank you very much for your interest in our project.
Sure. The num_input_rank represents the number of replications of stage 0 since stage 0 also handles the input data loader. For your case, you could just set it as 1. The nrank is the number of ranks (GPUs) used for running PipeDLRM. For your case, with num_input_rank=1, you could just set it nrank=3 (3 GPUs <-> 3 Stages, no replication). This setting for this script should be fine.

However, we still need to modify the model configuration file (models/dlrm/gpus=3/$conf_file) correspondingly.
For the above settings, you could just try conf_file=mp_conf.json in this script.

deepakn94 · 2020-10-22T20:43:46Z

This looks cool!

Agree with @dmudiger that the PipeDream parts of the code can probably be removed from this codebase, especially if you haven't made any changes -- will make the diff easier to look at. If you have some changes to PipeDream that you think would be broadly useful, I am happy to upstream them to PipeDream if you send me a PR.

sanjay-k-mukherjee · 2020-10-29T07:03:54Z

We are running pipedlrm with nrank=4 . In our case, with num_input_rank=3, and nrank=4. I am using the default script
"../../exp/pipeline/dlrm_dac_pytorch.sh" to execute.
I am presently observing the following issue :-
File "main_with_runtime.py", line 627, in <module> num_versions=num_versions, lr=args.learning_rate) File "../sgd.py", line 23, in __init__ macrobatch=macrobatch, File "../optimizer.py", line 41, in __init__ master_parameters, **optimizer_args) File "/opt/conda/lib/python3.6/site-packages/torch/optim/sgd.py", line 68, in __init__ super(SGD, self).__init__(params, defaults) File "/opt/conda/lib/python3.6/site-packages/torch/optim/optimizer.py", line 47, in __init__ raise ValueError("optimizer got an empty parameter list") ValueError: optimizer got an empty parameter list

And with nrank=6 . I observe the following failure.
File "main_with_runtime.py", line 585, in <module> dp.make_random_loader_with_sampler(args, train_data, test_data, num_ranks_in_first_stage) TypeError: 'NoneType' object is not iterable kiTraceback (most recent call last): File "main_with_runtime.py", line 585, in <module> dp.make_random_loader_with_sampler(args, train_data, test_data, num_ranks_in_first_stage) TypeError: 'NoneType' object is not iterable

YanzhaoWu · 2020-12-10T05:33:28Z

We are running pipedlrm with nrank=4 . In our case, with num_input_rank=3, and nrank=4. I am using the default script
"../../exp/pipeline/dlrm_dac_pytorch.sh" to execute.
I am presently observing the following issue :-
File "main_with_runtime.py", line 627, in <module> num_versions=num_versions, lr=args.learning_rate) File "../sgd.py", line 23, in __init__ macrobatch=macrobatch, File "../optimizer.py", line 41, in __init__ master_parameters, **optimizer_args) File "/opt/conda/lib/python3.6/site-packages/torch/optim/sgd.py", line 68, in __init__ super(SGD, self).__init__(params, defaults) File "/opt/conda/lib/python3.6/site-packages/torch/optim/optimizer.py", line 47, in __init__ raise ValueError("optimizer got an empty parameter list") ValueError: optimizer got an empty parameter list

And with nrank=6 . I observe the following failure.
File "main_with_runtime.py", line 585, in <module> dp.make_random_loader_with_sampler(args, train_data, test_data, num_ranks_in_first_stage) TypeError: 'NoneType' object is not iterable kiTraceback (most recent call last): File "main_with_runtime.py", line 585, in <module> dp.make_random_loader_with_sampler(args, train_data, test_data, num_ranks_in_first_stage) TypeError: 'NoneType' object is not iterable

Thank you very much for your interests in our project.
For the first problem, it seems that the PyTorch model was not correctly initialized. So the optimizer cannot obtain the trainable model parameters. You may need to compile PyTorch with the corresponding patches under the pytorch_patches folder.

For the second issue, it seems that the train_data or test_data is NoneType. The input ranks will load the actual training data while other ranks will generate random data to ensure the consistency of the number of iterations across different ranks. You need to check the configuration to ensure the num_batches is correct. Besides, it is suggested that you first try num_input_rank=1.

Open-sourced PipeDLRM

b304311

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 14, 2020

add the running scripts

9d15146

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Open-sourced PipeDLRM #122

Open-sourced PipeDLRM #122

YanzhaoWu commented Aug 14, 2020

facebook-github-bot commented Aug 14, 2020

ConnollyLeon commented Sep 9, 2020

facebook-github-bot commented Sep 9, 2020

facebook-github-bot commented Sep 9, 2020

facebook-github-bot commented Sep 9, 2020

dmudiger commented Sep 9, 2020

dmudiger commented Sep 9, 2020

dmudiger commented Sep 11, 2020

YanzhaoWu commented Sep 13, 2020

YanzhaoWu commented Sep 13, 2020

TimJZ commented Sep 19, 2020

YanzhaoWu commented Oct 1, 2020

deepakn94 commented Oct 22, 2020

sanjay-k-mukherjee commented Oct 29, 2020 •

edited

Loading

YanzhaoWu commented Dec 10, 2020

Open-sourced PipeDLRM #122

Are you sure you want to change the base?

Open-sourced PipeDLRM #122

Conversation

YanzhaoWu commented Aug 14, 2020

facebook-github-bot commented Aug 14, 2020

ConnollyLeon commented Sep 9, 2020

facebook-github-bot commented Sep 9, 2020

facebook-github-bot commented Sep 9, 2020

facebook-github-bot commented Sep 9, 2020

dmudiger commented Sep 9, 2020

dmudiger commented Sep 9, 2020

dmudiger commented Sep 11, 2020

YanzhaoWu commented Sep 13, 2020

YanzhaoWu commented Sep 13, 2020

TimJZ commented Sep 19, 2020

YanzhaoWu commented Oct 1, 2020

deepakn94 commented Oct 22, 2020

sanjay-k-mukherjee commented Oct 29, 2020 • edited Loading

YanzhaoWu commented Dec 10, 2020

sanjay-k-mukherjee commented Oct 29, 2020 •

edited

Loading