-
Notifications
You must be signed in to change notification settings - Fork 22.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Torchdynamo with aot_autograd_speedup_strategy has increased memory usage and long overhead on ResNet50 model #93751
Comments
cc @anijain2305 who was looking into a similar issue. |
Looking into this. I have seen that AOT Autograd can increase the memory footprint while tracing (maybe we are duplicating the memory somewhere, I have to check). But failing for batch size of 256 sounds bad. Thanks for pointing it out. The performance drop for working batch sizes is unexpected. I will update within a couple of days. |
Currently, Turning that on results in significant speedups over eager (on my machine, A100 40 GB) As for overhead of first compilation, that's not something we've substantially investigated/measured in the past - we'll look into that. Memory overhead could also come from a couple different places, also something we need to look into. Updated script with NVFuser
|
Thanks for the update @Chillee ! Just want to make sure, If so, do users need to add a nvfuser context |
Yes, it uses a rematerialization algorithm that needs to know what things are fusible (and thus, is currently tuned against nvfuser).
I think it totally makes sense to turn it on automatically, @anijain2305 is looking into that. |
FYI: It still uses significantly more memory on networks with nvFuser. We're seeing a situation where 3/4 of eager mode's batch size still doesn't fit, but 1/2 does. |
@csarofeen Yep, the memory overhead (nor investigating where the first compilation overhead come from!) have not been solved yet. There are some cases where we're careless with the memory and keeping references longer than they should be. Working on those now. |
Some investigations. So, the experiments here are all run with a batch size of 256, and I'll primarily be reporting on peak memory usage (as measured by As a baseline, eager-mode (no torchdynamo, etc.) reaches 22 GB of peak memory usage. On the other hand, torchdynamo + AOTAutograd + nvfuser run in 39 GB of peak memory usage - not good!
Which I've done here: pytorch/functorch#779 Unfortunately, this doesn't seem to reduce memory with Torchscript - I need to disable the lowering to Torchscript (and run it as a FX graph) in order to reduce memory. However, doing so reduces the peak memory usage to 22 GB, same as eager. Next steps:
cc: @csarofeen @xwang233 For now, if you want to benchmark more while reducing memory usage (while still using NVFuser), the easiest thing to try would be AOTAutograd without Dynamo. Updated script with AOTAutograd
|
One obvious source of memory overhead from TorchDynamo is config.dynamic_propagation=True. With this mode, TorchDynamo will create an This approach is nice, in that in it highly accurate and trivial to implement -- however it is very wasteful in the memory department. We should rewrite It is a very possible there are other sources of memory overhead as well, I think @anijain2305 is looking into one. Most things should work if you disable dynamic_propagation. The exceptions are it allows constant inlining of tensor properties (dtype/device/ndim/shape/contiguous/layout/etc) and handling of ops that return lists/tuples/etc. |
@jansel After my quick investigation, I saw two sources of memory increase
You already covered the first one in detail. I was thinking about using CPU device for storing the cloned tensors to free up GPU memory. But, fake/meta tensors sounds a better long-term solution. Second one is specific to AOT Autograd. Hopefully, this is temporary because we are trying to move to functionalize at the dispatcher level. Apart from this, I have a couple of small examples where Dynamo is not releasing/deleting the tensor when the tensor goes out of scope. At the moment, it is unclear if they are real issues or just badly setup test. |
@Chillee I tried AOTAutograd without dynamo but seems there's an error that's likely associated with AMP. I manually disabled the torchscript amp pass but still hitting this error.
The good news is in FP32 it doesn't OOM like with Dynamo. |
Also with just AOT I'm seeing on resnet50 with V100 in FP32: |
python script (please correct me if I'm using the torchdynamo API wrong)
dynamo-test.py
bash script
#!/bin/bash python dynamo-test.py -b 16 python dynamo-test.py -b 16 --use_dynamo python dynamo-test.py -b 64 python dynamo-test.py -b 64 --use_dynamo python dynamo-test.py -b 128 python dynamo-test.py -b 128 --use_dynamo
Tested with V100 16GB on
f6bbecf ,
pytorch/vision@104073c,
torchdynamo @ 0d59ce9,
pytorch/functorch@ac0fdf1,
cuda 11.6 update 1, cudnn 8.3.3.40
results
It is seen that torchdynamo with aot_autograd_speedup_strategy has increased memory usage and longer overhead on ResNet50 model than the eager mode.
cc @ezyang @soumith @msaroufim @wconstab @ngimel @bdhirsh @csarofeen @ptrblck @jjsjann123 @kevinstephano @jansel
The text was updated successfully, but these errors were encountered: