CPUoffloadOptimizer issues #1209

felipemello1 · 2024-11-01T04:34:43Z

hi all, i was giving the CPUOffloadOptimizer a try and found two issues when using with QLoRA single device in torchtune:

When using a LR scheduler i got. Maybe there is a way to inherit the optimizer class?

File "/data/users/felipemello/torchtune/torchtune/training/lr_schedulers.py", line 58, in get_cosine_schedule_with_warmup
    return LambdaLR(optimizer, lr_lambda, last_epoch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/felipemello/.conda/envs/torchtune/lib/python3.11/site-packages/torch/optim/lr_scheduler.py", line 336, in __init__
    super().__init__(optimizer, last_epoch, verbose)
  File "/home/felipemello/.conda/envs/torchtune/lib/python3.11/site-packages/torch/optim/lr_scheduler.py", line 99, in __init__
    raise TypeError(f"{type(optimizer).__name__} is not an Optimizer")
TypeError: CPUOffloadOptimizer is not an Optimizer

When passing model.params() i got the error below. I imagine that a simple fix is to keep only params that require grad, like adamw implementation oes

  File "/home/felipemello/.conda/envs/torchtune/lib/python3.11/site-packages/torchao/prototype/low_bit_optim/cpu_offload.py", line 76, in __init__
    p_cuda.register_post_accumulate_grad_hook(backward_hook)
  File "/home/felipemello/.conda/envs/torchtune/lib/python3.11/site-packages/torch/_tensor.py", line 678, in register_post_accumulate_grad_hook
    raise RuntimeError(
RuntimeError: cannot register a hook on a tensor that doesn't require gradient

cc: @gau-nernst

The text was updated successfully, but these errors were encountered:

gau-nernst · 2024-11-01T05:06:20Z

1 is a known issue. You can see my view here #959 (comment). I will look into torch.optim.Optimizer base class to see what could go wrong if I make CPUOffloadOptimizer inherit it. For example, on the top of my head, CPUOffloadOptimizer will not have self.state.

In the meantime, CPUOffloadOptimizer requires setting LR manually #584 (comment)

For 2, it's an oversight from my part. We can simply add a requires grad check here. Will push a fix

ao/torchao/prototype/low_bit_optim/cpu_offload.py

Lines 68 to 77 in 2761917

    
           for p_cuda in params: 
        
               # pre-allocate CPU params and grads 
        
               p_cpu = torch.empty_like(p_cuda, device="cpu", pin_memory=True) 
        
               p_cpu.grad = torch.empty_like(p_cpu, pin_memory=True) 
        
               p_cpu.copy_(p_cuda.detach(), non_blocking=True) 
        
               self.param_cuda2cpu_map[p_cuda] = p_cpu 
        
               p_cuda.register_post_accumulate_grad_hook(backward_hook) 
        
               self.optim_dict[p_cuda] = optimizer_class([{"params": p_cpu, **param_group}], **kwargs)

fzyzcjy · 2024-11-18T10:25:57Z

Hi, is there any updates? Thanks! It would be great if it can be directly plugged into huggingface transformers, but now it has errors caused by scheduler issue above:

[10:19:58.912]:     self.trainer.inner.train()
[10:19:58.912]:   File "/opt/conda/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 434, in train
[10:19:58.912]:     output = super().train(*args, **kwargs)
[10:19:58.912]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[10:19:58.912]:   File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 2123, in train
[10:19:58.912]:     return inner_training_loop(
[10:19:58.912]:            ^^^^^^^^^^^^^^^^^^^^
[10:19:58.912]:   File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 2224, in _inner_training_loop
[10:19:58.912]:     self.create_optimizer_and_scheduler(num_training_steps=max_steps)
[10:19:58.912]:   File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1130, in create_optimizer_and_scheduler
[10:19:58.912]:     self.create_scheduler(num_training_steps=num_training_steps, optimizer=optimizer)
[10:19:58.912]:   File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1632, in create_scheduler
[10:19:58.912]:     self.lr_scheduler = get_scheduler(
[10:19:58.912]:                         ^^^^^^^^^^^^^^
[10:19:58.912]:   File "/opt/conda/lib/python3.11/site-packages/transformers/optimization.py", line 550, in get_scheduler
[10:19:58.913]:     return schedule_func(
[10:19:58.913]:            ^^^^^^^^^^^^^^
[10:19:58.913]:   File "/opt/conda/lib/python3.11/site-packages/transformers/optimization.py", line 132, in get_linear_schedule_with_warmup
[10:19:58.913]:     return LambdaLR(optimizer, lr_lambda, last_epoch)
[10:19:58.913]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[10:19:58.913]:   File "/opt/conda/lib/python3.11/site-packages/torch/optim/lr_scheduler.py", line 336, in __init__
[10:19:58.913]:     super().__init__(optimizer, last_epoch, verbose)
[10:19:58.913]:   File "/opt/conda/lib/python3.11/site-packages/torch/optim/lr_scheduler.py", line 99, in __init__
[10:19:58.913]:     raise TypeError(f"{type(optimizer).__name__} is not an Optimizer")
[10:19:58.913]: TypeError: CPUOffloadOptimizer is not an Optimizer

gau-nernst · 2024-11-19T01:08:48Z

@fzyzcjy To unblock your case, you can try making CPUOffloadOptimizer a subclass of torch.optim.Optimizer i.e. change the following line

ao/torchao/prototype/low_bit_optim/cpu_offload.py

Line 9 in aeff75b

class CPUOffloadOptimizer:

to class CPUOffloadOptimizer(Optimizer):. Make sure to not call super().__init__(), as this is just a workaround to pass the class check by PyTorch LR scheduler. I will investigate if this will cause other issues before merging the fix.

IMO, since Python is duck-typing, PyTorch LR scheduler should not explicitly check for the optimizer class.

fzyzcjy · 2024-11-19T01:23:41Z

Thank you!

gau-nernst mentioned this issue Nov 1, 2024

[CPU offload optim] Fix when there are non-trainable params #1210

Merged

drisspg added bug Something isn't working optimizer labels Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPUoffloadOptimizer issues #1209

CPUoffloadOptimizer issues #1209

felipemello1 commented Nov 1, 2024

gau-nernst commented Nov 1, 2024

fzyzcjy commented Nov 18, 2024

gau-nernst commented Nov 19, 2024

fzyzcjy commented Nov 19, 2024

CPUoffloadOptimizer issues #1209

CPUoffloadOptimizer issues #1209

Comments

felipemello1 commented Nov 1, 2024

gau-nernst commented Nov 1, 2024

fzyzcjy commented Nov 18, 2024

gau-nernst commented Nov 19, 2024

fzyzcjy commented Nov 19, 2024