[PROPOSAL] Lazy initialization of model #3124

ver217 · 2023-03-13T09:53:44Z

ver217
Mar 13, 2023
Maintainer

What are LazyTensor and LazyInit

LazyTensor allows DL framework (PyTorch) to execute operations lazily, by storing all operations related to it and reruning them when it's required to be materialized.

LazyInit defers model initialiazation and it's based on LazyTensor.

Why we need implement new LazyInit

ColossalAI actually has similar feature now: ColoInitContext. It hijacks the __init__() method of each module and shard each layer with preset sharding strategy.

This can work well when sharding strategy is known before model initialization. If sharding strategy is generated from static analysis about model, this method won't work anymore.

So we need initialize model tensors using meta tensor and do static analysis to get sharding strategy. And then materialize each tensor and apply the sharding strategy. The static analysis can be omitted if the sharding strategy is known in advance.

A possible initialization process:

Actually torchdistx has implemented similar features named FakeTensor and deferred_init.

Why do we need implement a new one?

It is hard to verify the correctness of initialization. There may be random initialization and we have to control the seed or random number generator state.
This may be combined with DTensor. But torchdistx is a kind of blackbox to us.

Method

We have experimental code about lazy init. Thank @super-dainiu .

We can start this work by this file. We implement a LazyTensor class which tracks all OPs and rerun them when materializing.

A possible class definition of LazyTensor:

class LazyTensor(torch.Tensor):
    _meta_data: Optional[MetaTensor] = None    # shape, dtype, device
    _pre_op_fn: Callable[['LazyTensor'], None] = lambda *args: None  # reserverd for verification
    _factory_method: Callable[[Any], Tensor]]
    _op_buffer = []
    _materialized_data: Optional[torch.Tensor] = None

    def materialize(self) -> torch.Tensor:
        pass

    def clean(self) -> None:
        pass

A possible usage of LazyInit:

ctx = LazyInitContext()
with ctx:
    model = GPT()

# if no need to distribute model
ctx.materialize(model)

# if need distribute model
   
sharding_strategy = symbolic_trace(model, ...)
ctx.distribute(model, sharding_strategy )

LazyInitContext.materialize() and LazyInitContext.distribute() are static methods and may be replaced with downstream model wrappers.

Limitations

To keep the implementation simple, we have some trade-off when designing.

We cannot ensure the lazy intialized model is the same as standard intialized model, but we can ensure its parameters are initialized from the same distribution.

There are some cases we cannot ensure the correctness:

modify origin tensor after cloning. E.g.
```
b = a.clone()
a.add_(1)
```
modify origin tensor after tolist. E.g.
```
b = a.tolist()
a.add_(1)
```

As we don't track tensor's slice relationship, there are also some cases lazy execution won't work and tensors may be early materialized.

get slice. E.g.

b = a[:, 2:]

# modify slice directly won't trigger early materialization
a[:, 2:] = 1.0

split. E.g.
```
chunks = a.split(3)
```

How to verify

To verify the correctness, we have to control the random seed. We can implement a utility tensor class:

class MyTensor(Tensor):
    def __new__(cls, func, *args, dtype=None, device=None, **kwargs) -> 'MyTensor':
        set_seed(42)
        data = func(*args, dtype=dtype, device=device, **kwargs)
        return Tensor._make_subclass(cls, data, require_grad=data.requires_grad)

    @classmethod
    def __torch_function__(cls, func, types, args=(), kwargs=None):
        set_seed(42)
        return super().__torch_function__(func, types, args, kwargs)

Thus, random states are same before each OP.

For LazyTensor, we reserve a hook to control random seed before executing each OP.

By doing this, we can simply compare the state dict or forward result to verify the correctness of initialization.

Possible Roadmap

Implement a single-process version of lazy init.
Verify correctness on many models.
Combine with DTensor.
Verify on distributed training cases.

Muggle666 · 2023-06-05T02:13:50Z

Muggle666
Jun 5, 2023

How to load pretrained weight in this context? Could you please provide an example?

1 reply

ver217 Jun 6, 2023
Maintainer Author

We should not load pretrained weight in this context. If so, it's not lazy. We should first lazy initialize model and then load checkpoint. Here is an example (psuedo-code):

with LazyInitContext():
    model = GPT2LMHeadModel(config)

optimizer = ...
lr_scheduler = ...
dataloader = ...
model, optimizer, lr_scheduler, dataloader = booster.boost(model, optimizer, lr_scheduler, dataloader)

booster.load_model(model, pretrained_path)

Booster supports loading huggingface/transformers-fashion pretrained checkpoint, but it must be local.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROPOSAL] Lazy initialization of model #3124

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

[PROPOSAL] Lazy initialization of model #3124

ver217 Mar 13, 2023 Maintainer

What are LazyTensor and LazyInit

Why we need implement new LazyInit

Method

Limitations

How to verify

Possible Roadmap

Replies: 1 comment · 1 reply

Muggle666 Jun 5, 2023

ver217 Jun 6, 2023 Maintainer Author

ver217
Mar 13, 2023
Maintainer

Replies: 1 comment 1 reply

Muggle666
Jun 5, 2023

ver217 Jun 6, 2023
Maintainer Author