Allow allocation to be a split and different from loop. #3479

wujingyue · 2024-11-26T18:43:15Z

This is a spin-off from #3458 (comment).

For a multi-GPU fusion to take a sharded input tensor, the allocation domain has to be a split of logical. For example,

logical: iM, iN
allocation: iDIDx{D}, iM/D, iN

The loop domain, ideally, shouldn't be set or used because a fusion/segment input comes from outside and is not generated by a loop. Below is an ideal usage, which I also committed to https://github.com/NVIDIA/Fuser/tree/bug3479.

TEST_F(AllocationDomainTest, InputAllocationIsSplit_Concrete) {
  auto fusion = std::make_unique<Fusion>();
  FusionGuard fg(fusion.get());

  TensorView* in = makeContigConcreteTensor({6});
  TensorView* out = set(in);
  fusion->addInput(in);
  fusion->addOutput(out);

  auto [outer, inner] = IterDomain::split(
      in->axis(0), IrBuilder::create<Val>(2, DataType::Index), true);
  in->setAllocationDomain({outer, inner}, true);

  FusionExecutorCache executor_cache(std::move(fusion));
  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA);
  at::Tensor in_tensor = at::randn({6}, options);
  auto out_tensors = executor_cache.runFusionWithInputs({in_tensor});

  testValidate(
      executor_cache.fusion(), out_tensors, {in_tensor}, __LINE__, __FILE__);
}

This repro currently fails with the following error:

C++ exception with description " INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/ir/utils.cpp":965, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. dom0 has unreachable IDs. dom0: iS7{6}. dom1:

Code around

Fuser/csrc/transform_replay.cpp

Lines 760 to 763 in 9c9c34c

    
           auto replay_CasP = BestEffortReplay( 
        
               new_IDs, 
        
               producer->getLoopDomain(), 
        
               logical_map.mapProducerToConsumer(producer->domain(), replayed));

tries to propagate allocation from the producer to the consumer. However, the BestEffortReplay created fails to map the producer's allocation domain. As a result,

Fuser/csrc/transform_replay.cpp

Line 776 in 9c9c34c

if (auto it = p2c_map.find(id); it != p2c_map.end()) {

fails to add anything to the consumer's allocation domain, leading to an empty allocation domain that doesn't post-dominate logical.

While I've worked around this problem by setting loop to be the same as allocation, @naoyam and I discussed some potential solutions in the original thread. There's a line of thoughts on improving replayCasP, and there's a line of thoughts on propagating allocation domain within a kernel through a different mechanism as this is currently only needed for matmul. cc @zasdfgbnm

The text was updated successfully, but these errors were encountered:

naoyam · 2024-12-06T04:11:24Z

Let me see if I understand the issue correctly.

replayCasP has an option to propagate the allocation domain but assumes all IDs of the allocation domain are found between the root to loop domains. If not, the propagation of the allocation domain fails.
We could extend replayCasP to properly support the case like the above.
We could also just disable the propagation, however, the matmul scheduler relies on it.

It seems to me that we should change the matmul scheduler rather than replayCasP. I'd generally believe each scheduling primitive should do one task only. If we need to schedule the allocation domain, it should probably be done separately from the propagation of the loop domain.

Since this is only a problem with the matmul scheduler, I feel that we should extend the matmul scheduler to do scheduling of the allocation domain more explicitly. What do you think? @jacobhinkle @rdspring1 @protonu @zasdfgbnm

jacobhinkle · 2024-12-06T10:40:15Z

My .02: when I encountered this in the matmul scheduler I was surprised it propagated allocation domain. I think we should change that behavior since explicit is better than implicit.

zasdfgbnm · 2024-12-06T18:21:36Z

Isn't this guarded by a flag?

Fuser/csrc/transform_replay.cpp

Line 733 in 9c9c34c

if (!opt.replay_allocation) {

Would a quick unblock be to just set this flag to false wherever it is used? I believe it was used in cacheBefore/cacheAfter.

But on the other hand, I do think replaying allocation domain (or at least stride order) is what we want. For example, if you have a fusion input NCHW stride order NHWC, and when you cacheAfter this fusion input, you do want to have NHWC as allocation domain in the cache because otherwise you will not be able to vectorize. I think it is not matmul vs other schedulers. It is only used by matmul because matmul is the only scheduler that is stride-order-aware today. Other schedulers will just schedule a bad kernel with no vectorization, which is what we should change in the future.

naoyam · 2024-12-06T19:26:09Z

@zasdfgbnm For cacheAfter, I understand that we need to have a proper allocation order, but more strictly speaking, the allocation domain propagated from the input isn't actually the real allocation domain, right? The cache tensor should have some splits for parallelization that don't exist for the input tensor. So, the automatic propagation done by replayCasP isn't actually enough.

Yes, setting false to the flag of replayCasP would unblock the issue for now, but I think the allocation domain really needs actual scheduling rather than just done automatically behind the loop scheduling.

wujingyue · 2024-12-06T23:53:35Z

Yes, setting false to the flag of replayCasP would unblock the issue for now

We have to do that conditionally -- propagate only for the matmul scheduler

zasdfgbnm · 2024-12-07T02:28:08Z

but more strictly speaking, the allocation domain propagated from the input isn't actually the real allocation domain, right? The cache tensor should have some splits for parallelization that don't exist for the input tensor. So, the automatic propagation done by replayCasP isn't actually enough.

It depend on how we define what is a correct allocation domain, which we do not have an idea yet. For example, if a register tensor is:

Logical domain: [I0, I1]
Allocation domain: [I0, I1]
Loop domain: [BIDx{I0*I1/1024}, TIDx{256}, Vectorize{4}]

Then do we consider the allocation of this tensor correct? The answer could be:

Yes, although I0 and I1 is partially allocated, during lowering, we have the capability of inferring its actual extent, so it is totally valid.
No, the allocation domain should always faithfully represent how a tensor is allocated. Because we are not allocating like I0, I1, but like 4, this allocation is wrong.

1 is what we currently have in the legacy indexer. If in the new indexer, we still decide to keep 1, then I believe we should also allow having an allocation of [I1, I0]. If this is the design decision, then the allocation domain propagated from input should pretty much sufficient for scheduling.

If our design decision is 2, then of course, we should schedule allocation domain separately, instead of propagating it along with the loop domain.

I personally don't have preference over these two options, because to me both have pros and cons. I like 1 because it makes writing a scheduler easier, because in 2, the scheduler needs to book keep the allocation domain. I dislike 1 because, what is the point of having an allocation domain, if we are not faithfully respecting it? I like 2 because it makes the concept "allocation domain" mentally easier to think about, and I dislike it because it is more difficult to write a scheduler. For example, if all I want is to say "[I1, I0] is the correct stride order", I have to do all the heavy work of analyzing the loop domain and think about what is the correct allocation domain, instead of just naively set it as [I1, I0] as in 1.

naoyam · 2024-12-10T02:29:11Z

I see your point. From the scheduler's point of view, what matters is usually just the ordering, so it would make sense to #1 would be good enough for the scheduler. I think we would need to make sure that the lowering can infer the true allocation domain without any ambiguity, but I think that'd be the case.

In any case, I still prefer this approach:

Would a quick unblock be to just set this flag to false wherever it is used? I believe it was used in cacheBefore/cacheAfter.

And make the required ordering of the allocation domain for matmul done explicitly.

If we all agree on this, can someone volunteer to change the matmul scheduler?

wujingyue · 2024-12-10T06:11:47Z

If we all agree on this, can someone volunteer to change the matmul scheduler?

Add a knob like replay_allocation to cacheInputs and set that to true only in the matmul scheduler? If yes, I'm happy to take that.

naoyam · 2024-12-10T20:02:25Z

@wujingyue Yes, that should be good enough, although I still think we should do something with the matmul scheduler since even @jacobhinkle was surprised with the automatic reordering.

wujingyue added a commit that referenced this issue Nov 26, 2024

Add a repro for #3479

22a05d5

wujingyue assigned naoyam Nov 26, 2024

wujingyue changed the title ~~Allocation can't be a split unless it's same as loop.~~ Allow allocation to be a split and different from loop. Nov 27, 2024

wujingyue mentioned this issue Nov 29, 2024

Unshard tensor sizes before binding. #3444

Merged

wujingyue assigned wujingyue and unassigned naoyam Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow allocation to be a split and different from loop. #3479

Allow allocation to be a split and different from loop. #3479

wujingyue commented Nov 26, 2024 •

edited

Loading

naoyam commented Dec 6, 2024

jacobhinkle commented Dec 6, 2024

zasdfgbnm commented Dec 6, 2024

naoyam commented Dec 6, 2024

wujingyue commented Dec 6, 2024

zasdfgbnm commented Dec 7, 2024 •

edited

Loading

naoyam commented Dec 10, 2024

wujingyue commented Dec 10, 2024

naoyam commented Dec 10, 2024

Allow allocation to be a split and different from loop. #3479

Allow allocation to be a split and different from loop. #3479

Comments

wujingyue commented Nov 26, 2024 • edited Loading

naoyam commented Dec 6, 2024

jacobhinkle commented Dec 6, 2024

zasdfgbnm commented Dec 6, 2024

naoyam commented Dec 6, 2024

wujingyue commented Dec 6, 2024

zasdfgbnm commented Dec 7, 2024 • edited Loading

naoyam commented Dec 10, 2024

wujingyue commented Dec 10, 2024

naoyam commented Dec 10, 2024

wujingyue commented Nov 26, 2024 •

edited

Loading

zasdfgbnm commented Dec 7, 2024 •

edited

Loading