Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify behavior around context loss and error reporting. #744

Draft
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

inexorabletash
Copy link
Member

@inexorabletash inexorabletash commented Jul 29, 2024

Based on @mingmingtasd's work in the Chromium prototype implementation.

A lost promise attribute is added to MLContext, which resolves when the context is lost, and provides an implementation-defined message explaining the reason. Synchronous and asynchronous actions depending on the context will fail if the promise is settled.

This also modifies the omnipresent "has built" tests on MLGraphBuilder methods to be a "can built" test which also checks that the builder's context is not lost.

For #477


Preview | Diff

Based on @mingmingtasd's work in the Chromium prototype
implementation.

For webmachinelearning#477
@inexorabletash
Copy link
Member Author

Some points for discussion:

  • This uses the internal slot [[lost]] as both (1) the promise value and (2) how to check if a context "is lost". Maybe an explicit boolean internal slot would be clearer, albeit more verbose?
  • In the implementation, the "lost" state is equivalent to the context's mojo remote being connected. This is only checked explicitly when (1) creating an MLGraphBuilder and (2) creating an MLBuffer. This spec change does the former, but MLBuffer is not in the spec yet so that's not present. I did add an explicit check in compute(); I think that'll happen in the implementation indirectly because the dispatch would fail, but an explicit check seemed like a good idea?
  • The MLContext section of the spec could be reorganized a bit. Separate PR?

@mingmingtasd and @reillyeon could you do an initial review?

index.bs Outdated Show resolved Hide resolved
index.bs Show resolved Hide resolved
@mingmingtasd
Copy link
Contributor

mingmingtasd commented Jul 30, 2024

Thanks! @inexorabletash @reillyeon
I am also working on a chromium CL to expose MLContext::destroy to make a context lost.

In the implementation, the "lost" state is equivalent to the context's mojo remote being connected. This is only checked explicitly when (1) creating an MLGraphBuilder and (2) creating an MLBuffer. This spec change does the former, but MLBuffer is not in the spec yet so that's not present.

I will check context lost for all of the synchronous and asynchronous actions depending on the context in my chromium CL.

@reillyeon
Copy link
Contributor

Something this PR doesn't do is specify the behavior around rejecting in-flight asynchronous operations.

@inexorabletash
Copy link
Member Author

Something this PR doesn't do is specify the behavior around rejecting in-flight asynchronous operations.

Thoughts on how to do that and to what level of detail?

  • in compute(), is that handled by execute graph returning error, or do we need more steps?
  • in build() it looks like failure other than "does not support a requested feature" isn't covered.

One generic approach for both of those would be to replace: "Queue an ML task with global to resolve promise with ..." with:

  1. Queue an ML task with global to run these steps:
    1. If context is lost, then reject promise with an InvalidStateError.
    2. Resolve promise with ...

... which I think covers the script-observable behavior, but not that the async steps internally should fail.

@fdwr
Copy link
Collaborator

fdwr commented Aug 1, 2024

@RafaelCintron

@reillyeon
Copy link
Contributor

There are two separate considerations: what happens to the promise returned by a method and what happens to an asynchronous operation itself. Operations like build(), compute() and dispatch() aren't abortable (or at least, the specification should not require that they are aborted, only that whether or not they are aborted is not visible to script). We can however be explicit that the promises returned by any methods on an object are synchronously rejected when the destroy() method is called or the context is lost. We could either specify this exactly the way the Chromium implementation works, by having each object hold a list of all promises and reject them, or we could add a note to the "in parallel" steps which says "when the context is lost or this is destroyed, reject promise and potentially abort these steps". The definition of "potentially abort" is the tricky one because it has to consider cases like destroying a single buffer that is part of a larger set of pending operations which should still be able to complete.

@huningxin
Copy link
Contributor

@inexorabletash

  • in compute(), is that handled by execute graph returning error, or do we need more steps?

I suppose the following step of execute graph could handle the device lost error:

Issue a compute request to graph.[[implementation]] given name and inputResources and wait for completion.

If that returns an error, then return an "OperationError" DOMException.

A question is if the error is device lost, should it run the "context-lost" steps? Like the Chromium prototype does in HandleComputationFailure(). Or we can just assume the "context-lost" steps are triggered by "When an MLContext context is no longer available to fulfill requests" asynchronously?

  • in build() it looks like failure other than "does not support a requested feature" isn't covered.

Or we could just say "If that returns an error, then queue an ML task with global to reject promise with an 'OperationError' DOMException" similar to compute()?

One generic approach for both of those would be to replace: "Queue an ML task with global to resolve promise with ..." with:

  1. Queue an ML task with global to run these steps:

    1. If context is lost, then reject promise with an InvalidStateError.
    2. Resolve promise with ...

The new steps may not run because these steps may already be aborted (by "abort these steps") if previous steps fail .

@bbernhar
Copy link

bbernhar commented Aug 6, 2024

We can however be explicit that the promises returned by any methods on an object are synchronously rejected when the destroy() method is called or the context is lost.

+1. MLContext.destroy() steps could be spec to the following:

Script timeline:

  1. Immediately reject all pending promises made off |this| context.
  2. Issue steps for |this| context on the device/queue timeline.

Device/queue timeline:

  1. Wait for async operations on the device to complete.
  2. Then Lose |this| context.

Note: impl. is always free to abort pending async ops immediately (and release buffers).

@inexorabletash inexorabletash marked this pull request as draft August 6, 2024 19:29
@inexorabletash
Copy link
Member Author

A question is if the error is device lost, should it run the "context-lost" steps? Like the Chromium prototype does in HandleComputationFailure(). Or we can just assume the "context-lost" steps are triggered by "When an MLContext context is no longer available to fulfill requests" asynchronously?

I think this would be script observable - i.e. what order do the lost and compute() promises settle. So we should spec it explicitly.

@mingmingtasd
Copy link
Contributor

The definition of "potentially abort" is the tricky one because it has to consider cases like destroying a single buffer that is part of a larger set of pending operations which should still be able to complete.

@reillyeon @huningxin @bbernhar I think the DirectML backend here has considered this, MLBuffer can be destroyed prior to Compute() and Dispatch but its resource will be kept alive anyway until all the pending GPU work done. For example, if you destroy a MLBuffer used by Dispatch before the Dispatch is done, Dispatch can still complete but you can't continue to call ReadBuffer to read back the results. It seems OK and as expected?

@reillyeon
Copy link
Contributor

I think we've settled on the idea that destroying an MLBuffer will only make readBuffer() fail. Pending compute tasks will still complete.

Destroying an MLContext (or context loss) is the equivalent of calling destroy() on all the builders, graphs and buffers created by the context. We haven't yet decided the semantics of destroying a graph but it will probably similarly allow pending compute tasks to complete.

@mingmingtasd
Copy link
Contributor

I think we've settled on the idea that destroying an MLBuffer will only make readBuffer() fail. Pending compute tasks will still complete.

Destroying an MLContext (or context loss) is the equivalent of calling destroy() on all the builders, graphs and buffers created by the context. We haven't yet decided the semantics of destroying a graph but it will probably similarly allow pending compute tasks to complete.

I have submitted a CL to expose MLGraph::destroy : https://chromium-review.googlesource.com/c/chromium/src/+/5799069

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants