-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
c: define async version of ArrowArrayStream #811
Comments
Presumably we want more than ArrayStream to be async. In principle, any API that might require a database round trip should support it though in practice perhaps that might be unwieldy. Of course, every language has different idioms and capabilities when it comes to async and maybe it's not realistic to expose this in the C API and expect to be able to interact with it from all languages. The simplest thing is a callback which says "I'm done" and both C# and Rust can be made to work with that. I have no idea about Go. Java might need to spawn a thread to make that work (but with green threads on the horizon that could be okay). The "straw person" would be something like -- and this is really rough --
The caller populates ArrowAsyncInfo with caller_data and completion handler and initializes private_data to nullptr. It then calls ConnectionGetTableSchemaAsync, which can return an error, success indicating synchronous completion or success indicating initiation of an asynchronous operation. In the latter case, it also populates the private_data field. When the operation succeeds or fails, the driver cleans up private_data before calling the completion function with the status of the operation. The consumer may call AdbcCancel in order to terminate the operation. This should either fail or return asynchronously, and cancellation isn't complete until the completion function is called with a cancelled status. Note that an operation may still succeed after it has been cancelled due to timing issues. The driver may have to play games with sentinel values in the private_data field and interlocked exchanges in order to make concurrency work out with cancellation. |
Yup, this issue was just to sketch out the basic C Data Interface changes. Though, given the hoops we had to jump through to try to propagate ADBC error metadata through that boundary, I wonder if that's worth it and if we shouldn't instead use entirely ADBC interfaces (with ADBC error handling) instead. |
Thanks for the sketch! I was imagining something like that, but this is actually concrete. CC @zeroshade |
Thanks for the sketch @CurtHagenlocher I plan on starting to tackle this more formally in the next couple weeks. |
I've been starting to play around with this, and the management of the data structures is a bit annoying. It also isn't very consistent with other parts of the Arrow C API where the receiver of a call is free to move structures around in memory as long as it preserves the bits. Here's an alternative whose memory management is more like the synchronous API:
In this variation, the caller allocates (and is free to move) The primary drawback of this variation is that it doesn't cleanly support synchronous completion of the call. On the other hand, there was nothing in the initial sketch which would have prevented the completion callback from happening before the original async call returned, and that fact is a bit more obvious with this approach. Edited to add a synchronous error to the function signature so that e.g. parameter validation could return an error immediately instead of having to go through the callback. |
@CurtHagenlocher @lidavidm What do you two think about the following idea: struct AsyncArrowStream {
int (*on_schema)(struct AsyncArrowStream* self, struct ArrowSchema* out,
AdbcStatusCode status, struct AdbcError* error);
int (*on_next)(struct AsyncArrowStream* self, struct ArrowDeviceArray* out,
AdbcStatusCode status, struct AdbcError* error);
void (*release)(struct AsyncArrowSTream* self);
void* private_data;
}; Which would be used like: AdbcStatusCode AdbcStatementExecuteQuery(struct AdbcStatement* statement,
struct AsyncArrowStream* stream_handler,
int64_t* rows_affected, struct AdbcError* sync_error); The caller would populate the
The following rules should be observed by drivers:
This would work for any and all of the cases that work with ArrowArrayStreams. For the other scenarios, ( |
IMO this gets annoying, as you have to handle every error in two locations
int being basically bool here? Other questions
|
I gather that Any contract around releasing the
I think this would be a bit of a pain to manage. Why not add
What if the receiver e.g. can't allocate memory for an async task? I don't see how you get out of having to check the error in two locations in general, and it feels nicer for things like parameter validation failures to fail immediately.
Realistically, I think we'd need to push this API into Arrow itself :/. |
A minor modification of Matt's proposal for the sake of argument: struct ArrowArrayStreamHandler {
// For parity with the device array stream?
int32_t device_type;
// return codes would be errno, but no need to pass a message since it would just be
// passed to the handler anyway via on_error. If an error code is returned here the
// producer must pass it to on_error()
int (*on_schema)(struct AsyncArrowStream* self, struct ArrowSchema* out);
// Always called at least once with a released array to indicate the end except if on_error
// is called, in which case it would be the last call.
int (*on_next)(struct AsyncArrowStream* self, struct ArrowDeviceArray* out);
// For R I'd need this so that I can call some promise$reject() method, and I think it's needed
// Metadata here would be identical to ArrowSchema::metadata but the handler method
// would need to copy it (and message) if it needs to live after the handler function returns.
void (*on_error(struct AsyncArrowStream* self, int code, const char* message, const char* metadata);
void (*release)(struct AsyncArrowStream* self);
void* private_data;
}; ...which has a few features:
...and drawbacks:
|
+1
Hmm. For instance, in Rust, is it typical to have re:
Maybe we could add a |
I'm at best an enthusiastic dabbler in Rust, but I think you'd generally have |
Like @paleolimbot said, it would be errno like the C Data Interface callbacks
Yup. The callback can then be called using
Is the inclusion of a device type there for the caller to tell the producer what device it wants the results to be allocated on? If the producer can't comply with that, does it then call In your suggested interface, are the semantics for
Are there other scenarios I'm missing?
That works. I'm fine with that. Though we'd have to go the route that @lidavidm suggested if we're pushing this upstream into the C Data Interface ABI directly of the
The ADBC Error struct also contains other information such as an optional opaque binary blob for metadata values and such, which would not be able to be passed through this interface. So we'd probably need to expand this I agree that it might be easier in general if we push this upstream to the Arrow C ABI, but there are also benefits to being able to use the ADBC specific objects in the async interface too. If we do push upstream with this, would we want Async versions of both |
But the struct is caller-allocated, right? So the driver can immediately invoke On the other hand I agree it would be a better developer experience, and probably the caller already has to handle errors for other things, so I suppose I'm not entirely against it, so long as it's stipulated that it should really only be used for immediate validation errors (e.g. an IO error should basically never be returned through the immediate error). |
I vote no. |
I suspect I'm subconsciously shying away from that possibility because I expect it to increase the risk of accidental deadlocks. But writing concurrent code is hard, and if the contract allows it then the implementations had better act accordingly. If the async stream is not part of the Arrow C ABI, then there's no reason to conform to it at all. We could just unroll it into multiple callbacks on the primary function. Starting from my previous strawperson: struct ArrowAsyncInfo {
void* caller_data;
void* private_data;
void (*release)(struct ArrowAsyncInfo*);
};
AdbcStatusCode AdbcStatementExecuteQueryAsync(
struct AdbcStatement* statement,
struct ArrowAsyncInfo* asyncInfo,
void (*on_schema)(struct ArrowAsyncInfo*, AdbcStatus, long*, struct AdbcSchema*, struct AdbcError*),
void (*on_next)(struct ArrowAsyncInfo*, AdbcStatus, struct ArrowArray*, struct AdbcError*)
); |
Hmm, what's the purpose though? If you return an error to the driver...it just turns around and returns it back to you? |
(I mean purpose vs just a plain bool, or gRPC's approach where you explicitly call cancel) |
This admittedly gets quite awkward when trying to bind an array stream as a parameter, so ... probably not. |
One could also just state that if the consumer returns something non-zero that no other callbacks will be called (I don't have strong feelings about any of this). Having it be errno (or adbc status) just minimizes the number of types of things that get returned (makes it easier to reuse the return not OK macros).
I believe that the |
This was my original intention. Returning non-zero indicates that no other callbacks should be called. Only
Personally, I like putting more of the burden on the producer than the consumer. I'd prefer that the producer react to a non-zero return value by cancelling and then calling release than requiring the consumer to explicitly call cancel. That's just my personal opinion though, @CurtHagenlocher might have a better opinion on this given that he's likely to be an actual user of this API.
That's a good point. That could work pretty well then.
Are we intending that the binding should be asynchronous too? I was thinking that the binding API would still be synchronous and binding a normal It seems like we're approaching a small consensus on this, so if everyone is okay with it, i'll take the suggestions we've all discussed so far and try to come up with a potential proposal to upstream to the Arrow repo. Anyone object / think there's more we need to discuss before I try? |
…3632) ### Rationale for this change See apache/arrow-adbc#811 and #43631 ### What changes are included in this PR? Definition of `ArrowAsyncDeviceStreamHandler` and addition of it to the docs. I've sent an [email to the mailing list](https://lists.apache.org/thread/yfokmfkrmmp7tqvq0m3rshcvloq278cq) to start a discussion on this topic, so this may change over time due to those discussions. * GitHub Issue: #43631 Lead-authored-by: Matt Topol <[email protected]> Co-authored-by: Felipe Oliveira Carvalho <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Co-authored-by: Raúl Cumplido <[email protected]> Co-authored-by: Dane Pitkin <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Co-authored-by: David Li <[email protected]> Co-authored-by: Ian Cook <[email protected]> Signed-off-by: Matt Topol <[email protected]>
We should make sure ArrowAsyncStream also includes the ArrowDeviceStream changes.
The async version of the ADBC API could be the only device-aware one. (The sync version would have to play ABI shenanigans, or duplicate every call...)
The text was updated successfully, but these errors were encountered: