-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pyarrow/adbc: no support for binding DECIMAL in Snowflake driver #2084
Comments
CC @zeroshade |
@pkit Currently, decimal is not supported for using arrow record batch data to be inserted into a query via bind params such as doing If you're doing a bulk insert, you should utilize
At the same token, by default I believe we return all Could you share the code you were getting that error from if this doesn't answer your issue sufficiently? |
|
@zeroshade It would be also nice to know where the limitation comes from? Snowflake ADBC-server implementation? |
That's precisely what the
The Snowflake ADBC driver utilizes Snowflake's Go client for communication, which does not support any decimal type as a bind parameter, and ADBC doesn't make any attempt currently to perform a cast for binding (we can optionally perform casting on receiving). Snowflake's Server implementation, currently, does not accept Arrow data directly for bind parameter input. That's part of why If at all possible, my recommendation here would be to either use |
@zeroshade Thanks. It all makes sense. Some questions: at which point the data is considered sent to SF when |
Backpressure and concurrency are handled in two ways:
My personal recommendation would be to consolidate batches into fewer streams and call |
Nice! def main():
conn_read = connect(f"{user}:{password}@{account}/{database}", db_kwargs={
"adbc.snowflake.sql.schema": "PUBLIC",
})
conn_write = connect(f"{user}:{password}@{account}/{database}", db_kwargs={
"adbc.snowflake.sql.schema": "PUBLIC",
})
with conn_read.cursor() as cursor_read:
with conn_write.cursor() as cursor_write:
cursor_read.adbc_statement.set_options(**{"adbc.snowflake.rpc.prefetch_concurrency": 2, "adbc.rpc.result_queue_size": 10})
cursor_read.execute("SELECT * FROM T1")
for batch in cursor_read.fetch_record_batch():
print(batch)
cursor_write.adbc_ingest("T2", batch, mode="append") The failure is:
And that's it. |
try doing: cursor_read.execute("SELECT * FROM T1")
cursor_write.adbc_ingest("T2", cursor_read.fetch_record_batch(), mode="append") Also: We have an upstream PR waiting to be merged to address that specific issue snowflakedb/gosnowflake#1196 |
@zeroshade can we get some of these recommendations documented? |
@lidavidm Does our documentation not already reccomend using |
Also any clarifications about backpressure or data types |
@zeroshade |
Okay, while the multiple calls to ie. something like the following: def process_record_batches(input):
for batch in input:
# whatever pre-processing you want to perform on the batch
print(batch)
yield batch
with conn_read.cursor() as cursor_read:
with conn_write.cursor() as cursor_write:
cursor_read.adbc_statement.set_options(**{"adbc.snowflake.rpc.prefetch_concurrency": 2, "adbc.rpc.result_queue_size": 10})
cursor_read.execute("SELECT * FROM T1")
input = cursor_read.fetch_record_batch()
reader = pyarrow.RecordBatchReader.from_batches(input.schema, process_record_batches(input))
cursor_write.adbc_ingest("T2", reader, mode="append") This way you don't have to pay the overhead multiple times, you can only pay the overhead once and it effectively creates a full push/pull pipeline that will handle backpressure as each part will wait for the previous stage (determined by buffer and queue sizes using the options I mentioned earlier). |
@zeroshade Ok, makes sense. Although I'm here at the mercy of |
@zeroshade Just fyi. Calling |
@joellubi is there a workaround for that context error until your upstream change is merged? |
@zeroshade The simplest change we could make ourselves would be to set @pkit Is the data still ingested when the context error is produced? In all our reproductions the error comes from a log rather than an exception, and the ingestion itself is still successful. |
By "front-load the pre-processing" do you mean that you would like to process the next batch while the current batch is being uploaded by |
Upstream PR (snowflakedb/gosnowflake#1196) was just merged. |
@joellubi Although I see "Success" for the COPY operation from the stage in SF log. The data is not ingested.
Yes. But in this case I just send it for preprocessing somewhere else. So it's purely I/O wait. |
@joellubi let's bump our version of gosnowflake to pull in the fix so we can see if this fixes @pkit's issue.
We are in the process of working out an async interface that we can implement to allow for a non-blocking |
@zeroshade I've opened PR: #2091 |
@zeroshade I can confirm that the fix works for disabling the error (I've built python module and friends from the PR 2901) |
Ok, it works, I've noticed that COMMIT was not sent. Sending commit explicitly worked. |
The conversation here has strayed quite a bit away from the original problem report. It's good that the user's needs (may) have been solved, but the original problem still exists: you can only bind a few types to a Snowflake statement. This seems to be because the driver wants to convert them to types in the sql package, and that package has a decidedly minimal level of type support. I don't really know go, but my read of gosnowflake suggests that there's no extension mechanism by which it would support types not in the sql package, which means we're stuck until gosnowflake is updated. Does that look correct? And should we close this issue and open a new issue instead? |
What feature or improvement would you like to see?
I'm not sure if at the current state the driver is even usable. Almost all numeric types are decimals in Snowflake.
I've tried to bind it directly from RecordBatch or through the DBAPI "interface", but nope.
The text was updated successfully, but these errors were encountered: