-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adbc_ingest is working better, but still has some COPY INTO issues after adbc 1.2 release #2128
Comments
Thanks for the quick feedback @davlee1972. Can you check whether the copy history view is consistent with 901 files missing from the stage? The error is reporting that after 5 retries with backoff, snowflake's COPY INTO command still hasn't "noticed" all the files that were uploaded to the stage. Either that or the file count is off. |
Thanks. It does seem from the log you shared that there may be an issue with their file tracking for COPY. I wonder if they changed the logic or system any time in the last few months, as I would've expected this issue to have been reported in an earlier release otherwise. |
I know the latest gosnowflake release included some improvements to context propagation/cancelation. I wonder if one of those changes could be causing this error. I'll try to reproduce. |
I think there is a bug with the context manager in the latest Snowflake Go driver.. The error I'm seeing is a disconnect while copy into is running. I haven't been able to figure out a work around except to revert back to the adbc version 1.1.0 drivers.. Unfortunately there are other errors cropping up now with pyarrow 17 when using adbc 1.1.0 drivers.. This always happens when loading a 2nd file with a fresh connection.. The first file always works.. Downgrading to pyarrow 16.1.0 does not uninstall Arrow Go version 17, so I'm digging around trying to figure how to clean up my environment..
|
I reinstalled adbc 1.2.0 drivers and ran them in debug mode.. Here's a successful COPY INTO which looks like some internal connection using the same session id for all PUTs and COPY INTO(S):
Then stuff starts getting weird.. A internal connection is made to run what I'm assuming is the last "select count(*)", but: A. This select count(*) SQL statement tries to open a new connection with a different session id and closes the old one??
For this run there were four copy into(s) which all failed when the select count(*) statement ran.. |
I found a temporary work around by turning ingest_copy_concurrency off.. All 12 parquet files (24 gigs) loaded fine, but I had to call adbc_ingest() for each file instead of stacking up 47 PUTs per file.. I'll test this overnight on 150 gigs of parquet in ~700 daily parquet files..
|
Thanks for all the details, very helpful for creating a repro. Curious to see how the overnight job goes. |
|
I ran multiple tests and setting ingest copy into concurrency to "0" worked across all tests.. The problem looks like COPY INTO completions are not being tracked and when SELECT COUNT(*) runs at the end of the process it runs with a new Session Id which kills the old Session Id the COPY INTOs are running under.. This looks like a new bug which results in a partial copy into result.. |
Thanks for bumping this @davlee1972. I'm expecting to have some bandwidth open up soon to help with this. Generally the process should cancel all contexts once it believes that it is done. There are 2 possible changes I can think of that may be causing this: either the updates to context cancellation in recent releases of gosnowflake are changing this behavior and/or something changed with how snowflake's API handles COPY acknowledgement, which might cause API calls to unblock before the COPY is actually done. Do you know if the earlier ADBC driver versions you tested still work? If so, that may help rule out the possibility of this being an API-only change on snowflake's side. |
I tried to downgrade, but that introduced other issues with the older ADBC packages using unsupported newer versions of Apache Arrow Go. |
@joellubi I can confirm that ADBC Snowflake 1.0.0 does not have this issue. And based on the fact that this problem persists after upgrading Snowflake Go Driver and ADBC Snowflake 1.0.0 works well, I tend to believe the root cause comes from ADBC. But like @davlee1972, older ADBC versions have other issues like OOM or those caused by older arrow build, so downgrade is really not an option. Another observation to help you debug: There is randomness in this issue - Different retries result in different number of files remaining, and could eventually succeed for the same dataset if we retry enough times. I would appreciate if you could raise the priority and find a fix soon. Currently, since ADBC Snowflake does not support ingestion with temp tables, this ingestion instability risks corruption of target Snowflake tables when it happens, which renders it non-adaptable for production ETL pipelines. |
What happened?
I removed all of my previous work arounds and tried sending 24 gigs across 12 parquet files which contains 370 million rows via adbc_ingest().
From the logs I can see 1666 parquet files being generated (assuming these are 10 mb in size by default) and PUT..
But ultimately it fails with errors after 5 attempts to run COPY INTO(s).
I'm going try my old workarounds using adbc_ingest() with one parquet file at a time, etc..
Stack Trace
No response
How can we reproduce the bug?
No response
Environment/Setup
No response
The text was updated successfully, but these errors were encountered: