Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adbc_ingest is working better, but still has some COPY INTO issues after adbc 1.2 release #2128

Open
davlee1972 opened this issue Sep 5, 2024 · 15 comments
Labels
Type: bug Something isn't working

Comments

@davlee1972
Copy link

What happened?

I removed all of my previous work arounds and tried sending 24 gigs across 12 parquet files which contains 370 million rows via adbc_ingest().

From the logs I can see 1666 parquet files being generated (assuming these are 10 mb in size by default) and PUT..

But ultimately it fails with errors after 5 attempts to run COPY INTO(s).

image

  File "../miniconda3/lib/python3.9/site-packages/adbc_driver_manager/dbapi.py", line 937, in adbc_ingest
    return _blocking_call(self._stmt.execute_update, (), {}, self._stmt.cancel)
  File "adbc_driver_manager/_lib.pyx", line 1569, in adbc_driver_manager._lib._blocking_call_impl
  File "adbc_driver_manager/_lib.pyx", line 1562, in adbc_driver_manager._lib._blocking_call_impl
  File "adbc_driver_manager/_lib.pyx", line 1295, in adbc_driver_manager._lib.AdbcStatement.execute_update
  File "adbc_driver_manager/_lib.pyx", line 260, in adbc_driver_manager._lib.check_error
adbc_driver_manager.InternalError: INTERNAL: some files not loaded by COPY command, 901 files remain after 5 retries
ERRO[0619]connection.go:410 gosnowflake.(*snowflakeConn).queryContextInternal error: context canceled
ERRO[0619]connection.go:410 gosnowflake.(*snowflakeConn).queryContextInternal error: context canceled

I'm going try my old workarounds using adbc_ingest() with one parquet file at a time, etc..

Stack Trace

No response

How can we reproduce the bug?

No response

Environment/Setup

No response

@davlee1972 davlee1972 added the Type: bug Something isn't working label Sep 5, 2024
@joellubi
Copy link
Member

joellubi commented Sep 5, 2024

Thanks for the quick feedback @davlee1972. Can you check whether the copy history view is consistent with 901 files missing from the stage?

The error is reporting that after 5 retries with backoff, snowflake's COPY INTO command still hasn't "noticed" all the files that were uploaded to the stage. Either that or the file count is off.

@davlee1972
Copy link
Author

This is what I'm seeing when trying one file at a time.. The error is happening at random..
I won't be able to do any further investigation until next week..

adbc_driver_manager.InternalError: INTERNAL: some files not loaded by COPY command, 37 files remain after 5 retries

A copy into command partially inserted files..

Sorted by start time:
image

37 files PUT before this COPY INTO command failure matches the 37 files reported in error message.
The 5 retries at the end didn't pick them up either if the messaging is right..
image

@davlee1972
Copy link
Author

davlee1972 commented Sep 5, 2024

I sent the issue to our Snowflake vendor support to see if they can look at the server..

image

@joellubi
Copy link
Member

joellubi commented Sep 5, 2024

I sent the issue to our Snowflake vendor support to see if they can look at the server..

Thanks. It does seem from the log you shared that there may be an issue with their file tracking for COPY. I wonder if they changed the logic or system any time in the last few months, as I would've expected this issue to have been reported in an earlier release otherwise.

@joellubi
Copy link
Member

joellubi commented Sep 5, 2024

I know the latest gosnowflake release included some improvements to context propagation/cancelation. I wonder if one of those changes could be causing this error. I'll try to reproduce.

@lidavidm lidavidm added this to the ADBC Libraries 15 milestone Sep 6, 2024
@davlee1972
Copy link
Author

I think there is a bug with the context manager in the latest Snowflake Go driver.. The error I'm seeing is a disconnect while copy into is running.

I haven't been able to figure out a work around except to revert back to the adbc version 1.1.0 drivers..

Unfortunately there are other errors cropping up now with pyarrow 17 when using adbc 1.1.0 drivers..

This always happens when loading a 2nd file with a fresh connection.. The first file always works..

Downgrading to pyarrow 16.1.0 does not uninstall Arrow Go version 17, so I'm digging around trying to figure how to clean up my environment..

panic: cannot allocate memory

goroutine 764 [running]:
github.com/apache/arrow/go/v17/arrow/memory/mallocator.(*Mallocator).Reallocate(0xc0009e8008, 0x400000, {0x7f59765129b0?, 0x200000, 0x7f5e17d1565e?})
        /root/go/pkg/mod/github.com/apache/arrow/go/[email protected]/arrow/memory/mallocator/mallocator.go:94 +0xe5
github.com/apache/arrow/go/v17/arrow/memory.(*Buffer).Reserve(0xc000fe5cc0, 0x20?)
        /root/go/pkg/mod/github.com/apache/arrow/go/[email protected]/arrow/memory/buffer.go:124 +0x89
github.com/apache/arrow/go/v17/parquet/internal/encoding.(*BufferWriter).Reserve(0x7f5e1979c160?, 0x7f5e195433c0?)
        /root/go/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/internal/encoding/types.go:387 +0xfb
github.com/apache/arrow/go/v17/parquet/internal/encoding.(*BufferWriter).Write(0xc000fdd260, {0xc1b734c000, 0xdc6d7, 0x12c251})
        /root/go/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/internal/encoding/types.go:420 +0x6e
github.com/apache/arrow/go/v17/parquet/file.(*serializedPageWriter).WriteDataPage(0xc000fd5900, {0x7f5e197c7090, 0xc001a7a0a0})
        /root/go/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/file/page_writer.go:375 +0x516
github.com/apache/arrow/go/v17/parquet/file.(*bufferedPageWriter).WriteDataPage(0xc013a9c7d0?, {0x7f5e197c7090?, 0xc001a7a0a0?})
        /root/go/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/file/page_writer.go:459 +0x25
github.com/apache/arrow/go/v17/parquet/file.(*columnWriter).WriteDataPage(...)
        /root/go/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/file/column_writer.go:214
github.com/apache/arrow/go/v17/parquet/file.(*columnWriter).buildDataPageV1(0xc000fe3980, 0x13ee2fa0?, 0xc1?, 0x10142a, {0xc1b63b0000, 0x100ba8, 0x200000})
        /root/go/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/file/column_writer.go:345 +0x76c
github.com/apache/arrow/go/v17/parquet/file.(*columnWriter).FlushCurrentPage(0xc000fe3980)
        /root/go/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/file/column_writer.go:302 +0x26f
github.com/apache/arrow/go/v17/parquet/file.(*columnWriter).commitWriteAndCheckPageLimit(0xc000fe3980, 0xc000fe3980?, 0x7f5a20706bc0?)
        /root/go/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/file/column_writer.go:262 +0x4f
github.com/apache/arrow/go/v17/parquet/file.(*Float64ColumnChunkWriter).WriteBatchSpaced.func1(0xc00, 0x400)
        /root/go/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/file/column_writer_types.gen.go:897 +0x373
github.com/apache/arrow/go/v17/parquet/file.doBatches(0x7f5e17666576?, 0x400, 0xc113ee3160)
        /root/go/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/file/column_writer.go:634 +0x56
github.com/apache/arrow/go/v17/parquet/file.(*Float64ColumnChunkWriter).WriteBatchSpaced(0x7f5e17658cc5?, {0x7f5a20700bc0?, 0x50?, 0x48?}, {0xc1a8070000?, 0x1005e17660ee5?, 0x7f5d5866f278?}, {0x0, 0x0, 0x0}, ...)
        /root/go/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/file/column_writer_types.gen.go:883 +0x105
github.com/apache/arrow/go/v17/parquet/pqarrow.writeDenseArrow(0xc013a9c230, {0x7f5e197e1ca8?, 0xc000fe3980}, {0x7f5e197db0f8, 0xc000e3f8c0?}, {0xc1a8070000, 0x20000, 0x100000}, {0x0, 0x0, ...}, ...)
        /root/go/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/encode_arrow.go:472 +0x2145
github.com/apache/arrow/go/v17/parquet/pqarrow.WriteArrowToColumn({0x7f5e197b8480, 0xc001888120}, {0x7f5e197e1ca8, 0xc000fe3980}, {0x7f5e197db0f8?, 0xc000e3f8c0?}, {0xc1a8070000, 0x20000, 0x100000}, {0x0, ...}, ...)
        /root/go/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/encode_arrow.go:232 +0x32f
github.com/apache/arrow/go/v17/parquet/pqarrow.(*arrowColumnWriter).Write(0xc113ee3b08, {0x7f5e197b8480, 0xc001888120})
        /root/go/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/encode_arrow.go:193 +0x4f3
github.com/apache/arrow/go/v17/parquet/pqarrow.(*FileWriter).WriteColumnChunked(0xc0018b4000, 0x7f5e1a9b5020?, 0xc113ee3bd8?, 0x1?)
        /root/go/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/file_writer.go:330 +0xe5
github.com/apache/arrow/go/v17/parquet/pqarrow.(*FileWriter).WriteColumnData(0xc000dc2380?, {0x7f5e197db0f8, 0xc0011eedc0})
        /root/go/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/file_writer.go:339 +0xc9
github.com/apache/arrow/go/v17/parquet/pqarrow.(*FileWriter).WriteBuffered(0xc0018b4000, {0x7f5e197d8ee0, 0xc001066360})
        /root/go/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/file_writer.go:203 +0x3e5
github.com/apache/arrow-adbc/go/adbc/driver/snowflake.writeParquet(0x2fc6c655b1344a00?, {0x7f5e1979aa88?, 0xc0019a42a0}, 0x74852a1848ec0d8c?, 0xa00000, 0xad8b19d8d12c4235?, {{0x7f5e197ae3e0, 0xc0009e8008}, 0x0, 0x0, ...})
        /adbc/go/adbc/driver/snowflake/bulk_ingestion.go:345 +0x1cb
github.com/apache/arrow-adbc/go/adbc/driver/snowflake.runParallelParquetWriters.func3()
        /adbc/go/adbc/driver/snowflake/bulk_ingestion.go:409 +0x185
golang.org/x/sync/errgroup.(*Group).Go.func1()
        /root/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78 +0x56
created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 638
        /root/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75 +0x96
Aborted (core dumped)

@davlee1972
Copy link
Author

davlee1972 commented Sep 10, 2024

I reinstalled adbc 1.2.0 drivers and ran them in debug mode..

Here's a successful COPY INTO which looks like some internal connection using the same session id for all PUTs and COPY INTO(S):

DEBU[0698]connection_util.go:185 gosnowflake.(*snowflakeConn).populateSessionParameters parameter. name: TIMESTAMP_OUTPUT_FORMAT, value: YYYY-MM-DD HH24:MI:SS.FF3 TZHTZM  LOG_SESSION_ID=341236648727106
DEBU[0698]connection_util.go:185 gosnowflake.(*snowflakeConn).populateSessionParameters parameter. name: CLIENT_PREFETCH_THREADS, value: 4  LOG_SESSION_ID=341236648727106
DEBU[0698]connection_util.go:185 gosnowflake.(*snowflakeConn).populateSessionParameters parameter. name: TIME_OUTPUT_FORMAT, value: HH24:MI:SS  LOG_SESSION_ID=341236648727106
DEBU[0698]connection_util.go:185 gosnowflake.(*snowflakeConn).populateSessionParameters parameter. name: CLIENT_RESULT_CHUNK_SIZE, value: 160  LOG_SESSION_ID=341236648727106
DEBU[0698]connection_util.go:185 gosnowflake.(*snowflakeConn).populateSessionParameters parameter. name: TIMESTAMP_TZ_OUTPUT_FORMAT, value:   LOG_SESSION_ID=341236648727106
DEBU[0698]connection_util.go:185 gosnowflake.(*snowflakeConn).populateSessionParameters parameter. name: CLIENT_SESSION_KEEP_ALIVE, value: false  LOG_SESSION_ID=341236648727106
DEBU[0698]connection_util.go:185 gosnowflake.(*snowflakeConn).populateSessionParameters parameter. name: CLIENT_OUT_OF_BAND_TELEMETRY_ENABLED, value: false  LOG_SESSION_ID=341236648727106
DEBU[0698]connection_util.go:185 gosnowflake.(*snowflakeConn).populateSessionParameters parameter. name: QUERY_CONTEXT_CACHE_SIZE, value: 5  LOG_SESSION_ID=341236648727106
DEBU[0698]connection_util.go:185 gosnowflake.(*snowflakeConn).populateSessionParameters parameter. name: CLIENT_METADATA_USE_SESSION_DATABASE, value: false  LOG_SESSION_ID=341236648727106
DEBU[0698]connection_util.go:185 gosnowflake.(*snowflakeConn).populateSessionParameters parameter. name: ENABLE_STAGE_S3_PRIVATELINK_FOR_US_EAST_1, value: true  LOG_SESSION_ID=341236648727106
DEBU[0698]connection_util.go:185 gosnowflake.(*snowflakeConn).populateSessionParameters parameter. name: CLIENT_RESULT_PREFETCH_THREADS, value: 1  LOG_SESSION_ID=341236648727106
DEBU[0698]connection_util.go:185 gosnowflake.(*snowflakeConn).populateSessionParameters parameter. name: TIMESTAMP_NTZ_OUTPUT_FORMAT, value: YYYY-MM-DD HH24:MI:SS.FF3  LOG_SESSION_ID=341236648727106
DEBU[0698]connection_util.go:185 gosnowflake.(*snowflakeConn).populateSessionParameters parameter. name: CLIENT_METADATA_REQUEST_USE_CONNECTION_CTX, value: false  LOG_SESSION_ID=341236648727106
DEBU[0698]connection_util.go:185 gosnowflake.(*snowflakeConn).populateSessionParameters parameter. name: CLIENT_HONOR_CLIENT_TZ_FOR_TIMESTAMP_NTZ, value: true  LOG_SESSION_ID=341236648727106
DEBU[0698]connection_util.go:185 gosnowflake.(*snowflakeConn).populateSessionParameters parameter. name: CLIENT_MEMORY_LIMIT, value: 1536  LOG_SESSION_ID=341236648727106
DEBU[0698]connection_util.go:185 gosnowflake.(*snowflakeConn).populateSessionParameters parameter. name: CLIENT_TIMESTAMP_TYPE_MAPPING, value: TIMESTAMP_LTZ  LOG_SESSION_ID=341236648727106
DEBU[0698]connection_util.go:185 gosnowflake.(*snowflakeConn).populateSessionParameters parameter. name: TIMEZONE, value: America/New_York  LOG_SESSION_ID=341236648727106
DEBU[0698]connection_util.go:185 gosnowflake.(*snowflakeConn).populateSessionParameters parameter. name: CLIENT_RESULT_PREFETCH_SLOTS, value: 2  LOG_SESSION_ID=341236648727106
DEBU[0698]connection_util.go:185 gosnowflake.(*snowflakeConn).populateSessionParameters parameter. name: CLIENT_TELEMETRY_ENABLED, value: true  LOG_SESSION_ID=341236648727106
DEBU[0698]connection_util.go:185 gosnowflake.(*snowflakeConn).populateSessionParameters parameter. name: CLIENT_DISABLE_INCIDENTS, value: true  LOG_SESSION_ID=341236648727106
DEBU[0698]connection_util.go:185 gosnowflake.(*snowflakeConn).populateSessionParameters parameter. name: CLIENT_USE_V1_QUERY_API, value: true  LOG_SESSION_ID=341236648727106
DEBU[0698]connection_util.go:185 gosnowflake.(*snowflakeConn).populateSessionParameters parameter. name: CLIENT_RESULT_COLUMN_CASE_INSENSITIVE, value: false  LOG_SESSION_ID=341236648727106
DEBU[0698]connection_util.go:185 gosnowflake.(*snowflakeConn).populateSessionParameters parameter. name: BINARY_OUTPUT_FORMAT, value: HEX  LOG_SESSION_ID=341236648727106
DEBU[0698]connection_util.go:185 gosnowflake.(*snowflakeConn).populateSessionParameters parameter. name: CSV_TIMESTAMP_FORMAT, value:   LOG_SESSION_ID=341236648727106
DEBU[0698]connection_util.go:185 gosnowflake.(*snowflakeConn).populateSessionParameters parameter. name: CLIENT_ENABLE_LOG_INFO_STATEMENT_PARAMETERS, value: false  LOG_SESSION_ID=341236648727106
DEBU[0698]connection_util.go:185 gosnowflake.(*snowflakeConn).populateSessionParameters parameter. name: CLIENT_TELEMETRY_SESSIONLESS_ENABLED, value: true  LOG_SESSION_ID=341236648727106
DEBU[0698]connection_util.go:185 gosnowflake.(*snowflakeConn).populateSessionParameters parameter. name: DATE_OUTPUT_FORMAT, value: YYYY-MM-DD  LOG_SESSION_ID=341236648727106
DEBU[0698]connection_util.go:185 gosnowflake.(*snowflakeConn).populateSessionParameters parameter. name: CLIENT_FORCE_PROTECT_ID_TOKEN, value: true  LOG_SESSION_ID=341236648727106
DEBU[0698]connection_util.go:185 gosnowflake.(*snowflakeConn).populateSessionParameters parameter. name: CLIENT_CONSENT_CACHE_ID_TOKEN, value: false  LOG_SESSION_ID=341236648727106
DEBU[0698]connection_util.go:185 gosnowflake.(*snowflakeConn).populateSessionParameters parameter. name: CLIENT_STAGE_ARRAY_BINDING_THRESHOLD, value: 65280  LOG_SESSION_ID=341236648727106
DEBU[0698]connection_util.go:185 gosnowflake.(*snowflakeConn).populateSessionParameters parameter. name: CLIENT_SESSION_KEEP_ALIVE_HEARTBEAT_FREQUENCY, value: 3600  LOG_SESSION_ID=341236648727106
DEBU[0698]connection_util.go:185 gosnowflake.(*snowflakeConn).populateSessionParameters parameter. name: AUTOCOMMIT, value: true      LOG_SESSION_ID=341236648727106
DEBU[0698]connection_util.go:185 gosnowflake.(*snowflakeConn).populateSessionParameters parameter. name: CLIENT_SESSION_CLONE, value: false  LOG_SESSION_ID=341236648727106
DEBU[0698]connection_util.go:185 gosnowflake.(*snowflakeConn).populateSessionParameters parameter. name: TIMESTAMP_LTZ_OUTPUT_FORMAT, value:   LOG_SESSION_ID=341236648727106
DEBU[0698]rows.go:140 gosnowflake.(*snowflakeRows).Columns Rows.Columns
DEBU[0698]rows.go:80 gosnowflake.(*snowflakeRows).Close Rows.Close            LOG_SESSION_ID=341236648727106
INFO[0699]connection.go:399 gosnowflake.(*snowflakeConn).queryContextInternal Query: "COPY INTO IDENTIFIER(?) FROM @ADBC$BIND MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE", [{ 0 "RISK_CURVE_HIST"}]
INFO[0699]connection.go:117 gosnowflake.(*snowflakeConn).exec parameters: map[]
DEBU[0699]log.go:176 gosnowflake.(*defaultLogger).Debugf TYPE: string, "RISK_CURVE_HIST"
INFO[0699]connection.go:126 gosnowflake.(*snowflakeConn).exec bindings: map[1:{TEXT 0xc022594600  <nil>}]
INFO[0699]restful.go:236 gosnowflake.postRestfulQueryHelper params: &map[]
INFO[0699]retry.go:299 gosnowflake.(*retryHTTP).execute retryHTTP.totalTimeout: 0s
DEBU[0699]retry.go:309 gosnowflake.(*retryHTTP).execute retry count: 0
INFO[0700]restful.go:253 gosnowflake.postRestfulQueryHelper postQuery: resp: &{200 OK 200 HTTP/1.1 1 1 map[Cache-Control:[no-cache, no-store] Connection:[keep-alive] Content-Type:[application/json] Date:[Mon, 09 Sep 2024 23:25:40 GMT] Expect-Ct:[enforce, max-age=3600] Keep-Alive:[timeout=175] Vary:[Accept-Encoding, User-Agent] X-Content-Type-Options:[nosniff] X-Frame-Options:[deny] X-Xss-Protection:[1; mode=block]] 0xc028ca4fc0 -1 [chunked] false true map[] 0xc02262f300 0xc0226b64d0}
INFO[0700]connection.go:156 gosnowflake.(*snowflakeConn).exec Success: true, Code: -1
DEBU[0700]log.go:176 gosnowflake.(*defaultLogger).Debugf adding query context: {0 1725924340335596 0 }
INFO[0700]connection.go:190 gosnowflake.(*snowflakeConn).exec Exec/Query SUCCESS

Then stuff starts getting weird.. A internal connection is made to run what I'm assuming is the last "select count(*)", but:

A. This select count(*) SQL statement tries to open a new connection with a different session id and closes the old one??
DEBU[0700]rows.go:140 gosnowflake.(*snowflakeRows).Columns Rows.Columns
DEBU[0700]rows.go:80 gosnowflake.(*snowflakeRows).Close Rows.Close LOG_SESSION_ID=341236648727106
B. But it looks like COPY INTO(s) are still running.
C. It looks like the COPY INTO connections die with WARNings which orphans the COPY INTO(s).
D. Then we start getting context cancelled ERRORs

image

DEBU[0700]connection_util.go:185 gosnowflake.(*snowflakeConn).populateSessionParameters parameter. name: CLIENT_STAGE_ARRAY_BINDING_THRESHOLD, value: 65280  LOG_SESSION_ID=341236648727106
DEBU[0700]connection_util.go:185 gosnowflake.(*snowflakeConn).populateSessionParameters parameter. name: CLIENT_SESSION_KEEP_ALIVE_HEARTBEAT_FREQUENCY, value: 3600  LOG_SESSION_ID=341236648727106
DEBU[0700]connection_util.go:185 gosnowflake.(*snowflakeConn).populateSessionParameters parameter. name: AUTOCOMMIT, value: true      LOG_SESSION_ID=341236648727106
DEBU[0700]connection_util.go:185 gosnowflake.(*snowflakeConn).populateSessionParameters parameter. name: CLIENT_SESSION_CLONE, value: false  LOG_SESSION_ID=341236648727106
DEBU[0700]connection_util.go:185 gosnowflake.(*snowflakeConn).populateSessionParameters parameter. name: TIMESTAMP_LTZ_OUTPUT_FORMAT, value:   LOG_SESSION_ID=341236648727106
DEBU[0700]rows.go:140 gosnowflake.(*snowflakeRows).Columns Rows.Columns
DEBU[0700]rows.go:80 gosnowflake.(*snowflakeRows).Close Rows.Close            LOG_SESSION_ID=341236648727106
INFO[0700]connection.go:399 gosnowflake.(*snowflakeConn).queryContextInternal Query: "SELECT COUNT(*) FROM IDENTIFIER(?)", [{ 1 "RISK_CURVE_HIST"}]
INFO[0700]connection.go:117 gosnowflake.(*snowflakeConn).exec parameters: map[]
DEBU[0700]log.go:176 gosnowflake.(*defaultLogger).Debugf TYPE: string, "RISK_CURVE_HIST"
INFO[0700]connection.go:126 gosnowflake.(*snowflakeConn).exec bindings: map[1:{TEXT 0xc077c3f910  <nil>}]
WARN[0700]retry.go:332 gosnowflake.(*retryHTTP).execute failed http connection. err: Get "https://?????.privatelink.snowflakecomputing.com:443/queries/01b6ecdd-0b07-38b5-0001-365ae554753a/result": context canceled                        . retrying...
WARN[0700]retry.go:332 gosnowflake.(*retryHTTP).execute failed http connection. err: Get "https://?????.privatelink.snowflakecomputing.com:443/queries/01b6ecdd-0b07-38b3-0001-365ae5546612/result": context canceled                        . retrying...
INFO[0700]retry.go:378 gosnowflake.(*retryHTTP).execute sleeping 2.022s. to timeout: 0s. retrying
INFO[0700]retry.go:378 gosnowflake.(*retryHTTP).execute sleeping 1.385s. to timeout: 0s. retrying
WARN[0700]retry.go:332 gosnowflake.(*retryHTTP).execute failed http connection. err: Get "https://?????.privatelink.snowflakecomputing.com:443/queries/01b6ecdd-0b07-38b3-0001-365ae554656e/result": context canceled                        . retrying...
INFO[0700]retry.go:378 gosnowflake.(*retryHTTP).execute sleeping 2.645s. to timeout: 0s. retrying
INFO[0700]retry.go:379 gosnowflake.(*retryHTTP).execute retry count: 1, retry reason: 0
INFO[0700]retry.go:379 gosnowflake.(*retryHTTP).execute retry count: 1, retry reason: 0
ERRO[0700]connection.go:410 gosnowflake.(*snowflakeConn).queryContextInternal error: context canceled      
INFO[0700]connection.go:399 gosnowflake.(*snowflakeConn).queryContextInternal Query: "COPY INTO IDENTIFIER(?) FROM @ADBC$BIND MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE", [{ 0 "RISK_CURVE_HIST"}]

For this run there were four copy into(s) which all failed when the select count(*) statement ran..

image

@davlee1972
Copy link
Author

I found a temporary work around by turning ingest_copy_concurrency off.. All 12 parquet files (24 gigs) loaded fine, but I had to call adbc_ingest() for each file instead of stacking up 47 PUTs per file..

I'll test this overnight on 150 gigs of parquet in ~700 daily parquet files..

        # temporary fix for snowflake losing track of PUT files when running COPY INTO
        if driver == "snowflake":
            if not self.statement_kwargs:
                self.statement_kwargs = {}
            self.statement_kwargs["adbc.snowflake.statement.ingest_copy_concurrency"] = "0"

@joellubi
Copy link
Member

Thanks for all the details, very helpful for creating a repro. Curious to see how the overnight job goes.

@Niivii
Copy link

Niivii commented Sep 11, 2024

Same issue here :
image

  File "/.venv/lib/python3.12/site-packages/polars/dataframe/frame.py", line 3964, in write_database
    n_rows = cursor.adbc_ingest(
             ^^^^^^^^^^^^^^^^^^^
  File "/.venv/lib/python3.12/site-packages/adbc_driver_manager/dbapi.py", line 937, in adbc_ingest
    return _blocking_call(self._stmt.execute_update, (), {}, self._stmt.cancel)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "adbc_driver_manager/_lib.pyx", line 1569, in adbc_driver_manager._lib._blocking_call_impl
  File "adbc_driver_manager/_lib.pyx", line 1562, in adbc_driver_manager._lib._blocking_call_impl
  File "adbc_driver_manager/_lib.pyx", line 1295, in adbc_driver_manager._lib.AdbcStatement.execute_update
  File "adbc_driver_manager/_lib.pyx", line 260, in adbc_driver_manager._lib.check_error
adbc_driver_manager.InternalError: INTERNAL: some files not loaded by COPY command, 1 files remain after 5 retries

@davlee1972
Copy link
Author

I ran multiple tests and setting ingest copy into concurrency to "0" worked across all tests..

The problem looks like COPY INTO completions are not being tracked and when SELECT COUNT(*) runs at the end of the process it runs with a new Session Id which kills the old Session Id the COPY INTOs are running under..

This looks like a new bug which results in a partial copy into result..

@davlee1972
Copy link
Author

Just to confirm.. This is still a bug with ADBC 15 with the updated go 1.12.0 Driver. The final select count(*) is terminating any copy into operations still in progress..

image

@joellubi
Copy link
Member

Thanks for bumping this @davlee1972. I'm expecting to have some bandwidth open up soon to help with this.

Generally the process should cancel all contexts once it believes that it is done. There are 2 possible changes I can think of that may be causing this: either the updates to context cancellation in recent releases of gosnowflake are changing this behavior and/or something changed with how snowflake's API handles COPY acknowledgement, which might cause API calls to unblock before the COPY is actually done.

Do you know if the earlier ADBC driver versions you tested still work? If so, that may help rule out the possibility of this being an API-only change on snowflake's side.

@davlee1972
Copy link
Author

davlee1972 commented Nov 13, 2024

I tried to downgrade, but that introduced other issues with the older ADBC packages using unsupported newer versions of Apache Arrow Go.

@Zan-L
Copy link

Zan-L commented Nov 14, 2024

@joellubi I can confirm that ADBC Snowflake 1.0.0 does not have this issue. And based on the fact that this problem persists after upgrading Snowflake Go Driver and ADBC Snowflake 1.0.0 works well, I tend to believe the root cause comes from ADBC. But like @davlee1972, older ADBC versions have other issues like OOM or those caused by older arrow build, so downgrade is really not an option.

Another observation to help you debug: There is randomness in this issue - Different retries result in different number of files remaining, and could eventually succeed for the same dataset if we retry enough times.

I would appreciate if you could raise the priority and find a fix soon. Currently, since ADBC Snowflake does not support ingestion with temp tables, this ingestion instability risks corruption of target Snowflake tables when it happens, which renders it non-adaptable for production ETL pipelines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants