c/driver/postgresql: `adbc_ingest` fails for unsigned int and some temporal dtypes #1950

mcrumiller · 2024-06-30T18:20:52Z

What happened?

adbc_ingest fails for any dataframe that contains an unsigned int dtype and some (but not all) temporal dtypes.

import pyarrow as pa
from adbc_driver_postgresql import dbapi

uri = "postgresql://mcrumiller@winhost/mydb"


def try_ingest(data, dtype):
    """Attempt to write single-column df using ADBC."""
    df = pa.Table.from_arrays([data], schema=pa.schema([("a", dtype)]))
    conn = dbapi.connect(uri)
    cur = conn.cursor()
    try:
        cur.adbc_ingest("test_dtype", df, mode="replace")
        print(f"{dtype}: OK")
    except Exception as e:
        print(f"{dtype}: {e}")


# these all fail
try_ingest([0], pa.uint8())
try_ingest([0], pa.uint16())
try_ingest([0], pa.uint32())
try_ingest([0], pa.uint64())
try_ingest([0], dtype=pa.time32("s"))
try_ingest([0], dtype=pa.time32("ms"))
try_ingest([0], dtype=pa.time64("us"))
try_ingest([0], dtype=pa.time64("ns"))
try_ingest([0], dtype=pa.date64())

# these all pass
try_ingest([True, True, False], pa.bool_())
try_ingest([0], dtype=pa.int8())
try_ingest([0], dtype=pa.int16())
try_ingest([0], dtype=pa.int32())
try_ingest([0], dtype=pa.int64())
try_ingest([0], dtype=pa.float32())
try_ingest([0], dtype=pa.float64())
try_ingest([0], dtype=pa.timestamp("s"))
try_ingest([0], dtype=pa.timestamp("ms"))
try_ingest([0], dtype=pa.timestamp("us"))
try_ingest([0], dtype=pa.timestamp("ns"))
try_ingest([0], dtype=pa.duration("s"))
try_ingest([0], dtype=pa.duration("ms"))
try_ingest([0], dtype=pa.duration("us"))
try_ingest([0], dtype=pa.duration("ns"))
try_ingest([0], dtype=pa.date32())
try_ingest(["a", "b", "c"], pa.string())
try_ingest(["a", "b", "c"], pa.binary())

uint8: NOT_IMPLEMENTED: [libpq] Field #1 ('a') has unsupported type for ingestion uint8
uint16: NOT_IMPLEMENTED: [libpq] Field #1 ('a') has unsupported type for ingestion uint16
uint32: NOT_IMPLEMENTED: [libpq] Field #1 ('a') has unsupported type for ingestion uint32
uint64: NOT_IMPLEMENTED: [libpq] Field #1 ('a') has unsupported type for ingestion uint64
time32[s]: NOT_IMPLEMENTED: [libpq] Field #1 ('a') has unsupported type for ingestion time32
time32[ms]: NOT_IMPLEMENTED: [libpq] Field #1 ('a') has unsupported type for ingestion time32
time64[us]: NOT_IMPLEMENTED: [libpq] Field #1 ('a') has unsupported type for ingestion time64
time64[ns]: NOT_IMPLEMENTED: [libpq] Field #1 ('a') has unsupported type for ingestion time64
date64[ms]: NOT_IMPLEMENTED: [libpq] Field #1 ('a') has unsupported type for ingestion date64
bool: OK
int8: OK
int16: OK
int32: OK
int64: OK
float: OK
double: OK
timestamp[s]: OK
timestamp[ms]: OK
timestamp[us]: OK
timestamp[ns]: OK
duration[s]: OK
duration[ms]: OK
duration[us]: OK
duration[ns]: OK
date32[day]: OK
string: OK
binary: OK

Environment/Setup

adbc_driver_postgresql==1.0.0

The text was updated successfully, but these errors were encountered:

WillAyd · 2024-07-01T21:26:07Z

How do we think we should handle unsigned integers? Should they map to types that are equal in width and just raise on a possible overflow? Or should we inspect the data and maybe upscale to a larger width when it could hold the value losslessly?

mcrumiller · 2024-07-01T21:39:55Z

I would recommend upcasting to the smallest integer type that can hold all of the values:

uint8 -> int16
uint16 -> int32
uint32 -> int64

WillAyd · 2024-07-01T22:53:35Z

If I'm understanding your suggestion you just think we should do that across the board? i.e. uint16 will always be upcast to a 32 bit integer even if none of its values need that much width?

I assume from that with the uint64 case we would still choose a BIGINT and just raise on overflow

WillAyd · 2024-07-01T23:05:28Z

I'm also a little unsure about date64 support - date32 just aligns perfectly with the postgres storage. I'm not sure what value there would be in trying to convert date64 to date32 in the driver versus having the user/an upstream tool take care of that

lidavidm · 2024-07-01T23:10:52Z

Mostly for convenience, I think for date64 we can just divide and treat as date32. (So it won't/can't roundtrip.)

Unsigned types are a little iffier to me

WillAyd · 2024-07-01T23:13:57Z

So the date64 is always expected to be evenly divisible into a date right? Likely just my lack of understanding about that type, but I am wary that that object counts milliseconds

lidavidm · 2024-07-01T23:39:34Z

Yup

https://github.com/apache/arrow/blob/edfa343eeca008513f0300924380e1b187cc976b/format/Schema.fbs#L248-L249

It's weird because it was meant to be 1:1 with Java's old Date (AFAIK) which uses the same representation. IMO, it probably shouldn't have made it, but so it goes.

mcrumiller · 2024-07-02T00:28:19Z

If I'm understanding your suggestion you just think we should do that across the board?

@WillAyd postgres doesn't really have a natural mapping to the standard u/ints used in most languages (why they, and seemingly all other DBMS, didn't choose this design I'll never know). So the user cannot have the expectation that their source data type will be respected, when there's not even a guarantee that a commensurate type exists at all.

On top of that, if a user calls adbc_ingest, there is no guarantee that they will not later insert data that won't fit. As an example, suppose a user has a uint8 column with values that go up to 120. If adbc then casts to i8 because it "fits" and then the user tries to insert a 200, they get an unpleasant surprise--which should not happen, because they are after all inserting data of the same dtype that they initially inserted.

More progress towards #1950

One step towards #1950 Just doing microsecond support for now. Can add nanosecond support and time32 as a follow up Reading was already implemented but untested. Added some shared test data to go with the writer

scottywakefield · 2024-09-10T13:04:32Z

I would recommend upcasting to the smallest integer type that can hold all of the values:

uint8 -> int16

uint16 -> int32

uint32 -> int64

uint64 -> ???

mcrumiller · 2024-09-10T14:20:21Z

uint64 -> ???

Maybe raising an error if the value is too large? As far as I know values > i64.max cannot be ingested into postgres.

Edit: there is a variable-sized decimal type that allows for up to 131,072 digits prior to the decimal point, but that feels too wonky.

The max i64 value is 9,223,372,036,854,775,807 which is pretty darn big. If people are using values larger than that, they probably are dealing something not well suited for basic int types.

paleolimbot · 2024-09-12T16:15:58Z

I would recommend upcasting to the smallest integer type that can hold all of the values:

uint8 -> int16
uint16 -> int32
uint32 -> int64

This is definitely the safest way to go, and had already been baked in to PostgresType::FromSchema(). In #2153 we updated the ingestion behaviour to actually use PostgresType::FromSchema(), and so for a "create" or "append to something created by adbc" this should now work:

library(adbcdrivermanager)
library(nanoarrow)

con <- adbc_database_init(
  adbcpostgresql::adbcpostgresql(),
  uri = "postgresql://localhost:5432/postgres?user=postgres&password=password"
) |> 
  adbc_connection_init()

df <- tibble::tibble(
  uint8_col = 246:255,
  uint16_col = 65526:65535,
  uint32_col = (.Machine$integer.max + 1):(.Machine$integer.max + 10)
)

array <- df |> 
  nanoarrow::as_nanoarrow_array(
    schema = na_struct(
      list(
        uint8_col = na_uint8(),
        uint16_col = na_uint16(),
        uint32_col = na_uint32()
      )
    )
  )

con |> 
  execute_adbc("DROP TABLE IF EXISTS adbc_test")

array |> 
  write_adbc(con, "adbc_test")

con |> 
  read_adbc("select * from adbc_test") |> 
  tibble::as_tibble()
#> # A tibble: 10 × 3
#>    uint8_col uint16_col uint32_col
#>        <int>      <int>      <dbl>
#>  1       246      65526 2147483648
#>  2       247      65527 2147483649
#>  3       248      65528 2147483650
#>  4       249      65529 2147483651
#>  5       250      65530 2147483652
#>  6       251      65531 2147483653
#>  7       252      65532 2147483654
#>  8       253      65533 2147483655
#>  9       254      65534 2147483656
#> 10       255      65535 2147483657

Unfortunately, the method we're using to efficiently insert (generate COPY data) requires that the types match exactly, so this will fail to append to Arrow data that happens to have an unsigned integer column to an existing table:

con |> 
  execute_adbc("DROP TABLE IF EXISTS adbc_test")
con |> 
  execute_adbc("CREATE TABLE adbc_test (uint8_col int2, uint16_col int2, uint32_col int4)")
array |> 
  write_adbc(con, "adbc_test", mode = "append")
#> Error in adbc_statement_execute_query(stmt): INVALID_ARGUMENT: [libpq] Failed to execute COPY statement: PGRES_FATAL_ERROR ERROR:  incorrect binary data format
#> CONTEXT:  COPY adbc_test, line 1, column uint16_col

uint64 -> ???

For a fresh insert of Arrow data (i.e., when we are forced to generate a CREATE TABLE statement), this should probably be inferred as bigint because it preserves the "integerness", even though it will error at runtime if passed very large values. The workaround here would be to issue your own CREATE TABLE (bigint_col BYTEA), and we will need to support generating COPY for various postgres types outside the 1:1 mapping that is currently used by the COPY writer.

mcrumiller added the Type: bug Something isn't working label Jun 30, 2024

mcrumiller changed the title ~~adbc_ingest fails for unsigned int (and inherited) dtypes~~ adbc_ingest fails for unsigned int and some temporal dtypes Jun 30, 2024

mcrumiller mentioned this issue Jun 30, 2024

write_database fails for UInts and Time dtypes when ADBC used pola-rs/polars#17312

Open

2 tasks

lidavidm changed the title ~~adbc_ingest fails for unsigned int and some temporal dtypes~~ c/driver/postgresql: adbc_ingest fails for unsigned int and some temporal dtypes Jun 30, 2024

WillAyd mentioned this issue Jul 1, 2024

feat(c/driver/postgresql): Read/write support for TIME64[us] #1960

Merged

WillAyd mentioned this issue Jul 2, 2024

feat(c/driver/postgresql): UInt(8/16/32) Writer #1961

Merged

lidavidm pushed a commit that referenced this issue Jul 2, 2024

feat(c/driver/postgresql): UInt(8/16/32) Writer (#1961)

33ebd9d

More progress towards #1950

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

c/driver/postgresql: `adbc_ingest` fails for unsigned int and some temporal dtypes #1950

c/driver/postgresql: `adbc_ingest` fails for unsigned int and some temporal dtypes #1950

mcrumiller commented Jun 30, 2024 •

edited

Loading

WillAyd commented Jul 1, 2024

mcrumiller commented Jul 1, 2024

WillAyd commented Jul 1, 2024

WillAyd commented Jul 1, 2024

lidavidm commented Jul 1, 2024

WillAyd commented Jul 1, 2024

lidavidm commented Jul 1, 2024

mcrumiller commented Jul 2, 2024 •

edited

Loading

scottywakefield commented Sep 10, 2024

mcrumiller commented Sep 10, 2024 •

edited

Loading

paleolimbot commented Sep 12, 2024

c/driver/postgresql: adbc_ingest fails for unsigned int and some temporal dtypes #1950

c/driver/postgresql: adbc_ingest fails for unsigned int and some temporal dtypes #1950

Comments

mcrumiller commented Jun 30, 2024 • edited Loading

What happened?

Environment/Setup

WillAyd commented Jul 1, 2024

mcrumiller commented Jul 1, 2024

WillAyd commented Jul 1, 2024

WillAyd commented Jul 1, 2024

lidavidm commented Jul 1, 2024

WillAyd commented Jul 1, 2024

lidavidm commented Jul 1, 2024

mcrumiller commented Jul 2, 2024 • edited Loading

scottywakefield commented Sep 10, 2024

mcrumiller commented Sep 10, 2024 • edited Loading

paleolimbot commented Sep 12, 2024

c/driver/postgresql: `adbc_ingest` fails for unsigned int and some temporal dtypes #1950

c/driver/postgresql: `adbc_ingest` fails for unsigned int and some temporal dtypes #1950

mcrumiller commented Jun 30, 2024 •

edited

Loading

mcrumiller commented Jul 2, 2024 •

edited

Loading

mcrumiller commented Sep 10, 2024 •

edited

Loading