Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

c/driver/postgresql: adbc_ingest fails for unsigned int and some temporal dtypes #1950

Open
mcrumiller opened this issue Jun 30, 2024 · 11 comments
Labels
Type: bug Something isn't working

Comments

@mcrumiller
Copy link

mcrumiller commented Jun 30, 2024

What happened?

adbc_ingest fails for any dataframe that contains an unsigned int dtype and some (but not all) temporal dtypes.

import pyarrow as pa
from adbc_driver_postgresql import dbapi

uri = "postgresql://mcrumiller@winhost/mydb"


def try_ingest(data, dtype):
    """Attempt to write single-column df using ADBC."""
    df = pa.Table.from_arrays([data], schema=pa.schema([("a", dtype)]))
    conn = dbapi.connect(uri)
    cur = conn.cursor()
    try:
        cur.adbc_ingest("test_dtype", df, mode="replace")
        print(f"{dtype}: OK")
    except Exception as e:
        print(f"{dtype}: {e}")


# these all fail
try_ingest([0], pa.uint8())
try_ingest([0], pa.uint16())
try_ingest([0], pa.uint32())
try_ingest([0], pa.uint64())
try_ingest([0], dtype=pa.time32("s"))
try_ingest([0], dtype=pa.time32("ms"))
try_ingest([0], dtype=pa.time64("us"))
try_ingest([0], dtype=pa.time64("ns"))
try_ingest([0], dtype=pa.date64())

# these all pass
try_ingest([True, True, False], pa.bool_())
try_ingest([0], dtype=pa.int8())
try_ingest([0], dtype=pa.int16())
try_ingest([0], dtype=pa.int32())
try_ingest([0], dtype=pa.int64())
try_ingest([0], dtype=pa.float32())
try_ingest([0], dtype=pa.float64())
try_ingest([0], dtype=pa.timestamp("s"))
try_ingest([0], dtype=pa.timestamp("ms"))
try_ingest([0], dtype=pa.timestamp("us"))
try_ingest([0], dtype=pa.timestamp("ns"))
try_ingest([0], dtype=pa.duration("s"))
try_ingest([0], dtype=pa.duration("ms"))
try_ingest([0], dtype=pa.duration("us"))
try_ingest([0], dtype=pa.duration("ns"))
try_ingest([0], dtype=pa.date32())
try_ingest(["a", "b", "c"], pa.string())
try_ingest(["a", "b", "c"], pa.binary())
uint8: NOT_IMPLEMENTED: [libpq] Field #1 ('a') has unsupported type for ingestion uint8
uint16: NOT_IMPLEMENTED: [libpq] Field #1 ('a') has unsupported type for ingestion uint16
uint32: NOT_IMPLEMENTED: [libpq] Field #1 ('a') has unsupported type for ingestion uint32
uint64: NOT_IMPLEMENTED: [libpq] Field #1 ('a') has unsupported type for ingestion uint64
time32[s]: NOT_IMPLEMENTED: [libpq] Field #1 ('a') has unsupported type for ingestion time32
time32[ms]: NOT_IMPLEMENTED: [libpq] Field #1 ('a') has unsupported type for ingestion time32
time64[us]: NOT_IMPLEMENTED: [libpq] Field #1 ('a') has unsupported type for ingestion time64
time64[ns]: NOT_IMPLEMENTED: [libpq] Field #1 ('a') has unsupported type for ingestion time64
date64[ms]: NOT_IMPLEMENTED: [libpq] Field #1 ('a') has unsupported type for ingestion date64
bool: OK
int8: OK
int16: OK
int32: OK
int64: OK
float: OK
double: OK
timestamp[s]: OK
timestamp[ms]: OK
timestamp[us]: OK
timestamp[ns]: OK
duration[s]: OK
duration[ms]: OK
duration[us]: OK
duration[ns]: OK
date32[day]: OK
string: OK
binary: OK

Environment/Setup

adbc_driver_postgresql==1.0.0

@mcrumiller mcrumiller added the Type: bug Something isn't working label Jun 30, 2024
@mcrumiller mcrumiller changed the title adbc_ingest fails for unsigned int (and inherited) dtypes adbc_ingest fails for unsigned int and some temporal dtypes Jun 30, 2024
@lidavidm lidavidm changed the title adbc_ingest fails for unsigned int and some temporal dtypes c/driver/postgresql: adbc_ingest fails for unsigned int and some temporal dtypes Jun 30, 2024
@WillAyd
Copy link
Contributor

WillAyd commented Jul 1, 2024

How do we think we should handle unsigned integers? Should they map to types that are equal in width and just raise on a possible overflow? Or should we inspect the data and maybe upscale to a larger width when it could hold the value losslessly?

@mcrumiller
Copy link
Author

I would recommend upcasting to the smallest integer type that can hold all of the values:

  • uint8 -> int16
  • uint16 -> int32
  • uint32 -> int64

@WillAyd
Copy link
Contributor

WillAyd commented Jul 1, 2024

If I'm understanding your suggestion you just think we should do that across the board? i.e. uint16 will always be upcast to a 32 bit integer even if none of its values need that much width?

I assume from that with the uint64 case we would still choose a BIGINT and just raise on overflow

@WillAyd
Copy link
Contributor

WillAyd commented Jul 1, 2024

I'm also a little unsure about date64 support - date32 just aligns perfectly with the postgres storage. I'm not sure what value there would be in trying to convert date64 to date32 in the driver versus having the user/an upstream tool take care of that

@lidavidm
Copy link
Member

lidavidm commented Jul 1, 2024

Mostly for convenience, I think for date64 we can just divide and treat as date32. (So it won't/can't roundtrip.)

Unsigned types are a little iffier to me

@WillAyd
Copy link
Contributor

WillAyd commented Jul 1, 2024

So the date64 is always expected to be evenly divisible into a date right? Likely just my lack of understanding about that type, but I am wary that that object counts milliseconds

@lidavidm
Copy link
Member

lidavidm commented Jul 1, 2024

Yup

https://github.com/apache/arrow/blob/edfa343eeca008513f0300924380e1b187cc976b/format/Schema.fbs#L248-L249

It's weird because it was meant to be 1:1 with Java's old Date (AFAIK) which uses the same representation. IMO, it probably shouldn't have made it, but so it goes.

@mcrumiller
Copy link
Author

mcrumiller commented Jul 2, 2024

If I'm understanding your suggestion you just think we should do that across the board?

@WillAyd postgres doesn't really have a natural mapping to the standard u/ints used in most languages (why they, and seemingly all other DBMS, didn't choose this design I'll never know). So the user cannot have the expectation that their source data type will be respected, when there's not even a guarantee that a commensurate type exists at all.

On top of that, if a user calls adbc_ingest, there is no guarantee that they will not later insert data that won't fit. As an example, suppose a user has a uint8 column with values that go up to 120. If adbc then casts to i8 because it "fits" and then the user tries to insert a 200, they get an unpleasant surprise--which should not happen, because they are after all inserting data of the same dtype that they initially inserted.

lidavidm pushed a commit that referenced this issue Jul 2, 2024
lidavidm pushed a commit that referenced this issue Jul 3, 2024
One step towards #1950

Just doing microsecond support for now. Can add nanosecond support and
time32 as a follow up

Reading was already implemented but untested. Added some shared test
data to go with the writer
@scottywakefield
Copy link

I would recommend upcasting to the smallest integer type that can hold all of the values:

  • uint8 -> int16
  • uint16 -> int32
  • uint32 -> int64

uint64 -> ???

@mcrumiller
Copy link
Author

mcrumiller commented Sep 10, 2024

uint64 -> ???

Maybe raising an error if the value is too large? As far as I know values > i64.max cannot be ingested into postgres.

Edit: there is a variable-sized decimal type that allows for up to 131,072 digits prior to the decimal point, but that feels too wonky.

The max i64 value is 9,223,372,036,854,775,807 which is pretty darn big. If people are using values larger than that, they probably are dealing something not well suited for basic int types.

@paleolimbot
Copy link
Member

I would recommend upcasting to the smallest integer type that can hold all of the values:

uint8 -> int16
uint16 -> int32
uint32 -> int64

This is definitely the safest way to go, and had already been baked in to PostgresType::FromSchema(). In #2153 we updated the ingestion behaviour to actually use PostgresType::FromSchema(), and so for a "create" or "append to something created by adbc" this should now work:

library(adbcdrivermanager)
library(nanoarrow)

con <- adbc_database_init(
  adbcpostgresql::adbcpostgresql(),
  uri = "postgresql://localhost:5432/postgres?user=postgres&password=password"
) |> 
  adbc_connection_init()

df <- tibble::tibble(
  uint8_col = 246:255,
  uint16_col = 65526:65535,
  uint32_col = (.Machine$integer.max + 1):(.Machine$integer.max + 10)
)

array <- df |> 
  nanoarrow::as_nanoarrow_array(
    schema = na_struct(
      list(
        uint8_col = na_uint8(),
        uint16_col = na_uint16(),
        uint32_col = na_uint32()
      )
    )
  )

con |> 
  execute_adbc("DROP TABLE IF EXISTS adbc_test")

array |> 
  write_adbc(con, "adbc_test")

con |> 
  read_adbc("select * from adbc_test") |> 
  tibble::as_tibble()
#> # A tibble: 10 × 3
#>    uint8_col uint16_col uint32_col
#>        <int>      <int>      <dbl>
#>  1       246      65526 2147483648
#>  2       247      65527 2147483649
#>  3       248      65528 2147483650
#>  4       249      65529 2147483651
#>  5       250      65530 2147483652
#>  6       251      65531 2147483653
#>  7       252      65532 2147483654
#>  8       253      65533 2147483655
#>  9       254      65534 2147483656
#> 10       255      65535 2147483657

Unfortunately, the method we're using to efficiently insert (generate COPY data) requires that the types match exactly, so this will fail to append to Arrow data that happens to have an unsigned integer column to an existing table:

con |> 
  execute_adbc("DROP TABLE IF EXISTS adbc_test")
con |> 
  execute_adbc("CREATE TABLE adbc_test (uint8_col int2, uint16_col int2, uint32_col int4)")
array |> 
  write_adbc(con, "adbc_test", mode = "append")
#> Error in adbc_statement_execute_query(stmt): INVALID_ARGUMENT: [libpq] Failed to execute COPY statement: PGRES_FATAL_ERROR ERROR:  incorrect binary data format
#> CONTEXT:  COPY adbc_test, line 1, column uint16_col

uint64 -> ???

For a fresh insert of Arrow data (i.e., when we are forced to generate a CREATE TABLE statement), this should probably be inferred as bigint because it preserves the "integerness", even though it will error at runtime if passed very large values. The workaround here would be to issue your own CREATE TABLE (bigint_col BYTEA), and we will need to support generating COPY for various postgres types outside the 1:1 mapping that is currently used by the COPY writer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants