-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
c/driver/postgresql: adbc_ingest
fails for unsigned int and some temporal dtypes
#1950
Comments
adbc_ingest
fails for unsigned int (and inherited) dtypesadbc_ingest
fails for unsigned int and some temporal dtypes
adbc_ingest
fails for unsigned int and some temporal dtypesadbc_ingest
fails for unsigned int and some temporal dtypes
How do we think we should handle unsigned integers? Should they map to types that are equal in width and just raise on a possible overflow? Or should we inspect the data and maybe upscale to a larger width when it could hold the value losslessly? |
I would recommend upcasting to the smallest integer type that can hold all of the values:
|
If I'm understanding your suggestion you just think we should do that across the board? i.e. uint16 will always be upcast to a 32 bit integer even if none of its values need that much width? I assume from that with the uint64 case we would still choose a BIGINT and just raise on overflow |
I'm also a little unsure about date64 support - date32 just aligns perfectly with the postgres storage. I'm not sure what value there would be in trying to convert date64 to date32 in the driver versus having the user/an upstream tool take care of that |
Mostly for convenience, I think for date64 we can just divide and treat as date32. (So it won't/can't roundtrip.) Unsigned types are a little iffier to me |
So the date64 is always expected to be evenly divisible into a date right? Likely just my lack of understanding about that type, but I am wary that that object counts milliseconds |
Yup It's weird because it was meant to be 1:1 with Java's old Date (AFAIK) which uses the same representation. IMO, it probably shouldn't have made it, but so it goes. |
@WillAyd postgres doesn't really have a natural mapping to the standard u/ints used in most languages (why they, and seemingly all other DBMS, didn't choose this design I'll never know). So the user cannot have the expectation that their source data type will be respected, when there's not even a guarantee that a commensurate type exists at all. On top of that, if a user calls |
One step towards #1950 Just doing microsecond support for now. Can add nanosecond support and time32 as a follow up Reading was already implemented but untested. Added some shared test data to go with the writer
uint64 -> ??? |
Maybe raising an error if the value is too large? As far as I know values > i64.max cannot be ingested into postgres. Edit: there is a variable-sized decimal type that allows for up to 131,072 digits prior to the decimal point, but that feels too wonky. The max i64 value is 9,223,372,036,854,775,807 which is pretty darn big. If people are using values larger than that, they probably are dealing something not well suited for basic int types. |
This is definitely the safest way to go, and had already been baked in to library(adbcdrivermanager)
library(nanoarrow)
con <- adbc_database_init(
adbcpostgresql::adbcpostgresql(),
uri = "postgresql://localhost:5432/postgres?user=postgres&password=password"
) |>
adbc_connection_init()
df <- tibble::tibble(
uint8_col = 246:255,
uint16_col = 65526:65535,
uint32_col = (.Machine$integer.max + 1):(.Machine$integer.max + 10)
)
array <- df |>
nanoarrow::as_nanoarrow_array(
schema = na_struct(
list(
uint8_col = na_uint8(),
uint16_col = na_uint16(),
uint32_col = na_uint32()
)
)
)
con |>
execute_adbc("DROP TABLE IF EXISTS adbc_test")
array |>
write_adbc(con, "adbc_test")
con |>
read_adbc("select * from adbc_test") |>
tibble::as_tibble()
#> # A tibble: 10 × 3
#> uint8_col uint16_col uint32_col
#> <int> <int> <dbl>
#> 1 246 65526 2147483648
#> 2 247 65527 2147483649
#> 3 248 65528 2147483650
#> 4 249 65529 2147483651
#> 5 250 65530 2147483652
#> 6 251 65531 2147483653
#> 7 252 65532 2147483654
#> 8 253 65533 2147483655
#> 9 254 65534 2147483656
#> 10 255 65535 2147483657 Unfortunately, the method we're using to efficiently insert (generate COPY data) requires that the types match exactly, so this will fail to append to Arrow data that happens to have an unsigned integer column to an existing table: con |>
execute_adbc("DROP TABLE IF EXISTS adbc_test")
con |>
execute_adbc("CREATE TABLE adbc_test (uint8_col int2, uint16_col int2, uint32_col int4)")
array |>
write_adbc(con, "adbc_test", mode = "append")
#> Error in adbc_statement_execute_query(stmt): INVALID_ARGUMENT: [libpq] Failed to execute COPY statement: PGRES_FATAL_ERROR ERROR: incorrect binary data format
#> CONTEXT: COPY adbc_test, line 1, column uint16_col
For a fresh insert of Arrow data (i.e., when we are forced to generate a CREATE TABLE statement), this should probably be inferred as |
What happened?
adbc_ingest
fails for any dataframe that contains an unsigned int dtype and some (but not all) temporal dtypes.Environment/Setup
adbc_driver_postgresql==1.0.0
The text was updated successfully, but these errors were encountered: