-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
r: Support bool
and dictionary<string>
in AdbcStatementBind()
#1008
Comments
I think it seems reasonable that a driver R package should have the opportunity to have some control over how a given Perhaps: as_adbc_bind_array_stream <- function(x) {
UseMethod("as_adbc_bind_array_stream")
}
as_adbc_bind_array_stream.default <- function(x) {
nanoarrow::as_nanoarrow_array_stream(x)
} The SQLite driver could then do: as_adbc_bind_array_stream.adbcsqlite_statement <- function(x) {
if (is.data.frame(x)) {
# loop and reassign columns with unsupported types, possibly warning for
# assignments that will loose information
}
nanoarrow::as_nanoarrow_array_stream(x)
} |
@krlmlr You brought up the concept of roundtripping in a conversation offline: I think that may be more of an issue for conversion from the other side. In general, conversions to arrow (via nanoarrow or arrow) only happen without error if the vector (and any associated metadata - like timezone or unit) can be represented in Arrow losslessly. If you know the ptype that you want back, |
@paleolimbot Yeah, I think something like this goes a long way for solving this type of issues. As a more general remark: I don't think the current behavior is bad. It's just that I feel we need to give the user an option to opt in to (or out of) a more forgiving mechanism. We could keep the current behavior as "strict mode" (this might even be the default) and if strict mode not enabled (or disabled) something like the above kicks in where factors can end up as TEXT columns (in the case of SQLite). The down side of not being in "strict mode" is that you can't always round-trip: The TEXT column from before will yield a character vector if read back. Maybe keeping around "strict mode" addresses @krlmlr's comment as well: when enabled you will always be able to round trip (or error). @ianmcook Is this the type of discussion you wanted us to have? |
It should also be said that we could also:
Example of how you could avoid dictionary encoding without changes to nanoarrow or driver R packages: library(nanoarrow)
without_dictionary_encoding <- function(x) {
x$children <- lapply(x$children, without_dictionary_encoding)
if (is.null(x$dictionary)) {
x
} else {
x$dictionary
}
}
df <- palmerpenguins::penguins[1]
(schema <- infer_nanoarrow_schema(df))
#> <nanoarrow_schema struct>
#> $ format : chr "+s"
#> $ name : chr ""
#> $ metadata : list()
#> $ flags : int 0
#> $ children :List of 1
#> ..$ species:<nanoarrow_schema dictionary(int32)<string>>
#> .. ..$ format : chr "i"
#> .. ..$ name : chr "species"
#> .. ..$ metadata : list()
#> .. ..$ flags : int 2
#> .. ..$ children : list()
#> .. ..$ dictionary:<nanoarrow_schema string>
#> .. .. ..$ format : chr "u"
#> .. .. ..$ name : chr ""
#> .. .. ..$ metadata : list()
#> .. .. ..$ flags : int 2
#> .. .. ..$ children : list()
#> .. .. ..$ dictionary: NULL
#> $ dictionary: NULL
as_nanoarrow_array(df)
#> <nanoarrow_array struct[344]>
#> $ length : int 344
#> $ null_count: int 0
#> $ offset : int 0
#> $ buffers :List of 1
#> ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#> $ children :List of 1
#> ..$ species:<nanoarrow_array dictionary(int32)<string>[344]>
#> .. ..$ length : int 344
#> .. ..$ null_count: int 0
#> .. ..$ offset : int 0
#> .. ..$ buffers :List of 2
#> .. .. ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#> .. .. ..$ :<nanoarrow_buffer data<int32>[344][1376 b]> `0 0 0 0 0 0 0 0 0 ...`
#> .. ..$ dictionary:<nanoarrow_array string[3]>
#> .. .. ..$ length : int 3
#> .. .. ..$ null_count: int 0
#> .. .. ..$ offset : int 0
#> .. .. ..$ buffers :List of 3
#> .. .. .. ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#> .. .. .. ..$ :<nanoarrow_buffer data_offset<int32>[4][16 b]> `0 6 15 21`
#> .. .. .. ..$ :<nanoarrow_buffer data<string>[21 b]> `AdelieChinstrapGentoo`
#> .. .. ..$ dictionary: NULL
#> .. .. ..$ children : list()
#> .. ..$ children : list()
#> $ dictionary: NULL
as_nanoarrow_array(df, schema = without_dictionary_encoding(schema))
#> <nanoarrow_array struct[344]>
#> $ length : int 344
#> $ null_count: int 0
#> $ offset : int 0
#> $ buffers :List of 1
#> ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#> $ children :List of 1
#> ..$ species:<nanoarrow_array string[344]>
#> .. ..$ length : int 344
#> .. ..$ null_count: int 0
#> .. ..$ offset : int 0
#> .. ..$ buffers :List of 3
#> .. .. ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#> .. .. ..$ :<nanoarrow_buffer data_offset<int32>[345][1380 b]> `0 6 12 18 2...`
#> .. .. ..$ :<nanoarrow_buffer data<string>[2268 b]> `AdelieAdelieAdelieAdel...`
#> .. ..$ dictionary: NULL
#> .. ..$ children : list()
#> $ dictionary: NULL Created on 2023-09-01 with reprex v2.0.2 |
bool
and dictionary<string>
in AdbcStatementBind()
We had a somewhat extensive offline conversation about this, and I think the action item we decided on was to add support for |
And for reference |
Bool seems to be already supported; dictionary is implemented in #1224; blobs were already implemented but there was no way to roundtrip them until #1223. Bool example: library(adbcdrivermanager)
db <- adbc_database_init(adbcsqlite::adbcsqlite(), uri = ":memory:")
con <- adbc_connection_init(db)
df <- data.frame(x = c(TRUE, FALSE, NA))
write_adbc(df, con, "tbl")
read_adbc(con, "SELECT * from tbl") |>
as.data.frame()
#> x
#> 1 1
#> 2 0
#> 3 NA Created on 2023-10-25 with reprex v2.0.2 |
…binary types (#1224) This PR adds the ability to ingest dictionary-encoded string and binary columns. Part of addressing #1008. From the R bindings: ``` r library(adbcdrivermanager) db <- adbc_database_init(adbcsqlite::adbcsqlite(), uri = ":memory:") con <- adbc_connection_init(db) df <- data.frame(x = factor(letters[1:10])) write_adbc(df, con, "tbl") read_adbc(con, "SELECT * from tbl") |> as.data.frame() #> x #> 1 a #> 2 b #> 3 c #> 4 d #> 5 e #> 6 f #> 7 g #> 8 h #> 9 i #> 10 j ``` <sup>Created on 2023-10-25 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>
Some examples:
While the errors are correct in the sense that SQLite does not offer proper support for any of these types, DBI/RSQLite is more lenient:
The text was updated successfully, but these errors were encountered: