feat(c/driver/postgresql): optionally return NUMERIC columns as double #1513

lupko · 2024-02-05T16:56:23Z

Hi all,

I am currently evaluating/trying out ABDC (Python) in the following context:

Have a Flight RPC data service - let's call it 'connector' - that is responsible for handling arbitrary SQL queries to PostgreSQL and producing results as stream of Arrow data
The result provided by 'connector' is then used by other services that do additional 'post-SQL' processing; the result of 'connector' is further 'massaged' using different algorithms (typically aggregations or ML algorithms).
The PostgreSQL data stored in NUMERIC columns typically represents some measurable value - which is used in the different algorithms applied by the other services.
It is a hard constraint that the results produced by the 'connector' can be used as-is in the subsequent steps.

In this context, the current ADBC PostgreSQL behavior where all NUMERIC columns are converted to string is not ideal. I understand the motivation - ADBC does not want to do lossy conversion by default and string is the only safe bet at the moment; I need to find a way out though :)

I was entertaining the idea that the 'connector' talking to PostgreSQL would perform the conversion on the fly so that it produces data where NUMERICs are in the end doubles (e.g. pipe the record batches produced by ADBC via Acero to perform str->double conversion on the way out of the service).

The tricky part here is that i'm unable to make this work in a simple deterministic way with just the SQL 'in hand': ADBC result gives no indication (or I was unable to find it? please let me know) that a string column in the result was in fact created from NUMERIC data.

The ideal solution in my context would be for ADBC PostgreSQL to return NUMERIC columns as double. I read through the issues and found that you have been considering this as optional behavior.

I dug into this a little bit and found in PostgreSQL source that it does numeric -> double conversion by calling strtod on the string representation (e.g. the algorithm that is already ported to ADBC's PostgreSQL driver). So if ADBC was to follow the suit, it could reuse the existing numeric -> string conversion and then do strtod on top of it - it could be quite simple.

For reference, here is a working prototype: lupko@6d52952

refactored reader so that the numeric -> str code & friends from PostgreSQL are in common base class
there is one class for Numeric to String; as before, the code populates Arrow buffers directly
there is another class for Numeric to Double; string representation is written into temporary location from where code does strtod and populates Arrow buffer with the double value

What is missing in the prototype is some logic that decides whether to produce string or double - code in prototype always produces double.

If you find this general direction acceptable, I could take it further and complete it. I just need a little extra guidance about making this optional. At what level should the setting be available? What should be the name? What is the best way to 'tunnel' the option down to the code that decides that Arrow type to use? (e.g. PostgresType::SetSchema seems like a good place - a simple conditional here would be enough; no need to add extra conditions into adbcpq::MakeCopyFieldReader)

The text was updated successfully, but these errors were encountered:

lidavidm · 2024-02-05T17:17:10Z

A statement-level option seems good. Maybe @paleolimbot or @WillAyd would have more of an opinion on how to structure the code. We may want to consider doing the double conversion directly at some point instead of bouncing through a string.

We could consider a separate issue to add the postgres type specifier to the result schema so that you could identify cases like this?

paleolimbot · 2024-02-05T17:51:40Z

Implementation wise, the shortest path is probably to parse the decimal text with from_chars() (which I think is locale-independent). The representation we get from COPY is also not bad to convert (it's groups of four digits from 0 to 9999); however, one would have to be careful about and test cases of over/underflow.

For types that we don't know how to deal with we already attach the database name as field metadata...we could do that for all types (or types like decimal and array where we know in advance that what we return is ambiguous). As you noted, then the client could handle that conversion as a workaround.

Interface wise, I hope there is a way we can avoid AdbcStatementSetOption("adbc.statement.XXX_as_XXX", 1). There are a lot of options regarding how to calculate an Arrow type from a database type and it's easy to accumulate many of these. I wish there was a AdbcStatementRequestSchema() to handle the case where the client already knows what they want (perhaps after inspecting AdbcStatementExecuteSchema()); however, I don't think there's a clean way to handle that with the existing API. It would fit nicely with things like the Python capsule protocol (___arrow_c_array_stream__(requested_schema=xxx)).

lupko · 2024-02-05T18:02:17Z

indeed having the postgres type specifier also occurred to me but I did not notice it is already done for some other types so went with the conversion first. I'll create separate issue & PR for that.

imho in the end it could make sense to have this extra source-type info every time when a concrete type is converted to a more generic one or there is loss of semantic. like numeric to string is one instance. but there is for instance OID -> int32 and few others. wdyt?

lupko · 2024-02-05T18:18:27Z

@paleolimbot good point on the from_chars(), i was not aware that strtod() uses current locale; since from_chars() is available from C++ 17 onward, i guess this would have to wait for #1431?

WillAyd · 2024-02-05T18:19:35Z

@lupko you can already start using C++17

lidavidm · 2024-02-05T18:26:43Z

@paleolimbot want to add something to this milestone? https://github.com/apache/arrow-adbc/milestone/8

lupko · 2024-02-06T18:36:46Z

I have been poking into this some more and explored the optional nature of the conversion. Alas, not exactly as described in #1514 because that is just too advanced for me to do in reasonable time :(.

Anyway, I have created a draft PR for you guys to consider. In this draft, there is a new statement-level option adbc.postgresql.numeric_conversion that can be of value to_string, to_double with the idea that if one day ADBC PostgreSQL supports (limited) conversion to decimal128/256, then the value of the option could be to_decimal or whatnot. That way the number of possible options is limited.

I think an option like this still makes sense regardless of what is proposed in #1514. My (twisted :)) reasoning goes as follows: PostgreSQL NUMERIC cannot be always 1-1 mapped to Arrow types. The driver thus needs to compromise and fall back to some kind of conversion when creating Arrow data. There can be multiple different strategies for this fall back. Driver allows for a simple configuration of that strategy. E.g. this is not about configuring type mapping but instead configuring fallback when 'native' mapping for some type is just not possible.

The approach with the option would not conflict with the fancier stuff done in #1514 - schema with desired types always wins, without it driver falls back to use the option.

Anyway, here is the PR draft: #1521

I understand if this is not the way you want to go so no hard feelings can just close this here ticket in favor of #1514 :). With the extra field-level metadata that is already in place, I can create workaround in client code already (but was hoping I could avoid it).

paleolimbot · 2024-02-08T17:20:28Z

I can definitely see how an (or options) could be a workaround...the solution I suggested currently has an indeterminate timeline (requires an new ABI version) and I'm definitely not opposed to solutions that are possible today. Would the metadata solution (i.e., push the workaround into non-ADBC client code) also work or does that have performance/workflow implications that would make it unusable?

lidavidm · 2024-02-13T18:45:27Z

@lupko is the current approach with the type metadata sufficient?

lidavidm added this to the ADBC Libraries 0.10.0 milestone Feb 5, 2024

paleolimbot mentioned this issue Feb 5, 2024

feat: Add the ability to request a schema from a statement #1514

Open

lupko changed the title ~~feature(c/driver/postgresql): optionally return NUMERIC columns as double~~ feat(c/driver/postgresql): optionally return NUMERIC columns as double Feb 5, 2024

lidavidm modified the milestones: ADBC Libraries 0.10.0, ADBC Libraries 0.11.0 Feb 14, 2024

lidavidm removed this from the ADBC Libraries 0.11.0 milestone Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(c/driver/postgresql): optionally return NUMERIC columns as double #1513

feat(c/driver/postgresql): optionally return NUMERIC columns as double #1513

lupko commented Feb 5, 2024 •

edited

Loading

lidavidm commented Feb 5, 2024

paleolimbot commented Feb 5, 2024

lupko commented Feb 5, 2024

lupko commented Feb 5, 2024

WillAyd commented Feb 5, 2024

lidavidm commented Feb 5, 2024

lupko commented Feb 6, 2024 •

edited

Loading

paleolimbot commented Feb 8, 2024

lidavidm commented Feb 13, 2024

feat(c/driver/postgresql): optionally return NUMERIC columns as double #1513

feat(c/driver/postgresql): optionally return NUMERIC columns as double #1513

Comments

lupko commented Feb 5, 2024 • edited Loading

lidavidm commented Feb 5, 2024

paleolimbot commented Feb 5, 2024

lupko commented Feb 5, 2024

lupko commented Feb 5, 2024

WillAyd commented Feb 5, 2024

lidavidm commented Feb 5, 2024

lupko commented Feb 6, 2024 • edited Loading

paleolimbot commented Feb 8, 2024

lidavidm commented Feb 13, 2024

lupko commented Feb 5, 2024 •

edited

Loading

lupko commented Feb 6, 2024 •

edited

Loading