-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(c/driver/postgresql): optionally return NUMERIC columns as double #1513
Comments
A statement-level option seems good. Maybe @paleolimbot or @WillAyd would have more of an opinion on how to structure the code. We may want to consider doing the double conversion directly at some point instead of bouncing through a string. We could consider a separate issue to add the postgres type specifier to the result schema so that you could identify cases like this? |
Implementation wise, the shortest path is probably to parse the decimal text with For types that we don't know how to deal with we already attach the database name as field metadata...we could do that for all types (or types like decimal and array where we know in advance that what we return is ambiguous). As you noted, then the client could handle that conversion as a workaround. Interface wise, I hope there is a way we can avoid |
indeed having the postgres type specifier also occurred to me but I did not notice it is already done for some other types so went with the conversion first. I'll create separate issue & PR for that. imho in the end it could make sense to have this extra source-type info every time when a concrete type is converted to a more generic one or there is loss of semantic. like numeric to string is one instance. but there is for instance OID -> int32 and few others. wdyt? |
@paleolimbot good point on the |
@lupko you can already start using C++17 |
@paleolimbot want to add something to this milestone? https://github.com/apache/arrow-adbc/milestone/8 |
I have been poking into this some more and explored the optional nature of the conversion. Alas, not exactly as described in #1514 because that is just too advanced for me to do in reasonable time :(. Anyway, I have created a draft PR for you guys to consider. In this draft, there is a new statement-level option I think an option like this still makes sense regardless of what is proposed in #1514. My (twisted :)) reasoning goes as follows: PostgreSQL NUMERIC cannot be always 1-1 mapped to Arrow types. The driver thus needs to compromise and fall back to some kind of conversion when creating Arrow data. There can be multiple different strategies for this fall back. Driver allows for a simple configuration of that strategy. E.g. this is not about configuring type mapping but instead configuring fallback when 'native' mapping for some type is just not possible. The approach with the option would not conflict with the fancier stuff done in #1514 - schema with desired types always wins, without it driver falls back to use the option. Anyway, here is the PR draft: #1521 I understand if this is not the way you want to go so no hard feelings can just close this here ticket in favor of #1514 :). With the extra field-level metadata that is already in place, I can create workaround in client code already (but was hoping I could avoid it). |
I can definitely see how an (or options) could be a workaround...the solution I suggested currently has an indeterminate timeline (requires an new ABI version) and I'm definitely not opposed to solutions that are possible today. Would the metadata solution (i.e., push the workaround into non-ADBC client code) also work or does that have performance/workflow implications that would make it unusable? |
@lupko is the current approach with the type metadata sufficient? |
Hi all,
I am currently evaluating/trying out ABDC (Python) in the following context:
In this context, the current ADBC PostgreSQL behavior where all NUMERIC columns are converted to string is not ideal. I understand the motivation - ADBC does not want to do lossy conversion by default and string is the only safe bet at the moment; I need to find a way out though :)
I was entertaining the idea that the 'connector' talking to PostgreSQL would perform the conversion on the fly so that it produces data where NUMERICs are in the end doubles (e.g. pipe the record batches produced by ADBC via Acero to perform str->double conversion on the way out of the service).
The tricky part here is that i'm unable to make this work in a simple deterministic way with just the SQL 'in hand': ADBC result gives no indication (or I was unable to find it? please let me know) that a string column in the result was in fact created from NUMERIC data.
The ideal solution in my context would be for ADBC PostgreSQL to return NUMERIC columns as double. I read through the issues and found that you have been considering this as optional behavior.
I dug into this a little bit and found in PostgreSQL source that it does numeric -> double conversion by calling
strtod
on the string representation (e.g. the algorithm that is already ported to ADBC's PostgreSQL driver). So if ADBC was to follow the suit, it could reuse the existing numeric -> string conversion and then dostrtod
on top of it - it could be quite simple.For reference, here is a working prototype: lupko@6d52952
double
valueWhat is missing in the prototype is some logic that decides whether to produce string or double - code in prototype always produces double.
If you find this general direction acceptable, I could take it further and complete it. I just need a little extra guidance about making this optional. At what level should the setting be available? What should be the name? What is the best way to 'tunnel' the option down to the code that decides that Arrow type to use? (e.g. PostgresType::SetSchema seems like a good place - a simple conditional here would be enough; no need to add extra conditions into adbcpq::MakeCopyFieldReader)
The text was updated successfully, but these errors were encountered: