Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose upstream schema #18

Open
kylebarron opened this issue Mar 14, 2024 · 6 comments
Open

Expose upstream schema #18

kylebarron opened this issue Mar 14, 2024 · 6 comments

Comments

@kylebarron
Copy link

👋 I saw you were looking for feedback and I was curious whether you have support for or are interested in exposing support for user-defined data types? In particular, I'm working on building out support for geospatial in arrow in https://github.com/geoarrow/geoarrow-rs. We have a working but limited implementation that reads from PostGIS databases. It directly connects to sqlx, and it would be great to use a library like yours that focuses on converting database tables -> arrow. But we need to be able to access the type name, which is "geometry" or "geography" on a PostGIS column.

I haven't looked through your API yet, but I think one way this could work is if you exposed both the Postgres schema and the inferred Arrow schema? Because then geoarrow could access the upstream schema and know that the BinaryArray column actually represents geospatial data.

@aljazerzen
Copy link
Owner

Oh, interesting. So geoarrow-rs defines new types of array for each of the geometry types?

To answer your question: yes it is totally doable to expose the original type of a column in a query response.

A sidenote: for Postgres, response types are returned just as an u32 oid, so we'd need to make additional queries to the database to figure out the actual name.

In this scenario, connector_arrow would use a Postgres Client to make a request, then convert PostGIS data into plain arrow types and then pass that to geoarrow-rs, right?
So POINT would probably be converted into a List of Float64?
And POLYGON would probably be converted into a List of List of Float64?

This does add another step into the data pipeline, but it would be the easiest to implement.


The title of the post suggests something different: to dynamically specify how certain database type have to be converted into arrow types.

For the record, this is much harder to do, since it needs to plug into my machinery of mapping types and providing data converters.

@kylebarron
Copy link
Author

Oh, interesting. So geoarrow-rs defines new types of array for each of the geometry types?

Yeah, geoarrow-rs follows the GeoArrow specification. GeoArrow follows the same "Simple Features" spec as PostGIS, so GeoArrow also has Point, LineString, Polygon, etc types like PostGIS, but they're stored differently. PostGIS has a custom internal format IIRC while yes a two-dimensional PointArray is physically a FixedSizeList[2, Float64]. (A Polygon array has two levels of nesting above the coordinates so it's List[List[FixedSizeList[2, Float64]]]).

In this scenario, connector_arrow would use a Postgres Client to make a request, then convert PostGIS data into plain arrow types and then pass that to geoarrow-rs, right?

PostGIS data comes over the wire as Well-Known Binary, which packs each row into a well-specified binary format. This would come across as an Arrow BinaryArray. I'm not suggesting that connector_arrow manage conversion from the binary format to the arrow-native PointArray etc; that's all done in geoarrow-rs. Rather, we just need to know the original Postgres schema so that we know a given binary column actually represents geometries.

A sidenote: for Postgres, response types are returned just as an u32 oid, so we'd need to make additional queries to the database to figure out the actual name

Yeah... I'm a little hazy on those details. I think sqlx might sometime make those additional queries transparently to the user.

The title of the post suggests something different: to dynamically specify how certain database type have to be converted into arrow types.

I don't think I need anything truly dynamic.

@kylebarron kylebarron changed the title Customize type conversion? Expose upstream schema Mar 15, 2024
@kylebarron
Copy link
Author

One potential way to do this, though it would have less type safety, would be to include upstream metadata on the Arrow field metadata. Then you wouldn't need to change the API presumably.

@aljazerzen
Copy link
Owner

... include upstream metadata on the Arrow field metadata

PostGIS data comes over the wire as Well-Known Binary

Oh this is great! So connector_arrow will just be converting the binary buffers from Postgres wire protocol into BinaryArray and attaching the name of the original type into the metadata.

I wouldn't say that this is less type-safe.

I also like this approach because it also works with any other Postgres extension or built-in non-trivial type. It makes connector_arrow only deal with converting rows into RecordBatches and leaves custom type deserialization to downstream crates. Do one thing and do it well :D

@aljazerzen
Copy link
Owner

A sidenote: for Postgres, response types are returned just as an u32 oid, so we'd need to make additional queries to the database to figure out the actual name

Yeah... I'm a little hazy on those details. I think sqlx might sometime make those additional queries transparently to the user.

Ignore this, I was wrong. This is handled by postgres crate (or rather tokio-postgres), as it should be:

https://github.com/sfackler/rust-postgres/blob/master/tokio-postgres/src/prepare.rs#L134-L186

@aljazerzen
Copy link
Owner

aljazerzen commented Mar 20, 2024

Implemented in e4fd3ca

@kylebarron, please take a look and verify that this covers your use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants