-
Notifications
You must be signed in to change notification settings - Fork 337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve and document our GeoArrow compatibility story #8123
Comments
👋 I've vaguely followed rerun for a while and it's really cool to see you adding spatial support! The general structure of your data does indeed match the structure of GeoArrow points, but your data does not currently match the GeoArrow specification because the GeoArrow specification explicitly disallows
Confusion around coordinate axis order is almost a trope in the community. (I'd encourage you to switch to lon-lat if at all possible) Separately, technically GeoArrow currently requires float64 coordinates, and you use float32. The GeoArrow spec is pretty complex as it is, and it's unclear whether we'd want to include float32 as part of the spec, but I don't think it would be controversial to recommend that tools support both float32 or float64 input, and then float32 and float64 would at least be interoperable. (For context, I'm developing the Rust GeoArrow implementation, which also has Python bindings. Your use of |
Hi @kylebarron -- excited to see you here and thanks for bringing this to our attention. Hopefully we can get ahead of this both on the format and documentation front and save our users some pain and confusion. For what it's worth I actually tried to look this up in the https://geoarrow.org/format.html spec and not finding any mention of latitude or longitude, but knowing that geoarrow allowed providing the CRS, I naively jumped to the conclusion that the CRS-specified order would be respected. I had not even imagined that a spec might both provide a means of including CRS data and also explicitly disregard the axis-order the CRS specifies. Though the fact that the chain of transitively-cited links ultimately terminates in the gem of OGC 08-038r7 Case 4 and the final clause "The OGC does not recommend implementing this use case" at least makes me feel better about my incorrect assumption 😅 Although I can very much relate to the pragmatic reasons geoarrow went down this path, I'm somewhat reluctant to fully propagate these Case-4 style clauses myself in our own final spec, and so I suspect that once we include support for explicit CRS we will generally respect it (as suggested by Case 1 in the linked recommendation doc), but this whole topic has made me keenly aware that specifying a CRS alone is insufficient for all use-cases and we will likely need to also allow for an additional axis-order metadata that overrides the CRS (as suggested by Case 2). Regardless, in any context where we find ourselves reading / providing data where we find or apply the geoarrow extension metadata, we'll definitely make sure to track the swapped order and/or shuffle the data if necessary, as required per the geoarrow spec. Also, until we start allowing CRS data to be specified, I'm generally open to swapping the default order on our stored component data to align with geoarrow since this is likely to be the more common representation among folks already working with arrow natively. Needless to say, the choice shall be thoroughly documented and specification-linked to make sure there's no ambiguity. |
geoarrow-c/-python developer here...very cool! Just a note that our axis order assumption policy is a purely pragmatic move...this is ultimately something that needs to be solved at the CRS level (although unfortunately, the people developing CRS specs are the ones pointing at implementations saying "you're doing it wrong!", and implementors mostly keep ignoring them because following their advice leads to angry users whose data is constantly misinterpreted). I've seen this handled as a boolean ("authority_compliant: true") and as an integer array ("permutation: [0, 1]"). If you're using the "separated"/"struct" coordinate representation, it is almost instantaneous to reorder your axes on import/export (if you're using an interleaved representation, you'll have to make a copy). We might allow something like that in the GeoArrow specification, although I would prefer that the CRS specification people got it together and realized that requiring every single real-world implementation to have two fields to encode a CRS is not realistic (they either need to provide a CRS database that has CRSes with axis orders that match real-world data OR allow the axis order as part of some official CRS spec definition). |
And then also a GeoPandas developer chiming in .. :)
For supporting geopandas input, you could actually also reuse your general geoarrow support (at least when requiring geopandas >= 1.0, and you are fine with requiring pyarrow for that use case), because you can ask the geopandas object to convert itself to geoarrow using the new |
I was already kind of mentally weighing these two options. Do you recall which other systems/libraries do this. I would love to point at greater ecosystem alignment if possible when making a decision.
This is a great summary. I'm still quite new to this geo community -- what is the mechanism of communication with the "CRS specification people"? I'm assuming this is more involved than upvoting a GitHub issue somewhere or it would have been solved years ago, but is there some individual/committee we can email and complain at? |
This is not accurate. Although we do use |
Ah, apologies; I must've been looking at the wrong place in the flatbuffers. Aside from the CRS interoperability discussion above (I generally defer to @paleolimbot and @jorisvandenbossche on that front), I'd also encourage you to implement support for the Arrow PyCapsule Interface in your Python bindings. It enables safe Arrow data interchange without requiring the use of pyarrow. Any library implementing this protocol can exchange data via PyCapsules (safe wrappers of C pointers), where data producer and consumer don't need prior knowledge of each other. There are a growing number of Python libraries that support this protocol. So for example, you could exchange data zero-copy with Polars or DuckDB without needing to go through pyarrow. This applies to all Arrow data, not just GeoArrow data. Perhaps a separate issue would be better for this if you're interested in discussing it more? |
Yes -- I was looking into this recently and it looked very promising. Glad to hear an endorsement. |
The "global boolean flag" mechanism is used by sf: https://r-spatial.github.io/sf/reference/st_crs.html#arg-authority-compliant . This basically works by invoking I checked for any existing usage of
I'm actually not sure...probably "the OGC" but I don't know which mailing list would be relevant. I take the emperical/pragmatic view that there is no "right" or "wrong", just users with data that it's my job not to misinterpret. |
Oh, just kidding, I just found: https://gdal.org/en/latest/doxygen/classOGRSpatialReference.html#a1db6f8f544502740b218ba4526cebfd4 which implements the "permutation"). (Sorry for the overabundance of detail here...I'm writing about this in an unrelated document and so I've been collecting these links 🙂 ) |
@paleolimbot no apology necessary. Thank you for the detailed pointers. Knowledge about what the community expects for things like this is so valuable to us. |
Fundamentally, the new map-related archetypes are interoperable with GeoArrow in an efficient way. The way to achieve this should be documented, and helpers should be introduced.
Here is a mostly self-explanatory script showing ways to efficiently log geoarrow data, interoperate with GeoPandas, and a hint of what future helpers could look like:
The text was updated successfully, but these errors were encountered: