Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize GeoPandas support #200

Open
robfitzgerald opened this issue May 9, 2024 · 5 comments
Open

Normalize GeoPandas support #200

robfitzgerald opened this issue May 9, 2024 · 5 comments
Labels
python Applies to the python code
Milestone

Comments

@robfitzgerald
Copy link
Collaborator

robfitzgerald commented May 9, 2024

The result of a CompassApp::run is a Python Object which has some standard keys and structures that are known/specified internally in the rust codebase. Handling this monolithic object in Python is counter-intuitive since we are not leveraging Pandas/GeoPandas, which are the standard format for analysis of columnar (geo) data.

The CompassApp::run method could include a geopandas: bool = false (or gdf if that's more Pythonic) argument which will assume the user wants GeoDataFrame outputs for route and tree data when it is true, but output the monolithic Object when false (by default).

@robfitzgerald robfitzgerald added the python Applies to the python code label May 9, 2024
@robfitzgerald
Copy link
Collaborator Author

robfitzgerald commented May 9, 2024

here's some thoughts on how we could do this.

I'd propose the signature for run gets expanded to support a few different combinations.

class CompassApp:

  def run(self, queries, gdf: bool = false, index_col: Optional[str]=None)

Top-level structure

In my mind a route is mapped to a single row of a dataframe. I think for a tree, a link is mapped to a row. Because their schemas are different, I think we want the output of a gpd=true run to return:

  • GeoDataFrame when either routes or trees are found (but not both)
  • Tuple[GeoDataFrame, GeoDataFrame] when both are present

This is similar to osmnx.to_gdfs().

Returning Routes

Summary row data

Any row in a route GeoDataFrame should have all traversal and cost columns that may be relevant to post-processing:

column mapping
_index (user-defined index_col path, or, by default, the top-level index value in the Compass output)
distance traversal_summary.distance
time traversal_summary.time
...
distance_cost cost.distance
time_cost cost.time
total_cost cost.total_cost

All of these should be set only if they exist (optional semantics).

Geo row identifiers

The route geometries are parsed as geometry objects. A _path column is added to the dataframe in the case that there is more than one path. The value is the route.path index.

run() result description route column?
one route {} route.path has one entry no
n routes [] route.path has i entries append _path column with value i

As a result, a row is uniquely indexed by the _index column when there is one route per row, and a combination of _index and _path when route.path has more than one entry. Put another way: we end up with k * q rows for q queries and k paths.

Geometry data

We need to cover the possible geometry output types. We can know in advance what the geometry type is by inspecting the configuration of the CompassApp (via #201). We can then switch our geometry parsing method accordingly:

Compass traversal.route config value parse method for string s
json (no geometry)
wkt shapely.wkt.loads(s)
geojson shapely.ops.shape(s)

Returning trees

Trees are similar but different. Each Compass output row may have an entire tree, but, we do not want a GDF of trees (plural), we want instead to plot a single tree. Following the logic above, we can add one more index column so that each row is a link in a tree in a result:

for each Compass run result row _i_:
  for each tree _j_:
    for each tree branch _k_:
      create row for `_index` _i_, `_tree` _k_,  `_edge_id` edgeid(_k_)

@robfitzgerald
Copy link
Collaborator Author

btw, if we do the proposed above, then i feel like all of our plotting functions can disappear and get replaced by GeoDataFrame plotting methods, which for example supports the feature described in #199, but also can swap-in replace the existing to_plot_folium methods in compass-app-py.

@robfitzgerald
Copy link
Collaborator Author

Just thinking about it. Maybe instead of gdf: bool we instead provide output_format: str where it can be "json" or "geopandas"?

@nreinicke nreinicke added this to the PyCon 2024 milestone May 9, 2024
This was referenced May 14, 2024
@robfitzgerald
Copy link
Collaborator Author

robfitzgerald commented Oct 24, 2024

working with Compass right now in a python environment, and i want my run result as a dataframe. my result has thousands of rows, so, avoiding a copy would be ideal. that's where we should be using polars.

two thoughts:

  1. expand on our CSV output policy so that the end result is a polars dataframe
    - nice time to add JSONPath support to replace the current mapper operations
  2. revise ResponsePersistencePolicy::PersistResponseInMemory so it has an optional output format
  3. revise the run response to be either JSON or a DataTable
  4. revise the Python run method so it returns either a PyObject Dict or a DataTable (via polars/Apache Arrow -> pandas.DataFrame method which depends on pandas and pyarrow)

edit: then, to make this result a geopandas.GeoDataFrame could be simple with the user providing a JSONPath to the geometry object

@nreinicke
Copy link
Collaborator

Yeah I really like the idea of using polars here for zero copy operations and I like that this will still allow support of geopandas if we use the pyarrow backend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python Applies to the python code
Projects
None yet
Development

No branches or pull requests

2 participants