-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support copying postgres binary data to an Arrow RecordBatch #35
Comments
This library does the opposite: it converts arrow tables to postgres binary data. I don't intend to implement the opposite anytime soon. Are you asking for that as a feature? |
@adriangb , Yes, I'm asking for that as a feature. It will be very convinient to have 2 opposite metods in one python library. The first one converts arrow tables to postgres binary data and the second one - vice a virce. For example, PostgresBinaryToArrowEncoder for converting PostgresBinary to ArrowTable. |
Ok I've renamed the issue and will leave it open as a feature request |
Hey @Sergfalt , DuckDB already does this fantastically, and is arrow compatible. After attaching Postgres to DuckDB, you can do: arrow_reader = duckdb.sql(
f""
SELECT *
FROM pg.{schema_name}.{table_name}
"""
).fetch_arrow_reader() And then do whatever you'd like to with this standard arrow batch reader. DuckDB's Postgres scanner also uses Postgres's binary format, and is very fast and well engineered |
Hm, I just also realized that DuckDB's Postgres inserter appears to be doing exactly the same thing as this library (going from Arrow -> Postgres binary format). It'd be curious to benchmark the two. |
Do you know what types they support? Part of the impetus for this library was to support some more exotic types. I guess I can test it next week. |
One difference is that this library is sans-io, which can help or hurt depending on your use case. |
The support is pretty significant:
In DuckDB:
Result in PG:
|
Could you say a bit more about what that means for the end-users among us? Do you mean that it runs entirely within the thread / process of the caller, and is guaranteed not to spill to disk or another thread, etc.? If so, that definitely is a difference and is really useful! Even though DuckDB can run in memory and runs in your process, it's more difficult to write multi-threaded Python with since there's a whole other thing going on. |
Sans-IO means it doesn't make any IO calls. You have control over and the responsibility of doing all of the IO. So this library doesn't create threads, event loops, make HTTP requests, etc. Which for example may make it easier with asyncio or your own thread processing, but is also a lot more work if you don't need it. |
I added some benchmarks against DuckDB in #37. Given that it seems like DuckDB has robust support for this use case I'm going to close this as wontfix. I'll also say that I believe when I first built this library DuckDB didn't have as good of support for this workflow as it does now (if it did I must have missed that update), but now that it does I would personally pick that over this workflow unless there's issues with the IO (e.g. you want to control it or work very closely with arrow) or memory (e.g. I know DuckDB does an amazing job of streaming and not loading into memory but I haven't measured how it does for this use case, this library on the other hand gives you complete control by operating at the record batch level). |
@adriangb, Hi! Hope you are doing well. Could you please clarify is it posible to convert Postgres Binary to Pyarrow Table using pgpq, maybe you have some developments on this topic.
The text was updated successfully, but these errors were encountered: