Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support copying postgres binary data to an Arrow RecordBatch #35

Closed
Sergfalt opened this issue Feb 2, 2024 · 11 comments
Closed

Support copying postgres binary data to an Arrow RecordBatch #35

Sergfalt opened this issue Feb 2, 2024 · 11 comments

Comments

@Sergfalt
Copy link

Sergfalt commented Feb 2, 2024

@adriangb, Hi! Hope you are doing well. Could you please clarify is it posible to convert Postgres Binary to Pyarrow Table using pgpq, maybe you have some developments on this topic.

@adriangb
Copy link
Owner

adriangb commented Feb 2, 2024

This library does the opposite: it converts arrow tables to postgres binary data. I don't intend to implement the opposite anytime soon. Are you asking for that as a feature?

@Sergfalt
Copy link
Author

Sergfalt commented Feb 2, 2024

@adriangb , Yes, I'm asking for that as a feature. It will be very convinient to have 2 opposite metods in one python library. The first one converts arrow tables to postgres binary data and the second one - vice a virce. For example, PostgresBinaryToArrowEncoder for converting PostgresBinary to ArrowTable.

@adriangb
Copy link
Owner

adriangb commented Feb 2, 2024

Ok I've renamed the issue and will leave it open as a feature request

@adriangb adriangb changed the title PostgresBinaryToArrow Support copying postgres binary data to an Arrow RecordBatch Feb 2, 2024
@ben-pr-p
Copy link

ben-pr-p commented Mar 8, 2024

Hey @Sergfalt , DuckDB already does this fantastically, and is arrow compatible.

After attaching Postgres to DuckDB, you can do:

        arrow_reader = duckdb.sql(
            f""
            SELECT *
            FROM pg.{schema_name}.{table_name}
            """
        ).fetch_arrow_reader()

And then do whatever you'd like to with this standard arrow batch reader. DuckDB's Postgres scanner also uses Postgres's binary format, and is very fast and well engineered

@ben-pr-p
Copy link

ben-pr-p commented Mar 8, 2024

Hm, I just also realized that DuckDB's Postgres inserter appears to be doing exactly the same thing as this library (going from Arrow -> Postgres binary format). It'd be curious to benchmark the two.

@adriangb
Copy link
Owner

adriangb commented Mar 8, 2024

Do you know what types they support? Part of the impetus for this library was to support some more exotic types. I guess I can test it next week.

@adriangb
Copy link
Owner

adriangb commented Mar 8, 2024

One difference is that this library is sans-io, which can help or hurt depending on your use case.

@ben-pr-p
Copy link

ben-pr-p commented Mar 8, 2024

The support is pretty significant:
In psql:

\d numbers
                 Table "imports.numbers"
┌───────────┬───────────┬───────────┬──────────┬─────────┐
│  Column   │   Type    │ Collation │ Nullable │ Default │
╞═══════════╪═══════════╪═══════════╪══════════╪═════════╡
│ id        │ bigint    │           │          │         │
│ custom    │ jsonb     │           │          │         │
│ my_struct │ my_struct │           │          │         │
└───────────┴───────────┴───────────┴──────────┴─────────┘

In DuckDB:

insert into pg.imports.numbers (id, custom, my_struct) select i, json_object('n', i), { "first": i, "last": i } from range(1000000) tbl (i);

Result in PG:

> select * from numbers limit 10;
┌────┬──────────┬───────────┐
│ id │  custom  │ my_struct │
╞════╪══════════╪═══════════╡
│  0 │ {"n": 0} │ (0,0)     │
│  1 │ {"n": 1} │ (1,1)     │
│  2 │ {"n": 2} │ (2,2)     │
│  3 │ {"n": 3} │ (3,3)     │
│  4 │ {"n": 4} │ (4,4)     │
│  5 │ {"n": 5} │ (5,5)     │
│  6 │ {"n": 6} │ (6,6)     │
│  7 │ {"n": 7} │ (7,7)     │
│  8 │ {"n": 8} │ (8,8)     │
│  9 │ {"n": 9} │ (9,9)     │
└────┴──────────┴───────────┘
(10 rows)

> select sum(id) from numbers;
┌──────────────┐
│     sum      │
╞══════════════╡
│ 499999500000 │
└──────────────┘
(1 row)

@ben-pr-p
Copy link

ben-pr-p commented Mar 8, 2024

One difference is that this library is sans-io, which can help or hurt depending on your use case.

Could you say a bit more about what that means for the end-users among us? Do you mean that it runs entirely within the thread / process of the caller, and is guaranteed not to spill to disk or another thread, etc.?

If so, that definitely is a difference and is really useful! Even though DuckDB can run in memory and runs in your process, it's more difficult to write multi-threaded Python with since there's a whole other thing going on.

@adriangb
Copy link
Owner

adriangb commented Mar 8, 2024

Sans-IO means it doesn't make any IO calls. You have control over and the responsibility of doing all of the IO. So this library doesn't create threads, event loops, make HTTP requests, etc. Which for example may make it easier with asyncio or your own thread processing, but is also a lot more work if you don't need it.

@adriangb
Copy link
Owner

I added some benchmarks against DuckDB in #37.

Given that it seems like DuckDB has robust support for this use case I'm going to close this as wontfix.

I'll also say that I believe when I first built this library DuckDB didn't have as good of support for this workflow as it does now (if it did I must have missed that update), but now that it does I would personally pick that over this workflow unless there's issues with the IO (e.g. you want to control it or work very closely with arrow) or memory (e.g. I know DuckDB does an amazing job of streaming and not loading into memory but I haven't measured how it does for this use case, this library on the other hand gives you complete control by operating at the record batch level).

@adriangb adriangb closed this as not planned Won't fix, can't repro, duplicate, stale Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants