Support copying postgres binary data to an Arrow RecordBatch #35

Sergfalt · 2024-02-02T14:36:36Z

@adriangb, Hi! Hope you are doing well. Could you please clarify is it posible to convert Postgres Binary to Pyarrow Table using pgpq, maybe you have some developments on this topic.

adriangb · 2024-02-02T14:49:07Z

This library does the opposite: it converts arrow tables to postgres binary data. I don't intend to implement the opposite anytime soon. Are you asking for that as a feature?

Sergfalt · 2024-02-02T16:37:55Z

@adriangb , Yes, I'm asking for that as a feature. It will be very convinient to have 2 opposite metods in one python library. The first one converts arrow tables to postgres binary data and the second one - vice a virce. For example, PostgresBinaryToArrowEncoder for converting PostgresBinary to ArrowTable.

adriangb · 2024-02-02T22:23:57Z

Ok I've renamed the issue and will leave it open as a feature request

ben-pr-p · 2024-03-08T21:11:31Z

Hey @Sergfalt , DuckDB already does this fantastically, and is arrow compatible.

After attaching Postgres to DuckDB, you can do:

        arrow_reader = duckdb.sql(
            f""
            SELECT *
            FROM pg.{schema_name}.{table_name}
            """
        ).fetch_arrow_reader()

And then do whatever you'd like to with this standard arrow batch reader. DuckDB's Postgres scanner also uses Postgres's binary format, and is very fast and well engineered

ben-pr-p · 2024-03-08T21:25:46Z

Hm, I just also realized that DuckDB's Postgres inserter appears to be doing exactly the same thing as this library (going from Arrow -> Postgres binary format). It'd be curious to benchmark the two.

adriangb · 2024-03-08T21:27:20Z

Do you know what types they support? Part of the impetus for this library was to support some more exotic types. I guess I can test it next week.

adriangb · 2024-03-08T21:28:15Z

One difference is that this library is sans-io, which can help or hurt depending on your use case.

ben-pr-p · 2024-03-08T21:41:30Z

The support is pretty significant:
In psql:

\d numbers
                 Table "imports.numbers"
┌───────────┬───────────┬───────────┬──────────┬─────────┐
│  Column   │   Type    │ Collation │ Nullable │ Default │
╞═══════════╪═══════════╪═══════════╪══════════╪═════════╡
│ id        │ bigint    │           │          │         │
│ custom    │ jsonb     │           │          │         │
│ my_struct │ my_struct │           │          │         │
└───────────┴───────────┴───────────┴──────────┴─────────┘

In DuckDB:

insert into pg.imports.numbers (id, custom, my_struct) select i, json_object('n', i), { "first": i, "last": i } from range(1000000) tbl (i);

Result in PG:

> select * from numbers limit 10;
┌────┬──────────┬───────────┐
│ id │  custom  │ my_struct │
╞════╪══════════╪═══════════╡
│  0 │ {"n": 0} │ (0,0)     │
│  1 │ {"n": 1} │ (1,1)     │
│  2 │ {"n": 2} │ (2,2)     │
│  3 │ {"n": 3} │ (3,3)     │
│  4 │ {"n": 4} │ (4,4)     │
│  5 │ {"n": 5} │ (5,5)     │
│  6 │ {"n": 6} │ (6,6)     │
│  7 │ {"n": 7} │ (7,7)     │
│  8 │ {"n": 8} │ (8,8)     │
│  9 │ {"n": 9} │ (9,9)     │
└────┴──────────┴───────────┘
(10 rows)

> select sum(id) from numbers;
┌──────────────┐
│     sum      │
╞══════════════╡
│ 499999500000 │
└──────────────┘
(1 row)

ben-pr-p · 2024-03-08T22:32:51Z

One difference is that this library is sans-io, which can help or hurt depending on your use case.

Could you say a bit more about what that means for the end-users among us? Do you mean that it runs entirely within the thread / process of the caller, and is guaranteed not to spill to disk or another thread, etc.?

If so, that definitely is a difference and is really useful! Even though DuckDB can run in memory and runs in your process, it's more difficult to write multi-threaded Python with since there's a whole other thing going on.

adriangb · 2024-03-08T22:35:10Z

Sans-IO means it doesn't make any IO calls. You have control over and the responsibility of doing all of the IO. So this library doesn't create threads, event loops, make HTTP requests, etc. Which for example may make it easier with asyncio or your own thread processing, but is also a lot more work if you don't need it.

adriangb · 2024-03-12T22:49:32Z

I added some benchmarks against DuckDB in #37.

Given that it seems like DuckDB has robust support for this use case I'm going to close this as wontfix.

I'll also say that I believe when I first built this library DuckDB didn't have as good of support for this workflow as it does now (if it did I must have missed that update), but now that it does I would personally pick that over this workflow unless there's issues with the IO (e.g. you want to control it or work very closely with arrow) or memory (e.g. I know DuckDB does an amazing job of streaming and not loading into memory but I haven't measured how it does for this use case, this library on the other hand gives you complete control by operating at the record batch level).

adriangb changed the title ~~PostgresBinaryToArrow~~ Support copying postgres binary data to an Arrow RecordBatch Feb 2, 2024

adriangb mentioned this issue Mar 12, 2024

Add missing benchmarks #37

Merged

adriangb closed this as not planned Won't fix, can't repro, duplicate, stale Mar 12, 2024

adriangb mentioned this issue Mar 25, 2024

Postgres binary to parquet #38

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support copying postgres binary data to an Arrow RecordBatch #35

Support copying postgres binary data to an Arrow RecordBatch #35

Sergfalt commented Feb 2, 2024

adriangb commented Feb 2, 2024

Sergfalt commented Feb 2, 2024

adriangb commented Feb 2, 2024

ben-pr-p commented Mar 8, 2024

ben-pr-p commented Mar 8, 2024

adriangb commented Mar 8, 2024

adriangb commented Mar 8, 2024

ben-pr-p commented Mar 8, 2024

ben-pr-p commented Mar 8, 2024

adriangb commented Mar 8, 2024 •

edited

Loading

adriangb commented Mar 12, 2024

Support copying postgres binary data to an Arrow RecordBatch #35

Support copying postgres binary data to an Arrow RecordBatch #35

Comments

Sergfalt commented Feb 2, 2024

adriangb commented Feb 2, 2024

Sergfalt commented Feb 2, 2024

adriangb commented Feb 2, 2024

ben-pr-p commented Mar 8, 2024

ben-pr-p commented Mar 8, 2024

adriangb commented Mar 8, 2024

adriangb commented Mar 8, 2024

ben-pr-p commented Mar 8, 2024

ben-pr-p commented Mar 8, 2024

adriangb commented Mar 8, 2024 • edited Loading

adriangb commented Mar 12, 2024

adriangb commented Mar 8, 2024 •

edited

Loading