Initial connection to postgres very slow when number of OIDs is large #1755

mcrumiller · 2024-04-24T13:09:23Z

What happened?

When I perform a dbapi.connect(uri) call to a postgres 9.4.26 database (Greenplum 6.24.3), it takes upwards of two minutes to connect. Compare this to SQLAlchemy which connects in about one second:

from time import perf_counter

import sqlalchemy as sa
from adbc_driver_postgresql import dbapi

uri = "..."

# using SQLAlchemy
t = perf_counter()
conn = sa.create_engine(uri).connect()
print(f"SQLAlchemy connected in {perf_counter() - t:0.2f}s.")

# verify result
print(conn.execute(sa.text("SELECT 1")).fetchall())
conn.close()

# using adbc
t = perf_counter()
conn = dbapi.connect(uri)  # super slow
print(f"ADBC connected in {perf_counter() - t:0.2f}s.")

# verify result
with conn.cursor() as cur:
    cur.execute("SELECT 1")
    print(cur.fetchall())

SQLAlchemy connected in 1.31s.
[(1,)]
ADBC connected in 153.66s.
[(1,)]

My uri is of the form postgresl://username:password@server:port/database. Is there another connection setting I'm missing that's causing the huge connection delay?

Environment/Setup

Item	Version
OS	Windows 10 Enterprise
postgres driver	0.11.0
Postgres server	9.4.26
Greenplum	6.24.3

The text was updated successfully, but these errors were encountered:

lidavidm · 2024-04-24T13:13:34Z

Hmm, the main thing I would suspect is that we have to read all the type OIDs on first connect so that we know what Arrow type to map them to later. I don't think this has ever come up as taking a long time, but perhaps this deployment or Greenplum makes this slower somehow? (Or the type list is simply very long?)

lidavidm · 2024-04-24T13:17:37Z

Is there a way to spin up Greenplum/does this happen on a fresh install that we could poke at?

mcrumiller · 2024-04-24T13:46:04Z

Hi @lidavidm, thanks for your help. I'm new to ADBC so apologies that I'm not very well-versed here. I'm in a corporate environment so unfortunately I don't have any admin access to the Greenplum server. We did just upgrade from 5.x to 6.x last night so this is a pretty "fresh install," however I saw the same thing a few months ago on 5.x.

Is there anything I can do to help diagnose, perhaps export some logging or increase verbosity?

lidavidm · 2024-04-24T14:03:20Z

How long do these queries take for you?

arrow-adbc/c/driver/postgresql/database.cc

Lines 140 to 173 in 1129acc

    
             // We need a few queries to build the resolver. The current strategy might 
        
             // fail for some recursive definitions (e.g., arrays of records of arrays). 
        
             // First, one on the pg_attribute table to resolve column names/oids for 
        
             // record types. 
        
             const std::string kColumnsQuery = R"( 
        
           SELECT 
        
               attrelid, 
        
               attname, 
        
               atttypid 
        
           FROM 
        
               pg_catalog.pg_attribute 
        
           ORDER BY 
        
               attrelid, attnum 
        
           )"; 
        
             // Second, a query of the pg_type table. This query may need a few attempts to handle 
        
             // recursive definitions (e.g., record types with array column). This currently won't 
        
             // handle range types because those rows don't have child OID information. Arrays types 
        
             // are inserted after a successful insert of the element type. 
        
             const std::string kTypeQuery = R"( 
        
           SELECT 
        
               oid, 
        
               typname, 
        
               typreceive, 
        
               typbasetype, 
        
               typarray, 
        
               typrelid 
        
           FROM 
        
               pg_catalog.pg_type 
        
           WHERE 
        
               (typreceive != 0 OR typname = 'aclitem') AND typtype != 'r' AND typreceive::TEXT != 'array_recv' 
        
           ORDER BY 
        
               oid 
        
           )";

mcrumiller · 2024-04-24T14:26:49Z

First query: 6,667,495 rows in 97.72s.
Second query: 141,849 rows in 1.81s.

mcrumiller · 2024-04-24T16:39:50Z

@lidavidm it appears that ADBC downloads the data type for every single column in the database, even if there are thousands of tables? Is there a way to put this off until query time and only get the data types of the needed columns?

lidavidm · 2024-04-24T22:21:04Z

Yikes! Well that would explain why. Thanks for the numbers. I'll think about whether we can avoid this query at all...

lidavidm · 2024-04-24T22:21:54Z

@lidavidm it appears that ADBC downloads the data type for every single column in the database, even if there are thousands of tables? Is there a way to put this off until query time and only get the data types of the needed columns?

Well that would add additional latency in front of each query. We could possibly do it lazily and cache though.

davlee1972 · 2024-04-25T23:27:27Z

Hmm I got different results for my Greenplum test.. 6 seconds for the first query..
I'm calling cursor.fetch_arrow_table() to pull results.

a.num_rows
3028922

total
datetime.timedelta(seconds=6, microseconds=607360)

b.num_rows
120072

total2
datetime.timedelta(microseconds=469822)

start = datetime.now()
c.execute("""
SELECT 
     attrelid, 
     attname, 
     atttypid 
 FROM 
     pg_catalog.pg_attribute 
 ORDER BY 
     attrelid, attnum 
""")

a = c.fetch_arrow_table()

total = datetime.now() - start

start = datetime.now()
c.execute("""
SELECT 
     oid, 
     typname, 
     typreceive, 
     typbasetype, 
     typarray, 
     typrelid 
 FROM 
     pg_catalog.pg_type 
 WHERE 
     (typreceive != 0 OR typname = 'aclitem') AND typtype != 'r' AND typreceive::TEXT != 'array_recv' 
 ORDER BY 
     oid 
""")

b = c.fetch_arrow_table()

total2 = datetime.now() - start

mcrumiller · 2024-04-26T00:16:12Z

@davlee1972 well sure, the issue isn't that it's Greenplum (I included that information in case it was important), it's how many objects are defined in the database. The DB I use at work obviously has a lot of tables and columns, and as a result the initial connection takes a really long time, and probably consumes quite a bit of memory as well.

lidavidm · 2024-04-26T03:48:33Z

Yeah, I'm currently thinking we can cache things up front when there's not many types, and otherwise perhaps speculatively cache part of the info and load the rest on demand as queries are run. (And possibly provide a "please cache them all right now" control.) I'm not happy about the potential variance but I suppose that's the downside of needing to know the types to parse the binary copy format.

As for memory usage, I think we can explore switching to an LRU cache or something along those lines later if it's a concern.

lidavidm · 2024-11-05T01:57:22Z

This is probably not being worked on anytime soon, unfortunately (at least from my end); contributions are welcome

paleolimbot · 2024-11-05T18:09:25Z

I am not sure I'll get to this soon but I think some reasonable steps might be:

Add a virtual ArrowErrorCode Find(std::vector<uint32_t> oids, std::vector<PostgresType>* out, ArrowError* error) overload to look up more than one OID at a time:

arrow-adbc/c/driver/postgresql/postgres_type.h

Line 390 in 9ac8f6c

    
           ArrowErrorCode Find(uint32_t oid, PostgresType* type_out, ArrowError* error) const {

...and wire up existing uses of Find() to use the "bulk" approach where possible. Then, somewhere around here:

arrow-adbc/c/driver/postgresql/database.cc

Lines 125 to 132 in 9ac8f6c

    
           // Helpers for building the type resolver from queries 
        
           static inline int32_t InsertPgAttributeResult( 
        
               PGresult* result, const std::shared_ptr<PostgresTypeResolver>& resolver); 
        
           static inline int32_t InsertPgTypeResult( 
        
               PGresult* result, const std::shared_ptr<PostgresTypeResolver>& resolver);

...make a concrete subclass of the PostgresTypeResolver that implements the "bulk" and non-bulk finds by connecting to the database and issuing a query (if needed). I think the query would be much like the existing one except filtering on OID and possibly issuing an additional query(queries?) if one of those types is a "record" type (because the record type won't insert unless its column types are also available in the cache). There's an example of issuing a query with a parameterized "any of these integer things is equal to" in the GetObjects helper:

arrow-adbc/c/driver/postgresql/connection.cc

Lines 67 to 77 in 9ac8f6c

    
           // Parameterized on schema_name, relkind 
        
           // Note that when binding relkind as a string it must look like {"r", "v", ...} 
        
           // (i.e., double quotes). Binding a binary list<string> element also works. 
        
           static const char* kTablesQueryAll = 
        
               "SELECT c.relname, CASE c.relkind WHEN 'r' THEN 'table' WHEN 'v' THEN 'view' " 
        
               "WHEN 'm' THEN 'materialized view' WHEN 't' THEN 'TOAST table' " 
        
               "WHEN 'f' THEN 'foreign table' WHEN 'p' THEN 'partitioned table' END " 
        
               "AS reltype FROM pg_catalog.pg_class c " 
        
               "LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace " 
        
               "WHERE pg_catalog.pg_table_is_visible(c.oid) AND n.nspname = $1 AND c.relkind = " 
        
               "ANY($2)";

Connecting to a database, issuing one or more queries, then disconnecting is obviously slow, so if we did this we might need to stay connected for some amount of time before disconnecting. Also, to avoid doing this at all when a user issues a query, we probably want to cache all the types in PostgresTypeIdAll(/*nested*/false).

arrow-adbc/c/driver/postgresql/postgres_type.h

Lines 948 to 1034 in 9ac8f6c

    
           static inline std::vector<PostgresTypeId> PostgresTypeIdAll(bool nested) { 
        
             std::vector<PostgresTypeId> base = {PostgresTypeId::kAclitem, 
        
                                                 PostgresTypeId::kAnyarray, 
        
                                                 PostgresTypeId::kAnycompatiblearray, 
        
                                                 PostgresTypeId::kBit, 
        
                                                 PostgresTypeId::kBool, 
        
                                                 PostgresTypeId::kBox, 
        
                                                 PostgresTypeId::kBpchar, 
        
                                                 PostgresTypeId::kBrinBloomSummary, 
        
                                                 PostgresTypeId::kBrinMinmaxMultiSummary, 
        
                                                 PostgresTypeId::kBytea, 
        
                                                 PostgresTypeId::kCash, 
        
                                                 PostgresTypeId::kChar, 
        
                                                 PostgresTypeId::kCidr, 
        
                                                 PostgresTypeId::kCid, 
        
                                                 PostgresTypeId::kCircle, 
        
                                                 PostgresTypeId::kCstring, 
        
                                                 PostgresTypeId::kDate, 
        
                                                 PostgresTypeId::kEnum, 
        
                                                 PostgresTypeId::kFloat4, 
        
                                                 PostgresTypeId::kFloat8, 
        
                                                 PostgresTypeId::kInet, 
        
                                                 PostgresTypeId::kInt2, 
        
                                                 PostgresTypeId::kInt2vector, 
        
                                                 PostgresTypeId::kInt4, 
        
                                                 PostgresTypeId::kInt8, 
        
                                                 PostgresTypeId::kInterval, 
        
                                                 PostgresTypeId::kJson, 
        
                                                 PostgresTypeId::kJsonb, 
        
                                                 PostgresTypeId::kJsonpath, 
        
                                                 PostgresTypeId::kLine, 
        
                                                 PostgresTypeId::kLseg, 
        
                                                 PostgresTypeId::kMacaddr, 
        
                                                 PostgresTypeId::kMacaddr8, 
        
                                                 PostgresTypeId::kMultirange, 
        
                                                 PostgresTypeId::kName, 
        
                                                 PostgresTypeId::kNumeric, 
        
                                                 PostgresTypeId::kOid, 
        
                                                 PostgresTypeId::kOidvector, 
        
                                                 PostgresTypeId::kPath, 
        
                                                 PostgresTypeId::kPgNodeTree, 
        
                                                 PostgresTypeId::kPgNdistinct, 
        
                                                 PostgresTypeId::kPgDependencies, 
        
                                                 PostgresTypeId::kPgLsn, 
        
                                                 PostgresTypeId::kPgMcvList, 
        
                                                 PostgresTypeId::kPgDdlCommand, 
        
                                                 PostgresTypeId::kPgSnapshot, 
        
                                                 PostgresTypeId::kPoint, 
        
                                                 PostgresTypeId::kPoly, 
        
                                                 PostgresTypeId::kRegclass, 
        
                                                 PostgresTypeId::kRegcollation, 
        
                                                 PostgresTypeId::kRegconfig, 
        
                                                 PostgresTypeId::kRegdictionary, 
        
                                                 PostgresTypeId::kRegnamespace, 
        
                                                 PostgresTypeId::kRegoperator, 
        
                                                 PostgresTypeId::kRegoper, 
        
                                                 PostgresTypeId::kRegprocedure, 
        
                                                 PostgresTypeId::kRegproc, 
        
                                                 PostgresTypeId::kRegrole, 
        
                                                 PostgresTypeId::kRegtype, 
        
                                                 PostgresTypeId::kText, 
        
                                                 PostgresTypeId::kTid, 
        
                                                 PostgresTypeId::kTime, 
        
                                                 PostgresTypeId::kTimestamp, 
        
                                                 PostgresTypeId::kTimestamptz, 
        
                                                 PostgresTypeId::kTimetz, 
        
                                                 PostgresTypeId::kTsquery, 
        
                                                 PostgresTypeId::kTsvector, 
        
                                                 PostgresTypeId::kTxidSnapshot, 
        
                                                 PostgresTypeId::kUnknown, 
        
                                                 PostgresTypeId::kUuid, 
        
                                                 PostgresTypeId::kVarbit, 
        
                                                 PostgresTypeId::kVarchar, 
        
                                                 PostgresTypeId::kVoid, 
        
                                                 PostgresTypeId::kXid8, 
        
                                                 PostgresTypeId::kXid, 
        
                                                 PostgresTypeId::kXml}; 
        
             if (nested) { 
        
               base.push_back(PostgresTypeId::kArray); 
        
               base.push_back(PostgresTypeId::kRecord); 
        
               base.push_back(PostgresTypeId::kRange); 
        
               base.push_back(PostgresTypeId::kDomain); 
        
             } 
        
             return base; 
        
           }

mcrumiller added the Type: bug Something isn't working label Apr 24, 2024

lidavidm added this to the ADBC Libraries 13 milestone May 10, 2024

mcrumiller changed the title ~~Connecting to postgres (Greenplum) database very slow~~ Initial connection to postgres very slow when number of OIDs is large Jun 28, 2024

lidavidm modified the milestones: ADBC Libraries 13, ADBC Libraries 14 Jun 30, 2024

lidavidm modified the milestones: ADBC Libraries 14, ADBC Libraries 15 Aug 20, 2024

lidavidm removed this from the ADBC Libraries 15 milestone Nov 5, 2024

lidavidm mentioned this issue Nov 5, 2024

feat(c/driver/postgresql): Enable basic connect/query workflow for Redshift #2219

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial connection to postgres very slow when number of OIDs is large #1755

Initial connection to postgres very slow when number of OIDs is large #1755

mcrumiller commented Apr 24, 2024 •

edited

Loading

lidavidm commented Apr 24, 2024

lidavidm commented Apr 24, 2024

mcrumiller commented Apr 24, 2024

lidavidm commented Apr 24, 2024

mcrumiller commented Apr 24, 2024 •

edited

Loading

mcrumiller commented Apr 24, 2024

lidavidm commented Apr 24, 2024

lidavidm commented Apr 24, 2024

davlee1972 commented Apr 25, 2024 •

edited

Loading

mcrumiller commented Apr 26, 2024

lidavidm commented Apr 26, 2024

lidavidm commented Nov 5, 2024

paleolimbot commented Nov 5, 2024

Initial connection to postgres very slow when number of OIDs is large #1755

Initial connection to postgres very slow when number of OIDs is large #1755

Comments

mcrumiller commented Apr 24, 2024 • edited Loading

What happened?

Environment/Setup

lidavidm commented Apr 24, 2024

lidavidm commented Apr 24, 2024

mcrumiller commented Apr 24, 2024

lidavidm commented Apr 24, 2024

mcrumiller commented Apr 24, 2024 • edited Loading

mcrumiller commented Apr 24, 2024

lidavidm commented Apr 24, 2024

lidavidm commented Apr 24, 2024

davlee1972 commented Apr 25, 2024 • edited Loading

mcrumiller commented Apr 26, 2024

lidavidm commented Apr 26, 2024

lidavidm commented Nov 5, 2024

paleolimbot commented Nov 5, 2024

mcrumiller commented Apr 24, 2024 •

edited

Loading

mcrumiller commented Apr 24, 2024 •

edited

Loading

davlee1972 commented Apr 25, 2024 •

edited

Loading