Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial connection to postgres very slow when number of OIDs is large #1755

Open
mcrumiller opened this issue Apr 24, 2024 · 13 comments
Open
Labels
Type: bug Something isn't working

Comments

@mcrumiller
Copy link

mcrumiller commented Apr 24, 2024

What happened?

When I perform a dbapi.connect(uri) call to a postgres 9.4.26 database (Greenplum 6.24.3), it takes upwards of two minutes to connect. Compare this to SQLAlchemy which connects in about one second:

from time import perf_counter

import sqlalchemy as sa
from adbc_driver_postgresql import dbapi

uri = "..."

# using SQLAlchemy
t = perf_counter()
conn = sa.create_engine(uri).connect()
print(f"SQLAlchemy connected in {perf_counter() - t:0.2f}s.")

# verify result
print(conn.execute(sa.text("SELECT 1")).fetchall())
conn.close()

# using adbc
t = perf_counter()
conn = dbapi.connect(uri)  # super slow
print(f"ADBC connected in {perf_counter() - t:0.2f}s.")

# verify result
with conn.cursor() as cur:
    cur.execute("SELECT 1")
    print(cur.fetchall())
SQLAlchemy connected in 1.31s.
[(1,)]
ADBC connected in 153.66s.
[(1,)]

My uri is of the form postgresl://username:password@server:port/database. Is there another connection setting I'm missing that's causing the huge connection delay?

Environment/Setup

Item Version
OS Windows 10 Enterprise
postgres driver 0.11.0
Postgres server 9.4.26
Greenplum 6.24.3
@mcrumiller mcrumiller added the Type: bug Something isn't working label Apr 24, 2024
@lidavidm
Copy link
Member

Hmm, the main thing I would suspect is that we have to read all the type OIDs on first connect so that we know what Arrow type to map them to later. I don't think this has ever come up as taking a long time, but perhaps this deployment or Greenplum makes this slower somehow? (Or the type list is simply very long?)

@lidavidm
Copy link
Member

Is there a way to spin up Greenplum/does this happen on a fresh install that we could poke at?

@mcrumiller
Copy link
Author

Hi @lidavidm, thanks for your help. I'm new to ADBC so apologies that I'm not very well-versed here. I'm in a corporate environment so unfortunately I don't have any admin access to the Greenplum server. We did just upgrade from 5.x to 6.x last night so this is a pretty "fresh install," however I saw the same thing a few months ago on 5.x.

Is there anything I can do to help diagnose, perhaps export some logging or increase verbosity?

@lidavidm
Copy link
Member

How long do these queries take for you?

// We need a few queries to build the resolver. The current strategy might
// fail for some recursive definitions (e.g., arrays of records of arrays).
// First, one on the pg_attribute table to resolve column names/oids for
// record types.
const std::string kColumnsQuery = R"(
SELECT
attrelid,
attname,
atttypid
FROM
pg_catalog.pg_attribute
ORDER BY
attrelid, attnum
)";
// Second, a query of the pg_type table. This query may need a few attempts to handle
// recursive definitions (e.g., record types with array column). This currently won't
// handle range types because those rows don't have child OID information. Arrays types
// are inserted after a successful insert of the element type.
const std::string kTypeQuery = R"(
SELECT
oid,
typname,
typreceive,
typbasetype,
typarray,
typrelid
FROM
pg_catalog.pg_type
WHERE
(typreceive != 0 OR typname = 'aclitem') AND typtype != 'r' AND typreceive::TEXT != 'array_recv'
ORDER BY
oid
)";

@mcrumiller
Copy link
Author

mcrumiller commented Apr 24, 2024

First query: 6,667,495 rows in 97.72s.
Second query: 141,849 rows in 1.81s.

@mcrumiller
Copy link
Author

@lidavidm it appears that ADBC downloads the data type for every single column in the database, even if there are thousands of tables? Is there a way to put this off until query time and only get the data types of the needed columns?

@lidavidm
Copy link
Member

Yikes! Well that would explain why. Thanks for the numbers. I'll think about whether we can avoid this query at all...

@lidavidm
Copy link
Member

@lidavidm it appears that ADBC downloads the data type for every single column in the database, even if there are thousands of tables? Is there a way to put this off until query time and only get the data types of the needed columns?

Well that would add additional latency in front of each query. We could possibly do it lazily and cache though.

@davlee1972
Copy link

davlee1972 commented Apr 25, 2024

Hmm I got different results for my Greenplum test.. 6 seconds for the first query..
I'm calling cursor.fetch_arrow_table() to pull results.

a.num_rows
3028922

total
datetime.timedelta(seconds=6, microseconds=607360)

b.num_rows
120072

total2
datetime.timedelta(microseconds=469822)

start = datetime.now()
c.execute("""
SELECT 
     attrelid, 
     attname, 
     atttypid 
 FROM 
     pg_catalog.pg_attribute 
 ORDER BY 
     attrelid, attnum 
""")

a = c.fetch_arrow_table()

total = datetime.now() - start

start = datetime.now()
c.execute("""
SELECT 
     oid, 
     typname, 
     typreceive, 
     typbasetype, 
     typarray, 
     typrelid 
 FROM 
     pg_catalog.pg_type 
 WHERE 
     (typreceive != 0 OR typname = 'aclitem') AND typtype != 'r' AND typreceive::TEXT != 'array_recv' 
 ORDER BY 
     oid 
""")

b = c.fetch_arrow_table()

total2 = datetime.now() - start

@mcrumiller
Copy link
Author

@davlee1972 well sure, the issue isn't that it's Greenplum (I included that information in case it was important), it's how many objects are defined in the database. The DB I use at work obviously has a lot of tables and columns, and as a result the initial connection takes a really long time, and probably consumes quite a bit of memory as well.

@lidavidm
Copy link
Member

Yeah, I'm currently thinking we can cache things up front when there's not many types, and otherwise perhaps speculatively cache part of the info and load the rest on demand as queries are run. (And possibly provide a "please cache them all right now" control.) I'm not happy about the potential variance but I suppose that's the downside of needing to know the types to parse the binary copy format.

As for memory usage, I think we can explore switching to an LRU cache or something along those lines later if it's a concern.

@lidavidm lidavidm added this to the ADBC Libraries 13 milestone May 10, 2024
@mcrumiller mcrumiller changed the title Connecting to postgres (Greenplum) database very slow Initial connection to postgres very slow when number of OIDs is large Jun 28, 2024
@lidavidm lidavidm removed this from the ADBC Libraries 15 milestone Nov 5, 2024
@lidavidm
Copy link
Member

lidavidm commented Nov 5, 2024

This is probably not being worked on anytime soon, unfortunately (at least from my end); contributions are welcome

@paleolimbot
Copy link
Member

I am not sure I'll get to this soon but I think some reasonable steps might be:

Add a virtual ArrowErrorCode Find(std::vector<uint32_t> oids, std::vector<PostgresType>* out, ArrowError* error) overload to look up more than one OID at a time:

ArrowErrorCode Find(uint32_t oid, PostgresType* type_out, ArrowError* error) const {

...and wire up existing uses of Find() to use the "bulk" approach where possible. Then, somewhere around here:

// Helpers for building the type resolver from queries
static inline int32_t InsertPgAttributeResult(
PGresult* result, const std::shared_ptr<PostgresTypeResolver>& resolver);
static inline int32_t InsertPgTypeResult(
PGresult* result, const std::shared_ptr<PostgresTypeResolver>& resolver);

...make a concrete subclass of the PostgresTypeResolver that implements the "bulk" and non-bulk finds by connecting to the database and issuing a query (if needed). I think the query would be much like the existing one except filtering on OID and possibly issuing an additional query(queries?) if one of those types is a "record" type (because the record type won't insert unless its column types are also available in the cache). There's an example of issuing a query with a parameterized "any of these integer things is equal to" in the GetObjects helper:

// Parameterized on schema_name, relkind
// Note that when binding relkind as a string it must look like {"r", "v", ...}
// (i.e., double quotes). Binding a binary list<string> element also works.
static const char* kTablesQueryAll =
"SELECT c.relname, CASE c.relkind WHEN 'r' THEN 'table' WHEN 'v' THEN 'view' "
"WHEN 'm' THEN 'materialized view' WHEN 't' THEN 'TOAST table' "
"WHEN 'f' THEN 'foreign table' WHEN 'p' THEN 'partitioned table' END "
"AS reltype FROM pg_catalog.pg_class c "
"LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace "
"WHERE pg_catalog.pg_table_is_visible(c.oid) AND n.nspname = $1 AND c.relkind = "
"ANY($2)";

Connecting to a database, issuing one or more queries, then disconnecting is obviously slow, so if we did this we might need to stay connected for some amount of time before disconnecting. Also, to avoid doing this at all when a user issues a query, we probably want to cache all the types in PostgresTypeIdAll(/*nested*/false).

static inline std::vector<PostgresTypeId> PostgresTypeIdAll(bool nested) {
std::vector<PostgresTypeId> base = {PostgresTypeId::kAclitem,
PostgresTypeId::kAnyarray,
PostgresTypeId::kAnycompatiblearray,
PostgresTypeId::kBit,
PostgresTypeId::kBool,
PostgresTypeId::kBox,
PostgresTypeId::kBpchar,
PostgresTypeId::kBrinBloomSummary,
PostgresTypeId::kBrinMinmaxMultiSummary,
PostgresTypeId::kBytea,
PostgresTypeId::kCash,
PostgresTypeId::kChar,
PostgresTypeId::kCidr,
PostgresTypeId::kCid,
PostgresTypeId::kCircle,
PostgresTypeId::kCstring,
PostgresTypeId::kDate,
PostgresTypeId::kEnum,
PostgresTypeId::kFloat4,
PostgresTypeId::kFloat8,
PostgresTypeId::kInet,
PostgresTypeId::kInt2,
PostgresTypeId::kInt2vector,
PostgresTypeId::kInt4,
PostgresTypeId::kInt8,
PostgresTypeId::kInterval,
PostgresTypeId::kJson,
PostgresTypeId::kJsonb,
PostgresTypeId::kJsonpath,
PostgresTypeId::kLine,
PostgresTypeId::kLseg,
PostgresTypeId::kMacaddr,
PostgresTypeId::kMacaddr8,
PostgresTypeId::kMultirange,
PostgresTypeId::kName,
PostgresTypeId::kNumeric,
PostgresTypeId::kOid,
PostgresTypeId::kOidvector,
PostgresTypeId::kPath,
PostgresTypeId::kPgNodeTree,
PostgresTypeId::kPgNdistinct,
PostgresTypeId::kPgDependencies,
PostgresTypeId::kPgLsn,
PostgresTypeId::kPgMcvList,
PostgresTypeId::kPgDdlCommand,
PostgresTypeId::kPgSnapshot,
PostgresTypeId::kPoint,
PostgresTypeId::kPoly,
PostgresTypeId::kRegclass,
PostgresTypeId::kRegcollation,
PostgresTypeId::kRegconfig,
PostgresTypeId::kRegdictionary,
PostgresTypeId::kRegnamespace,
PostgresTypeId::kRegoperator,
PostgresTypeId::kRegoper,
PostgresTypeId::kRegprocedure,
PostgresTypeId::kRegproc,
PostgresTypeId::kRegrole,
PostgresTypeId::kRegtype,
PostgresTypeId::kText,
PostgresTypeId::kTid,
PostgresTypeId::kTime,
PostgresTypeId::kTimestamp,
PostgresTypeId::kTimestamptz,
PostgresTypeId::kTimetz,
PostgresTypeId::kTsquery,
PostgresTypeId::kTsvector,
PostgresTypeId::kTxidSnapshot,
PostgresTypeId::kUnknown,
PostgresTypeId::kUuid,
PostgresTypeId::kVarbit,
PostgresTypeId::kVarchar,
PostgresTypeId::kVoid,
PostgresTypeId::kXid8,
PostgresTypeId::kXid,
PostgresTypeId::kXml};
if (nested) {
base.push_back(PostgresTypeId::kArray);
base.push_back(PostgresTypeId::kRecord);
base.push_back(PostgresTypeId::kRange);
base.push_back(PostgresTypeId::kDomain);
}
return base;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants