-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial connection to postgres very slow when number of OIDs is large #1755
Comments
Hmm, the main thing I would suspect is that we have to read all the type OIDs on first connect so that we know what Arrow type to map them to later. I don't think this has ever come up as taking a long time, but perhaps this deployment or Greenplum makes this slower somehow? (Or the type list is simply very long?) |
Is there a way to spin up Greenplum/does this happen on a fresh install that we could poke at? |
Hi @lidavidm, thanks for your help. I'm new to ADBC so apologies that I'm not very well-versed here. I'm in a corporate environment so unfortunately I don't have any admin access to the Greenplum server. We did just upgrade from 5.x to 6.x last night so this is a pretty "fresh install," however I saw the same thing a few months ago on 5.x. Is there anything I can do to help diagnose, perhaps export some logging or increase verbosity? |
How long do these queries take for you? arrow-adbc/c/driver/postgresql/database.cc Lines 140 to 173 in 1129acc
|
First query: 6,667,495 rows in 97.72s. |
@lidavidm it appears that ADBC downloads the data type for every single column in the database, even if there are thousands of tables? Is there a way to put this off until query time and only get the data types of the needed columns? |
Yikes! Well that would explain why. Thanks for the numbers. I'll think about whether we can avoid this query at all... |
Well that would add additional latency in front of each query. We could possibly do it lazily and cache though. |
Hmm I got different results for my Greenplum test.. 6 seconds for the first query..
|
@davlee1972 well sure, the issue isn't that it's Greenplum (I included that information in case it was important), it's how many objects are defined in the database. The DB I use at work obviously has a lot of tables and columns, and as a result the initial connection takes a really long time, and probably consumes quite a bit of memory as well. |
Yeah, I'm currently thinking we can cache things up front when there's not many types, and otherwise perhaps speculatively cache part of the info and load the rest on demand as queries are run. (And possibly provide a "please cache them all right now" control.) I'm not happy about the potential variance but I suppose that's the downside of needing to know the types to parse the binary copy format. As for memory usage, I think we can explore switching to an LRU cache or something along those lines later if it's a concern. |
This is probably not being worked on anytime soon, unfortunately (at least from my end); contributions are welcome |
I am not sure I'll get to this soon but I think some reasonable steps might be: Add a
...and wire up existing uses of arrow-adbc/c/driver/postgresql/database.cc Lines 125 to 132 in 9ac8f6c
...make a concrete subclass of the PostgresTypeResolver that implements the "bulk" and non-bulk finds by connecting to the database and issuing a query (if needed). I think the query would be much like the existing one except filtering on OID and possibly issuing an additional query(queries?) if one of those types is a "record" type (because the record type won't insert unless its column types are also available in the cache). There's an example of issuing a query with a parameterized "any of these integer things is equal to" in the GetObjects helper: arrow-adbc/c/driver/postgresql/connection.cc Lines 67 to 77 in 9ac8f6c
Connecting to a database, issuing one or more queries, then disconnecting is obviously slow, so if we did this we might need to stay connected for some amount of time before disconnecting. Also, to avoid doing this at all when a user issues a query, we probably want to cache all the types in arrow-adbc/c/driver/postgresql/postgres_type.h Lines 948 to 1034 in 9ac8f6c
|
What happened?
When I perform a
dbapi.connect(uri)
call to a postgres 9.4.26 database (Greenplum 6.24.3), it takes upwards of two minutes to connect. Compare this to SQLAlchemy which connects in about one second:My uri is of the form
postgresl://username:password@server:port/database
. Is there another connection setting I'm missing that's causing the huge connection delay?Environment/Setup
The text was updated successfully, but these errors were encountered: