Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADBC Python Postgres - Stuck connections to the database #1881

Open
gaspardc-met opened this issue May 22, 2024 · 11 comments
Open

ADBC Python Postgres - Stuck connections to the database #1881

gaspardc-met opened this issue May 22, 2024 · 11 comments
Labels
Type: bug Something isn't working

Comments

@gaspardc-met
Copy link

gaspardc-met commented May 22, 2024

What happened?

Context before the bug (working):

  • Postgres database on Kubernetes with several tables
  • 4 services (webapp, machine learning inference, and FastAPI backend APIs) deployed on kubernetes and fetching data from postgres
  • 1 service, data orchestrator, writing data to Postgres
  • Fetching data from PG with pd.read_sql from Pandas and a SQLalchemy engine
  • Been doing this for 1+ year without any Postgres issues

Switching to ADBC:

  • Following my upgrade to pandas >2.0.0 I wanted to switch to adbc_driver_postgresql's dbapi connection with pd.read_sql
  • Initial tests were great, it was faster than before
  • Deployed this to production on all aforementioned services twice (initially with connection caching, then no caching and properly closing each and every connection
  • Once again initially smooth, everything worked and was fast

Problem:

  • In both instances of the deployment, within ~12 hours, the connections would be stuck
  • Webapp or another service would create an ADBC connection, and run the sql query with pd.read_sql (know this through caching) and then wait indefinitely.
  • Reloading the webapp, clearing webapp cache, recreating the connection would do nothing at all
  • The log on the Postgres pod indicated a password issue with the current database/user, which never happened before
  • Both SQLalchemy and ADBC get the same postgres URI to create the engine/connection with
  • Reverting to SQLalchemy solved the problem, and the error has not been seen again

How can we reproduce the bug?

  • The given URI was "postgresql://{user}:{password}@{host}:{port}/{db}" formatted with the proper values
  • The function was used to create the ADBC connection:
def create_adbc_conn() -> Connection:
    logger_stdout.info(f"Creating a new ADBC connection at {pd.Timestamp.now()}.")
    uri = get_default_uri() # URI shown above, formatted
    connection = dbapi.connect(uri=uri)
    logger_stdout.info("ADBC connection created")
    return connection
  • The function to execute the SQL query was:
def handle_sql_query(
    sql: str,
    index_col: Optional[str] = None,
    connection: Optional[Connection] = None,
    need_to_close: bool = False,
) -> pd.DataFrame:
    if connection is None:
        logger_stdout.info(f"Connection is None, creating a new ADBC connection at {pd.Timestamp.now()}.")
        connection= create_adbc_conn()
        need_to_close = True
    try:
        logger_stdout.info("Executing SQL query with connection")
        return pd.read_sql_query(sql=sql, con=connection, index_col=index_col, parse_dates=[index_col])
    finally:
        if need_to_close:
            logger_stdout.info("Closing the ADBC connection.")
            connection.close()
  • The SQL queries ranged from select * from TABLE_NAME to selecting specific columns on a range of specific dates

Environment/Setup

python 3.11
pandas == 2.2.2
adbc_driver_postgresql==0.11.0
adbc-driver-manager==0.11.0

@gaspardc-met gaspardc-met added the Type: bug Something isn't working label May 22, 2024
@zeroshade
Copy link
Member

@kou @lidavidm any ideas on this one? I'm not familiar enough with the postgres driver (or postgres in general) to have an idea where the problem is

@lidavidm
Copy link
Member

Is it possible to share the log?

I think we'll have to set up a pod for 12 hours and see if we can reproduce at all...

@kou
Copy link
Member

kou commented May 23, 2024

Do you set timeout related parameters such as idle_session_timeout?
See https://www.postgresql.org/docs/current/runtime-config-client.html for other timeout related parameters?

Could you show SELECT * FROM pg_stat_activity when this problem is happen?
See also: https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-PG-STAT-ACTIVITY-VIEW

Do you pass connection to handle_sql_query()? Or is connection=None used?

@lidavidm
Copy link
Member

One other thing, if we want to try and reproduce this, what was the PostgreSQL version?

@gaspardc-met
Copy link
Author

Hello all,
thanks for the quick response.

@lidavidm :
Log entries

2024-05-20 14:05:17.660 GMT [521061] LOG:  could not receive data from client: Connection reset by peer
2024-05-20 14:13:31.910 GMT [521817] FATAL:  unsupported frontend protocol 0.0: server supports 3.0 to 3.0
2024-05-20 14:13:32.214 GMT [521818] FATAL:  unsupported frontend protocol 255.255: server supports 3.0 to 3.0
2024-05-20 14:13:32.518 GMT [521819] FATAL:  no PostgreSQL user name specified in startup packet
2024-05-20 14:15:07.995 GMT [521500] LOG:  could not receive data from client: Connection reset by peer
2024-05-20 15:26:08.819 GMT [528434] LOG:  invalid length of startup packet

The could not receive data from client: Connection reset by peer happened a lot and is standard, I think it's a connection not properly closed by the webapp (now I close every connection) and happen with SQLalchemy too.
The others are "new" and happened when the connection was stuck and not before

PG version:

  • psql --version: psql (PostgreSQL) 14.5
  • Docker image: docker.io/bitnami/postgresql:14.5.0-debian-11-r35

@kou :

  • I did not create the postgres pod, and I couldn't see any environment variables on this, so it might be just "vanilla" bitnami/postgresql
  • The problem happened both with passing a cached connection and creating a new connection. I tried both.

@gaspardc-met
Copy link
Author

@kou :

  • I will try to recreate the issue next week locally so I can share the SELECT * FROM pg_stat_activity then
  • However I did run the command right now and noticed these irregular entries from my first attempt at deploying with ADBC (17th of mai then 20th of mai)
    image

@lidavidm
Copy link
Member

Thanks. Interesting, there's occasional reference to errors like this elsewhere 1, but usually the suggestion is that something is port-scanning Postgres. That doesn't seem likely here. On the other hand, if the client were doing something wrong after a long time, restarting the client completely should presumably reset that. So instead it seems like something borks the server.

Just to clarify, though:

(1) Did you try restarting the Postgres server, too?
(2) Did you restart the Postgres server and then try with SQLAlchemy?
(3) If not, then it sounds like: using ADBC for a while borks the server, but using SQLAlchemy (without restarting the server) works, and it's unknown whether switching back to ADBC would fail or work?

@gaspardc-met
Copy link
Author

  1. Never touched the postgres server pod before, during or after the issue.
  2. Did not try restarting the server with ADBC
  3. Using SQLalchemy seems to have fixed the issue between and after 1st and 2nd attempt with ADBC
  4. I could probably just resume using ADBC right now without changing the previous code until the next issue

@lidavidm
Copy link
Member

lidavidm commented May 29, 2024

Thanks. I'll try to find time to set up a container and reproduce a setup like this.

@lidavidm
Copy link
Member

Sorry, it's looking like I won't have time to investigate this anytime soon. This is on my backlog and I do hope to get to it but any help here is welcome

@WillAyd
Copy link
Contributor

WillAyd commented Jul 3, 2024

Does it matter at all if you remove pandas and use the connection directly to parse the results? Its possible there is also something with pandas that is causing the problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants