fix: Ensure connection pool metrics stay consistent #99

SevInf · 2024-10-08T15:43:08Z

This started as a bugfix to several prisma issues regarding incorrect metrics:

And a couple of more discovered the testing, but not reported in the issues.
There were several causes for this:

Following pattern appears quite a lot in mobc code:

gauge!("something").increment(1.0);
do_a_thing_that_could_fail()?;
gauge!("something").decrement(1.0);

So, in case do_a_thing_that_could_fail actually fails, gauge will get incremented, but never will get decremented.

Couple of metrics were relying on Conn::close being manually called and that was not the case every once in a while.

To prevent both of those problems, I rewrote the internals of library to rely on RAII rather than manual counters and resources management.

Conn struct is now split into two:

ActiveConn - represents currently checked out connection that have been actively used by the client. Holds onto semaphore permit and can be converted into IdleConn. Doing so will free the permit.
IdleConn - represents idle connection, currently checked into the pool. Can be converted to ActiveConn by providing a valid permit.

ConnState represents the shared state of the connection that is retained between different activity states.

Both IdleConn and ActiveConn manage their corresponding gauges - increment them on creation and decrement them during drop. ConnState manages CONNECTIONS_OPEN gauge and CONNECTIONS_TOTAL and CLOSED_TOTAL counters in the same way.

This system ensures that metrics stay consistent: since metrics are automatically incremented and decremented on state conversions, we can always be sure that:

Connection is always either idle or active, there is no in between state.
Idle connections and active connections gauges will always add up to the currently open connections gauge
Total connections open counter, minus total connections closed gauge will always be equal to number of currently open connections.

Since resources are now managed by Drop::drop implementations, that removes the need for manual close method and simiplifies the code quite in a few places, also ensuring it is safer against future changes.

This started as a bugfix to several prisma issues regarding incorrect metrics: prisma/prisma#25177 prisma/prisma#23525 And a couple of more discovered the testing, but not reported in the issues. There were several causes for this: 1. Following pattern appears quite a lot in mobc code: ```rust gauge!("something").increment(1.0); do_a_thing_that_could_fail()?; gauge!("something").decrement(1.0); ``` So, in case `do_a_thing_that_could_fail` actually fails, gauge will get incremented, but never will get decremented. 2. Couple of metrics were relying on `Conn::close` being manually called and that was not the case every once in a while. To prevent both of those problems, I rewrote the internals of library to rely on RAII rather than manual counters and resources management. `Conn` struct is now split into two: - `ActiveConn` - represents currently checked out connection that have been actively used by the client. Holds onto semaphore permit and can be converted into `IdleConn`. Doing so will free the permit. - `IdleConn` - represents idle connection, currently checked into the pool. Can be converted to `ActiveConn` by providing a valid permit. `ConnState` represents the shared state of the connection that is retained between different activity states. Both `IdleConn` and `ActiveConn` manage their corresponding gauges - increment them on creation and decrement them during drop. `ConnState` manages `CONNECTIONS_OPEN` gauge and `CONNECTIONS_TOTAL` and `CLOSED_TOTAL` counters in the same way. This system ensures that metrics stay consistent: since metrics are automatically incremented and decremented on state conversions, we can always be sure that: - Connection is always either idle or active, there is no in between state. - Idle connections and active connections gauges will always add up to the currently open connections gauge - Total connections open counter, minus total connections closed gauge will always be equal to number of currently open connections. Since resources are now managed by `Drop::drop` implementations, that removes the need for manual `close` method and simiplifies the code quite in a few places, also ensuring it is safer against future changes.

SevInf · 2024-10-08T15:44:26Z

tests/mobc.rs

@@ -706,14 +706,6 @@ fn test_max_idle_lifetime() {
        drop(v);
        delay_for(Duration::from_millis(800)).await;

-        let mut v = vec![];


This assertion was incorrect - connection lifeteme is 1 sec, by that time 1800ms passed but connections were not freed for some reason

mobc update includes fixes from importcjj/mobc#99 Rest of the PR is update to new api of the `metrics` crate. Close prisma/team-orm#1317

jkomyno · 2024-10-09T13:25:06Z

src/lib.rs

    max_lifetime_closed: AtomicU64,
-    max_idle_closed: AtomicU64,
+    max_idle_closed: Arc<AtomicU64>,


Why are atomics wrapped in an Arc?
Arc<T> is Send + Safe as long as T: Send + Safe, and AtomicU64 is already Send + Safe.
What do we need the Arc for?

I think I was wrong.

From https://doc.rust-lang.org/std/sync/atomic/:

Atomic variables are safe to share between threads (they implement Sync) but they do not themselves provide the mechanism for sharing and follow the threading model of Rust. The most common way to share an atomic variable is to put it into an Arc (an atomically-reference-counted shared pointer).

Yep, exactly. Those two are needed to be shared becuase drop for ConnState needs to increment one and decrement another

@jkomyno atomics are safely shareable (and you can even safely concurrently mutate them through a shared reference) but they are not magically shared just by cloning them (that would violate the idea of ownership), you need some kind of a reference or a pointer to be able to access the same location in memory from multiple places. If you don't want an Arc here, the only other option is for ConnState to borrow from the Pool and that is not necessarily practical.

aqrln

beautiful

src/conn.rs

aqrln · 2024-10-10T09:12:01Z

src/lib.rs

+        internals.wait_duration += wait_guard.elapsed();
+        drop(wait_guard);


A bit of an edge case and I'm not sure how critical it is here, but if the thread is preempted by the kernel between these two statements, the counters in internal and the histogram may diverge a bit. Could be especially sensitive to running on a cloud VM with something like 100m CPU quota.

True. I think instead of a manual drop we can add a method to HistogramGuard that takes ownership of self and returns elapsed time.

yep internals.wait_duration += wait_guard.into_elapsed() sounds great

aqrln · 2024-10-10T09:33:05Z

src/conn.rs

+
+impl Drop for ConnState {
+    fn drop(&mut self) {
+        self.total_connections_open.fetch_sub(1, Ordering::Relaxed);


Is Relaxed sound here and no happens-before relationship is necessary, or should the loads of PoolState::num_open form an acquire-release pair with this store?

I've just moved existing code around, but I agree with you here, acquire-release makes more sense in this case

Co-authored-by: Alexey Orlenko <[email protected]>

mobc update includes fixes from importcjj/mobc#99 Rest of the PR is update to new api of the `metrics` crate. Close prisma/team-orm#1317

garrensmith

Hey @SevInf

Thanks for the PR. There is a lot of changes here and I haven't looked at this code for awhile so I'm not entirely sure on a few things

So, in case do_a_thing_that_could_fail actually fails, gauge will get incremented, but never will get decremented.

Can you give an example of this? Why what adding the GaugeGuard your solution to this?

Couple of metrics were relying on Conn::close being manually called and that was not the case every once in a while.

What causes the Conn::close not to be called?

I don't understand why two connection states, I can't see what the difference is between them?

The error in metrics you are trying to fix makes sense. Are you 100% certain this fixes them?
Have you measure what the performance impact this has on the library?
I know you say it makes the code simpler but I don't see how, it seems to add a lot more to it.

Sorry for lots of questions, I'm trying to get some context here?

garrensmith · 2024-10-11T13:06:44Z

src/lib.rs

        };

        let shared = Arc::new(SharedPool {
            config: share_config,
            manager,
            internals,
-            semaphore: Semaphore::new(max_open),
+            semaphore: Arc::new(Semaphore::new(max_open)),


Why do you need the Arc around the Semaphore?
What is the performance implications of this?

Arc is needed to get owned permit, that method exists only on Arc<Self>
No noticeable negative performance implications, if anything, mobc own testsuite consistently runs 10-15s faster on my machine after my changes.

garrensmith · 2024-10-11T13:07:30Z

src/conn.rs

+        Self {
+            inner,
+            state,
+            _permit: permit,


Why do you keep the permit instead of forgetting it? I can't see you using it anywhere?

If I forget it, i will then need to add it back manually when connection drops. If I keep on struct, it will automatically gets returned when the struct drops, we don't need to think about doing manual management and we can't get it wrong in case struct drops earlier then we thought.

SevInf · 2024-10-14T12:52:52Z

Can you give an example of this? Why what adding the GaugeGuard your solution to this?

For example:

What causes the Conn::close not to be called?

Generally, Conn::close is called only when connection is closed "normally" after timeouts or failed healthcheck and it's might not get closed in case of "abnormal" error.

Why what adding the GaugeGuard your solution to this?

GaugeGuard decrements counter on drop and drop is automatically called in case of early return like in example above. This way it is guaranteed that every increment will have one and only decrement call and it is guaranteed that corresponding decrement will be called.

I don't understand why two connection states, I can't see what the difference is between them?

Just a safety mechanism to ensure connections are used correctly when checked in and out of pool. For example, now typesystem guarantees that:

it is impossible to check connection out of the pool without a semaphore permit
it is impossible to check connection in and out and mess up the gauges of currently open/idle connections

The error in metrics you are trying to fix makes sense. Are you 100% certain this fixes them?

Not 100% certain, but I went from being able to reproduce them consistently to not being able to reproduce them at all. (autocannon with 10k connections on a test service).

Have you measure what the performance impact this has on the library?

mobc's own test suite consistently runs faster for me after my changes

I know you say it makes the code simpler but I don't see how, it seems to add a lot more to it.

Sure, it's more code, but resource management is no longer relies on us calling right cleanup method at the right time, Rust does it for us in a safer and less brittle way.

garrensmith · 2024-10-15T05:13:01Z

Awesome thanks for the work and explanations.

garrensmith · 2024-10-15T05:32:16Z

@SevInf you should have commit access and be able to push a release.

SevInf · 2024-10-15T13:12:25Z

Thank you @garrensmith!
Could you maybe also add @aqrln? This is my last week at Prisma, I don't think I'll be that actively working on mobc in future

SevInf · 2024-10-15T13:24:59Z

Also, @garrensmith, I do have access on Github but not on crates.io so can't publish a release right now

mobc update includes fixes from importcjj/mobc#99 Rest of the PR is update to new api of the `metrics` crate. Close prisma/team-orm#1317

cc importcjj/mobc#99

tmm1 · 2024-10-27T03:15:04Z

FYI when building prisma-engines 5.14.0 w/ mobc 0.8.5 i'm not getting any values for most metrics

aqrln · 2024-10-27T11:18:18Z

@tmm1 yeah that's a known issue in Prisma (one problem was masking another), will be fixed in 5.22.0

aqrln · 2024-10-29T14:21:09Z

@tmm1 actually on a second thought it wouldn't have led to not getting any values at all. In your case it's probably just because the version of the metrics crate in prisma-engines 5.14.0 is much older than the one used in mobc, so now you have two versions of metrics in the tree, and the one metrics are written to is not the same that the one metrics are read from.

tmm1 · 2024-10-29T19:48:40Z

Good catch! Can confirm things work as expected with 5.14.0 + prisma/prisma-engines#5015

SevInf commented Oct 8, 2024

View reviewed changes

SevInf added a commit to prisma/prisma-engines that referenced this pull request Oct 8, 2024

Update mobc and metrics crates

25e5b67

mobc update includes fixes from importcjj/mobc#99 Rest of the PR is update to new api of the `metrics` crate. Close prisma/team-orm#1317

SevInf mentioned this pull request Oct 8, 2024

Update mobc and metrics crates prisma/prisma-engines#5015

Merged

jkomyno reviewed Oct 9, 2024

View reviewed changes

aqrln reviewed Oct 10, 2024

View reviewed changes

Update src/conn.rs

bd2d547

Co-authored-by: Alexey Orlenko <[email protected]>

SevInf added a commit to prisma/prisma-engines that referenced this pull request Oct 10, 2024

Update mobc and metrics crates

1423419

mobc update includes fixes from importcjj/mobc#99 Rest of the PR is update to new api of the `metrics` crate. Close prisma/team-orm#1317

garrensmith reviewed Oct 11, 2024

View reviewed changes

garrensmith merged commit 04eb7b2 into importcjj:main Oct 15, 2024
1 check passed

aqrln mentioned this pull request Oct 15, 2024

fix: make the timings recorded in wait_duration and histogram consistent #101

Merged

SevInf added a commit to prisma/prisma-engines that referenced this pull request Oct 16, 2024

Update mobc and metrics crates

e2c6b6a

mobc update includes fixes from importcjj/mobc#99 Rest of the PR is update to new api of the `metrics` crate. Close prisma/team-orm#1317

SevInf added a commit to prisma/prisma-engines that referenced this pull request Oct 16, 2024

Update mobc and metrics crates

d6de3e6

mobc update includes fixes from importcjj/mobc#99 Rest of the PR is update to new api of the `metrics` crate. Close prisma/team-orm#1317

aqrln pushed a commit to prisma/prisma-engines that referenced this pull request Oct 21, 2024

Update mobc and metrics crates

3373e87

mobc update includes fixes from importcjj/mobc#99 Rest of the PR is update to new api of the `metrics` crate. Close prisma/team-orm#1317

aqrln pushed a commit to prisma/prisma-engines that referenced this pull request Oct 23, 2024

Update mobc and metrics crates

a940230

mobc update includes fixes from importcjj/mobc#99 Rest of the PR is update to new api of the `metrics` crate. Close prisma/team-orm#1317

tmm1 added a commit to anysphere/prisma-engines that referenced this pull request Oct 26, 2024

bump mobc with metrics fixes

24d7f86

cc importcjj/mobc#99

tmm1 mentioned this pull request Oct 31, 2024

Metric prisma_pool_connections_idle reports faulty high number prisma/prisma#23525

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Ensure connection pool metrics stay consistent #99

fix: Ensure connection pool metrics stay consistent #99

SevInf commented Oct 8, 2024

SevInf Oct 8, 2024

jkomyno Oct 9, 2024

jkomyno Oct 9, 2024

SevInf Oct 9, 2024

aqrln Oct 10, 2024

aqrln left a comment

aqrln Oct 10, 2024 •

edited

Loading

SevInf Oct 10, 2024

aqrln Oct 10, 2024

aqrln Oct 10, 2024

SevInf Oct 10, 2024

garrensmith left a comment

garrensmith Oct 11, 2024

SevInf Oct 14, 2024

garrensmith Oct 11, 2024

SevInf Oct 14, 2024

SevInf commented Oct 14, 2024

garrensmith commented Oct 15, 2024

garrensmith commented Oct 15, 2024

SevInf commented Oct 15, 2024

SevInf commented Oct 15, 2024

tmm1 commented Oct 27, 2024

aqrln commented Oct 27, 2024

aqrln commented Oct 29, 2024 •

edited

Loading

tmm1 commented Oct 29, 2024

		internals.wait_duration += wait_guard.elapsed();
		drop(wait_guard);

fix: Ensure connection pool metrics stay consistent #99

fix: Ensure connection pool metrics stay consistent #99

Conversation

SevInf commented Oct 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aqrln left a comment

Choose a reason for hiding this comment

aqrln Oct 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

garrensmith left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SevInf commented Oct 14, 2024

garrensmith commented Oct 15, 2024

garrensmith commented Oct 15, 2024

SevInf commented Oct 15, 2024

SevInf commented Oct 15, 2024

tmm1 commented Oct 27, 2024

aqrln commented Oct 27, 2024

aqrln commented Oct 29, 2024 • edited Loading

tmm1 commented Oct 29, 2024

aqrln Oct 10, 2024 •

edited

Loading

aqrln commented Oct 29, 2024 •

edited

Loading