-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Ensure connection pool metrics stay consistent #99
Conversation
This started as a bugfix to several prisma issues regarding incorrect metrics: prisma/prisma#25177 prisma/prisma#23525 And a couple of more discovered the testing, but not reported in the issues. There were several causes for this: 1. Following pattern appears quite a lot in mobc code: ```rust gauge!("something").increment(1.0); do_a_thing_that_could_fail()?; gauge!("something").decrement(1.0); ``` So, in case `do_a_thing_that_could_fail` actually fails, gauge will get incremented, but never will get decremented. 2. Couple of metrics were relying on `Conn::close` being manually called and that was not the case every once in a while. To prevent both of those problems, I rewrote the internals of library to rely on RAII rather than manual counters and resources management. `Conn` struct is now split into two: - `ActiveConn` - represents currently checked out connection that have been actively used by the client. Holds onto semaphore permit and can be converted into `IdleConn`. Doing so will free the permit. - `IdleConn` - represents idle connection, currently checked into the pool. Can be converted to `ActiveConn` by providing a valid permit. `ConnState` represents the shared state of the connection that is retained between different activity states. Both `IdleConn` and `ActiveConn` manage their corresponding gauges - increment them on creation and decrement them during drop. `ConnState` manages `CONNECTIONS_OPEN` gauge and `CONNECTIONS_TOTAL` and `CLOSED_TOTAL` counters in the same way. This system ensures that metrics stay consistent: since metrics are automatically incremented and decremented on state conversions, we can always be sure that: - Connection is always either idle or active, there is no in between state. - Idle connections and active connections gauges will always add up to the currently open connections gauge - Total connections open counter, minus total connections closed gauge will always be equal to number of currently open connections. Since resources are now managed by `Drop::drop` implementations, that removes the need for manual `close` method and simiplifies the code quite in a few places, also ensuring it is safer against future changes.
@@ -706,14 +706,6 @@ fn test_max_idle_lifetime() { | |||
drop(v); | |||
delay_for(Duration::from_millis(800)).await; | |||
|
|||
let mut v = vec![]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assertion was incorrect - connection lifeteme is 1 sec, by that time 1800ms passed but connections were not freed for some reason
mobc update includes fixes from importcjj/mobc#99 Rest of the PR is update to new api of the `metrics` crate. Close prisma/team-orm#1317
max_lifetime_closed: AtomicU64, | ||
max_idle_closed: AtomicU64, | ||
max_idle_closed: Arc<AtomicU64>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are atomics wrapped in an Arc
?
Arc<T>
is Send + Safe
as long as T: Send + Safe
, and AtomicU64
is already Send + Safe
.
What do we need the Arc
for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I was wrong.
From https://doc.rust-lang.org/std/sync/atomic/:
Atomic variables are safe to share between threads (they implement Sync) but they do not themselves provide the mechanism for sharing and follow the threading model of Rust. The most common way to share an atomic variable is to put it into an Arc (an atomically-reference-counted shared pointer).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, exactly. Those two are needed to be shared becuase drop
for ConnState
needs to increment one and decrement another
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkomyno atomics are safely shareable (and you can even safely concurrently mutate them through a shared reference) but they are not magically shared just by cloning them (that would violate the idea of ownership), you need some kind of a reference or a pointer to be able to access the same location in memory from multiple places. If you don't want an Arc
here, the only other option is for ConnState
to borrow from the Pool
and that is not necessarily practical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
beautiful
internals.wait_duration += wait_guard.elapsed(); | ||
drop(wait_guard); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit of an edge case and I'm not sure how critical it is here, but if the thread is preempted by the kernel between these two statements, the counters in internal
and the histogram may diverge a bit. Could be especially sensitive to running on a cloud VM with something like 100m CPU quota.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. I think instead of a manual drop we can add a method to HistogramGuard
that takes ownership of self
and returns elapsed time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep internals.wait_duration += wait_guard.into_elapsed()
sounds great
|
||
impl Drop for ConnState { | ||
fn drop(&mut self) { | ||
self.total_connections_open.fetch_sub(1, Ordering::Relaxed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is Relaxed
sound here and no happens-before relationship is necessary, or should the loads of PoolState::num_open
form an acquire-release pair with this store?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just moved existing code around, but I agree with you here, acquire-release makes more sense in this case
Co-authored-by: Alexey Orlenko <[email protected]>
mobc update includes fixes from importcjj/mobc#99 Rest of the PR is update to new api of the `metrics` crate. Close prisma/team-orm#1317
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @SevInf
Thanks for the PR. There is a lot of changes here and I haven't looked at this code for awhile so I'm not entirely sure on a few things
So, in case do_a_thing_that_could_fail actually fails, gauge will get incremented, but never will get decremented.
Can you give an example of this? Why what adding the GaugeGuard your solution to this?
Couple of metrics were relying on Conn::close being manually called and that was not the case every once in a while.
What causes the Conn::close
not to be called?
I don't understand why two connection states, I can't see what the difference is between them?
The error in metrics you are trying to fix makes sense. Are you 100% certain this fixes them?
Have you measure what the performance impact this has on the library?
I know you say it makes the code simpler but I don't see how, it seems to add a lot more to it.
Sorry for lots of questions, I'm trying to get some context here?
}; | ||
|
||
let shared = Arc::new(SharedPool { | ||
config: share_config, | ||
manager, | ||
internals, | ||
semaphore: Semaphore::new(max_open), | ||
semaphore: Arc::new(Semaphore::new(max_open)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need the Arc around the Semaphore?
What is the performance implications of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arc
is needed to get owned permit, that method exists only on Arc<Self>
No noticeable negative performance implications, if anything, mobc own testsuite consistently runs 10-15s faster on my machine after my changes.
Self { | ||
inner, | ||
state, | ||
_permit: permit, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you keep the permit instead of forgetting it? I can't see you using it anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I forget it, i will then need to add it back manually when connection drops. If I keep on struct, it will automatically gets returned when the struct drops, we don't need to think about doing manual management and we can't get it wrong in case struct drops earlier then we thought.
For example:
Generally,
Just a safety mechanism to ensure connections are used correctly when checked in and out of pool. For example, now typesystem guarantees that:
Not 100% certain, but I went from being able to reproduce them consistently to not being able to reproduce them at all. (autocannon with 10k connections on a test service).
mobc's own test suite consistently runs faster for me after my changes
Sure, it's more code, but resource management is no longer relies on us calling right cleanup method at the right time, Rust does it for us in a safer and less brittle way. |
Awesome thanks for the work and explanations. |
@SevInf you should have commit access and be able to push a release. |
Thank you @garrensmith! |
Also, @garrensmith, I do have access on Github but not on crates.io so can't publish a release right now |
mobc update includes fixes from importcjj/mobc#99 Rest of the PR is update to new api of the `metrics` crate. Close prisma/team-orm#1317
mobc update includes fixes from importcjj/mobc#99 Rest of the PR is update to new api of the `metrics` crate. Close prisma/team-orm#1317
mobc update includes fixes from importcjj/mobc#99 Rest of the PR is update to new api of the `metrics` crate. Close prisma/team-orm#1317
mobc update includes fixes from importcjj/mobc#99 Rest of the PR is update to new api of the `metrics` crate. Close prisma/team-orm#1317
FYI when building prisma-engines 5.14.0 w/ mobc 0.8.5 i'm not getting any values for most metrics |
@tmm1 yeah that's a known issue in Prisma (one problem was masking another), will be fixed in 5.22.0 |
@tmm1 actually on a second thought it wouldn't have led to not getting any values at all. In your case it's probably just because the version of the |
Good catch! Can confirm things work as expected with 5.14.0 + prisma/prisma-engines#5015 |
This started as a bugfix to several prisma issues regarding incorrect metrics:
prisma/prisma#25177
prisma/prisma#23525
And a couple of more discovered the testing, but not reported in the issues.
There were several causes for this:
So, in case
do_a_thing_that_could_fail
actually fails, gauge will get incremented, but never will get decremented.Conn::close
being manually called and that was not the case every once in a while.To prevent both of those problems, I rewrote the internals of library to rely on RAII rather than manual counters and resources management.
Conn
struct is now split into two:ActiveConn
- represents currently checked out connection that have been actively used by the client. Holds onto semaphore permit and can be converted intoIdleConn
. Doing so will free the permit.IdleConn
- represents idle connection, currently checked into the pool. Can be converted toActiveConn
by providing a valid permit.ConnState
represents the shared state of the connection that is retained between different activity states.Both
IdleConn
andActiveConn
manage their corresponding gauges - increment them on creation and decrement them during drop.ConnState
managesCONNECTIONS_OPEN
gauge andCONNECTIONS_TOTAL
andCLOSED_TOTAL
counters in the same way.This system ensures that metrics stay consistent: since metrics are automatically incremented and decremented on state conversions, we can always be sure that:
Since resources are now managed by
Drop::drop
implementations, that removes the need for manualclose
method and simiplifies the code quite in a few places, also ensuring it is safer against future changes.