-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Record Metrics for Reminder #4831
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Vyom Yadav <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bunch of comments, mostly aiming to help simplify your life. 😁
// For tracking average calculation | ||
// TODO: consider persisting this to avoid reset on restart (maybe) | ||
totalBatches int64 | ||
totalReminders int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than storing these as int64s, why not use an Int64Counter
? That will automatically record the total, but also that the metric is cumulative over time.
// Current number of reminders in the batch | ||
BatchSize metric.Int64Gauge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use an Int64Histogram
for this, which will also give you the Average metric you're tracking next (approximately, but with more-detailed frequency information, which is probably interesting).
0, // immediate | ||
10, // 10 seconds | ||
20, // 20 seconds | ||
40, // 40 seconds | ||
80, // 1m 20s | ||
160, // 2m 40s | ||
320, // 5m 20s | ||
640, // 10m 40s | ||
1280, // 21m 20s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While powers of 2 are generally handy, I'd tend to align this more with human units, e.g.:
0, 10, 30, 60, 120, 300, 600, 900, 1800, 3600, 7200
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what frequency we'll run reminder
at, but 2 hours seems reasonably pessimistic for now -- it's really annoying when you run into problems and your frequency distribution has always been "it's not so good".
Given that our expected value for latency is 1/2 scan interval, I could even be convinced that < 1 minute is not particularly interesting (which could trim a couple buckets if you're worried about it -- I'm not!).
// Update running average | ||
m.totalBatches++ | ||
m.totalReminders += size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should aim to get the metrics code to do this for you automatically -- in particular, if we end up with two reminder
processes running, getting a global average from reminder_avg_batch_size
is impossible without knowing the numerator and denominator for each process. Exporting them as two cumulative totals allows aggregation across multiple processes.
remMetrics.SendDelay.Record(ctx, (time.Now().Sub(reminderLastSent.Time) - r.cfg.RecurrenceConfig.MinElapsed).Seconds()) | ||
} else { | ||
// TODO: Should the send delay be zero if the reminder has never been sent? | ||
remMetrics.SendDelay.Record(ctx, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will produce a bunch of +Inf
values in the histogram bucket, which will disturb any sort of Nth percentile.
It's probably better to either export nothing here, or export it under some different counter metric ("new reminders").
otel.SetMeterProvider(mp) | ||
|
||
// Create metrics | ||
meter := mp.Meter("reminder-service") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meter := mp.Meter("reminder-service") | |
meter := mp.Meter("reminder") |
} | ||
|
||
// Shutdown gracefully shuts down the metrics server | ||
func (p *Provider) Shutdown(ctx context.Context) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you make this method private, then cancelling the context to Start
or NewProvider
becomes the only way to reach this code, which simplifies your lifecycle over having multiple callers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(among other things, you can guarantee that p
is not nil
, because you don't call the shutdown
method unless p
is non-nil.)
func (p *Provider) Metrics() *Metrics { | ||
return p.metrics | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main suggestion is to remove the Provider
object in favor of a static setup function with cancellation, but if you do keep it, I'd suggesting making Metrics
exported, rather than needing to add an accessor function.
@@ -143,6 +163,11 @@ func (r *reminder) sendReminders(ctx context.Context) error { | |||
return fmt.Errorf("error creating reminder messages: %w", err) | |||
} | |||
|
|||
remMetrics := r.metricsProvider.Metrics() | |||
if remMetrics != nil { | |||
remMetrics.RecordBatch(ctx, int64(len(repos))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you make batches a histogram, this can just be a Record()
on that histogram.
func (m *Metrics) RecordSendDelay(ctx context.Context, delaySeconds float64) { | ||
m.SendDelay.Record(ctx, delaySeconds) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this function isn't doing anything -- you're already exposing SendDelay
, and there's not much semantic benefit to foo.RecordSendDelay(...)
over foo.SendDelay.Record(...)
. That also saves you from needing to write a comment.
If you transform BatchSize
into a histogram as well, then you might be able to get rid of all the methods on Metrics
, which would be nice.
Summary
Provide a brief overview of the changes and the issue being addressed.
Explain the rationale and any background necessary for understanding the changes.
List dependencies required by this change, if any.
Fixes #(related issue)
Change Type
Mark the type of change your PR introduces:
Testing
Outline how the changes were tested, including steps to reproduce and any relevant configurations.
Attach screenshots if helpful.
Review Checklist: