Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uneven job scheduling in rate-limited grouped queues #32

Open
x4lldux opened this issue Aug 17, 2019 · 3 comments
Open

Uneven job scheduling in rate-limited grouped queues #32

x4lldux opened this issue Aug 17, 2019 · 3 comments

Comments

@x4lldux
Copy link

x4lldux commented Aug 17, 2019

Rate-limited queue group with rate-limited queues are not scheduled evenly. First queue is always suffering.

Expected behaviour is for the queues to be fairly scheduled in a group rate limit scenario.

Details

I've created a demo project showing this behaviour: x4lldux/jobs_group_queues repo (Elixir project, sorry).
Running Aggregator.test(n_queues, n_procs) creates a group rate limit (hardcoded 5 req/s in this demo), nqueues rate limited queues (hardcoded to 3 req/s) and nprocs processes per queue. Queues are numbered from 0..nqueues-1. Each process will ask it's respective queue for a permission, and when granted, send a message to Aggregator gen_server.
After all process send their messages (it takes approximately nprocs seconds), a "sum up" list is returned, where each element is a tuple containing time, a tuple with numbers of grants/messages sent for each queue (up to that point in time) and a list of queues activated in that time frame (per second).

This creates 5 queues and 30 process per queue:

iex(1)> Aggregator.test(5, 30)
[
  {{10, 47, 40}, {1, 0, 1, 1, 1}, [2, 4, 3, 0]},
  {{10, 47, 41}, {1, 1, 2, 3, 2}, [3, 1, 2, 3, 4]},
  {{10, 47, 42}, {1, 3, 4, 3, 3}, [2, 4, 1, 2, 1]},
  {{10, 47, 43}, {1, 3, 6, 5, 4}, [2, 3, 2, 4, 3]},
  {{10, 47, 44}, {1, 5, 7, 6, 5}, [4, 2, 1, 3, 1]},
  {{10, 47, 45}, {1, 6, 9, 7, 6}, [3, 1, 2, 4, 2]},
  {{10, 47, 46}, {1, 7, 11, 8, 7}, [3, 2, 4, 1, 2]},
  {{10, 47, 47}, {1, 8, 13, 9, 8}, [4, 2, 1, 3, 2]},
  {{10, 47, 48}, {1, 8, 15, 11, 9}, [4, 2, 3, 2, 3]},
  {{10, 47, 49}, {1, 9, 16, 13, 10}, [4, 1, 3, 2, 3]},
  {{10, 47, 50}, {1, 10, 17, 14, 12}, [4, 3, 4, 2, 1]},
  {{10, 47, 51}, {1, 11, 18, 16, 13}, [3, 1, 4, 2, 3]},
  {{10, 47, 52}, {2, 12, 20, 17, 13}, [1, 2, 3, 0, 2]},
  {{10, 47, 53}, {2, 12, 21, 20, 14}, [3, 2, 3, 4, 3]},
  {{10, 47, 54}, {2, 13, 21, 22, 16}, [3, 4, 1, 3, 4]},
  {{10, 47, 55}, {2, 14, 23, 22, 18}, [4, 2, 4, 2, 1]},
  {{10, 47, 56}, {2, 15, 25, 24, 18}, [3, 2, 3, 2, 1]},
  {{10, 47, 57}, {2, 16, 27, 25, 19}, [2, 3, 1, 4, 2]},
  {{10, 47, 58}, {2, 19, 27, 26, 20}, [1, 4, 1, 3, 1]},
  {{10, 47, 59}, {2, 20, 29, 27, 21}, [2, 4, 1, 3, 2]},
  {{10, 48, 0}, {2, 21, 30, 30, 21}, [3, 1, 3, 2, 3]},
  {{10, 48, 1}, {3, 23, 30, 30, 23}, [1, 4, 1, 4, 0]},
  {{10, 48, 2}, {3, 25, 30, 30, 26}, [4, 1, 4, 1, 4]},
  {{10, 48, 3}, {3, 28, 30, 30, 28}, [1, 4, 1, 4, 1]},
  {{10, 48, 4}, {4, 30, 30, 30, 30}, [0, 1, 4, 1, 4]},
  {{10, 48, 5}, {7, 30, 30, 30, 30}, [0, 0, 0]},
  {{10, 48, 6}, {10, 30, 30, 30, 30}, [0, 0, 0]},
  {{10, 48, 7}, {13, 30, 30, 30, 30}, [0, 0, 0]},
  {{10, 48, 8}, {16, 30, 30, 30, 30}, [0, 0, 0]},
  {{10, 48, 9}, {19, 30, 30, 30, 30}, [0, 0, 0]},
  {{10, 48, 10}, {22, 30, 30, 30, 30}, [0, 0, 0]},
  {{10, 48, 11}, {25, 30, 30, 30, 30}, [0, 0, 0]},
  {{10, 48, 12}, {28, 30, 30, 30, 30}, [0, 0, 0]},
  {{10, 48, 13}, {30, 30, 30, 30, 30}, [0, 0]}
]

In this example first queue is rarely scheduled until most of the other queues are done. The last 10 seconds is spent solely scheduling the first queue because all others have already finished.

@uwiger
Copy link
Owner

uwiger commented Aug 19, 2019

Interesting. One thing to try might be to randomize each timer interval. Jobs doesn't depend on the timers firing exactly, but will calculate for each actual interval how many jobs to allow.

I don't have time to try this right now, but if you feel up to it, please feel free.

@x4lldux
Copy link
Author

x4lldux commented Aug 19, 2019

Any idea were to start?

@uwiger
Copy link
Owner

uwiger commented Aug 19, 2019

Well, perhaps the simplest way to start would be to write a callback for check_interval.

https://github.com/uwiger/jobs/blob/master/src/jobs_server.erl#L1745

This would also give an indication of where to try to fix the jobs_server code. While the code is a bit hairy, basically, it uses a fixed interval for the next timer, and adjusts for timer drift. It also allows the max_time value to override (e.g. long) intervals to ensure that timeout-based rejections are delivered promptly.

But if you create your queues with {check_interval, {M, F, Args}}, the jobs_server will call M:F(CurrentTimestamp, LastDispatchTimestamp, Args), which is expected to return an integer. Note that this function will run in the jobs_server process, so taking a long time or crashing will be very bad.

Things to consider: If the interval is too short, compared to the expected rate, precision may suffer. If the interval is too long, jobs may end up waiting unnecessarily before being dispatched. But basically, if you introduce sufficient jitter in the intervals, it might be possible to avoid systematic starving of a certain queue (assuming this is caused by too much determinism in the dispatch.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants