Uneven job scheduling in rate-limited grouped queues #32

x4lldux · 2019-08-17T11:24:20Z

Rate-limited queue group with rate-limited queues are not scheduled evenly. First queue is always suffering.

Expected behaviour is for the queues to be fairly scheduled in a group rate limit scenario.

Details

I've created a demo project showing this behaviour: x4lldux/jobs_group_queues repo (Elixir project, sorry).
Running Aggregator.test(n_queues, n_procs) creates a group rate limit (hardcoded 5 req/s in this demo), n_queues rate limited queues (hardcoded to 3 req/s) and n_procs processes per queue. Queues are numbered from 0..n_queues-1. Each process will ask it's respective queue for a permission, and when granted, send a message to Aggregator gen_server.
After all process send their messages (it takes approximately n_procs seconds), a "sum up" list is returned, where each element is a tuple containing time, a tuple with numbers of grants/messages sent for each queue (up to that point in time) and a list of queues activated in that time frame (per second).

This creates 5 queues and 30 process per queue:

iex(1)> Aggregator.test(5, 30)
[
  {{10, 47, 40}, {1, 0, 1, 1, 1}, [2, 4, 3, 0]},
  {{10, 47, 41}, {1, 1, 2, 3, 2}, [3, 1, 2, 3, 4]},
  {{10, 47, 42}, {1, 3, 4, 3, 3}, [2, 4, 1, 2, 1]},
  {{10, 47, 43}, {1, 3, 6, 5, 4}, [2, 3, 2, 4, 3]},
  {{10, 47, 44}, {1, 5, 7, 6, 5}, [4, 2, 1, 3, 1]},
  {{10, 47, 45}, {1, 6, 9, 7, 6}, [3, 1, 2, 4, 2]},
  {{10, 47, 46}, {1, 7, 11, 8, 7}, [3, 2, 4, 1, 2]},
  {{10, 47, 47}, {1, 8, 13, 9, 8}, [4, 2, 1, 3, 2]},
  {{10, 47, 48}, {1, 8, 15, 11, 9}, [4, 2, 3, 2, 3]},
  {{10, 47, 49}, {1, 9, 16, 13, 10}, [4, 1, 3, 2, 3]},
  {{10, 47, 50}, {1, 10, 17, 14, 12}, [4, 3, 4, 2, 1]},
  {{10, 47, 51}, {1, 11, 18, 16, 13}, [3, 1, 4, 2, 3]},
  {{10, 47, 52}, {2, 12, 20, 17, 13}, [1, 2, 3, 0, 2]},
  {{10, 47, 53}, {2, 12, 21, 20, 14}, [3, 2, 3, 4, 3]},
  {{10, 47, 54}, {2, 13, 21, 22, 16}, [3, 4, 1, 3, 4]},
  {{10, 47, 55}, {2, 14, 23, 22, 18}, [4, 2, 4, 2, 1]},
  {{10, 47, 56}, {2, 15, 25, 24, 18}, [3, 2, 3, 2, 1]},
  {{10, 47, 57}, {2, 16, 27, 25, 19}, [2, 3, 1, 4, 2]},
  {{10, 47, 58}, {2, 19, 27, 26, 20}, [1, 4, 1, 3, 1]},
  {{10, 47, 59}, {2, 20, 29, 27, 21}, [2, 4, 1, 3, 2]},
  {{10, 48, 0}, {2, 21, 30, 30, 21}, [3, 1, 3, 2, 3]},
  {{10, 48, 1}, {3, 23, 30, 30, 23}, [1, 4, 1, 4, 0]},
  {{10, 48, 2}, {3, 25, 30, 30, 26}, [4, 1, 4, 1, 4]},
  {{10, 48, 3}, {3, 28, 30, 30, 28}, [1, 4, 1, 4, 1]},
  {{10, 48, 4}, {4, 30, 30, 30, 30}, [0, 1, 4, 1, 4]},
  {{10, 48, 5}, {7, 30, 30, 30, 30}, [0, 0, 0]},
  {{10, 48, 6}, {10, 30, 30, 30, 30}, [0, 0, 0]},
  {{10, 48, 7}, {13, 30, 30, 30, 30}, [0, 0, 0]},
  {{10, 48, 8}, {16, 30, 30, 30, 30}, [0, 0, 0]},
  {{10, 48, 9}, {19, 30, 30, 30, 30}, [0, 0, 0]},
  {{10, 48, 10}, {22, 30, 30, 30, 30}, [0, 0, 0]},
  {{10, 48, 11}, {25, 30, 30, 30, 30}, [0, 0, 0]},
  {{10, 48, 12}, {28, 30, 30, 30, 30}, [0, 0, 0]},
  {{10, 48, 13}, {30, 30, 30, 30, 30}, [0, 0]}
]

In this example first queue is rarely scheduled until most of the other queues are done. The last 10 seconds is spent solely scheduling the first queue because all others have already finished.

The text was updated successfully, but these errors were encountered:

uwiger · 2019-08-19T07:29:30Z

Interesting. One thing to try might be to randomize each timer interval. Jobs doesn't depend on the timers firing exactly, but will calculate for each actual interval how many jobs to allow.

I don't have time to try this right now, but if you feel up to it, please feel free.

x4lldux · 2019-08-19T08:02:18Z

Any idea were to start?

uwiger · 2019-08-19T08:28:13Z

Well, perhaps the simplest way to start would be to write a callback for check_interval.

https://github.com/uwiger/jobs/blob/master/src/jobs_server.erl#L1745

This would also give an indication of where to try to fix the jobs_server code. While the code is a bit hairy, basically, it uses a fixed interval for the next timer, and adjusts for timer drift. It also allows the max_time value to override (e.g. long) intervals to ensure that timeout-based rejections are delivered promptly.

But if you create your queues with {check_interval, {M, F, Args}}, the jobs_server will call M:F(CurrentTimestamp, LastDispatchTimestamp, Args), which is expected to return an integer. Note that this function will run in the jobs_server process, so taking a long time or crashing will be very bad.

Things to consider: If the interval is too short, compared to the expected rate, precision may suffer. If the interval is too long, jobs may end up waiting unnecessarily before being dispatched. But basically, if you introduce sufficient jitter in the intervals, it might be possible to avoid systematic starving of a certain queue (assuming this is caused by too much determinism in the dispatch.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uneven job scheduling in rate-limited grouped queues #32

Uneven job scheduling in rate-limited grouped queues #32

x4lldux commented Aug 17, 2019

uwiger commented Aug 19, 2019

x4lldux commented Aug 19, 2019

uwiger commented Aug 19, 2019

Uneven job scheduling in rate-limited grouped queues #32

Uneven job scheduling in rate-limited grouped queues #32

Comments

x4lldux commented Aug 17, 2019

Details

uwiger commented Aug 19, 2019

x4lldux commented Aug 19, 2019

uwiger commented Aug 19, 2019