Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock detection #182

Open
njsmith opened this issue May 30, 2017 · 3 comments
Open

Deadlock detection #182

njsmith opened this issue May 30, 2017 · 3 comments

Comments

@njsmith
Copy link
Member

njsmith commented May 30, 2017

Our Lock objects enforce the constraint that they can only be released by the same task that acquired them. We might (probably should) add a version of Semaphore that enforces the same constraint (see discussion in #156). CapacityLimit objects are like Semaphores that enforce the same constraint. We should take advantage of these constraints to provide better error detection. In particular:

Detecting never-released locks

This is pretty trivial, yet useful: if a Task exits while holding a Lock, then this is a bug (that Lock can never be released), and we should alert the user. (I guess technically it's not a bug if this is the only Task that knows about the Lock, but I don't think requiring people to issue a spurious release in this corner case is too onerous if it helps catch errors in general.) This would be pretty straightforward – just keep a record of the lock-or-lock-like objects held by each Task, and then when a task exits, if this set is non-empty, whine loudly.

Not entirely sure if this means "print a warning" or "raise an exception". There's also the option of forcibly releasing the lock, though I'm dubious about the wisdom of carrying on once your distributed system has lost internal consistency... maybe poison the lock so anyone who waits on it gets an exception, and if people are following good Erlang style then they'll all get rebooted and all may be well?

Full-fledged deadlock detection

We can also detect classic deadlocks by tracing the wait-for graph.

This scope would be limited, e.g. we can't detect deadlocks caused by two Tasks that are using two Queues for communication and are both blocked, because we don't have any way to know that that task there that's blocked is the only one that can send on this Queue. Maybe we should! (E.g., by having the concept of binding a Queue to a Task so only that Task can send on it.) But even if all we can detect are mutex-based deadlocks that's still potentially very useful!

There are a lot of possible variants on this. We don't want to add too much overhead; I don't know if tracing the wait-for graph on every contended acquire is fast enough to be acceptable. (It worked for these people, but that's a pretty different context.) It's probably a reasonable thing to try first, though. (One can also do stuff like trigger a trace only after a task has been blocked for N seconds. Though in practice keeping track of this on every contended acquire might be just as expensive as tracing the graph immediately, if the graph trace usually finishes almost immediately.)

Another thing that can be useful to detect is we see one Task acquire A-then-B, while another Task acquires B-then-A. This by itself doesn't prove that there's a deadlock, but it's certainly suspicious, and can help catch latent bugs before they hit.

Again the basic data needed for all of these is the ability to look at a Task and get a list of all the lock and lock-like objects its waiting on.

@njsmith
Copy link
Member Author

njsmith commented May 30, 2017

This library also has an extremely simple feature: print a stacktrace whenever taking a lock takes too long, period. Which is surprisingly reasonable...

@njsmith
Copy link
Member Author

njsmith commented Jun 18, 2017

In working on #181 (and #156, sorta), I realized that threads make this a bit more complicated. In particular, here's the problem I ran into: in run_in_worker_thread, the task that's about to start the thread needs to acquire a token, and then the token has to be released after the thread finishes... which might not be until after the original task has abandoned this and gone off to do something else! So the task releasing the token may not be the same task that acquired it.

Conceptually, I think the way we would want to model this for deadlock-detection purposes is:

  • tasks and threads are both first-class schedulable entities in the waits-for graph
  • a task executing run_in_worker_thread is waiting-for the thread
  • the thread owns the capacity token
  • a thread executing await_in_trio_thread is waiting-for the task that it spawned

So... I was initially thinking in terms of complicated APIs that let you transfer "ownership" of a lock/token to another task, by materializing it in an object and then passing that object over etc., but this suggests something much simpler. Maybe all we need is like limit.acquire_on_behalf_of(thread_id), limit.release_on_behalf_of(thread_id), where thread_id is an opaque object that will eventually be hooked into the deadlock detection system (or a Task of course). So run_in_worker_thread would do something like:

thread_id = new_thread_id()
# While blocked here, this task is *waiting-for* the limit
# But afterwards, the thread_id *holds* the limit
await limit.acquire_on_behalf_of(thread_id)
spawn_thread_with_id(...)

(If we start re-using threads, #6, then the "thread id" here would become more like a "job id", i.e. it refers to a particular chunk of work done by a thread corresponding to a single call to run_in_worker_thread, not that thread over all time.)

And eventually we'd want to stash the thread_id in a thread-local inside the worker thread, so that {run,await}_in_trio_thread can get at it for deadlock detection.

njsmith added a commit to njsmith/trio that referenced this issue Jun 18, 2017
- New synchronization primitive: CapacityLimiter. Like a Semaphore but
  more specialized. See python-triogh-182 for rationale.

- Add limiter= argument to run_in_worker_thread, that allows one to
  correctly (modulo
  python-trio#6 (comment))
  control the number of active threads.

- Added new function current_default_worker_thread_limiter(), which
  creates or returns a run-local CapacityLimiter, and made
  run_in_worker_thread use it by default when no other limiter= is
  given.

Closes: python-triogh-10, python-triogh-57, python-triogh-156
njsmith added a commit to njsmith/trio that referenced this issue Jun 18, 2017
- New synchronization primitive: CapacityLimiter. Like a Semaphore but
  more specialized. See python-triogh-182 for rationale.

- Add limiter= argument to run_in_worker_thread, that allows one to
  correctly (modulo
  python-trio#6 (comment))
  control the number of active threads.

- Added new function current_default_worker_thread_limiter(), which
  creates or returns a run-local CapacityLimiter, and made
  run_in_worker_thread use it by default when no other limiter= is
  given.

Closes: python-triogh-10, python-triogh-57, python-triogh-156
@smurfix
Copy link
Contributor

smurfix commented Jun 9, 2020

While #1085 will take care of global deadlock detection, real-life code probably needs some more intelligence than that, as otherwise no server will ever have a deadlock (its accept call is not going to be affected by the deadlock; also there's the systemd keepalive task).

I don't think that that any timeout-based solution is sufficient, but some "this lock stays locked for more than N seconds" instrumentation (restarting the timer when it's immediately re-acquired by a task that's waiting on it) on selected Locks might be a good first step that should be implementable without much overhead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants