Workers can die when processing large tasks #60

Simon-van-Diepen · 2023-07-28T12:15:34Z

When the task consists of >50,000 jobs, the scheduler can lose connection with some of its workers, after which the workers are cancelled. The relevant part of the slurm output:

2023-07-28 12:11:08,636 - distributed.utils_perf - WARNING - full garbage collections took 16% CPU time recently (threshold: 10%)
2023-07-28 12:11:08,744 - distributed.core - INFO - Event loop was unresponsive in Worker for 5.26s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
2023-07-28 12:11:38,205 - distributed.utils_perf - WARNING - full garbage collections took 16% CPU time recently (threshold: 10%)
2023-07-28 12:11:38,582 - distributed.core - INFO - Event loop was unresponsive in Worker for 5.61s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
2023-07-28 12:12:07,805 - distributed.core - INFO - Event loop was unresponsive in Worker for 4.90s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
2023-07-28 12:12:07,833 - distributed.utils_perf - WARNING - full garbage collections took 16% CPU time recently (threshold: 10%)
2023-07-28 12:12:38,548 - distributed.utils_perf - WARNING - full garbage collections took 16% CPU time recently (threshold: 10%)
2023-07-28 12:12:38,736 - distributed.core - INFO - Event loop was unresponsive in Worker for 5.34s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
2023-07-28 12:13:08,342 - distributed.utils_perf - WARNING - full garbage collections took 16% CPU time recently (threshold: 10%)
2023-07-28 12:13:08,463 - distributed.core - INFO - Event loop was unresponsive in Worker for 5.04s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
2023-07-28 12:13:57,666 - distributed.utils_perf - WARNING - full garbage collections took 16% CPU time recently (threshold: 10%)
2023-07-28 12:13:57,985 - distributed.core - INFO - Event loop was unresponsive in Worker for 5.19s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
2023-07-28 12:16:10,940 - distributed.core - INFO - Removing comms to tcp://10.0.3.104:33549
2023-07-28 12:16:44,880 - distributed.utils_perf - WARNING - full garbage collections took 17% CPU time recently (threshold: 10%)
2023-07-28 12:16:44,891 - distributed.core - INFO - Event loop was unresponsive in Worker for 3.92s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
2023-07-28 12:17:14,247 - distributed.utils_perf - WARNING - full garbage collections took 17% CPU time recently (threshold: 10%)
2023-07-28 12:17:14,319 - distributed.core - INFO - Event loop was unresponsive in Worker for 4.42s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
2023-07-28 12:17:39,094 - distributed.utils_perf - WARNING - full garbage collections took 17% CPU time recently (threshold: 10%)
2023-07-28 12:17:39,192 - distributed.core - INFO - Event loop was unresponsive in Worker for 4.26s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
slurmstepd: error: *** JOB 5888554 ON wn-ca-02 CANCELLED AT 2023-07-28T12:25:56 ***

The text was updated successfully, but these errors were encountered:

Simon-van-Diepen changed the title ~~Workers die when processing large tasks~~ Workers can die when processing large tasks Jul 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workers can die when processing large tasks #60

Workers can die when processing large tasks #60

Simon-van-Diepen commented Jul 28, 2023

Workers can die when processing large tasks #60

Workers can die when processing large tasks #60

Comments

Simon-van-Diepen commented Jul 28, 2023