You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When the task consists of >50,000 jobs, the scheduler can lose connection with some of its workers, after which the workers are cancelled. The relevant part of the slurm output:
2023-07-28 12:11:08,636 - distributed.utils_perf - WARNING - full garbage collections took 16% CPU time recently (threshold: 10%)
2023-07-28 12:11:08,744 - distributed.core - INFO - Event loop was unresponsive in Worker for 5.26s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
2023-07-28 12:11:38,205 - distributed.utils_perf - WARNING - full garbage collections took 16% CPU time recently (threshold: 10%)
2023-07-28 12:11:38,582 - distributed.core - INFO - Event loop was unresponsive in Worker for 5.61s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
2023-07-28 12:12:07,805 - distributed.core - INFO - Event loop was unresponsive in Worker for 4.90s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
2023-07-28 12:12:07,833 - distributed.utils_perf - WARNING - full garbage collections took 16% CPU time recently (threshold: 10%)
2023-07-28 12:12:38,548 - distributed.utils_perf - WARNING - full garbage collections took 16% CPU time recently (threshold: 10%)
2023-07-28 12:12:38,736 - distributed.core - INFO - Event loop was unresponsive in Worker for 5.34s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
2023-07-28 12:13:08,342 - distributed.utils_perf - WARNING - full garbage collections took 16% CPU time recently (threshold: 10%)
2023-07-28 12:13:08,463 - distributed.core - INFO - Event loop was unresponsive in Worker for 5.04s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
2023-07-28 12:13:57,666 - distributed.utils_perf - WARNING - full garbage collections took 16% CPU time recently (threshold: 10%)
2023-07-28 12:13:57,985 - distributed.core - INFO - Event loop was unresponsive in Worker for 5.19s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
2023-07-28 12:16:10,940 - distributed.core - INFO - Removing comms to tcp://10.0.3.104:33549
2023-07-28 12:16:44,880 - distributed.utils_perf - WARNING - full garbage collections took 17% CPU time recently (threshold: 10%)
2023-07-28 12:16:44,891 - distributed.core - INFO - Event loop was unresponsive in Worker for 3.92s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
2023-07-28 12:17:14,247 - distributed.utils_perf - WARNING - full garbage collections took 17% CPU time recently (threshold: 10%)
2023-07-28 12:17:14,319 - distributed.core - INFO - Event loop was unresponsive in Worker for 4.42s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
2023-07-28 12:17:39,094 - distributed.utils_perf - WARNING - full garbage collections took 17% CPU time recently (threshold: 10%)
2023-07-28 12:17:39,192 - distributed.core - INFO - Event loop was unresponsive in Worker for 4.26s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
slurmstepd: error: *** JOB 5888554 ON wn-ca-02 CANCELLED AT 2023-07-28T12:25:56 ***
The text was updated successfully, but these errors were encountered:
Simon-van-Diepen
changed the title
Workers die when processing large tasks
Workers can die when processing large tasks
Jul 28, 2023
When the task consists of >50,000 jobs, the scheduler can lose connection with some of its workers, after which the workers are cancelled. The relevant part of the slurm output:
The text was updated successfully, but these errors were encountered: