Delay in recording a task as succeeded #5144
-
Thank you so much for such a mature product. We have been using Nextflow recently on GCP and getting more familiar with it. On a nf-core/cutandrun pipeline run that we did on a GCP Cloud Workstation, we noticed that there were many delays after the Google Cloud Batch job succeeded and before the task was registered as succeeded by Nextflow. For example, the below two consecutive lines from the log:
Events like this resulted in our pipeline taking ~13 hours. When I ran the same pipeline on a regular GCE VM instead of a Cloud Workstation, the same pipeline took 4 hours. On a first glance, monitoring logs show that CPU and memory utilization was fairly low on the machine during this time and no one was using the machine at all. What could be a reason this could be happening on our run? Just looking for hints so I can debug this further on my own. |
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 6 replies
-
It is hard to say without more in-depth details. How many jobs is running your pipeline? how many CPUs and memory has the instance running Nextflow? Is the pipeline uploading/downloading large data? The full |
Beta Was this translation helpful? Give feedback.
-
Thank you for your help. Some information attached below. I don't have the resource monitoring graphs anymore but the controller machine was an n2-standard-16 with nothing else running on it. CPU, memory and network utilization were all low throughout the run. [1] Log file. [2] Screenshot for GetTask method traffic during the job run. [3] Zoomed version of GetTask method traffic for the time in the 2 log lines posted in the original post. If it is not obvious, then please leave it. It may be that this machine was going to sleep or something similar. I am not quite sure how the autosleep behavior works in Google Cloud Workstations. I can bring this up again if it happens more regularly; I just wanted to check if someone knew something already. |
Beta Was this translation helpful? Give feedback.
-
Any chance to upload also the for one execution not showing the problem? |
Beta Was this translation helpful? Give feedback.
-
Hi yes, thank you for looking into this. Attached are logs from two runs, both run at the same time on the same VM (different directories, but same workflow and data). The only difference between them is that one was run with Fusion and one without. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the suggestion. I tried your suggestion, and launched two jobs, one with and one without Java virtual threads. Same thing happened in both jobs. These problematic jobs are running from terminals inside OSS Code (open source version of VS Code). I have now verified that the VM is not sleeping but rather the jobs go to sleep and don't wake up unless the Terminal window gains focus again. I think this problem is something specific to the terminal in which I am launching my jobs. I will investigate further and report here for posterity, but it looks like my problem to solve. My observation: % date
Wed Jul 17 21:46:29 UTC 2024
% tail -n1 .nextflow.log
Jul-17 21:08:20.691 [Task monitor] DEBUG n.c.g.batch.GoogleBatchTaskHandler - [GOOGLE BATCH] Process `NFCORE_CUTANDRUN:CUTANDRUN:PREPARE_GENOME:TABIX_BGZIPTABIX (gencode.v45.annotation.bed)` - terminated job=nf-6e4f30a7-1721250381925; task=0; state=SUCCEEDED
% ps 10104
PID TTY STAT TIME COMMAND
10104 pts/0 Sl+ 1:59 [REDACTED]/jvm/bin/java ...
% sudo strace -p 10104
strace: Process 10104 attached
futex(0x7ee399d16910, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 10150, NULL, FUTEX_BITSET_MATCH_ANY |
Beta Was this translation helpful? Give feedback.
I think I have the answer.
In my setup, the terminal is an emulated terminal (PTS instead of TTY) run by the VS Code browser window. When I close the browser window, VS Code starts buffering the data sent to the PTS, to display when the user opens a new browser window to reconnect to the session. I am not sure of the details, but I think because of the nature of Nextflow progress rendering, the PTS buffers become full and any further write operations to them will block. Which means, the
renderProgress
call in Nextflow will block. However, the various functions inAnsiLogObserver
are all synchronized under one lock, instead of more granular synchronization on sources and sinks. So becauser…