Delay in recording a task as succeeded #5144

siddharthab · 2024-07-13T02:18:20Z

siddharthab
Jul 13, 2024

Thank you so much for such a mature product.

We have been using Nextflow recently on GCP and getting more familiar with it.

On a nf-core/cutandrun pipeline run that we did on a GCP Cloud Workstation, we noticed that there were many delays after the Google Cloud Batch job succeeded and before the task was registered as succeeded by Nextflow. For example, the below two consecutive lines from the log:

Jul-12 08:38:08.229 [Task monitor] DEBUG n.c.g.batch.GoogleBatchTaskHandler - [GOOGLE BATCH] Process `NFCORE_CUTANDRUN:CUTANDRUN:PREPARE_PEAKCALLING:UCSC_BEDCLIP (IgG_R2)` - terminated job=nf-c7227eb0-1720772308641; task=0; state=SUCCEEDED
Jul-12 09:06:57.188 [Task monitor] DEBUG n.c.g.batch.GoogleBatchTaskHandler - [GOOGLE BATCH] Process `NFCORE_CUTANDRUN:CUTANDRUN:PREPARE_PEAKCALLING:UCSC_BEDCLIP (IgG_R2)` - last event: description: "Task state is updated from RUNNING to SUCCEEDED on zones/us-central1-c/instances/6311147577739662809"

Events like this resulted in our pipeline taking ~13 hours. When I ran the same pipeline on a regular GCE VM instead of a Cloud Workstation, the same pipeline took 4 hours.

On a first glance, monitoring logs show that CPU and memory utilization was fairly low on the machine during this time and no one was using the machine at all.

What could be a reason this could be happening on our run? Just looking for hints so I can debug this further on my own.

Answered by siddharthab

Jul 18, 2024

I think I have the answer.

In my setup, the terminal is an emulated terminal (PTS instead of TTY) run by the VS Code browser window. When I close the browser window, VS Code starts buffering the data sent to the PTS, to display when the user opens a new browser window to reconnect to the session. I am not sure of the details, but I think because of the nature of Nextflow progress rendering, the PTS buffers become full and any further write operations to them will block. Which means, the renderProgress call in Nextflow will block. However, the various functions in AnsiLogObserver are all synchronized under one lock, instead of more granular synchronization on sources and sinks. So because r…

View full answer

pditommaso · 2024-07-13T14:32:32Z

pditommaso
Jul 13, 2024
Maintainer

It is hard to say without more in-depth details. How many jobs is running your pipeline? how many CPUs and memory has the instance running Nextflow? Is the pipeline uploading/downloading large data?

The full .nextlow.log may help troubleshooting the problem ?

0 replies

siddharthab · 2024-07-14T03:38:38Z

siddharthab
Jul 14, 2024
Author

Thank you for your help. Some information attached below. I don't have the resource monitoring graphs anymore but the controller machine was an n2-standard-16 with nothing else running on it. CPU, memory and network utilization were all low throughout the run.

[1] Log file.
nextflow.log

[2] Screenshot for GetTask method traffic during the job run.

[3] Zoomed version of GetTask method traffic for the time in the 2 log lines posted in the original post.

If it is not obvious, then please leave it. It may be that this machine was going to sleep or something similar. I am not quite sure how the autosleep behavior works in Google Cloud Workstations. I can bring this up again if it happens more regularly; I just wanted to check if someone knew something already.

0 replies

pditommaso · 2024-07-16T15:46:55Z

pditommaso
Jul 16, 2024
Maintainer

Any chance to upload also the for one execution not showing the problem?

0 replies

siddharthab · 2024-07-16T19:17:39Z

siddharthab
Jul 16, 2024
Author

Hi yes, thank you for looking into this.

Attached are logs from two runs, both run at the same time on the same VM (different directories, but same workflow and data). The only difference between them is that one was run with Fusion and one without.

with_fusion.log
without_fusion.log

Screenshot for GetTask method traffic during the job run:

1 reply

pditommaso Jul 17, 2024
Maintainer

That's weird, also because the first double of memory and CPUs. Please try settings export NXF_ENABLE_VIRTUAL_THREADS=false in the running environment

siddharthab · 2024-07-17T21:55:55Z

siddharthab
Jul 17, 2024
Author

Thanks for the suggestion. I tried your suggestion, and launched two jobs, one with and one without Java virtual threads. Same thing happened in both jobs. These problematic jobs are running from terminals inside OSS Code (open source version of VS Code). I have now verified that the VM is not sleeping but rather the jobs go to sleep and don't wake up unless the Terminal window gains focus again.

I think this problem is something specific to the terminal in which I am launching my jobs. I will investigate further and report here for posterity, but it looks like my problem to solve.

My observation:

% date              
Wed Jul 17 21:46:29 UTC 2024
% tail -n1 .nextflow.log
Jul-17 21:08:20.691 [Task monitor] DEBUG n.c.g.batch.GoogleBatchTaskHandler - [GOOGLE BATCH] Process `NFCORE_CUTANDRUN:CUTANDRUN:PREPARE_GENOME:TABIX_BGZIPTABIX (gencode.v45.annotation.bed)` - terminated job=nf-6e4f30a7-1721250381925; task=0; state=SUCCEEDED
% ps 10104              
    PID TTY      STAT   TIME COMMAND
  10104 pts/0    Sl+    1:59 [REDACTED]/jvm/bin/java ...
% sudo strace -p 10104
strace: Process 10104 attached
futex(0x7ee399d16910, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 10150, NULL, FUTEX_BITSET_MATCH_ANY

5 replies

pditommaso Jul 18, 2024
Maintainer

Are you launching nextflow in the background using the -bg option e.g. nextflow -bg run .. ?

siddharthab Jul 18, 2024
Author

I think I have the answer.

In my setup, the terminal is an emulated terminal (PTS instead of TTY) run by the VS Code browser window. When I close the browser window, VS Code starts buffering the data sent to the PTS, to display when the user opens a new browser window to reconnect to the session. I am not sure of the details, but I think because of the nature of Nextflow progress rendering, the PTS buffers become full and any further write operations to them will block. Which means, the renderProgress call in Nextflow will block. However, the various functions in AnsiLogObserver are all synchronized under one lock, instead of more granular synchronization on sources and sinks. So because renderProgress is blocked, functions appending to the log file are also blocked. And so the Task Monitor and Task Submitter threads get blocked when trying to write to the log file. Until I reconnect to the VS Code session, which will recreate a PTS and consume the full buffer, unblocking renderProgress, and in turn unblocking all other threads with log statements.

I suppose a fix here would be to improve the locking in AnsiLogObserver, but this is a rather esoteric case, so I am happy to change my usage pattern instead of expecting a fix. I suppose another fix could be that VS Code implement a fixed size FIFO buffer instead of blocking.

Thread dump attached.
threads.txt

For my record: VS Code behavior can be reproduced by a program writing a large amount of data to the PTS while the session is inactive (no open browser window). This can be done with a simple while true; do date; done. After a minute or so, the PTS buffer will be full and the date command will block. One can verify that the PID seen in ps aux | grep date will not change. A further test will be to verify echo "hello" >/proc/$pid/fd/1 will also block. The exact PTS can be seen by ls -l /proc/$pid/fd/1.

Answer selected by siddharthab

siddharthab Jul 18, 2024
Author

Also filed microsoft/vscode#222062.

pditommaso Jul 18, 2024
Maintainer

You should used -bg also for that, because it disables the use of ansi logger in favour of closing output.

siddharthab Jul 18, 2024
Author

Yes, that's good advice. I will change our usage to either run within screen or tux, or run with -bg. I can confirm that -bg works for us.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delay in recording a task as succeeded #5144

{{title}}

Replies: 5 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Delay in recording a task as succeeded #5144

siddharthab Jul 13, 2024

Replies: 5 comments · 6 replies

pditommaso Jul 13, 2024 Maintainer

siddharthab Jul 14, 2024 Author

pditommaso Jul 16, 2024 Maintainer

siddharthab Jul 16, 2024 Author

pditommaso Jul 17, 2024 Maintainer

siddharthab Jul 17, 2024 Author

pditommaso Jul 18, 2024 Maintainer

siddharthab Jul 18, 2024 Author

siddharthab Jul 18, 2024 Author

pditommaso Jul 18, 2024 Maintainer

siddharthab Jul 18, 2024 Author

siddharthab
Jul 13, 2024

Replies: 5 comments 6 replies

pditommaso
Jul 13, 2024
Maintainer

siddharthab
Jul 14, 2024
Author

pditommaso
Jul 16, 2024
Maintainer

siddharthab
Jul 16, 2024
Author

pditommaso Jul 17, 2024
Maintainer

siddharthab
Jul 17, 2024
Author

pditommaso Jul 18, 2024
Maintainer

siddharthab Jul 18, 2024
Author

siddharthab Jul 18, 2024
Author

pditommaso Jul 18, 2024
Maintainer

siddharthab Jul 18, 2024
Author