-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hang with exec when using bash script (maybe any forked process) #1243
Comments
add EXIT to trap call |
It doesn't help. Note that the script above has no way to exit without calling cleanup anyway, but adding it does not change the behavior. |
What is the purpose of such an implementation of bigbuckbunny.sh? What problem are you trying to solve? #!/bin/sh
output=${1} # URL for publishing RTSP stream
echo "Starting ffmpeg..." >&2
ffmpeg -re -i /media/BigBuckBunny.mp4 -c copy -rtsp_transport tcp -f rtsp ${output} 2> /dev/null |
The purpose of bigbuckbunny.sh is just to provide a simple way to demonstrate the problem that can be easily setup to run anywhere with no depdencies. In the actual use case for my project I don't directly run ffmpeg like this, rather it is a small set of scripts which communicate with an upstream process that starts and stops the RTSP producer stream. Also, with your example, ffmpeg is not being explicitly killed by a signal as go2rtc only signals the directly spawned parent process. I don't really think this is an issue as I'm fine with management my own child processes (some other projects set the pgid attribute and then send the signal to the group pid so that all processes receive it, but I actually think that is more incorrect). However, for your script bash receives the signal and just exits, so ffmpeg doesn't receive a signal at all and is dependent on the stream being stopped to kill off ffmpeg. However, even your script will still reproduce the issue if you start and stop the stream enough. I used your script to create a new "stream3" which ran the script named "bigbuckbunny-simple.sh", which is an exact copy of your implementation. I then started and stopped VLC player about 20 times and eventually got the same problem:
Notice that on the final stop at 10:07:46.961 the kill signal was sent, and the bash script exited as normal (I checked the process list with ps and double-checked for the pid in /proc, it was gone), but there was never a "[streams] stop producer" line because c.cmd.Wait() never returns. I know it is c.cmd.Wait() because I have a custom version that puts it in a context with a timeout and you can see that, on the times when it hangs, the context timeout kicks in and returns anyway and the problem never occurs, although I know this isn't the correct fix. Overall, I'm confident the issue could be fixed with a modification to only use cmd.Wait() once, I just don't (yet) have the skills to code it correctly. As far as I can see, calling cmd.Wait() twice on the same process should return an error to the second caller anyway, but it's not because of a race condition I think because wait is being called the second time immediately after the kill signal is being sent. If the timing lucky it works, if it's unlucky, the second wait() never returns. |
Submitted PR #1246 for this. The PR simply passes in the existing channel from the first wait to the closer so that it can wait on that vs calling wait a second time. For the pipe case, there's no initial wait() so the code behaves the same as the original code. I'm still running a gambit of tests, but so far, after 100+ cycles, I've not been able to produce a hang in any case, while previously it would take on average on 10-12 cycles for the hang to happen and, with some test cases, it could happen in just 2-3 attempts. |
Hi @AlexxIT,
I'm back with an attempt to provide a more complete description of this issue as well as a simple configuration to reproduce. I'm still not 100% sure of the cause or why it is more pronounced when using bash, but I'm starting to believe it could happen with any command, given enough cycles. It's definitely more pronounced with anything that spawns extra processes, because the default behavior is to use SIGKILL, which will leave child processes running since the signal is only sent to the parent.
However, even when the primary exec process properly handles signals and kills children prior to exiting, the problem still occurs. The simplest way to reproduce the issue is with the following go2rtc config:
Obviously stream1 just runs ffmpeg directly, but stream 2 instead uses the following simple bash script:
Technically both stream1 and stream2 should behave the same, and mostly they do. The only difference is that stream2 uses a bash script to spawn ffmpeg in the background, then traps the TERM signal and kills ffmpeg itself prior to existing. You'll still see in the debug output that ffmpeg gets the kill signal and exits, as does the script, but, about 1 time out of 5, the cmd.wait() call will no return, and thus new connection will not cause the script to be run. Here's what a successful exit looks like:
And here is a failed:
So notice there's never a "disconnect stream" message in the second one, and this is because the c.cmd.Wait() never returns. I think there's another subtle hint here in that, in the first example, the last log messages from ffmpeg are both output from go2rtc debug, but on second one, the second line of output from ffmpeg is just straight on stdout. I think this indicates that go has already closed std out for this process.
I'm not 100% sure, but I think this is caused by the fact that there is already a wait() on this process. The go implementation of process.Wait() on Linux appears to use waitid syscall, which is not really designed to have mulitple waiters on the same process and this also seems to be considered bad practice in go in general. I think it is somehow causes a very subtle race condition. The strace between a working an non-working case are also subtly different, I can post those if interested.
Now, I'm not sure what the correct fix is here, it's seems a little complex because there's not a second wait for the pipe case, only for the pure exec case. I thought about having completely separate functions for pipe vs exec, then we could just use a single wait, but again, I'm quite weak with golang, and a few attempts at asking AI didn't quite get me there, but since there's a simple reproducer, I thought I would post here as you might have better understanding and it's certainly possible my analysis is just wrong.
The text was updated successfully, but these errors were encountered: