-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Filebeat] Journald causes Filebeat to crash #34077
Comments
We at Siemens are also experiencing exactly this error and have tracked it to the journald file rotation. In our logs, we can correlate the two events:
The filebeat crash does not always happen when journald rotation is triggered, the rotation is not a sufficient condition, but it's necessary. On busy hosts were journald rotates faster, the correlation is almost 100%. I have also found this related issue in systemd: systemd/systemd#24320 So it's unclear to me atm whether this is filebeat not handling properly the expected |
Bump the issue for hopefully a reaction of the maintainers, or a quick assessment, this is literally crashing dozens of times per day on busy hosts 🙇 I did not find a policy for tagging / mentioning, could you perhaps help @ph with assessment? Thanks in advance for any input! |
I'm seeing the same issue on Debian 12 (not technically supported yet) and Filebeat 8.10.1. |
Checking upstream, in theory systemd/systemd#29456 is solving this, but I do not even see it added to systemd v255 rc2, so it will probably be a while until we can verify this or whether they'll backport it to previous releases. It'd be great if somebody that runs a cutting edge setup could confirm 😇 |
I've been trying to reproduce this issue today and I can't get it to happen. Following the linked issues, I ended up using systemd/systemd#24320 (comment) to try reproducing it. I've tried two distros so far:
I'll look more into it, probably trying with Fedora as well. |
@belimawr We get the crashes regularly (verified our logs just now again) on fedora 38 & amazon linux 2023 hosts. We only see it on busy hosts, which makes sense as this is some kind of race condition. |
Thanks for the quick reply @dlouzan ! I'll try those distros and see if I can reproduce it. Do you have any idea of the throughput of messages in the journald logs? Currently I'm working with about 20k ~ 30k events per minute on the systems I mentioned before. Which version of Filebeat are you currently using? |
@belimawr Both kinds of hosts are using latest stable dnf packages {
"message": "Non-zero metrics in the last 30s",
"service.name": "filebeat",
"monitoring": {
"metrics": {
"beat": {
"cgroup": {
"memory": {
"mem": {
"usage": {
"bytes": 91836416
}
}
}
},
...
"handles": {
"limit": {
"hard": 65535,
"soft": 65535
},
"open": 116
},
"info": {
"ephemeral_id": "1fe23dc6-4d58-47c3-8c77-fdfdaaa8c143",
"uptime": {
"ms": 6810097
},
"version": "8.12.2"
},
...
},
"filebeat": {
"events": {
"active": 775,
"added": 10007,
"done": 9601
},
"harvester": {
"open_files": 11,
"running": 11
}
},
"libbeat": {
...
"output": {
"events": {
"acked": 9600,
"active": 0,
"batches": 6,
"total": 9600
},
"read": {
"bytes": 440
},
"write": {
"bytes": 1841900
}
},
"pipeline": {
"clients": 17,
"events": {
"active": 775,
"filtered": 1,
"published": 10006,
"total": 10007
},
"queue": {
"acked": 9600
}
}
},
"registrar": {
"states": {
"current": 20,
"update": 8524
},
"writes": {
"success": 6,
"total": 6
}
},
"system": {
"load": {
"1": 5.98,
"15": 5.45,
"5": 5.33,
"norm": {
"1": 0.3738,
"15": 0.3406,
"5": 0.3331
}
}
}
},
"ecs.version": "1.6.0"
}
} |
Thanks! |
I can confirm I can reproduce the crash and the time I noticed it happening was when the journald was rotating its logs. Interestingly enough, Vesrions (IP redacted):
How to reproduce
|
I've also tried to reproduce it in a VM running Archlinux and the crash does not reproduce, it uses a newer version of Journald/Systemd (255):
It really looks like the crash is not caused by Filebeat, but by Journald/go-systemd. |
@belimawr Perhaps the efforts should go into supporting the backport of the supposed fix into systemd v252, which is the stable version in multiple distributions: systemd/systemd-stable#356 🙇 |
I was looking at this issue again in a more structured way, and this time I can confirm that the crash is not related to Filebeat, even a Filebeat build with a newer version of Journald still experiences the same crash. I also experienced the crashes described by:
They happen intermittently with the SIGBUS error. All while flooding Journald with logs, thus forcing a quick log rotation. |
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
I've tried backporting systemd/systemd-stable#356 to Anyway, here is my attempt: https://github.com/belimawr/systemd-stable/tree/v252-stable |
The AmazonLinux issue about it also seems pretty stale. I added a comment here but I don't have high hopes. amazonlinux/amazon-linux-2023#608 (comment) |
I did some investigation tying to recover from the panic caused by systemd and, unfortunately, it's not possible to recover from it :/ When a SIGBUS is sent due to an error in program execution the Go runtime converts it into a run-time panic that we cannot recover from on our code. From the Go docs:
|
@belimawr I am fine closing it with a "won't fix" status then. |
Yes we need something like this, otherwise this error will lead to support cases for us. I mean it already effectively is in the issue tracker and it isn't GA yet. Can we detect the systemd version at runtime and refuse to run with a detailed error if it is the version with this bug? That seems preferable to letting us be killed by SIGBUS. |
I'll look into that. Worst case scenario we can |
PR adding validation to the Systemd version to prevent Filebeat from crashing: #39605 |
I performed 3 tests reproducing this crash to better understand if there is data loss due to the crash itself. Out of 3 runs, 2 ingested all data and 1 duplicated some events, about 3.2% of duplication. |
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description According to the community, there are bugs in systemd that could corrupt the journal files or crash the log receiver: systemd/systemd#24320 systemd/systemd#24150 We've seen some issues reported to Elastic/beats project: elastic/beats#39352 elastic/beats#32782 elastic/beats#34077 Unfortunately, the otelcol is not immune from these issues. When the journalctl process exits for any reason, the log consumption from journald just stops. We've experienced this on some machines that have high log volume. Currently we monitors the journalctl processes started by otelcol, and restart the otelcol when some of them is missing. IMO, The journald receiver itself should monitor the journalctl process it starts, and does its best to keep it alive. In this PR, we try to restart the journalctl process when it exits unexpectedly. As long as the journalctl cmd can be started (via `Cmd.Start()`) successfully, the journald_input will always try to restart the journalctl process if it exits. The error reporting behaviour changes a bit in this PR. Before the PR, the `operator.Start` waits up to 1 sec to capture any immediate error returned from journalctl. After the PR, the error won't be reported back even if the journalctl exits immediately after start, instead, the error will be logged, and the process will be restarted. The fix is largely inspired by elastic/beats#40558. <!--Describe what testing was performed and which tests were added.--> #### Testing Add a simple bash script that print a line every second, and load it to systemd. `log_every_second.sh`: ```bash #!/bin/bash while true; do echo "Log message: $(date)" sleep 1 done ``` `log.service`: ``` [Unit] Description=Print logs to journald every second After=network.target [Service] ExecStart=/usr/local/bin/log_every_second.sh Restart=always StandardOutput=journal StandardError=journal [Install] WantedBy=multi-user.target ``` Start the otelcol with the following config: ```yaml service: telemetry: logs: level: debug pipelines: logs: receivers: [journald] processors: [] exporters: [debug] receivers: journald: exporters: debug: verbosity: basic sampling_initial: 1 sampling_thereafter: 1 ``` Kill the journalctl process and observe the otelcol's behaviour. The journactl process will be restarted after the backoff period (hardcoded to 2 sec): ```bash 2024-10-06T14:32:33.755Z info LogsExporter {"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 1} 2024-10-06T14:32:34.709Z error journald/input.go:98 journalctl command exited {"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input", "error": "signal: terminated"} github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/journald.(*Input).run github.com/open-telemetry/opentelemetry-collector-contrib/pkg/[email protected]/operator/input/journald/input.go:98 2024-10-06T14:32:36.712Z debug journald/input.go:94 Starting the journalctl command {"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input"} 2024-10-06T14:32:36.756Z info LogsExporter {"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 10} ``` <!--Please delete paragraphs that you did not use before submitting.--> --------- Signed-off-by: Mengnan Gong <[email protected]>
…-telemetry#35635) <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description According to the community, there are bugs in systemd that could corrupt the journal files or crash the log receiver: systemd/systemd#24320 systemd/systemd#24150 We've seen some issues reported to Elastic/beats project: elastic/beats#39352 elastic/beats#32782 elastic/beats#34077 Unfortunately, the otelcol is not immune from these issues. When the journalctl process exits for any reason, the log consumption from journald just stops. We've experienced this on some machines that have high log volume. Currently we monitors the journalctl processes started by otelcol, and restart the otelcol when some of them is missing. IMO, The journald receiver itself should monitor the journalctl process it starts, and does its best to keep it alive. In this PR, we try to restart the journalctl process when it exits unexpectedly. As long as the journalctl cmd can be started (via `Cmd.Start()`) successfully, the journald_input will always try to restart the journalctl process if it exits. The error reporting behaviour changes a bit in this PR. Before the PR, the `operator.Start` waits up to 1 sec to capture any immediate error returned from journalctl. After the PR, the error won't be reported back even if the journalctl exits immediately after start, instead, the error will be logged, and the process will be restarted. The fix is largely inspired by elastic/beats#40558. <!--Describe what testing was performed and which tests were added.--> #### Testing Add a simple bash script that print a line every second, and load it to systemd. `log_every_second.sh`: ```bash #!/bin/bash while true; do echo "Log message: $(date)" sleep 1 done ``` `log.service`: ``` [Unit] Description=Print logs to journald every second After=network.target [Service] ExecStart=/usr/local/bin/log_every_second.sh Restart=always StandardOutput=journal StandardError=journal [Install] WantedBy=multi-user.target ``` Start the otelcol with the following config: ```yaml service: telemetry: logs: level: debug pipelines: logs: receivers: [journald] processors: [] exporters: [debug] receivers: journald: exporters: debug: verbosity: basic sampling_initial: 1 sampling_thereafter: 1 ``` Kill the journalctl process and observe the otelcol's behaviour. The journactl process will be restarted after the backoff period (hardcoded to 2 sec): ```bash 2024-10-06T14:32:33.755Z info LogsExporter {"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 1} 2024-10-06T14:32:34.709Z error journald/input.go:98 journalctl command exited {"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input", "error": "signal: terminated"} github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/journald.(*Input).run github.com/open-telemetry/opentelemetry-collector-contrib/pkg/[email protected]/operator/input/journald/input.go:98 2024-10-06T14:32:36.712Z debug journald/input.go:94 Starting the journalctl command {"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input"} 2024-10-06T14:32:36.756Z info LogsExporter {"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 10} ``` <!--Please delete paragraphs that you did not use before submitting.--> --------- Signed-off-by: Mengnan Gong <[email protected]>
Crash logs: filebeat.log
The text was updated successfully, but these errors were encountered: