Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Filebeat] Journald causes Filebeat to crash #34077

Closed
Tracked by #37086
Aqualie opened this issue Dec 20, 2022 · 23 comments · Fixed by #40061
Closed
Tracked by #37086

[Filebeat] Journald causes Filebeat to crash #34077

Aqualie opened this issue Dec 20, 2022 · 23 comments · Fixed by #40061
Assignees
Labels
bug Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@Aqualie
Copy link

Aqualie commented Dec 20, 2022

  • Version: 8.5.3
  • Operating System: 6.0.12-arch1-1
  • Steps to Reproduce: Enable the IPTables (input: journald) plugin in the elastic-agent wait a few minutes and it crashes
    Crash logs: filebeat.log
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Dec 20, 2022
@dlouzan
Copy link

dlouzan commented Jan 24, 2023

We at Siemens are also experiencing exactly this error and have tracked it to the journald file rotation. In our logs, we can correlate the two events:

$ cat /etc/fedora-release
Fedora release 36 (Thirty Six)

$ dnf list --installed filebeat
Installed Packages
filebeat.x86_64                                         8.5.1-1                                          @@commandline

$ journalctl -u filebeat -u systemd-journald

...

systemd-journald[538]: Field hash table of /var/log/journal/ec2a83e928b0ba8fda4772c954d0d37d/system.journal has a fill level at 78.4 (261 of 333 items), suggesting rotation.
systemd-journald[538]: /var/log/journal/ec2a83e928b0ba8fda4772c954d0d37d/system.journal: Journal header limits reached or header out-of-date, rotating.

...

filebeat[1323847]: fatal error: unexpected signal during runtime execution
filebeat[1323847]: [signal SIGBUS: bus error code=0x2 addr=0x79572ec00098 pc=0x795778ce39c4]
filebeat[1323847]: runtime stack:
filebeat[1323847]: runtime.throw({0x61cde86dd0cb?, 0x760d102f0c38?})
filebeat[1323847]:         /usr/local/go/src/runtime/panic.go:992 +0x71
filebeat[1323847]: runtime.sigpanic()
filebeat[1323847]:         /usr/local/go/src/runtime/signal_unix.go:802 +0x225
filebeat[1323847]: goroutine 30 [syscall]:
filebeat[1323847]: runtime.cgocall(0x61cde8683bf0, 0xc000cc1058)
filebeat[1323847]:         /usr/local/go/src/runtime/cgocall.go:157 +0x5c fp=0xc000cc1010 sp=0xc000cc0fd8 pc=0x61cde5f6045c
filebeat[1323847]: github.com/coreos/go-systemd/v22/sdjournal._Cfunc_my_sd_journal_next(0x760d102f1900, 0x760cec005530)
filebeat[1323847]:         _cgo_gotypes.go:351 +0x50 fp=0xc000cc1058 sp=0xc000cc1010 pc=0x61cde8660e30
filebeat[1323847]: github.com/coreos/go-systemd/v22/sdjournal.(*Journal).Next.func1(0x760d102f1900, 0xc000cb0080)
filebeat[1323847]:         /go/pkg/mod/github.com/coreos/go-systemd/[email protected]/sdjournal/journal.go:579 +0xa8 fp=0xc000cc10c0 sp=0xc000cc1058 pc=0x61cde86646e8
filebeat[1323847]: github.com/coreos/go-systemd/v22/sdjournal.(*Journal).Next(0xc000cb0080)
filebeat[1323847]:         /go/pkg/mod/github.com/coreos/go-systemd/[email protected]/sdjournal/journal.go:579 +0x10b fp=0xc000cc11b8 sp=0xc000cc10c0 pc=0x61cde86644ab
filebeat[1323847]: github.com/elastic/beats/v7/filebeat/input/journald/pkg/journalread.(*Reader).Next(0xc000be63c0, {0x760d12f73708, 0xc0004467c0})

...

The filebeat crash does not always happen when journald rotation is triggered, the rotation is not a sufficient condition, but it's necessary. On busy hosts were journald rotates faster, the correlation is almost 100%.

I have also found this related issue in systemd: systemd/systemd#24320 So it's unclear to me atm whether this is filebeat not handling properly the expected SIGBUS signal that can be delivered by the go-systemd library, the library itself not behaving, or a fundamental issue with how systemd works.

@dlouzan
Copy link

dlouzan commented Feb 10, 2023

Bump the issue for hopefully a reaction of the maintainers, or a quick assessment, this is literally crashing dozens of times per day on busy hosts 🙇

I did not find a policy for tagging / mentioning, could you perhaps help @ph with assessment?

Thanks in advance for any input!

@utilitynerd
Copy link

I'm seeing the same issue on Debian 12 (not technically supported yet) and Filebeat 8.10.1.

@dlouzan
Copy link

dlouzan commented Nov 17, 2023

Checking upstream, in theory systemd/systemd#29456 is solving this, but I do not even see it added to systemd v255 rc2, so it will probably be a while until we can verify this or whether they'll backport it to previous releases.

It'd be great if somebody that runs a cutting edge setup could confirm 😇

@belimawr belimawr self-assigned this Mar 25, 2024
@belimawr
Copy link
Contributor

I've been trying to reproduce this issue today and I can't get it to happen.

Following the linked issues, I ended up using systemd/systemd#24320 (comment) to try reproducing it.

I've tried two distros so far:

  • Archlinux, journald/systemd 255 (255.4-2-arch)
  • Ubunutu 2004, journald/systemd 245 (245.4-4ubuntu3.23)

I'll look more into it, probably trying with Fedora as well.

@dlouzan
Copy link

dlouzan commented Mar 25, 2024

@belimawr We get the crashes regularly (verified our logs just now again) on fedora 38 & amazon linux 2023 hosts. We only see it on busy hosts, which makes sense as this is some kind of race condition.

@belimawr
Copy link
Contributor

belimawr commented Mar 25, 2024

Thanks for the quick reply @dlouzan ! I'll try those distros and see if I can reproduce it.

Do you have any idea of the throughput of messages in the journald logs? Currently I'm working with about 20k ~ 30k events per minute on the systems I mentioned before.

Which version of Filebeat are you currently using?

@dlouzan
Copy link

dlouzan commented Mar 25, 2024

@belimawr Both kinds of hosts are using latest stable dnf packages 8.12.2-1; at the time of the last crash that I can see, the node was delivering ~10k events from 11 log files every 30s. Not much of a difference on the delivered stats after the restart following the crash:

{
    "message": "Non-zero metrics in the last 30s",
    "service.name": "filebeat",
    "monitoring": {
        "metrics": {
            "beat": {
                "cgroup": {
                    "memory": {
                        "mem": {
                            "usage": {
                                "bytes": 91836416
                            }
                        }
                    }
                },
...
                "handles": {
                    "limit": {
                        "hard": 65535,
                        "soft": 65535
                    },
                    "open": 116
                },
                "info": {
                    "ephemeral_id": "1fe23dc6-4d58-47c3-8c77-fdfdaaa8c143",
                    "uptime": {
                        "ms": 6810097
                    },
                    "version": "8.12.2"
                },
...
            },
            "filebeat": {
                "events": {
                    "active": 775,
                    "added": 10007,
                    "done": 9601
                },
                "harvester": {
                    "open_files": 11,
                    "running": 11
                }
            },
            "libbeat": {
...
                "output": {
                    "events": {
                        "acked": 9600,
                        "active": 0,
                        "batches": 6,
                        "total": 9600
                    },
                    "read": {
                        "bytes": 440
                    },
                    "write": {
                        "bytes": 1841900
                    }
                },
                "pipeline": {
                    "clients": 17,
                    "events": {
                        "active": 775,
                        "filtered": 1,
                        "published": 10006,
                        "total": 10007
                    },
                    "queue": {
                        "acked": 9600
                    }
                }
            },
            "registrar": {
                "states": {
                    "current": 20,
                    "update": 8524
                },
                "writes": {
                    "success": 6,
                    "total": 6
                }
            },
            "system": {
                "load": {
                    "1": 5.98,
                    "15": 5.45,
                    "5": 5.33,
                    "norm": {
                        "1": 0.3738,
                        "15": 0.3406,
                        "5": 0.3331
                    }
                }
            }
        },
        "ecs.version": "1.6.0"
    }
}

@belimawr
Copy link
Contributor

Thanks!

@belimawr
Copy link
Contributor

I can confirm I can reproduce the crash and the time I noticed it happening was when the journald was rotating its logs.

Interestingly enough, journalctl --rotate does not trigger it. There might be something else than just a rotation playing a role there.

Vesrions (IP redacted):

[[email protected] ec2-user]# filebeat  version
filebeat version 8.12.2 (amd64), libbeat 8.12.2 [0b71acf2d6b4cb6617bff980ed6caf0477905efa built 2024-02-15 13:39:15 +0000 UTC]
[[email protected] ec2-user]# uname -a
Linux ip-42.42.42.42.eu-central-1.compute.internal 6.1.79-99.167.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Mar 12 18:15:55 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
[[email protected] ec2-user]# journalctl --version
systemd 252 (252.16-1.amzn2023.0.2)
+PAM +AUDIT +SELINUX -APPARMOR +IMA +SMACK +SECCOMP -GCRYPT -GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN -IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 +PWQUALITY +P11KIT +QRENCODE +TPM2 -BZIP2 -LZ4 +XZ -ZLIB -ZSTD +BPF_FRAMEWORK +XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified
[[email protected] ec2-user]# 

How to reproduce

  1. Get an Amazon 2023 Linux host

  2. Install Filebeat using yum as described in our documentation

  3. Use the following configuration (/etc/filebeat/filebeat.yml)

    /etc/filebeat/filebeat.yml

    filebeat.inputs:
    - type: journald
      id: journald-input
    
    output.elasticsearch:
      hosts:
        - https://my-awesome-deployment.elastic-cloud.com:443
      username: elastic
      password: a-very-secret-password

  4. Start the service: systemctl enable --now filebeat

  5. Start watching Filebeat logs journalctl -fu filebeat

  6. Generate a high load of logs sent to journald so rotation is triggered often. The following script is from Journal-signal SIGBUS error systemd/systemd#24320 (comment). I ran it as root.

    crash.py

    #!/usr/bin/env python3
    
    import logging
    import logging.handlers
    import random
    import string
    
    my_logger = logging.getLogger('MyLogger')
    my_logger.setLevel(logging.DEBUG)
    
    handler = logging.handlers.SysLogHandler(address = '/dev/log')
    
    my_logger.addHandler(handler)
    
    animals = ["cat", "dog", "crab"]
    padding = ''.join(random.choices(string.ascii_lowercase, k=1024))
    
    while True:
        for animal in animals:
            my_logger.critical(f"Rubbish - {animal} food is running short. Rubbish: " + padding)

  7. Wait until the crash happens. On my tests it seemed to only happen after the maximum number of rotated files was reaches. Once that was reached, every rotation triggered the crash.

@belimawr
Copy link
Contributor

I've also tried to reproduce it in a VM running Archlinux and the crash does not reproduce, it uses a newer version of Journald/Systemd (255):

systemd 255 (255.4-2-arch)
+PAM +AUDIT -SELINUX -APPARMOR -IMA +SMACK +SECCOMP +GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 +PWQUALITY +P11KIT +QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +BPF_FRAMEWORK +XKBCOMMON +UTMP -SYSVINIT default-hierarchy=unified

It really looks like the crash is not caused by Filebeat, but by Journald/go-systemd.

@dlouzan
Copy link

dlouzan commented Mar 28, 2024

@belimawr Perhaps the efforts should go into supporting the backport of the supposed fix into systemd v252, which is the stable version in multiple distributions: systemd/systemd-stable#356 🙇

@belimawr
Copy link
Contributor

belimawr commented May 1, 2024

I was looking at this issue again in a more structured way, and this time I can confirm that the crash is not related to Filebeat, even a Filebeat build with a newer version of Journald still experiences the same crash.

I also experienced the crashes described by:

They happen intermittently with the SIGBUS error. All while flooding Journald with logs, thus forcing a quick log rotation.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@belimawr
Copy link
Contributor

belimawr commented May 6, 2024

I've tried backporting systemd/systemd-stable#356 to v252, but just the patch from the original PR is not enough because other parts of the code have diverged. I tried to fix some obvious ones, but my C is pretty rusty and my knowledge of the systemd codebase is zero :/

Anyway, here is my attempt: https://github.com/belimawr/systemd-stable/tree/v252-stable

@belimawr
Copy link
Contributor

belimawr commented May 6, 2024

The AmazonLinux issue about it also seems pretty stale. I added a comment here but I don't have high hopes. amazonlinux/amazon-linux-2023#608 (comment)

@belimawr
Copy link
Contributor

I did some investigation tying to recover from the panic caused by systemd and, unfortunately, it's not possible to recover from it :/

When a SIGBUS is sent due to an error in program execution the Go runtime converts it into a run-time panic that we cannot recover from on our code.

From the Go docs:

Synchronous signals are signals triggered by errors in program execution: SIGBUS, SIGFPE, and SIGSEGV. These are only considered synchronous when caused by program execution, not when sent using os.Process.Kill or the kill program or some similar mechanism. In general, except as discussed below, Go programs will convert a synchronous signal into a run-time panic.

@belimawr
Copy link
Contributor

@cmacknz, @jlind23 I think we should close this task as there isn't anything else we can do on our side. At least on a code level.

We could add something about compatible versions of systemd in our documentation, but aside from it I can't see anything else we could do at the moment.

@jlind23
Copy link
Collaborator

jlind23 commented May 13, 2024

@belimawr I am fine closing it with a "won't fix" status then.

@cmacknz
Copy link
Member

cmacknz commented May 13, 2024

We could add something about compatible versions of systemd in our documentation, but aside from it I can't see anything else we could do at the moment.

Yes we need something like this, otherwise this error will lead to support cases for us. I mean it already effectively is in the issue tracker and it isn't GA yet.

Can we detect the systemd version at runtime and refuse to run with a detailed error if it is the version with this bug? That seems preferable to letting us be killed by SIGBUS.

@belimawr
Copy link
Contributor

Can we detect the systemd version at runtime and refuse to run with a detailed error if it is the version with this bug? That seems preferable to letting us be killed by SIGBUS.

I'll look into that. Worst case scenario we can exec.Command("journalctl", "--version") and parse the output.

@belimawr
Copy link
Contributor

PR adding validation to the Systemd version to prevent Filebeat from crashing: #39605

@belimawr
Copy link
Contributor

belimawr commented Jun 4, 2024

I performed 3 tests reproducing this crash to better understand if there is data loss due to the crash itself. Out of 3 runs, 2 ingested all data and 1 duplicated some events, about 3.2% of duplication.

djaglowski pushed a commit to open-telemetry/opentelemetry-collector-contrib that referenced this issue Oct 28, 2024
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
According to the community, there are bugs in systemd that could corrupt
the journal files or crash the log receiver:
systemd/systemd#24320
systemd/systemd#24150

We've seen some issues reported to Elastic/beats project:
elastic/beats#39352
elastic/beats#32782
elastic/beats#34077

Unfortunately, the otelcol is not immune from these issues. When the
journalctl process exits for any reason, the log consumption from
journald just stops. We've experienced this on some machines that have
high log volume. Currently we monitors the journalctl processes started
by otelcol, and restart the otelcol when some of them is missing. IMO,
The journald receiver itself should monitor the journalctl process it
starts, and does its best to keep it alive.

In this PR, we try to restart the journalctl process when it exits
unexpectedly. As long as the journalctl cmd can be started (via
`Cmd.Start()`) successfully, the journald_input will always try to
restart the journalctl process if it exits.

The error reporting behaviour changes a bit in this PR. Before the PR,
the `operator.Start` waits up to 1 sec to capture any immediate error
returned from journalctl. After the PR, the error won't be reported back
even if the journalctl exits immediately after start, instead, the error
will be logged, and the process will be restarted.

The fix is largely inspired by
elastic/beats#40558.

<!--Describe what testing was performed and which tests were added.-->
#### Testing
Add a simple bash script that print a line every second, and load it to
systemd.

`log_every_second.sh`:
```bash
#!/bin/bash
while true; do
    echo "Log message: $(date)"
    sleep 1
done
```

`log.service`:
```
[Unit]
Description=Print logs to journald every second
After=network.target

[Service]
ExecStart=/usr/local/bin/log_every_second.sh
Restart=always
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target
```

Start the otelcol with the following config:
```yaml
service:
  telemetry:
    logs:
      level: debug
  pipelines:
    logs:
      receivers: [journald]
      processors: []
      exporters: [debug]

receivers:
  journald:

exporters:
  debug:
    verbosity: basic
    sampling_initial: 1
    sampling_thereafter: 1
```

Kill the journalctl process and observe the otelcol's behaviour. The
journactl process will be restarted after the backoff period (hardcoded
to 2 sec):
```bash
2024-10-06T14:32:33.755Z	info	LogsExporter	{"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 1}
2024-10-06T14:32:34.709Z	error	journald/input.go:98	journalctl command exited	{"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input", "error": "signal: terminated"}
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/journald.(*Input).run
	github.com/open-telemetry/opentelemetry-collector-contrib/pkg/[email protected]/operator/input/journald/input.go:98
2024-10-06T14:32:36.712Z	debug	journald/input.go:94	Starting the journalctl command	{"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input"}
2024-10-06T14:32:36.756Z	info	LogsExporter	{"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 10}
```

<!--Please delete paragraphs that you did not use before submitting.-->

---------

Signed-off-by: Mengnan Gong <[email protected]>
jpbarto pushed a commit to jpbarto/opentelemetry-collector-contrib that referenced this issue Oct 29, 2024
…-telemetry#35635)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
According to the community, there are bugs in systemd that could corrupt
the journal files or crash the log receiver:
systemd/systemd#24320
systemd/systemd#24150

We've seen some issues reported to Elastic/beats project:
elastic/beats#39352
elastic/beats#32782
elastic/beats#34077

Unfortunately, the otelcol is not immune from these issues. When the
journalctl process exits for any reason, the log consumption from
journald just stops. We've experienced this on some machines that have
high log volume. Currently we monitors the journalctl processes started
by otelcol, and restart the otelcol when some of them is missing. IMO,
The journald receiver itself should monitor the journalctl process it
starts, and does its best to keep it alive.

In this PR, we try to restart the journalctl process when it exits
unexpectedly. As long as the journalctl cmd can be started (via
`Cmd.Start()`) successfully, the journald_input will always try to
restart the journalctl process if it exits.

The error reporting behaviour changes a bit in this PR. Before the PR,
the `operator.Start` waits up to 1 sec to capture any immediate error
returned from journalctl. After the PR, the error won't be reported back
even if the journalctl exits immediately after start, instead, the error
will be logged, and the process will be restarted.

The fix is largely inspired by
elastic/beats#40558.

<!--Describe what testing was performed and which tests were added.-->
#### Testing
Add a simple bash script that print a line every second, and load it to
systemd.

`log_every_second.sh`:
```bash
#!/bin/bash
while true; do
    echo "Log message: $(date)"
    sleep 1
done
```

`log.service`:
```
[Unit]
Description=Print logs to journald every second
After=network.target

[Service]
ExecStart=/usr/local/bin/log_every_second.sh
Restart=always
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target
```

Start the otelcol with the following config:
```yaml
service:
  telemetry:
    logs:
      level: debug
  pipelines:
    logs:
      receivers: [journald]
      processors: []
      exporters: [debug]

receivers:
  journald:

exporters:
  debug:
    verbosity: basic
    sampling_initial: 1
    sampling_thereafter: 1
```

Kill the journalctl process and observe the otelcol's behaviour. The
journactl process will be restarted after the backoff period (hardcoded
to 2 sec):
```bash
2024-10-06T14:32:33.755Z	info	LogsExporter	{"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 1}
2024-10-06T14:32:34.709Z	error	journald/input.go:98	journalctl command exited	{"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input", "error": "signal: terminated"}
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/journald.(*Input).run
	github.com/open-telemetry/opentelemetry-collector-contrib/pkg/[email protected]/operator/input/journald/input.go:98
2024-10-06T14:32:36.712Z	debug	journald/input.go:94	Starting the journalctl command	{"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input"}
2024-10-06T14:32:36.756Z	info	LogsExporter	{"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 10}
```

<!--Please delete paragraphs that you did not use before submitting.-->

---------

Signed-off-by: Mengnan Gong <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
8 participants