[Journald] Restart journalctl if it exits unexpectedly #40558

belimawr · 2024-08-19T19:36:18Z

Proposed commit message

If journalctl exits unexpectedly the journald input will restart it and set the cursor to the last know position. Any error/non zero return code is logged at level error.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~I have made corresponding changes to the documentation~~
~~I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

Author's Checklist

Test if this PR fixes Filbeat stops shipping journald logs when encountered "failed to read message field: bad message" error #32782

How to test this PR locally

Build Filebeat from this PR and start it with the following configuration file:

filebeat.yml

filebeat.inputs:
  - type: journald
    id: my-unique-input-id
    enabled: true
    syslog_identifiers:
      - "potato"

output:
  console:
    pretty: true

Publish some events with the syslog identfier potato

echo "$(date '+%Y-%m-%d %T') - foo"| systemd-cat -t potato

Wait to see the output in the console
Kill journalctl

killall journalctl

Wait journalctl to restart, look for a log entry like this:

{
  "log.level": "info",
  "@timestamp": "2024-08-20T12:41:29.682-0400",
  "log.logger": "input.journald.reader.journalctl-runner",
  "log.origin": {
    "function": "github.com/elastic/beats/v7/filebeat/input/journald/pkg/journalctl.Factory",
    "file.name": "journalctl/journalctl.go",
    "file.line": 120
  },
  "message": "journalctl started with PID 3119456",
  "service.name": "filebeat",
  "id": "filestream-input-id",
  "input_source": "LOCAL_SYSTEM_JOURNAL",
  "path": "LOCAL_SYSTEM_JOURNAL",
  "ecs.version": "1.6.0"
}

Publish a few more messages and ensure they appear in the console

echo "$(date '+%Y-%m-%d %T') - foo"| systemd-cat -t potato

Related issues

Tests

I run a test overnight where I used a mock to simulate constant failures and restarts of journalclt and monitored the host and Filebeat process to ensure there weren't any goroutine/resouce leaks. The screenshot below shows the CPU and memory usage from Filebeat as well as counters for the log entries stating journalctl crashed and was restarted.

~~## Use cases~~
~~## Screenshots~~
~~## Logs~~

mergify · 2024-08-20T16:55:07Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b restart-journalctl upstream/restart-journalctl
git merge upstream/main
git push upstream restart-journalctl

mergify · 2024-08-26T15:37:51Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b restart-journalctl upstream/restart-journalctl
git merge upstream/main
git push upstream restart-journalctl

elasticmachine · 2024-08-26T20:54:22Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

filebeat/input/journald/pkg/journalctl/journalctl.go

leehinman · 2024-08-28T20:01:56Z

filebeat/input/journald/pkg/journalctl/journalctl.go

+	logger.Infof("journalctl started with PID %d", cmd.Process.Pid)
+
+	go func() {
+		if err := cmd.Wait(); err != nil {


I'm wondering if we should also watch for the dataChan and errChan to close. For example, if we failed to read stdout for some reason, that go routine loop would exit

This goroutine is here only to log any error returned by cmd.Wait, if the process exits unexpectedly its stderr and stdout will be closed and the reader gorourines will get an EOF or error. At the moment I'm doing the best to read and ship/log all data without over complicating the code and risking getting into a dead lock.

filebeat/input/journald/pkg/journalctl/journalctl.go

leehinman · 2024-09-06T14:54:55Z

filebeat/input/journald/pkg/journalctl/journalctl.go

+
+func (j *journalctl) Next(cancel input.Canceler) ([]byte, error) {
+	select {
+	case <-cancel.Done():


if cancel.Done(), do we have to kill the process?

This input.Canceler is in the context of the Next call, so I don't think we should kill the journalctl process. If the journald input is stopped, it will call Close on the Reader:

https://github.com/belimawr/beats/blob/e7d226030d8a90477949a4cd3a013e09c28db519/filebeat/input/journald/input.go#L236-L238

and the reader will kill the journalctl process

https://github.com/belimawr/beats/blob/e7d226030d8a90477949a4cd3a013e09c28db519/filebeat/input/journald/pkg/journalctl/reader.go#L192-L200.

I added some comments explaining kill needs to be called.

mergify · 2024-09-10T16:14:09Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b restart-journalctl upstream/restart-journalctl
git merge upstream/main
git push upstream restart-journalctl

Fully isolates the journalctl process into a new type called journalctl, so the reader does not know about it and can simply call a Next method on the new type.

If journalctl exits for any reason, the reader now restarts it.

This commit adds a test to the journald reader that ensures is can restart the journalctl process in case it exits. It is tested by mocking the type abstracting the calls to journalctl, if there is an error reading the next message, we ensure a new journalclt is created.

This commit refactors TestEventWithNonStringData to use the new types and re-enables the test.

This commit adds an integration test for the journald input. The test ensures that if the journalctl process exits unexpectedly it is restarted.

Fix publishing an empty message when journalctl crashes and needs to be restarted. Now when the reader restarts journalctl it returns ErrRestarting and the input can correctly handle this situation. The tests are updated and the code formatted (mage fmt).

Test the bytes written to systemd-cat on every call, properly handle errors when closing its stdin and better error reporting.

This commit removes two wait groups that were not being used correctly and improves error handling. Now if journalctl writes to stderr, the lines are read and directly logged. Lint warnings are also fixed.

Make sure everywhere in the reader and its subcomponets the logger name is consistent.

When restarting journalclt, a exponential backoff will be used if the last restart was less than 5s ago. In the unlikely case jouranlctl will crash right after being installed, the exponential backoff will make Filebeat restart journalctl at most once every two seconds.

filebeat/tests/integration/journald_test.go

mergify · 2024-09-11T11:53:08Z

backport-8.x has been added to help with the transition to the new branch 8.x.

If journalctl exits unexpectedly the journald input will restart it and set the cursor to the last know position. Any error/non zero return code is logged at level error. There is an exponential backoff that caps at 1 restart every 2s. (cherry picked from commit a9fb9fa)

If journalctl exits unexpectedly the journald input will restart it and set the cursor to the last know position. Any error/non zero return code is logged at level error. There is an exponential backoff that caps at 1 restart every 2s. (cherry picked from commit a9fb9fa) Co-authored-by: Tiago Queiroz <[email protected]>

#### Description According to the community, there are bugs in systemd that could corrupt the journal files or crash the log receiver: systemd/systemd#24320 systemd/systemd#24150 We've seen some issues reported to Elastic/beats project: elastic/beats#39352 elastic/beats#32782 elastic/beats#34077 Unfortunately, the otelcol is not immune from these issues. When the journalctl process exits for any reason, the log consumption from journald just stops. We've experienced this on some machines that have high log volume. Currently we monitors the journalctl processes started by otelcol, and restart the otelcol when some of them is missing. IMO, The journald receiver itself should monitor the journalctl process it starts, and does its best to keep it alive. In this PR, we try to restart the journalctl process when it exits unexpectedly. As long as the journalctl cmd can be started (via `Cmd.Start()`) successfully, the journald_input will always try to restart the journalctl process if it exits. The error reporting behaviour changes a bit in this PR. Before the PR, the `operator.Start` waits up to 1 sec to capture any immediate error returned from journalctl. After the PR, the error won't be reported back even if the journalctl exits immediately after start, instead, the error will be logged, and the process will be restarted. The fix is largely inspired by elastic/beats#40558.  #### Testing Add a simple bash script that print a line every second, and load it to systemd. `log_every_second.sh`: ```bash #!/bin/bash while true; do echo "Log message: $(date)" sleep 1 done ``` `log.service`: ``` [Unit] Description=Print logs to journald every second After=network.target [Service] ExecStart=/usr/local/bin/log_every_second.sh Restart=always StandardOutput=journal StandardError=journal [Install] WantedBy=multi-user.target ``` Start the otelcol with the following config: ```yaml service: telemetry: logs: level: debug pipelines: logs: receivers: [journald] processors: [] exporters: [debug] receivers: journald: exporters: debug: verbosity: basic sampling_initial: 1 sampling_thereafter: 1 ``` Kill the journalctl process and observe the otelcol's behaviour. The journactl process will be restarted after the backoff period (hardcoded to 2 sec): ```bash 2024-10-06T14:32:33.755Z info LogsExporter {"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 1} 2024-10-06T14:32:34.709Z error journald/input.go:98 journalctl command exited {"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input", "error": "signal: terminated"} github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/journald.(*Input).run github.com/open-telemetry/opentelemetry-collector-contrib/pkg/[email protected]/operator/input/journald/input.go:98 2024-10-06T14:32:36.712Z debug journald/input.go:94 Starting the journalctl command {"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input"} 2024-10-06T14:32:36.756Z info LogsExporter {"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 10} ```  --------- Signed-off-by: Mengnan Gong <[email protected]>

…-telemetry#35635)  #### Description According to the community, there are bugs in systemd that could corrupt the journal files or crash the log receiver: systemd/systemd#24320 systemd/systemd#24150 We've seen some issues reported to Elastic/beats project: elastic/beats#39352 elastic/beats#32782 elastic/beats#34077 Unfortunately, the otelcol is not immune from these issues. When the journalctl process exits for any reason, the log consumption from journald just stops. We've experienced this on some machines that have high log volume. Currently we monitors the journalctl processes started by otelcol, and restart the otelcol when some of them is missing. IMO, The journald receiver itself should monitor the journalctl process it starts, and does its best to keep it alive. In this PR, we try to restart the journalctl process when it exits unexpectedly. As long as the journalctl cmd can be started (via `Cmd.Start()`) successfully, the journald_input will always try to restart the journalctl process if it exits. The error reporting behaviour changes a bit in this PR. Before the PR, the `operator.Start` waits up to 1 sec to capture any immediate error returned from journalctl. After the PR, the error won't be reported back even if the journalctl exits immediately after start, instead, the error will be logged, and the process will be restarted. The fix is largely inspired by elastic/beats#40558.  #### Testing Add a simple bash script that print a line every second, and load it to systemd. `log_every_second.sh`: ```bash #!/bin/bash while true; do echo "Log message: $(date)" sleep 1 done ``` `log.service`: ``` [Unit] Description=Print logs to journald every second After=network.target [Service] ExecStart=/usr/local/bin/log_every_second.sh Restart=always StandardOutput=journal StandardError=journal [Install] WantedBy=multi-user.target ``` Start the otelcol with the following config: ```yaml service: telemetry: logs: level: debug pipelines: logs: receivers: [journald] processors: [] exporters: [debug] receivers: journald: exporters: debug: verbosity: basic sampling_initial: 1 sampling_thereafter: 1 ``` Kill the journalctl process and observe the otelcol's behaviour. The journactl process will be restarted after the backoff period (hardcoded to 2 sec): ```bash 2024-10-06T14:32:33.755Z info LogsExporter {"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 1} 2024-10-06T14:32:34.709Z error journald/input.go:98 journalctl command exited {"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input", "error": "signal: terminated"} github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/journald.(*Input).run github.com/open-telemetry/opentelemetry-collector-contrib/pkg/[email protected]/operator/input/journald/input.go:98 2024-10-06T14:32:36.712Z debug journald/input.go:94 Starting the journalctl command {"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input"} 2024-10-06T14:32:36.756Z info LogsExporter {"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 10} ```  --------- Signed-off-by: Mengnan Gong <[email protected]>

belimawr added backport-skip Skip notification from the automated backport with mergify Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels Aug 19, 2024

belimawr self-assigned this Aug 19, 2024

botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Aug 19, 2024

belimawr changed the title ~~Restart journalctl~~ [Journald] Restart journalctl if it exits unexpectedly Aug 20, 2024

belimawr added skip-ci Skip the build in the CI but linting and removed skip-ci Skip the build in the CI but linting labels Aug 20, 2024

belimawr force-pushed the restart-journalctl branch from 93230ef to 6507efe Compare August 20, 2024 20:25

belimawr force-pushed the restart-journalctl branch from 6507efe to 7b93e74 Compare August 26, 2024 20:54

belimawr marked this pull request as ready for review August 26, 2024 20:54

belimawr requested a review from a team as a code owner August 26, 2024 20:54

belimawr requested review from leehinman and VihasMakwana August 26, 2024 20:54

leehinman reviewed Aug 27, 2024

View reviewed changes

filebeat/input/journald/pkg/journalctl/journalctl.go Outdated Show resolved Hide resolved

filebeat/input/journald/pkg/journalctl/journalctl.go Outdated Show resolved Hide resolved

belimawr requested a review from leehinman August 27, 2024 19:45

leehinman reviewed Aug 28, 2024

View reviewed changes

belimawr force-pushed the restart-journalctl branch from cc66c5d to 7037dc2 Compare September 5, 2024 13:58

belimawr requested a review from leehinman September 6, 2024 14:12

leehinman reviewed Sep 6, 2024

View reviewed changes

belimawr added 6 commits September 10, 2024 12:14

Move journalctl process management to a different type

0710d83

Fully isolates the journalctl process into a new type called journalctl, so the reader does not know about it and can simply call a Next method on the new type.

Restart journalctl on error

181db4f

If journalctl exits for any reason, the reader now restarts it.

Re-enable TestEventWithNonStringData

9390296

This commit refactors TestEventWithNonStringData to use the new types and re-enables the test.

Add integration test for journald input

d07d7f4

This commit adds an integration test for the journald input. The test ensures that if the journalctl process exits unexpectedly it is restarted.

belimawr added 9 commits September 10, 2024 12:30

Close stdin instead of sending sigterm

286d994

Ensure call to systemd-cat worked correctly

0de7a7a

Test the bytes written to systemd-cat on every call, properly handle errors when closing its stdin and better error reporting.

Remove unused wait groups and improve error handling

f89cabc

This commit removes two wait groups that were not being used correctly and improves error handling. Now if journalctl writes to stderr, the lines are read and directly logged. Lint warnings are also fixed.

Remove unused field

67d6bd9

Add missing licence headers

ac72552

Only restart with --after-cursor is cursor exists

c38f0cf

make logging consistent

48dd490

Make sure everywhere in the reader and its subcomponets the logger name is consistent.

Add comments explaining why there is no goroutine leak

72eb03f

belimawr force-pushed the restart-journalctl branch from 3a5aaec to e7d2260 Compare September 10, 2024 16:34

belimawr added 3 commits September 10, 2024 12:46

Addd comments about calling Kill

6162dbe

Implement review suggestions

dc06651

Update go.mod and notice file

d5cb6fc

belimawr requested a review from a team as a code owner September 10, 2024 18:49

andrewkroh reviewed Sep 10, 2024

View reviewed changes

filebeat/tests/integration/journald_test.go Outdated Show resolved Hide resolved

Use github.com/gofrs/uuid and fix lint warnings

54e5be6

v1v added backport-v8.x and removed backport-skip Skip notification from the automated backport with mergify labels Sep 10, 2024

mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Sep 11, 2024

v1v removed the backport-v8.x label Sep 11, 2024

belimawr merged commit a9fb9fa into elastic:main Sep 11, 2024
123 of 125 checks passed

mergify bot mentioned this pull request Sep 11, 2024

[8.x](backport #40558) [Journald] Restart journalctl if it exits unexpectedly #40772

Merged

6 tasks

This was referenced Sep 13, 2024

Filbeat stops shipping journald logs when encountered "failed to read message field: bad message" error #32782

Closed

[Journald] input crashes with "failed to read message field: cannot allocate memory" #39352

Closed

namco1992 mentioned this pull request Oct 6, 2024

[receiver/journald] Restart journalctl if it exits unexpectedly open-telemetry/opentelemetry-collector-contrib#35635

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Journald] Restart journalctl if it exits unexpectedly #40558

[Journald] Restart journalctl if it exits unexpectedly #40558

belimawr commented Aug 19, 2024 •

edited

Loading

mergify bot commented Aug 20, 2024

mergify bot commented Aug 26, 2024

elasticmachine commented Aug 26, 2024

leehinman Aug 28, 2024

belimawr Aug 28, 2024

leehinman Sep 6, 2024

belimawr Sep 10, 2024

mergify bot commented Sep 10, 2024

mergify bot commented Sep 11, 2024

[Journald] Restart journalctl if it exits unexpectedly #40558

[Journald] Restart journalctl if it exits unexpectedly #40558

Conversation

belimawr commented Aug 19, 2024 • edited Loading

Proposed commit message

Checklist

Disruptive User Impact

Author's Checklist

How to test this PR locally

Related issues

Tests

mergify bot commented Aug 20, 2024

mergify bot commented Aug 26, 2024

elasticmachine commented Aug 26, 2024

leehinman Aug 28, 2024

Choose a reason for hiding this comment

belimawr Aug 28, 2024

Choose a reason for hiding this comment

leehinman Sep 6, 2024

Choose a reason for hiding this comment

belimawr Sep 10, 2024

Choose a reason for hiding this comment

mergify bot commented Sep 10, 2024

mergify bot commented Sep 11, 2024

belimawr commented Aug 19, 2024 •

edited

Loading