Confusing systemd behaviour with OnFailure= and Restart=

up vote
2
down vote

favorite

I'm using systemd 231 in an embedded system, and I'm trying to create a service that monitors a hardware component in the system. Here's a rough description of what I'm trying to do:

When the service, foo.service, is started, it launches an application, foo_app.

foo_appmonitors the hardware component, running continuously.

If foo_app detects a hardware failure, it exits with a return code of 1. This should trigger a system reboot.

If foo_app crashes, systemd should relaunch foo_app.

If foo_app repeatedly crashes, systemd should reboot the system.

Here's my attempt at implementing this as a service:

[Unit]
Description=Foo Hardware Monitor

# If the application fails 3 times in 30 seconds, something has gone wrong,
# and the state of the hardware can't be guaranteed. Reboot the system here.
StartLimitBurst=3
StartLimitIntervalSec=30
StartLimitAction=reboot

# StartLimitAction=reboot will reboot the box if the app fails repeatedly,
# but if the app exits voluntarily, the reboot should trigger immediately
OnFailure=systemd-reboot.service

[Service]
ExecStart=/usr/bin/foo_app

# If the app fails from an abnormal condition (e.g. crash), try to
# restart it (within the limits of StartLimit*).
Restart=on-abnormal

From the documentation (systemd.service and systemd.service), I'd expect that if I kill foo_app in a way such that Restart=on-abnormal is triggered (e.g. killall -9 foo_app), systemd should give priority to Restart=on-abnormal over OnFailure=systemd-reboot.service and not start systemd-reboot.service.

However, this isn't what I'm seeing. As soon as I kill foo_app once, the system immediately reboots.

Here are some relevant snippets from the docs:

OnFailure=

A space-separated list of one or more units that are activated when this unit enters the "failed" state. A service unit using Restart= enters the failed state only after the start limits are reached.

Restart=

[snip] Note that service restart is subject to unit start rate limiting configured with StartLimitIntervalSec= and StartLimitBurst=, see systemd.unit(5) for details. A restarted service enters the failed state only after the start limits are reached.

The documentation seems pretty clear:

Services specified in OnFailure should only run when a service enters the "failed" state

A service should only enter the "failed" state after StartLimitIntervalSec and StartLimitBurst are satisfied.

This is not what I'm seeing.

To confirm this, I edited my service file to the following:

[Unit]
Description=Foo Hardware Monitor 

StartLimitBurst=3
StartLimitIntervalSec=30
StartLimitAction=none

[Service]
ExecStart=/usr/bin/foo_app
Restart=on-abnormal

By removing OnFailure and setting StartLimitAction=none, I was able to see how systemd is responding to foo_app dying. Here's a test where I repeatedly kill foo_app with SIGKILL.

[root@device ~]
# systemctl start foo.service
[root@device ~]
# journalctl -f -o cat -u foo.service &
[1] 2107
Started Foo Hardware Monitor.
[root@device ~]
# killall -9 foo_app
foo.service: Main process exited, code=killed, status=9/KILL
foo.service: Unit entered failed state.
foo.service: Failed with result 'signal'
foo.service: Service hold-off time over, scheduling restart.
Stopped foo.
Started foo.

[root@device ~]
# killall -9 foo_app
foo.service: Main process exited, code=killed, status=9/KILL
foo.service: Unit entered failed state.
foo.service: Failed with result 'signal'
foo.service: Service hold-off time over, scheduling restart.
Stopped foo.
Started foo.

[root@device ~]
# killall -9 foo_app
foo.service: Main process exited, code=killed, status=9/KILL
foo.service: Unit entered failed state.
foo.service: Failed with result 'signal'
foo.service: Service hold-off time over, scheduling restart.
Stopped foo.
foo.service: Start request repeated too quickly
Failed to start foo.
foo.service: Unit entered failed state.
foo.service: Failed with result 'start-limit-hit'

This makes sense or the most part. When foo_app is killed, systemd restarts it until StartLimitBurst is hit and then gives up. This is what I want, except with StartLimitAction=reboot.

What's unusual is that systemd prints foo.service: Unit entered failed state. whenever foo_app is killed, even if it is about to be restarted through Restart=on-abnormal. This seems to directly contradict these lines from the docs quoted above:

A service unit using Restart= enters the failed state only after the start limits are reached.

A restarted service enters the failed state only after the start limits are reached.

All of this has left me pretty confused. Am I misunderstanding any of these systemd options? Is this a systemd bug? Any help is appreciated.

asked Feb 8 at 23:50

Matt K

1486

add a commentÂ |Â

up vote
2
down vote

favorite

I'm using systemd 231 in an embedded system, and I'm trying to create a service that monitors a hardware component in the system. Here's a rough description of what I'm trying to do:

When the service, foo.service, is started, it launches an application, foo_app.

foo_appmonitors the hardware component, running continuously.

If foo_app detects a hardware failure, it exits with a return code of 1. This should trigger a system reboot.

If foo_app crashes, systemd should relaunch foo_app.

If foo_app repeatedly crashes, systemd should reboot the system.

Here's my attempt at implementing this as a service:

[Unit]
Description=Foo Hardware Monitor

# If the application fails 3 times in 30 seconds, something has gone wrong,
# and the state of the hardware can't be guaranteed. Reboot the system here.
StartLimitBurst=3
StartLimitIntervalSec=30
StartLimitAction=reboot

# StartLimitAction=reboot will reboot the box if the app fails repeatedly,
# but if the app exits voluntarily, the reboot should trigger immediately
OnFailure=systemd-reboot.service

[Service]
ExecStart=/usr/bin/foo_app

# If the app fails from an abnormal condition (e.g. crash), try to
# restart it (within the limits of StartLimit*).
Restart=on-abnormal

However, this isn't what I'm seeing. As soon as I kill foo_app once, the system immediately reboots.

Here are some relevant snippets from the docs:

OnFailure=

A space-separated list of one or more units that are activated when this unit enters the "failed" state. A service unit using Restart= enters the failed state only after the start limits are reached.

Restart=

[snip] Note that service restart is subject to unit start rate limiting configured with StartLimitIntervalSec= and StartLimitBurst=, see systemd.unit(5) for details. A restarted service enters the failed state only after the start limits are reached.

The documentation seems pretty clear:

Services specified in OnFailure should only run when a service enters the "failed" state

A service should only enter the "failed" state after StartLimitIntervalSec and StartLimitBurst are satisfied.

This is not what I'm seeing.

To confirm this, I edited my service file to the following:

[Unit]
Description=Foo Hardware Monitor 

StartLimitBurst=3
StartLimitIntervalSec=30
StartLimitAction=none

[Service]
ExecStart=/usr/bin/foo_app
Restart=on-abnormal

By removing OnFailure and setting StartLimitAction=none, I was able to see how systemd is responding to foo_app dying. Here's a test where I repeatedly kill foo_app with SIGKILL.

[root@device ~]
# systemctl start foo.service
[root@device ~]
# journalctl -f -o cat -u foo.service &
[1] 2107
Started Foo Hardware Monitor.
[root@device ~]
# killall -9 foo_app
foo.service: Main process exited, code=killed, status=9/KILL
foo.service: Unit entered failed state.
foo.service: Failed with result 'signal'
foo.service: Service hold-off time over, scheduling restart.
Stopped foo.
Started foo.

[root@device ~]
# killall -9 foo_app
foo.service: Main process exited, code=killed, status=9/KILL
foo.service: Unit entered failed state.
foo.service: Failed with result 'signal'
foo.service: Service hold-off time over, scheduling restart.
Stopped foo.
Started foo.

[root@device ~]
# killall -9 foo_app
foo.service: Main process exited, code=killed, status=9/KILL
foo.service: Unit entered failed state.
foo.service: Failed with result 'signal'
foo.service: Service hold-off time over, scheduling restart.
Stopped foo.
foo.service: Start request repeated too quickly
Failed to start foo.
foo.service: Unit entered failed state.
foo.service: Failed with result 'start-limit-hit'

This makes sense or the most part. When foo_app is killed, systemd restarts it until StartLimitBurst is hit and then gives up. This is what I want, except with StartLimitAction=reboot.

A service unit using Restart= enters the failed state only after the start limits are reached.

A restarted service enters the failed state only after the start limits are reached.

All of this has left me pretty confused. Am I misunderstanding any of these systemd options? Is this a systemd bug? Any help is appreciated.

asked Feb 8 at 23:50

Matt K

1486

add a commentÂ |Â

up vote
2
down vote

favorite

I'm using systemd 231 in an embedded system, and I'm trying to create a service that monitors a hardware component in the system. Here's a rough description of what I'm trying to do:

When the service, foo.service, is started, it launches an application, foo_app.

foo_appmonitors the hardware component, running continuously.

If foo_app detects a hardware failure, it exits with a return code of 1. This should trigger a system reboot.

If foo_app crashes, systemd should relaunch foo_app.

If foo_app repeatedly crashes, systemd should reboot the system.

Here's my attempt at implementing this as a service:

[Unit]
Description=Foo Hardware Monitor

# If the application fails 3 times in 30 seconds, something has gone wrong,
# and the state of the hardware can't be guaranteed. Reboot the system here.
StartLimitBurst=3
StartLimitIntervalSec=30
StartLimitAction=reboot

# StartLimitAction=reboot will reboot the box if the app fails repeatedly,
# but if the app exits voluntarily, the reboot should trigger immediately
OnFailure=systemd-reboot.service

[Service]
ExecStart=/usr/bin/foo_app

# If the app fails from an abnormal condition (e.g. crash), try to
# restart it (within the limits of StartLimit*).
Restart=on-abnormal

However, this isn't what I'm seeing. As soon as I kill foo_app once, the system immediately reboots.

Here are some relevant snippets from the docs:

OnFailure=

A space-separated list of one or more units that are activated when this unit enters the "failed" state. A service unit using Restart= enters the failed state only after the start limits are reached.

Restart=

[snip] Note that service restart is subject to unit start rate limiting configured with StartLimitIntervalSec= and StartLimitBurst=, see systemd.unit(5) for details. A restarted service enters the failed state only after the start limits are reached.

The documentation seems pretty clear:

Services specified in OnFailure should only run when a service enters the "failed" state

A service should only enter the "failed" state after StartLimitIntervalSec and StartLimitBurst are satisfied.

This is not what I'm seeing.

To confirm this, I edited my service file to the following:

[Unit]
Description=Foo Hardware Monitor 

StartLimitBurst=3
StartLimitIntervalSec=30
StartLimitAction=none

[Service]
ExecStart=/usr/bin/foo_app
Restart=on-abnormal

By removing OnFailure and setting StartLimitAction=none, I was able to see how systemd is responding to foo_app dying. Here's a test where I repeatedly kill foo_app with SIGKILL.

[root@device ~]
# systemctl start foo.service
[root@device ~]
# journalctl -f -o cat -u foo.service &
[1] 2107
Started Foo Hardware Monitor.
[root@device ~]
# killall -9 foo_app
foo.service: Main process exited, code=killed, status=9/KILL
foo.service: Unit entered failed state.
foo.service: Failed with result 'signal'
foo.service: Service hold-off time over, scheduling restart.
Stopped foo.
Started foo.

[root@device ~]
# killall -9 foo_app
foo.service: Main process exited, code=killed, status=9/KILL
foo.service: Unit entered failed state.
foo.service: Failed with result 'signal'
foo.service: Service hold-off time over, scheduling restart.
Stopped foo.
Started foo.

[root@device ~]
# killall -9 foo_app
foo.service: Main process exited, code=killed, status=9/KILL
foo.service: Unit entered failed state.
foo.service: Failed with result 'signal'
foo.service: Service hold-off time over, scheduling restart.
Stopped foo.
foo.service: Start request repeated too quickly
Failed to start foo.
foo.service: Unit entered failed state.
foo.service: Failed with result 'start-limit-hit'

This makes sense or the most part. When foo_app is killed, systemd restarts it until StartLimitBurst is hit and then gives up. This is what I want, except with StartLimitAction=reboot.

A service unit using Restart= enters the failed state only after the start limits are reached.

A restarted service enters the failed state only after the start limits are reached.

All of this has left me pretty confused. Am I misunderstanding any of these systemd options? Is this a systemd bug? Any help is appreciated.

asked Feb 8 at 23:50

Matt K

1486

I'm using systemd 231 in an embedded system, and I'm trying to create a service that monitors a hardware component in the system. Here's a rough description of what I'm trying to do:

When the service, foo.service, is started, it launches an application, foo_app.

foo_appmonitors the hardware component, running continuously.

If foo_app detects a hardware failure, it exits with a return code of 1. This should trigger a system reboot.

If foo_app crashes, systemd should relaunch foo_app.

If foo_app repeatedly crashes, systemd should reboot the system.

Here's my attempt at implementing this as a service:

[Unit]
Description=Foo Hardware Monitor

# If the application fails 3 times in 30 seconds, something has gone wrong,
# and the state of the hardware can't be guaranteed. Reboot the system here.
StartLimitBurst=3
StartLimitIntervalSec=30
StartLimitAction=reboot

# StartLimitAction=reboot will reboot the box if the app fails repeatedly,
# but if the app exits voluntarily, the reboot should trigger immediately
OnFailure=systemd-reboot.service

[Service]
ExecStart=/usr/bin/foo_app

# If the app fails from an abnormal condition (e.g. crash), try to
# restart it (within the limits of StartLimit*).
Restart=on-abnormal

However, this isn't what I'm seeing. As soon as I kill foo_app once, the system immediately reboots.

Here are some relevant snippets from the docs:

OnFailure=

A space-separated list of one or more units that are activated when this unit enters the "failed" state. A service unit using Restart= enters the failed state only after the start limits are reached.

Restart=

[snip] Note that service restart is subject to unit start rate limiting configured with StartLimitIntervalSec= and StartLimitBurst=, see systemd.unit(5) for details. A restarted service enters the failed state only after the start limits are reached.

The documentation seems pretty clear:

Services specified in OnFailure should only run when a service enters the "failed" state

A service should only enter the "failed" state after StartLimitIntervalSec and StartLimitBurst are satisfied.

This is not what I'm seeing.

To confirm this, I edited my service file to the following:

[Unit]
Description=Foo Hardware Monitor 

StartLimitBurst=3
StartLimitIntervalSec=30
StartLimitAction=none

[Service]
ExecStart=/usr/bin/foo_app
Restart=on-abnormal

By removing OnFailure and setting StartLimitAction=none, I was able to see how systemd is responding to foo_app dying. Here's a test where I repeatedly kill foo_app with SIGKILL.

[root@device ~]
# systemctl start foo.service
[root@device ~]
# journalctl -f -o cat -u foo.service &
[1] 2107
Started Foo Hardware Monitor.
[root@device ~]
# killall -9 foo_app
foo.service: Main process exited, code=killed, status=9/KILL
foo.service: Unit entered failed state.
foo.service: Failed with result 'signal'
foo.service: Service hold-off time over, scheduling restart.
Stopped foo.
Started foo.

[root@device ~]
# killall -9 foo_app
foo.service: Main process exited, code=killed, status=9/KILL
foo.service: Unit entered failed state.
foo.service: Failed with result 'signal'
foo.service: Service hold-off time over, scheduling restart.
Stopped foo.
Started foo.

[root@device ~]
# killall -9 foo_app
foo.service: Main process exited, code=killed, status=9/KILL
foo.service: Unit entered failed state.
foo.service: Failed with result 'signal'
foo.service: Service hold-off time over, scheduling restart.
Stopped foo.
foo.service: Start request repeated too quickly
Failed to start foo.
foo.service: Unit entered failed state.
foo.service: Failed with result 'start-limit-hit'

This makes sense or the most part. When foo_app is killed, systemd restarts it until StartLimitBurst is hit and then gives up. This is what I want, except with StartLimitAction=reboot.

A service unit using Restart= enters the failed state only after the start limits are reached.

A restarted service enters the failed state only after the start limits are reached.

All of this has left me pretty confused. Am I misunderstanding any of these systemd options? Is this a systemd bug? Any help is appreciated.

asked Feb 8 at 23:50

Matt K

1486

asked Feb 8 at 23:50

Matt K

1486

asked Feb 8 at 23:50

Matt K

1486

asked Feb 8 at 23:50

Matt K

1486

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
4
down vote

accepted

TL;DR - Known documentation issue, currently still an outstanding issue for the systemd project

It turns out, since you asked this question, this has been reported and identified as a discrepancy in systemd between the documentation and the actual behavior. In my understanding (and my reading of the github issue) your expectation and the documentation match, so you are not crazy.

Currently systemd sets the state to failed after every attempted start, regardless of whether the start limit has been reached. In the issue the OP wrote an amusing anecdote about learning to ride a bike that I highly suggest taking a gander at.

answered Mar 28 at 20:23

cunninghamp3

473215

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f422933%2fconfusing-systemd-behaviour-with-onfailure-and-restart%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
4
down vote

accepted

TL;DR - Known documentation issue, currently still an outstanding issue for the systemd project

answered Mar 28 at 20:23

cunninghamp3

473215

add a commentÂ |Â

up vote
4
down vote

accepted

TL;DR - Known documentation issue, currently still an outstanding issue for the systemd project

answered Mar 28 at 20:23

cunninghamp3

473215

add a commentÂ |Â

up vote
4
down vote

accepted

TL;DR - Known documentation issue, currently still an outstanding issue for the systemd project

answered Mar 28 at 20:23

cunninghamp3

473215

TL;DR - Known documentation issue, currently still an outstanding issue for the systemd project

answered Mar 28 at 20:23

cunninghamp3

473215

answered Mar 28 at 20:23

cunninghamp3

473215

answered Mar 28 at 20:23

cunninghamp3

473215

answered Mar 28 at 20:23

cunninghamp3

473215

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu