Confusing systemd behaviour with OnFailure= and Restart=
Clash Royale CLAN TAG#URR8PPP
up vote
2
down vote
favorite
I'm using systemd 231 in an embedded system, and I'm trying to create a service that monitors a hardware component in the system. Here's a rough description of what I'm trying to do:
- When the service,
foo.service
, is started, it launches an application,foo_app
. foo_app
monitors the hardware component, running continuously.- If
foo_app
detects a hardware failure, it exits with a return code of 1. This should trigger a system reboot. - If
foo_app
crashes, systemd should relaunchfoo_app
. - If
foo_app
repeatedly crashes, systemd should reboot the system.
Here's my attempt at implementing this as a service:
[Unit]
Description=Foo Hardware Monitor
# If the application fails 3 times in 30 seconds, something has gone wrong,
# and the state of the hardware can't be guaranteed. Reboot the system here.
StartLimitBurst=3
StartLimitIntervalSec=30
StartLimitAction=reboot
# StartLimitAction=reboot will reboot the box if the app fails repeatedly,
# but if the app exits voluntarily, the reboot should trigger immediately
OnFailure=systemd-reboot.service
[Service]
ExecStart=/usr/bin/foo_app
# If the app fails from an abnormal condition (e.g. crash), try to
# restart it (within the limits of StartLimit*).
Restart=on-abnormal
From the documentation (systemd.service and systemd.service), I'd expect that if I kill foo_app
in a way such that Restart=on-abnormal
is triggered (e.g. killall -9 foo_app
), systemd should give priority to Restart=on-abnormal
over OnFailure=systemd-reboot.service
and not start systemd-reboot.service
.
However, this isn't what I'm seeing. As soon as I kill foo_app
once, the system immediately reboots.
Here are some relevant snippets from the docs:
OnFailure=
A space-separated list of one or more units that are activated when this unit enters the "failed" state. A service unit using Restart= enters the failed state only after the start limits are reached.
Restart=
[snip] Note that service restart is subject to unit start rate limiting configured with StartLimitIntervalSec= and StartLimitBurst=, see systemd.unit(5) for details. A restarted service enters the failed state only after the start limits are reached.
The documentation seems pretty clear:
- Services specified in
OnFailure
should only run when a service enters the "failed
" state - A service should only enter the "
failed
" state afterStartLimitIntervalSec
andStartLimitBurst
are satisfied.
This is not what I'm seeing.
To confirm this, I edited my service file to the following:
[Unit]
Description=Foo Hardware Monitor
StartLimitBurst=3
StartLimitIntervalSec=30
StartLimitAction=none
[Service]
ExecStart=/usr/bin/foo_app
Restart=on-abnormal
By removing OnFailure
and setting StartLimitAction=none
, I was able to see how systemd is responding to foo_app
dying. Here's a test where I repeatedly kill foo_app
with SIGKILL
.
[root@device ~]
# systemctl start foo.service
[root@device ~]
# journalctl -f -o cat -u foo.service &
[1] 2107
Started Foo Hardware Monitor.
[root@device ~]
# killall -9 foo_app
foo.service: Main process exited, code=killed, status=9/KILL
foo.service: Unit entered failed state.
foo.service: Failed with result 'signal'
foo.service: Service hold-off time over, scheduling restart.
Stopped foo.
Started foo.
[root@device ~]
# killall -9 foo_app
foo.service: Main process exited, code=killed, status=9/KILL
foo.service: Unit entered failed state.
foo.service: Failed with result 'signal'
foo.service: Service hold-off time over, scheduling restart.
Stopped foo.
Started foo.
[root@device ~]
# killall -9 foo_app
foo.service: Main process exited, code=killed, status=9/KILL
foo.service: Unit entered failed state.
foo.service: Failed with result 'signal'
foo.service: Service hold-off time over, scheduling restart.
Stopped foo.
foo.service: Start request repeated too quickly
Failed to start foo.
foo.service: Unit entered failed state.
foo.service: Failed with result 'start-limit-hit'
This makes sense or the most part. When foo_app
is killed, systemd restarts it until StartLimitBurst
is hit and then gives up. This is what I want, except with StartLimitAction=reboot
.
What's unusual is that systemd prints foo.service: Unit entered failed state.
whenever foo_app
is killed, even if it is about to be restarted through Restart=on-abnormal
. This seems to directly contradict these lines from the docs quoted above:
A service unit using Restart= enters the failed state only after the start limits are reached.
A restarted service enters the failed state only after the start limits are reached.
All of this has left me pretty confused. Am I misunderstanding any of these systemd options? Is this a systemd bug? Any help is appreciated.
systemd
add a comment |Â
up vote
2
down vote
favorite
I'm using systemd 231 in an embedded system, and I'm trying to create a service that monitors a hardware component in the system. Here's a rough description of what I'm trying to do:
- When the service,
foo.service
, is started, it launches an application,foo_app
. foo_app
monitors the hardware component, running continuously.- If
foo_app
detects a hardware failure, it exits with a return code of 1. This should trigger a system reboot. - If
foo_app
crashes, systemd should relaunchfoo_app
. - If
foo_app
repeatedly crashes, systemd should reboot the system.
Here's my attempt at implementing this as a service:
[Unit]
Description=Foo Hardware Monitor
# If the application fails 3 times in 30 seconds, something has gone wrong,
# and the state of the hardware can't be guaranteed. Reboot the system here.
StartLimitBurst=3
StartLimitIntervalSec=30
StartLimitAction=reboot
# StartLimitAction=reboot will reboot the box if the app fails repeatedly,
# but if the app exits voluntarily, the reboot should trigger immediately
OnFailure=systemd-reboot.service
[Service]
ExecStart=/usr/bin/foo_app
# If the app fails from an abnormal condition (e.g. crash), try to
# restart it (within the limits of StartLimit*).
Restart=on-abnormal
From the documentation (systemd.service and systemd.service), I'd expect that if I kill foo_app
in a way such that Restart=on-abnormal
is triggered (e.g. killall -9 foo_app
), systemd should give priority to Restart=on-abnormal
over OnFailure=systemd-reboot.service
and not start systemd-reboot.service
.
However, this isn't what I'm seeing. As soon as I kill foo_app
once, the system immediately reboots.
Here are some relevant snippets from the docs:
OnFailure=
A space-separated list of one or more units that are activated when this unit enters the "failed" state. A service unit using Restart= enters the failed state only after the start limits are reached.
Restart=
[snip] Note that service restart is subject to unit start rate limiting configured with StartLimitIntervalSec= and StartLimitBurst=, see systemd.unit(5) for details. A restarted service enters the failed state only after the start limits are reached.
The documentation seems pretty clear:
- Services specified in
OnFailure
should only run when a service enters the "failed
" state - A service should only enter the "
failed
" state afterStartLimitIntervalSec
andStartLimitBurst
are satisfied.
This is not what I'm seeing.
To confirm this, I edited my service file to the following:
[Unit]
Description=Foo Hardware Monitor
StartLimitBurst=3
StartLimitIntervalSec=30
StartLimitAction=none
[Service]
ExecStart=/usr/bin/foo_app
Restart=on-abnormal
By removing OnFailure
and setting StartLimitAction=none
, I was able to see how systemd is responding to foo_app
dying. Here's a test where I repeatedly kill foo_app
with SIGKILL
.
[root@device ~]
# systemctl start foo.service
[root@device ~]
# journalctl -f -o cat -u foo.service &
[1] 2107
Started Foo Hardware Monitor.
[root@device ~]
# killall -9 foo_app
foo.service: Main process exited, code=killed, status=9/KILL
foo.service: Unit entered failed state.
foo.service: Failed with result 'signal'
foo.service: Service hold-off time over, scheduling restart.
Stopped foo.
Started foo.
[root@device ~]
# killall -9 foo_app
foo.service: Main process exited, code=killed, status=9/KILL
foo.service: Unit entered failed state.
foo.service: Failed with result 'signal'
foo.service: Service hold-off time over, scheduling restart.
Stopped foo.
Started foo.
[root@device ~]
# killall -9 foo_app
foo.service: Main process exited, code=killed, status=9/KILL
foo.service: Unit entered failed state.
foo.service: Failed with result 'signal'
foo.service: Service hold-off time over, scheduling restart.
Stopped foo.
foo.service: Start request repeated too quickly
Failed to start foo.
foo.service: Unit entered failed state.
foo.service: Failed with result 'start-limit-hit'
This makes sense or the most part. When foo_app
is killed, systemd restarts it until StartLimitBurst
is hit and then gives up. This is what I want, except with StartLimitAction=reboot
.
What's unusual is that systemd prints foo.service: Unit entered failed state.
whenever foo_app
is killed, even if it is about to be restarted through Restart=on-abnormal
. This seems to directly contradict these lines from the docs quoted above:
A service unit using Restart= enters the failed state only after the start limits are reached.
A restarted service enters the failed state only after the start limits are reached.
All of this has left me pretty confused. Am I misunderstanding any of these systemd options? Is this a systemd bug? Any help is appreciated.
systemd
add a comment |Â
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I'm using systemd 231 in an embedded system, and I'm trying to create a service that monitors a hardware component in the system. Here's a rough description of what I'm trying to do:
- When the service,
foo.service
, is started, it launches an application,foo_app
. foo_app
monitors the hardware component, running continuously.- If
foo_app
detects a hardware failure, it exits with a return code of 1. This should trigger a system reboot. - If
foo_app
crashes, systemd should relaunchfoo_app
. - If
foo_app
repeatedly crashes, systemd should reboot the system.
Here's my attempt at implementing this as a service:
[Unit]
Description=Foo Hardware Monitor
# If the application fails 3 times in 30 seconds, something has gone wrong,
# and the state of the hardware can't be guaranteed. Reboot the system here.
StartLimitBurst=3
StartLimitIntervalSec=30
StartLimitAction=reboot
# StartLimitAction=reboot will reboot the box if the app fails repeatedly,
# but if the app exits voluntarily, the reboot should trigger immediately
OnFailure=systemd-reboot.service
[Service]
ExecStart=/usr/bin/foo_app
# If the app fails from an abnormal condition (e.g. crash), try to
# restart it (within the limits of StartLimit*).
Restart=on-abnormal
From the documentation (systemd.service and systemd.service), I'd expect that if I kill foo_app
in a way such that Restart=on-abnormal
is triggered (e.g. killall -9 foo_app
), systemd should give priority to Restart=on-abnormal
over OnFailure=systemd-reboot.service
and not start systemd-reboot.service
.
However, this isn't what I'm seeing. As soon as I kill foo_app
once, the system immediately reboots.
Here are some relevant snippets from the docs:
OnFailure=
A space-separated list of one or more units that are activated when this unit enters the "failed" state. A service unit using Restart= enters the failed state only after the start limits are reached.
Restart=
[snip] Note that service restart is subject to unit start rate limiting configured with StartLimitIntervalSec= and StartLimitBurst=, see systemd.unit(5) for details. A restarted service enters the failed state only after the start limits are reached.
The documentation seems pretty clear:
- Services specified in
OnFailure
should only run when a service enters the "failed
" state - A service should only enter the "
failed
" state afterStartLimitIntervalSec
andStartLimitBurst
are satisfied.
This is not what I'm seeing.
To confirm this, I edited my service file to the following:
[Unit]
Description=Foo Hardware Monitor
StartLimitBurst=3
StartLimitIntervalSec=30
StartLimitAction=none
[Service]
ExecStart=/usr/bin/foo_app
Restart=on-abnormal
By removing OnFailure
and setting StartLimitAction=none
, I was able to see how systemd is responding to foo_app
dying. Here's a test where I repeatedly kill foo_app
with SIGKILL
.
[root@device ~]
# systemctl start foo.service
[root@device ~]
# journalctl -f -o cat -u foo.service &
[1] 2107
Started Foo Hardware Monitor.
[root@device ~]
# killall -9 foo_app
foo.service: Main process exited, code=killed, status=9/KILL
foo.service: Unit entered failed state.
foo.service: Failed with result 'signal'
foo.service: Service hold-off time over, scheduling restart.
Stopped foo.
Started foo.
[root@device ~]
# killall -9 foo_app
foo.service: Main process exited, code=killed, status=9/KILL
foo.service: Unit entered failed state.
foo.service: Failed with result 'signal'
foo.service: Service hold-off time over, scheduling restart.
Stopped foo.
Started foo.
[root@device ~]
# killall -9 foo_app
foo.service: Main process exited, code=killed, status=9/KILL
foo.service: Unit entered failed state.
foo.service: Failed with result 'signal'
foo.service: Service hold-off time over, scheduling restart.
Stopped foo.
foo.service: Start request repeated too quickly
Failed to start foo.
foo.service: Unit entered failed state.
foo.service: Failed with result 'start-limit-hit'
This makes sense or the most part. When foo_app
is killed, systemd restarts it until StartLimitBurst
is hit and then gives up. This is what I want, except with StartLimitAction=reboot
.
What's unusual is that systemd prints foo.service: Unit entered failed state.
whenever foo_app
is killed, even if it is about to be restarted through Restart=on-abnormal
. This seems to directly contradict these lines from the docs quoted above:
A service unit using Restart= enters the failed state only after the start limits are reached.
A restarted service enters the failed state only after the start limits are reached.
All of this has left me pretty confused. Am I misunderstanding any of these systemd options? Is this a systemd bug? Any help is appreciated.
systemd
I'm using systemd 231 in an embedded system, and I'm trying to create a service that monitors a hardware component in the system. Here's a rough description of what I'm trying to do:
- When the service,
foo.service
, is started, it launches an application,foo_app
. foo_app
monitors the hardware component, running continuously.- If
foo_app
detects a hardware failure, it exits with a return code of 1. This should trigger a system reboot. - If
foo_app
crashes, systemd should relaunchfoo_app
. - If
foo_app
repeatedly crashes, systemd should reboot the system.
Here's my attempt at implementing this as a service:
[Unit]
Description=Foo Hardware Monitor
# If the application fails 3 times in 30 seconds, something has gone wrong,
# and the state of the hardware can't be guaranteed. Reboot the system here.
StartLimitBurst=3
StartLimitIntervalSec=30
StartLimitAction=reboot
# StartLimitAction=reboot will reboot the box if the app fails repeatedly,
# but if the app exits voluntarily, the reboot should trigger immediately
OnFailure=systemd-reboot.service
[Service]
ExecStart=/usr/bin/foo_app
# If the app fails from an abnormal condition (e.g. crash), try to
# restart it (within the limits of StartLimit*).
Restart=on-abnormal
From the documentation (systemd.service and systemd.service), I'd expect that if I kill foo_app
in a way such that Restart=on-abnormal
is triggered (e.g. killall -9 foo_app
), systemd should give priority to Restart=on-abnormal
over OnFailure=systemd-reboot.service
and not start systemd-reboot.service
.
However, this isn't what I'm seeing. As soon as I kill foo_app
once, the system immediately reboots.
Here are some relevant snippets from the docs:
OnFailure=
A space-separated list of one or more units that are activated when this unit enters the "failed" state. A service unit using Restart= enters the failed state only after the start limits are reached.
Restart=
[snip] Note that service restart is subject to unit start rate limiting configured with StartLimitIntervalSec= and StartLimitBurst=, see systemd.unit(5) for details. A restarted service enters the failed state only after the start limits are reached.
The documentation seems pretty clear:
- Services specified in
OnFailure
should only run when a service enters the "failed
" state - A service should only enter the "
failed
" state afterStartLimitIntervalSec
andStartLimitBurst
are satisfied.
This is not what I'm seeing.
To confirm this, I edited my service file to the following:
[Unit]
Description=Foo Hardware Monitor
StartLimitBurst=3
StartLimitIntervalSec=30
StartLimitAction=none
[Service]
ExecStart=/usr/bin/foo_app
Restart=on-abnormal
By removing OnFailure
and setting StartLimitAction=none
, I was able to see how systemd is responding to foo_app
dying. Here's a test where I repeatedly kill foo_app
with SIGKILL
.
[root@device ~]
# systemctl start foo.service
[root@device ~]
# journalctl -f -o cat -u foo.service &
[1] 2107
Started Foo Hardware Monitor.
[root@device ~]
# killall -9 foo_app
foo.service: Main process exited, code=killed, status=9/KILL
foo.service: Unit entered failed state.
foo.service: Failed with result 'signal'
foo.service: Service hold-off time over, scheduling restart.
Stopped foo.
Started foo.
[root@device ~]
# killall -9 foo_app
foo.service: Main process exited, code=killed, status=9/KILL
foo.service: Unit entered failed state.
foo.service: Failed with result 'signal'
foo.service: Service hold-off time over, scheduling restart.
Stopped foo.
Started foo.
[root@device ~]
# killall -9 foo_app
foo.service: Main process exited, code=killed, status=9/KILL
foo.service: Unit entered failed state.
foo.service: Failed with result 'signal'
foo.service: Service hold-off time over, scheduling restart.
Stopped foo.
foo.service: Start request repeated too quickly
Failed to start foo.
foo.service: Unit entered failed state.
foo.service: Failed with result 'start-limit-hit'
This makes sense or the most part. When foo_app
is killed, systemd restarts it until StartLimitBurst
is hit and then gives up. This is what I want, except with StartLimitAction=reboot
.
What's unusual is that systemd prints foo.service: Unit entered failed state.
whenever foo_app
is killed, even if it is about to be restarted through Restart=on-abnormal
. This seems to directly contradict these lines from the docs quoted above:
A service unit using Restart= enters the failed state only after the start limits are reached.
A restarted service enters the failed state only after the start limits are reached.
All of this has left me pretty confused. Am I misunderstanding any of these systemd options? Is this a systemd bug? Any help is appreciated.
systemd
asked Feb 8 at 23:50
Matt K
1486
1486
add a comment |Â
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
4
down vote
accepted
TL;DR - Known documentation issue, currently still an outstanding issue for the systemd
project
It turns out, since you asked this question, this has been reported and identified as a discrepancy in systemd
between the documentation and the actual behavior. In my understanding (and my reading of the github issue) your expectation and the documentation match, so you are not crazy.
Currently systemd
sets the state to failed after every attempted start, regardless of whether the start limit has been reached. In the issue the OP wrote an amusing anecdote about learning to ride a bike that I highly suggest taking a gander at.
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
4
down vote
accepted
TL;DR - Known documentation issue, currently still an outstanding issue for the systemd
project
It turns out, since you asked this question, this has been reported and identified as a discrepancy in systemd
between the documentation and the actual behavior. In my understanding (and my reading of the github issue) your expectation and the documentation match, so you are not crazy.
Currently systemd
sets the state to failed after every attempted start, regardless of whether the start limit has been reached. In the issue the OP wrote an amusing anecdote about learning to ride a bike that I highly suggest taking a gander at.
add a comment |Â
up vote
4
down vote
accepted
TL;DR - Known documentation issue, currently still an outstanding issue for the systemd
project
It turns out, since you asked this question, this has been reported and identified as a discrepancy in systemd
between the documentation and the actual behavior. In my understanding (and my reading of the github issue) your expectation and the documentation match, so you are not crazy.
Currently systemd
sets the state to failed after every attempted start, regardless of whether the start limit has been reached. In the issue the OP wrote an amusing anecdote about learning to ride a bike that I highly suggest taking a gander at.
add a comment |Â
up vote
4
down vote
accepted
up vote
4
down vote
accepted
TL;DR - Known documentation issue, currently still an outstanding issue for the systemd
project
It turns out, since you asked this question, this has been reported and identified as a discrepancy in systemd
between the documentation and the actual behavior. In my understanding (and my reading of the github issue) your expectation and the documentation match, so you are not crazy.
Currently systemd
sets the state to failed after every attempted start, regardless of whether the start limit has been reached. In the issue the OP wrote an amusing anecdote about learning to ride a bike that I highly suggest taking a gander at.
TL;DR - Known documentation issue, currently still an outstanding issue for the systemd
project
It turns out, since you asked this question, this has been reported and identified as a discrepancy in systemd
between the documentation and the actual behavior. In my understanding (and my reading of the github issue) your expectation and the documentation match, so you are not crazy.
Currently systemd
sets the state to failed after every attempted start, regardless of whether the start limit has been reached. In the issue the OP wrote an amusing anecdote about learning to ride a bike that I highly suggest taking a gander at.
answered Mar 28 at 20:23
cunninghamp3
473215
473215
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f422933%2fconfusing-systemd-behaviour-with-onfailure-and-restart%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password