Confusing systemd behaviour with OnFailure= and Restart=

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
2
down vote

favorite












I'm using systemd 231 in an embedded system, and I'm trying to create a service that monitors a hardware component in the system. Here's a rough description of what I'm trying to do:



  1. When the service, foo.service, is started, it launches an application, foo_app.


  2. foo_appmonitors the hardware component, running continuously.

  3. If foo_app detects a hardware failure, it exits with a return code of 1. This should trigger a system reboot.

  4. If foo_app crashes, systemd should relaunch foo_app.

  5. If foo_app repeatedly crashes, systemd should reboot the system.

Here's my attempt at implementing this as a service:



[Unit]
Description=Foo Hardware Monitor

# If the application fails 3 times in 30 seconds, something has gone wrong,
# and the state of the hardware can't be guaranteed. Reboot the system here.
StartLimitBurst=3
StartLimitIntervalSec=30
StartLimitAction=reboot

# StartLimitAction=reboot will reboot the box if the app fails repeatedly,
# but if the app exits voluntarily, the reboot should trigger immediately
OnFailure=systemd-reboot.service

[Service]
ExecStart=/usr/bin/foo_app

# If the app fails from an abnormal condition (e.g. crash), try to
# restart it (within the limits of StartLimit*).
Restart=on-abnormal


From the documentation (systemd.service and systemd.service), I'd expect that if I kill foo_app in a way such that Restart=on-abnormal is triggered (e.g. killall -9 foo_app), systemd should give priority to Restart=on-abnormal over OnFailure=systemd-reboot.service and not start systemd-reboot.service.



However, this isn't what I'm seeing. As soon as I kill foo_app once, the system immediately reboots.



Here are some relevant snippets from the docs:




OnFailure=



A space-separated list of one or more units that are activated when this unit enters the "failed" state. A service unit using Restart= enters the failed state only after the start limits are reached.



Restart=



[snip] Note that service restart is subject to unit start rate limiting configured with StartLimitIntervalSec= and StartLimitBurst=, see systemd.unit(5) for details. A restarted service enters the failed state only after the start limits are reached.




The documentation seems pretty clear:



  • Services specified in OnFailure should only run when a service enters the "failed" state

  • A service should only enter the "failed" state after StartLimitIntervalSec and StartLimitBurst are satisfied.

This is not what I'm seeing.



To confirm this, I edited my service file to the following:



[Unit]
Description=Foo Hardware Monitor

StartLimitBurst=3
StartLimitIntervalSec=30
StartLimitAction=none

[Service]
ExecStart=/usr/bin/foo_app
Restart=on-abnormal


By removing OnFailure and setting StartLimitAction=none, I was able to see how systemd is responding to foo_app dying. Here's a test where I repeatedly kill foo_app with SIGKILL.



[root@device ~]
# systemctl start foo.service
[root@device ~]
# journalctl -f -o cat -u foo.service &
[1] 2107
Started Foo Hardware Monitor.
[root@device ~]
# killall -9 foo_app
foo.service: Main process exited, code=killed, status=9/KILL
foo.service: Unit entered failed state.
foo.service: Failed with result 'signal'
foo.service: Service hold-off time over, scheduling restart.
Stopped foo.
Started foo.

[root@device ~]
# killall -9 foo_app
foo.service: Main process exited, code=killed, status=9/KILL
foo.service: Unit entered failed state.
foo.service: Failed with result 'signal'
foo.service: Service hold-off time over, scheduling restart.
Stopped foo.
Started foo.

[root@device ~]
# killall -9 foo_app
foo.service: Main process exited, code=killed, status=9/KILL
foo.service: Unit entered failed state.
foo.service: Failed with result 'signal'
foo.service: Service hold-off time over, scheduling restart.
Stopped foo.
foo.service: Start request repeated too quickly
Failed to start foo.
foo.service: Unit entered failed state.
foo.service: Failed with result 'start-limit-hit'


This makes sense or the most part. When foo_app is killed, systemd restarts it until StartLimitBurst is hit and then gives up. This is what I want, except with StartLimitAction=reboot.



What's unusual is that systemd prints foo.service: Unit entered failed state. whenever foo_app is killed, even if it is about to be restarted through Restart=on-abnormal. This seems to directly contradict these lines from the docs quoted above:




A service unit using Restart= enters the failed state only after the start limits are reached.



A restarted service enters the failed state only after the start limits are reached.




All of this has left me pretty confused. Am I misunderstanding any of these systemd options? Is this a systemd bug? Any help is appreciated.







share|improve this question
























    up vote
    2
    down vote

    favorite












    I'm using systemd 231 in an embedded system, and I'm trying to create a service that monitors a hardware component in the system. Here's a rough description of what I'm trying to do:



    1. When the service, foo.service, is started, it launches an application, foo_app.


    2. foo_appmonitors the hardware component, running continuously.

    3. If foo_app detects a hardware failure, it exits with a return code of 1. This should trigger a system reboot.

    4. If foo_app crashes, systemd should relaunch foo_app.

    5. If foo_app repeatedly crashes, systemd should reboot the system.

    Here's my attempt at implementing this as a service:



    [Unit]
    Description=Foo Hardware Monitor

    # If the application fails 3 times in 30 seconds, something has gone wrong,
    # and the state of the hardware can't be guaranteed. Reboot the system here.
    StartLimitBurst=3
    StartLimitIntervalSec=30
    StartLimitAction=reboot

    # StartLimitAction=reboot will reboot the box if the app fails repeatedly,
    # but if the app exits voluntarily, the reboot should trigger immediately
    OnFailure=systemd-reboot.service

    [Service]
    ExecStart=/usr/bin/foo_app

    # If the app fails from an abnormal condition (e.g. crash), try to
    # restart it (within the limits of StartLimit*).
    Restart=on-abnormal


    From the documentation (systemd.service and systemd.service), I'd expect that if I kill foo_app in a way such that Restart=on-abnormal is triggered (e.g. killall -9 foo_app), systemd should give priority to Restart=on-abnormal over OnFailure=systemd-reboot.service and not start systemd-reboot.service.



    However, this isn't what I'm seeing. As soon as I kill foo_app once, the system immediately reboots.



    Here are some relevant snippets from the docs:




    OnFailure=



    A space-separated list of one or more units that are activated when this unit enters the "failed" state. A service unit using Restart= enters the failed state only after the start limits are reached.



    Restart=



    [snip] Note that service restart is subject to unit start rate limiting configured with StartLimitIntervalSec= and StartLimitBurst=, see systemd.unit(5) for details. A restarted service enters the failed state only after the start limits are reached.




    The documentation seems pretty clear:



    • Services specified in OnFailure should only run when a service enters the "failed" state

    • A service should only enter the "failed" state after StartLimitIntervalSec and StartLimitBurst are satisfied.

    This is not what I'm seeing.



    To confirm this, I edited my service file to the following:



    [Unit]
    Description=Foo Hardware Monitor

    StartLimitBurst=3
    StartLimitIntervalSec=30
    StartLimitAction=none

    [Service]
    ExecStart=/usr/bin/foo_app
    Restart=on-abnormal


    By removing OnFailure and setting StartLimitAction=none, I was able to see how systemd is responding to foo_app dying. Here's a test where I repeatedly kill foo_app with SIGKILL.



    [root@device ~]
    # systemctl start foo.service
    [root@device ~]
    # journalctl -f -o cat -u foo.service &
    [1] 2107
    Started Foo Hardware Monitor.
    [root@device ~]
    # killall -9 foo_app
    foo.service: Main process exited, code=killed, status=9/KILL
    foo.service: Unit entered failed state.
    foo.service: Failed with result 'signal'
    foo.service: Service hold-off time over, scheduling restart.
    Stopped foo.
    Started foo.

    [root@device ~]
    # killall -9 foo_app
    foo.service: Main process exited, code=killed, status=9/KILL
    foo.service: Unit entered failed state.
    foo.service: Failed with result 'signal'
    foo.service: Service hold-off time over, scheduling restart.
    Stopped foo.
    Started foo.

    [root@device ~]
    # killall -9 foo_app
    foo.service: Main process exited, code=killed, status=9/KILL
    foo.service: Unit entered failed state.
    foo.service: Failed with result 'signal'
    foo.service: Service hold-off time over, scheduling restart.
    Stopped foo.
    foo.service: Start request repeated too quickly
    Failed to start foo.
    foo.service: Unit entered failed state.
    foo.service: Failed with result 'start-limit-hit'


    This makes sense or the most part. When foo_app is killed, systemd restarts it until StartLimitBurst is hit and then gives up. This is what I want, except with StartLimitAction=reboot.



    What's unusual is that systemd prints foo.service: Unit entered failed state. whenever foo_app is killed, even if it is about to be restarted through Restart=on-abnormal. This seems to directly contradict these lines from the docs quoted above:




    A service unit using Restart= enters the failed state only after the start limits are reached.



    A restarted service enters the failed state only after the start limits are reached.




    All of this has left me pretty confused. Am I misunderstanding any of these systemd options? Is this a systemd bug? Any help is appreciated.







    share|improve this question






















      up vote
      2
      down vote

      favorite









      up vote
      2
      down vote

      favorite











      I'm using systemd 231 in an embedded system, and I'm trying to create a service that monitors a hardware component in the system. Here's a rough description of what I'm trying to do:



      1. When the service, foo.service, is started, it launches an application, foo_app.


      2. foo_appmonitors the hardware component, running continuously.

      3. If foo_app detects a hardware failure, it exits with a return code of 1. This should trigger a system reboot.

      4. If foo_app crashes, systemd should relaunch foo_app.

      5. If foo_app repeatedly crashes, systemd should reboot the system.

      Here's my attempt at implementing this as a service:



      [Unit]
      Description=Foo Hardware Monitor

      # If the application fails 3 times in 30 seconds, something has gone wrong,
      # and the state of the hardware can't be guaranteed. Reboot the system here.
      StartLimitBurst=3
      StartLimitIntervalSec=30
      StartLimitAction=reboot

      # StartLimitAction=reboot will reboot the box if the app fails repeatedly,
      # but if the app exits voluntarily, the reboot should trigger immediately
      OnFailure=systemd-reboot.service

      [Service]
      ExecStart=/usr/bin/foo_app

      # If the app fails from an abnormal condition (e.g. crash), try to
      # restart it (within the limits of StartLimit*).
      Restart=on-abnormal


      From the documentation (systemd.service and systemd.service), I'd expect that if I kill foo_app in a way such that Restart=on-abnormal is triggered (e.g. killall -9 foo_app), systemd should give priority to Restart=on-abnormal over OnFailure=systemd-reboot.service and not start systemd-reboot.service.



      However, this isn't what I'm seeing. As soon as I kill foo_app once, the system immediately reboots.



      Here are some relevant snippets from the docs:




      OnFailure=



      A space-separated list of one or more units that are activated when this unit enters the "failed" state. A service unit using Restart= enters the failed state only after the start limits are reached.



      Restart=



      [snip] Note that service restart is subject to unit start rate limiting configured with StartLimitIntervalSec= and StartLimitBurst=, see systemd.unit(5) for details. A restarted service enters the failed state only after the start limits are reached.




      The documentation seems pretty clear:



      • Services specified in OnFailure should only run when a service enters the "failed" state

      • A service should only enter the "failed" state after StartLimitIntervalSec and StartLimitBurst are satisfied.

      This is not what I'm seeing.



      To confirm this, I edited my service file to the following:



      [Unit]
      Description=Foo Hardware Monitor

      StartLimitBurst=3
      StartLimitIntervalSec=30
      StartLimitAction=none

      [Service]
      ExecStart=/usr/bin/foo_app
      Restart=on-abnormal


      By removing OnFailure and setting StartLimitAction=none, I was able to see how systemd is responding to foo_app dying. Here's a test where I repeatedly kill foo_app with SIGKILL.



      [root@device ~]
      # systemctl start foo.service
      [root@device ~]
      # journalctl -f -o cat -u foo.service &
      [1] 2107
      Started Foo Hardware Monitor.
      [root@device ~]
      # killall -9 foo_app
      foo.service: Main process exited, code=killed, status=9/KILL
      foo.service: Unit entered failed state.
      foo.service: Failed with result 'signal'
      foo.service: Service hold-off time over, scheduling restart.
      Stopped foo.
      Started foo.

      [root@device ~]
      # killall -9 foo_app
      foo.service: Main process exited, code=killed, status=9/KILL
      foo.service: Unit entered failed state.
      foo.service: Failed with result 'signal'
      foo.service: Service hold-off time over, scheduling restart.
      Stopped foo.
      Started foo.

      [root@device ~]
      # killall -9 foo_app
      foo.service: Main process exited, code=killed, status=9/KILL
      foo.service: Unit entered failed state.
      foo.service: Failed with result 'signal'
      foo.service: Service hold-off time over, scheduling restart.
      Stopped foo.
      foo.service: Start request repeated too quickly
      Failed to start foo.
      foo.service: Unit entered failed state.
      foo.service: Failed with result 'start-limit-hit'


      This makes sense or the most part. When foo_app is killed, systemd restarts it until StartLimitBurst is hit and then gives up. This is what I want, except with StartLimitAction=reboot.



      What's unusual is that systemd prints foo.service: Unit entered failed state. whenever foo_app is killed, even if it is about to be restarted through Restart=on-abnormal. This seems to directly contradict these lines from the docs quoted above:




      A service unit using Restart= enters the failed state only after the start limits are reached.



      A restarted service enters the failed state only after the start limits are reached.




      All of this has left me pretty confused. Am I misunderstanding any of these systemd options? Is this a systemd bug? Any help is appreciated.







      share|improve this question












      I'm using systemd 231 in an embedded system, and I'm trying to create a service that monitors a hardware component in the system. Here's a rough description of what I'm trying to do:



      1. When the service, foo.service, is started, it launches an application, foo_app.


      2. foo_appmonitors the hardware component, running continuously.

      3. If foo_app detects a hardware failure, it exits with a return code of 1. This should trigger a system reboot.

      4. If foo_app crashes, systemd should relaunch foo_app.

      5. If foo_app repeatedly crashes, systemd should reboot the system.

      Here's my attempt at implementing this as a service:



      [Unit]
      Description=Foo Hardware Monitor

      # If the application fails 3 times in 30 seconds, something has gone wrong,
      # and the state of the hardware can't be guaranteed. Reboot the system here.
      StartLimitBurst=3
      StartLimitIntervalSec=30
      StartLimitAction=reboot

      # StartLimitAction=reboot will reboot the box if the app fails repeatedly,
      # but if the app exits voluntarily, the reboot should trigger immediately
      OnFailure=systemd-reboot.service

      [Service]
      ExecStart=/usr/bin/foo_app

      # If the app fails from an abnormal condition (e.g. crash), try to
      # restart it (within the limits of StartLimit*).
      Restart=on-abnormal


      From the documentation (systemd.service and systemd.service), I'd expect that if I kill foo_app in a way such that Restart=on-abnormal is triggered (e.g. killall -9 foo_app), systemd should give priority to Restart=on-abnormal over OnFailure=systemd-reboot.service and not start systemd-reboot.service.



      However, this isn't what I'm seeing. As soon as I kill foo_app once, the system immediately reboots.



      Here are some relevant snippets from the docs:




      OnFailure=



      A space-separated list of one or more units that are activated when this unit enters the "failed" state. A service unit using Restart= enters the failed state only after the start limits are reached.



      Restart=



      [snip] Note that service restart is subject to unit start rate limiting configured with StartLimitIntervalSec= and StartLimitBurst=, see systemd.unit(5) for details. A restarted service enters the failed state only after the start limits are reached.




      The documentation seems pretty clear:



      • Services specified in OnFailure should only run when a service enters the "failed" state

      • A service should only enter the "failed" state after StartLimitIntervalSec and StartLimitBurst are satisfied.

      This is not what I'm seeing.



      To confirm this, I edited my service file to the following:



      [Unit]
      Description=Foo Hardware Monitor

      StartLimitBurst=3
      StartLimitIntervalSec=30
      StartLimitAction=none

      [Service]
      ExecStart=/usr/bin/foo_app
      Restart=on-abnormal


      By removing OnFailure and setting StartLimitAction=none, I was able to see how systemd is responding to foo_app dying. Here's a test where I repeatedly kill foo_app with SIGKILL.



      [root@device ~]
      # systemctl start foo.service
      [root@device ~]
      # journalctl -f -o cat -u foo.service &
      [1] 2107
      Started Foo Hardware Monitor.
      [root@device ~]
      # killall -9 foo_app
      foo.service: Main process exited, code=killed, status=9/KILL
      foo.service: Unit entered failed state.
      foo.service: Failed with result 'signal'
      foo.service: Service hold-off time over, scheduling restart.
      Stopped foo.
      Started foo.

      [root@device ~]
      # killall -9 foo_app
      foo.service: Main process exited, code=killed, status=9/KILL
      foo.service: Unit entered failed state.
      foo.service: Failed with result 'signal'
      foo.service: Service hold-off time over, scheduling restart.
      Stopped foo.
      Started foo.

      [root@device ~]
      # killall -9 foo_app
      foo.service: Main process exited, code=killed, status=9/KILL
      foo.service: Unit entered failed state.
      foo.service: Failed with result 'signal'
      foo.service: Service hold-off time over, scheduling restart.
      Stopped foo.
      foo.service: Start request repeated too quickly
      Failed to start foo.
      foo.service: Unit entered failed state.
      foo.service: Failed with result 'start-limit-hit'


      This makes sense or the most part. When foo_app is killed, systemd restarts it until StartLimitBurst is hit and then gives up. This is what I want, except with StartLimitAction=reboot.



      What's unusual is that systemd prints foo.service: Unit entered failed state. whenever foo_app is killed, even if it is about to be restarted through Restart=on-abnormal. This seems to directly contradict these lines from the docs quoted above:




      A service unit using Restart= enters the failed state only after the start limits are reached.



      A restarted service enters the failed state only after the start limits are reached.




      All of this has left me pretty confused. Am I misunderstanding any of these systemd options? Is this a systemd bug? Any help is appreciated.









      share|improve this question











      share|improve this question




      share|improve this question










      asked Feb 8 at 23:50









      Matt K

      1486




      1486




















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          4
          down vote



          accepted










          TL;DR - Known documentation issue, currently still an outstanding issue for the systemd project



          It turns out, since you asked this question, this has been reported and identified as a discrepancy in systemd between the documentation and the actual behavior. In my understanding (and my reading of the github issue) your expectation and the documentation match, so you are not crazy.



          Currently systemd sets the state to failed after every attempted start, regardless of whether the start limit has been reached. In the issue the OP wrote an amusing anecdote about learning to ride a bike that I highly suggest taking a gander at.






          share|improve this answer




















            Your Answer







            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "106"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            convertImagesToLinks: false,
            noModals: false,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );








             

            draft saved


            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f422933%2fconfusing-systemd-behaviour-with-onfailure-and-restart%23new-answer', 'question_page');

            );

            Post as a guest






























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            4
            down vote



            accepted










            TL;DR - Known documentation issue, currently still an outstanding issue for the systemd project



            It turns out, since you asked this question, this has been reported and identified as a discrepancy in systemd between the documentation and the actual behavior. In my understanding (and my reading of the github issue) your expectation and the documentation match, so you are not crazy.



            Currently systemd sets the state to failed after every attempted start, regardless of whether the start limit has been reached. In the issue the OP wrote an amusing anecdote about learning to ride a bike that I highly suggest taking a gander at.






            share|improve this answer
























              up vote
              4
              down vote



              accepted










              TL;DR - Known documentation issue, currently still an outstanding issue for the systemd project



              It turns out, since you asked this question, this has been reported and identified as a discrepancy in systemd between the documentation and the actual behavior. In my understanding (and my reading of the github issue) your expectation and the documentation match, so you are not crazy.



              Currently systemd sets the state to failed after every attempted start, regardless of whether the start limit has been reached. In the issue the OP wrote an amusing anecdote about learning to ride a bike that I highly suggest taking a gander at.






              share|improve this answer






















                up vote
                4
                down vote



                accepted







                up vote
                4
                down vote



                accepted






                TL;DR - Known documentation issue, currently still an outstanding issue for the systemd project



                It turns out, since you asked this question, this has been reported and identified as a discrepancy in systemd between the documentation and the actual behavior. In my understanding (and my reading of the github issue) your expectation and the documentation match, so you are not crazy.



                Currently systemd sets the state to failed after every attempted start, regardless of whether the start limit has been reached. In the issue the OP wrote an amusing anecdote about learning to ride a bike that I highly suggest taking a gander at.






                share|improve this answer












                TL;DR - Known documentation issue, currently still an outstanding issue for the systemd project



                It turns out, since you asked this question, this has been reported and identified as a discrepancy in systemd between the documentation and the actual behavior. In my understanding (and my reading of the github issue) your expectation and the documentation match, so you are not crazy.



                Currently systemd sets the state to failed after every attempted start, regardless of whether the start limit has been reached. In the issue the OP wrote an amusing anecdote about learning to ride a bike that I highly suggest taking a gander at.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Mar 28 at 20:23









                cunninghamp3

                473215




                473215






















                     

                    draft saved


                    draft discarded


























                     


                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f422933%2fconfusing-systemd-behaviour-with-onfailure-and-restart%23new-answer', 'question_page');

                    );

                    Post as a guest













































































                    Popular posts from this blog

                    How to check contact read email or not when send email to Individual?

                    Bahrain

                    Postfix configuration issue with fips on centos 7; mailgun relay