Proper way to use OnFailure in systemd

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
2
down vote

favorite












I have a service running software that generates some configuration files if they don't exist, and read them if they do exist. The problem I have been facing is that these files sometimes get corrupt, making the software unable to start, and thus making the service fail. In this case I would like to remove these files and restart the service.



I tried creating a service that should get executed in case of failure, by doing this:



[Service]
ExecStart=/bin/run_program
OnFailure=software-fail.service


where this service is:



[Service]
ExecStart=/bin/rm /file/to/delete
ExecStop=systemctl --user start software.service


The problem, however, is that this service doesn't start, even when the service has failed.

I tried doing



systemctl --user enable software-fail.service


but then it starts every time the system starts, just like any other service.



My temporary solution is to use



ExecStopPost=/bin/rm /file/to/delete


but this is not a satisfying way of solving it, as it will always delete the file upon stopping the service, no matter if it was because of failure or not.



Output when failing:



● software.service - Software
Loaded: loaded (/home/trippelganger/.config/systemd/user/software.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Fri 2018-05-04 09:05:26 CEST; 5s ago
Process: 1839 ExecStart=/bin/run_program (code=exited, status=1/FAILURE)
Main PID: 1839 (code=exited, status=1/FAILURE)



May 04 09:05:26 trippelganger systemd[595]: software.service: Main process exited, code=exited, status=1/FAILURE
May 04 09:05:26 trippelganger systemd[595]: software.service: Unit entered failed state.
May 04 09:05:26 trippelganger systemd[595]: software.service: Failed with result 'exit-code'.


Output of systemctl --user status software-fail.service
is:



● software-fail.service - Delete corrupt files
Loaded: loaded (/home/trippelganger/.config/systemd/user/software-fail.service; disabled; vendor preset: enabled)
Active: inactive (dead)






share|improve this question





















  • Is there any journal output from your first service file? How about the second service file? That is, can you tell if the first service reaches the failure condition at all? You should update the question with the (relevant) journal output.
    – Chiraag
    May 3 at 16:26










  • @Chiraag Updated.
    – trippelganger
    May 4 at 7:10










  • @trippelganger Please edit your question to make some corrections. start_if_fail.service should probably be software-fail.service (from your later systemctl status output) and ExecStop=systemctl ... fails because the path is not absolute, so I imagine what you really have is ExecStop=/bin/systemctl ... (though it's possible this might also be part of the problem you're having.)
    – Filipe Brandenburger
    May 4 at 16:05














up vote
2
down vote

favorite












I have a service running software that generates some configuration files if they don't exist, and read them if they do exist. The problem I have been facing is that these files sometimes get corrupt, making the software unable to start, and thus making the service fail. In this case I would like to remove these files and restart the service.



I tried creating a service that should get executed in case of failure, by doing this:



[Service]
ExecStart=/bin/run_program
OnFailure=software-fail.service


where this service is:



[Service]
ExecStart=/bin/rm /file/to/delete
ExecStop=systemctl --user start software.service


The problem, however, is that this service doesn't start, even when the service has failed.

I tried doing



systemctl --user enable software-fail.service


but then it starts every time the system starts, just like any other service.



My temporary solution is to use



ExecStopPost=/bin/rm /file/to/delete


but this is not a satisfying way of solving it, as it will always delete the file upon stopping the service, no matter if it was because of failure or not.



Output when failing:



● software.service - Software
Loaded: loaded (/home/trippelganger/.config/systemd/user/software.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Fri 2018-05-04 09:05:26 CEST; 5s ago
Process: 1839 ExecStart=/bin/run_program (code=exited, status=1/FAILURE)
Main PID: 1839 (code=exited, status=1/FAILURE)



May 04 09:05:26 trippelganger systemd[595]: software.service: Main process exited, code=exited, status=1/FAILURE
May 04 09:05:26 trippelganger systemd[595]: software.service: Unit entered failed state.
May 04 09:05:26 trippelganger systemd[595]: software.service: Failed with result 'exit-code'.


Output of systemctl --user status software-fail.service
is:



● software-fail.service - Delete corrupt files
Loaded: loaded (/home/trippelganger/.config/systemd/user/software-fail.service; disabled; vendor preset: enabled)
Active: inactive (dead)






share|improve this question





















  • Is there any journal output from your first service file? How about the second service file? That is, can you tell if the first service reaches the failure condition at all? You should update the question with the (relevant) journal output.
    – Chiraag
    May 3 at 16:26










  • @Chiraag Updated.
    – trippelganger
    May 4 at 7:10










  • @trippelganger Please edit your question to make some corrections. start_if_fail.service should probably be software-fail.service (from your later systemctl status output) and ExecStop=systemctl ... fails because the path is not absolute, so I imagine what you really have is ExecStop=/bin/systemctl ... (though it's possible this might also be part of the problem you're having.)
    – Filipe Brandenburger
    May 4 at 16:05












up vote
2
down vote

favorite









up vote
2
down vote

favorite











I have a service running software that generates some configuration files if they don't exist, and read them if they do exist. The problem I have been facing is that these files sometimes get corrupt, making the software unable to start, and thus making the service fail. In this case I would like to remove these files and restart the service.



I tried creating a service that should get executed in case of failure, by doing this:



[Service]
ExecStart=/bin/run_program
OnFailure=software-fail.service


where this service is:



[Service]
ExecStart=/bin/rm /file/to/delete
ExecStop=systemctl --user start software.service


The problem, however, is that this service doesn't start, even when the service has failed.

I tried doing



systemctl --user enable software-fail.service


but then it starts every time the system starts, just like any other service.



My temporary solution is to use



ExecStopPost=/bin/rm /file/to/delete


but this is not a satisfying way of solving it, as it will always delete the file upon stopping the service, no matter if it was because of failure or not.



Output when failing:



● software.service - Software
Loaded: loaded (/home/trippelganger/.config/systemd/user/software.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Fri 2018-05-04 09:05:26 CEST; 5s ago
Process: 1839 ExecStart=/bin/run_program (code=exited, status=1/FAILURE)
Main PID: 1839 (code=exited, status=1/FAILURE)



May 04 09:05:26 trippelganger systemd[595]: software.service: Main process exited, code=exited, status=1/FAILURE
May 04 09:05:26 trippelganger systemd[595]: software.service: Unit entered failed state.
May 04 09:05:26 trippelganger systemd[595]: software.service: Failed with result 'exit-code'.


Output of systemctl --user status software-fail.service
is:



● software-fail.service - Delete corrupt files
Loaded: loaded (/home/trippelganger/.config/systemd/user/software-fail.service; disabled; vendor preset: enabled)
Active: inactive (dead)






share|improve this question













I have a service running software that generates some configuration files if they don't exist, and read them if they do exist. The problem I have been facing is that these files sometimes get corrupt, making the software unable to start, and thus making the service fail. In this case I would like to remove these files and restart the service.



I tried creating a service that should get executed in case of failure, by doing this:



[Service]
ExecStart=/bin/run_program
OnFailure=software-fail.service


where this service is:



[Service]
ExecStart=/bin/rm /file/to/delete
ExecStop=systemctl --user start software.service


The problem, however, is that this service doesn't start, even when the service has failed.

I tried doing



systemctl --user enable software-fail.service


but then it starts every time the system starts, just like any other service.



My temporary solution is to use



ExecStopPost=/bin/rm /file/to/delete


but this is not a satisfying way of solving it, as it will always delete the file upon stopping the service, no matter if it was because of failure or not.



Output when failing:



● software.service - Software
Loaded: loaded (/home/trippelganger/.config/systemd/user/software.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Fri 2018-05-04 09:05:26 CEST; 5s ago
Process: 1839 ExecStart=/bin/run_program (code=exited, status=1/FAILURE)
Main PID: 1839 (code=exited, status=1/FAILURE)



May 04 09:05:26 trippelganger systemd[595]: software.service: Main process exited, code=exited, status=1/FAILURE
May 04 09:05:26 trippelganger systemd[595]: software.service: Unit entered failed state.
May 04 09:05:26 trippelganger systemd[595]: software.service: Failed with result 'exit-code'.


Output of systemctl --user status software-fail.service
is:



● software-fail.service - Delete corrupt files
Loaded: loaded (/home/trippelganger/.config/systemd/user/software-fail.service; disabled; vendor preset: enabled)
Active: inactive (dead)








share|improve this question












share|improve this question




share|improve this question








edited May 5 at 3:01
























asked May 3 at 15:07









trippelganger

133




133











  • Is there any journal output from your first service file? How about the second service file? That is, can you tell if the first service reaches the failure condition at all? You should update the question with the (relevant) journal output.
    – Chiraag
    May 3 at 16:26










  • @Chiraag Updated.
    – trippelganger
    May 4 at 7:10










  • @trippelganger Please edit your question to make some corrections. start_if_fail.service should probably be software-fail.service (from your later systemctl status output) and ExecStop=systemctl ... fails because the path is not absolute, so I imagine what you really have is ExecStop=/bin/systemctl ... (though it's possible this might also be part of the problem you're having.)
    – Filipe Brandenburger
    May 4 at 16:05
















  • Is there any journal output from your first service file? How about the second service file? That is, can you tell if the first service reaches the failure condition at all? You should update the question with the (relevant) journal output.
    – Chiraag
    May 3 at 16:26










  • @Chiraag Updated.
    – trippelganger
    May 4 at 7:10










  • @trippelganger Please edit your question to make some corrections. start_if_fail.service should probably be software-fail.service (from your later systemctl status output) and ExecStop=systemctl ... fails because the path is not absolute, so I imagine what you really have is ExecStop=/bin/systemctl ... (though it's possible this might also be part of the problem you're having.)
    – Filipe Brandenburger
    May 4 at 16:05















Is there any journal output from your first service file? How about the second service file? That is, can you tell if the first service reaches the failure condition at all? You should update the question with the (relevant) journal output.
– Chiraag
May 3 at 16:26




Is there any journal output from your first service file? How about the second service file? That is, can you tell if the first service reaches the failure condition at all? You should update the question with the (relevant) journal output.
– Chiraag
May 3 at 16:26












@Chiraag Updated.
– trippelganger
May 4 at 7:10




@Chiraag Updated.
– trippelganger
May 4 at 7:10












@trippelganger Please edit your question to make some corrections. start_if_fail.service should probably be software-fail.service (from your later systemctl status output) and ExecStop=systemctl ... fails because the path is not absolute, so I imagine what you really have is ExecStop=/bin/systemctl ... (though it's possible this might also be part of the problem you're having.)
– Filipe Brandenburger
May 4 at 16:05




@trippelganger Please edit your question to make some corrections. start_if_fail.service should probably be software-fail.service (from your later systemctl status output) and ExecStop=systemctl ... fails because the path is not absolute, so I imagine what you really have is ExecStop=/bin/systemctl ... (though it's possible this might also be part of the problem you're having.)
– Filipe Brandenburger
May 4 at 16:05










2 Answers
2






active

oldest

votes

















up vote
0
down vote



accepted










NOTE: You probably want to use ExecStopPost= instead of OnFailure= here (see my other answer), but this is trying to address why your OnFailure= setup is not working.



The problem with OnFailure= not starting the unit might be because it's in the wrong section, it needs to be in the [Unit] section and not [Service].



You can try this instead:



# software.service
[Unit]
Description=Software
OnFailure=software-fail.service

[Service]
ExecStart=/bin/run_program


And:



# software-fail.service
[Unit]
Description=Delete corrupt files

[Service]
ExecStart=/bin/rm /file/to/delete
ExecStop=/bin/systemctl --user start software.service


I can make it work with this setup.



But note that using OnFailure= is not ideal here, since you can't really tell why the program failed, and chaining another start of it in ExecStop= by calling /bin/systemctl start directly is pretty hacky... The solution using ExecStopPost= and looking at the exit status is definitely superior.



If you define OnFailure= inside [Service], systemd (at least version 234 from Fedora 27) complains with:



software.service:6: Unknown lvalue 'OnFailure' in section 'Service'


Not sure if you're seeing that in your logs or not... (Maybe this was added in recent systemd?) That should be a hint of what is going on there... I hope this helps.






share|improve this answer





















  • Thank you, what a stupid mistake! I did not see the "Unknown lvalue" text in my logs, using the latest systemd in Debian Stretch.
    – trippelganger
    May 5 at 3:05

















up vote
1
down vote













In order to perform some cleanup if the service fails, you can use ExecStopPost=, which is executed whether the service succeeds or not.



In the code you run at ExecStopPost=, you can use one of $SERVICE_RESULT, $EXIT_CODE or $EXIT_STATUS to determine the failure condition and act accordingly. See the documentation on those environment variables to check which one is appropriate for you.



Then you can use Restart=on-failure so that systemd tries to restart your unit when it fails.



Putting it all together, this is what it would look like. Assuming that run_program will exit with status 2 whenever the files are corrupted (hopefully you can adapt this to other failure scenarios from the documentation above), this should work:



[Service]
ExecStart=/bin/run_program
ExecStopPost=/bin/sh -c 'if [ "$$EXIT_STATUS" = 2 ]; then rm /file/to/delete; fi'
Restart=on-failure


(NOTE: The double dollar-sign $$ is to escape this to systemd, so the shell sees $EXIT_STATUS and accesses that variable. Using a single dollar-sign would also work, but then systemd would do that replacement instead and the shell would see [ "2" = 2 ], which arguably also works... Anyways, you can bypass most of that by putting all this logic into a shell script and calling it by its full path in ExecStopPost=, that would be probably better and you could also easily add more commands to the script, such as logging the action taken to recover from the error condition.)



Hopefully this will give you enough pointers to figure out how to configure this correctly given your particular situation!






share|improve this answer

















  • 1




    Thank you, this is a better way of doing my current solution. However, it does not explain how to properly use the OnFailure command in systemd, which was my original question. I find the systemd documentation can be a bit hard to interpret at times. Also, shouldn't I check for exit status 1, rather than 2?
    – trippelganger
    May 4 at 7:43










  • Checking exit status 2 was an example, in case your program does something specific in case of file corruption (some programs use different exit status to signal specific conditions.) A more general solution would be checking for "$$EXIT_STATUS" != 0, but the best is for you to look at your program's documentation (or source code) and look at the three variables systemd uses and try to figure out what's the best way to detect the error condition in your specific case.
    – Filipe Brandenburger
    May 4 at 15:41










  • Indeed I did not explain why OnFailure= was not working for you, I focused on the problem you were trying to solve and I think OnFailure= is not the best solution for that... I added another answer to address OnFailure=, it seems the problem is you have it in the wrong section ([Service], while it should be in [Unit]), I hope that's helpful to you as well. In any case, for this specific problem you described, ExecStopPost= and checking for exit status (or service result, etc.) is probably a better approach.
    – Filipe Brandenburger
    May 4 at 16:03










  • Isn't this kind of situation exactly what OnFailure is used for?
    – trippelganger
    May 5 at 3:04










Your Answer







StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);








 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f441575%2fproper-way-to-use-onfailure-in-systemd%23new-answer', 'question_page');

);

Post as a guest






























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
0
down vote



accepted










NOTE: You probably want to use ExecStopPost= instead of OnFailure= here (see my other answer), but this is trying to address why your OnFailure= setup is not working.



The problem with OnFailure= not starting the unit might be because it's in the wrong section, it needs to be in the [Unit] section and not [Service].



You can try this instead:



# software.service
[Unit]
Description=Software
OnFailure=software-fail.service

[Service]
ExecStart=/bin/run_program


And:



# software-fail.service
[Unit]
Description=Delete corrupt files

[Service]
ExecStart=/bin/rm /file/to/delete
ExecStop=/bin/systemctl --user start software.service


I can make it work with this setup.



But note that using OnFailure= is not ideal here, since you can't really tell why the program failed, and chaining another start of it in ExecStop= by calling /bin/systemctl start directly is pretty hacky... The solution using ExecStopPost= and looking at the exit status is definitely superior.



If you define OnFailure= inside [Service], systemd (at least version 234 from Fedora 27) complains with:



software.service:6: Unknown lvalue 'OnFailure' in section 'Service'


Not sure if you're seeing that in your logs or not... (Maybe this was added in recent systemd?) That should be a hint of what is going on there... I hope this helps.






share|improve this answer





















  • Thank you, what a stupid mistake! I did not see the "Unknown lvalue" text in my logs, using the latest systemd in Debian Stretch.
    – trippelganger
    May 5 at 3:05














up vote
0
down vote



accepted










NOTE: You probably want to use ExecStopPost= instead of OnFailure= here (see my other answer), but this is trying to address why your OnFailure= setup is not working.



The problem with OnFailure= not starting the unit might be because it's in the wrong section, it needs to be in the [Unit] section and not [Service].



You can try this instead:



# software.service
[Unit]
Description=Software
OnFailure=software-fail.service

[Service]
ExecStart=/bin/run_program


And:



# software-fail.service
[Unit]
Description=Delete corrupt files

[Service]
ExecStart=/bin/rm /file/to/delete
ExecStop=/bin/systemctl --user start software.service


I can make it work with this setup.



But note that using OnFailure= is not ideal here, since you can't really tell why the program failed, and chaining another start of it in ExecStop= by calling /bin/systemctl start directly is pretty hacky... The solution using ExecStopPost= and looking at the exit status is definitely superior.



If you define OnFailure= inside [Service], systemd (at least version 234 from Fedora 27) complains with:



software.service:6: Unknown lvalue 'OnFailure' in section 'Service'


Not sure if you're seeing that in your logs or not... (Maybe this was added in recent systemd?) That should be a hint of what is going on there... I hope this helps.






share|improve this answer





















  • Thank you, what a stupid mistake! I did not see the "Unknown lvalue" text in my logs, using the latest systemd in Debian Stretch.
    – trippelganger
    May 5 at 3:05












up vote
0
down vote



accepted







up vote
0
down vote



accepted






NOTE: You probably want to use ExecStopPost= instead of OnFailure= here (see my other answer), but this is trying to address why your OnFailure= setup is not working.



The problem with OnFailure= not starting the unit might be because it's in the wrong section, it needs to be in the [Unit] section and not [Service].



You can try this instead:



# software.service
[Unit]
Description=Software
OnFailure=software-fail.service

[Service]
ExecStart=/bin/run_program


And:



# software-fail.service
[Unit]
Description=Delete corrupt files

[Service]
ExecStart=/bin/rm /file/to/delete
ExecStop=/bin/systemctl --user start software.service


I can make it work with this setup.



But note that using OnFailure= is not ideal here, since you can't really tell why the program failed, and chaining another start of it in ExecStop= by calling /bin/systemctl start directly is pretty hacky... The solution using ExecStopPost= and looking at the exit status is definitely superior.



If you define OnFailure= inside [Service], systemd (at least version 234 from Fedora 27) complains with:



software.service:6: Unknown lvalue 'OnFailure' in section 'Service'


Not sure if you're seeing that in your logs or not... (Maybe this was added in recent systemd?) That should be a hint of what is going on there... I hope this helps.






share|improve this answer













NOTE: You probably want to use ExecStopPost= instead of OnFailure= here (see my other answer), but this is trying to address why your OnFailure= setup is not working.



The problem with OnFailure= not starting the unit might be because it's in the wrong section, it needs to be in the [Unit] section and not [Service].



You can try this instead:



# software.service
[Unit]
Description=Software
OnFailure=software-fail.service

[Service]
ExecStart=/bin/run_program


And:



# software-fail.service
[Unit]
Description=Delete corrupt files

[Service]
ExecStart=/bin/rm /file/to/delete
ExecStop=/bin/systemctl --user start software.service


I can make it work with this setup.



But note that using OnFailure= is not ideal here, since you can't really tell why the program failed, and chaining another start of it in ExecStop= by calling /bin/systemctl start directly is pretty hacky... The solution using ExecStopPost= and looking at the exit status is definitely superior.



If you define OnFailure= inside [Service], systemd (at least version 234 from Fedora 27) complains with:



software.service:6: Unknown lvalue 'OnFailure' in section 'Service'


Not sure if you're seeing that in your logs or not... (Maybe this was added in recent systemd?) That should be a hint of what is going on there... I hope this helps.







share|improve this answer













share|improve this answer



share|improve this answer











answered May 4 at 16:01









Filipe Brandenburger

3,451521




3,451521











  • Thank you, what a stupid mistake! I did not see the "Unknown lvalue" text in my logs, using the latest systemd in Debian Stretch.
    – trippelganger
    May 5 at 3:05
















  • Thank you, what a stupid mistake! I did not see the "Unknown lvalue" text in my logs, using the latest systemd in Debian Stretch.
    – trippelganger
    May 5 at 3:05















Thank you, what a stupid mistake! I did not see the "Unknown lvalue" text in my logs, using the latest systemd in Debian Stretch.
– trippelganger
May 5 at 3:05




Thank you, what a stupid mistake! I did not see the "Unknown lvalue" text in my logs, using the latest systemd in Debian Stretch.
– trippelganger
May 5 at 3:05












up vote
1
down vote













In order to perform some cleanup if the service fails, you can use ExecStopPost=, which is executed whether the service succeeds or not.



In the code you run at ExecStopPost=, you can use one of $SERVICE_RESULT, $EXIT_CODE or $EXIT_STATUS to determine the failure condition and act accordingly. See the documentation on those environment variables to check which one is appropriate for you.



Then you can use Restart=on-failure so that systemd tries to restart your unit when it fails.



Putting it all together, this is what it would look like. Assuming that run_program will exit with status 2 whenever the files are corrupted (hopefully you can adapt this to other failure scenarios from the documentation above), this should work:



[Service]
ExecStart=/bin/run_program
ExecStopPost=/bin/sh -c 'if [ "$$EXIT_STATUS" = 2 ]; then rm /file/to/delete; fi'
Restart=on-failure


(NOTE: The double dollar-sign $$ is to escape this to systemd, so the shell sees $EXIT_STATUS and accesses that variable. Using a single dollar-sign would also work, but then systemd would do that replacement instead and the shell would see [ "2" = 2 ], which arguably also works... Anyways, you can bypass most of that by putting all this logic into a shell script and calling it by its full path in ExecStopPost=, that would be probably better and you could also easily add more commands to the script, such as logging the action taken to recover from the error condition.)



Hopefully this will give you enough pointers to figure out how to configure this correctly given your particular situation!






share|improve this answer

















  • 1




    Thank you, this is a better way of doing my current solution. However, it does not explain how to properly use the OnFailure command in systemd, which was my original question. I find the systemd documentation can be a bit hard to interpret at times. Also, shouldn't I check for exit status 1, rather than 2?
    – trippelganger
    May 4 at 7:43










  • Checking exit status 2 was an example, in case your program does something specific in case of file corruption (some programs use different exit status to signal specific conditions.) A more general solution would be checking for "$$EXIT_STATUS" != 0, but the best is for you to look at your program's documentation (or source code) and look at the three variables systemd uses and try to figure out what's the best way to detect the error condition in your specific case.
    – Filipe Brandenburger
    May 4 at 15:41










  • Indeed I did not explain why OnFailure= was not working for you, I focused on the problem you were trying to solve and I think OnFailure= is not the best solution for that... I added another answer to address OnFailure=, it seems the problem is you have it in the wrong section ([Service], while it should be in [Unit]), I hope that's helpful to you as well. In any case, for this specific problem you described, ExecStopPost= and checking for exit status (or service result, etc.) is probably a better approach.
    – Filipe Brandenburger
    May 4 at 16:03










  • Isn't this kind of situation exactly what OnFailure is used for?
    – trippelganger
    May 5 at 3:04














up vote
1
down vote













In order to perform some cleanup if the service fails, you can use ExecStopPost=, which is executed whether the service succeeds or not.



In the code you run at ExecStopPost=, you can use one of $SERVICE_RESULT, $EXIT_CODE or $EXIT_STATUS to determine the failure condition and act accordingly. See the documentation on those environment variables to check which one is appropriate for you.



Then you can use Restart=on-failure so that systemd tries to restart your unit when it fails.



Putting it all together, this is what it would look like. Assuming that run_program will exit with status 2 whenever the files are corrupted (hopefully you can adapt this to other failure scenarios from the documentation above), this should work:



[Service]
ExecStart=/bin/run_program
ExecStopPost=/bin/sh -c 'if [ "$$EXIT_STATUS" = 2 ]; then rm /file/to/delete; fi'
Restart=on-failure


(NOTE: The double dollar-sign $$ is to escape this to systemd, so the shell sees $EXIT_STATUS and accesses that variable. Using a single dollar-sign would also work, but then systemd would do that replacement instead and the shell would see [ "2" = 2 ], which arguably also works... Anyways, you can bypass most of that by putting all this logic into a shell script and calling it by its full path in ExecStopPost=, that would be probably better and you could also easily add more commands to the script, such as logging the action taken to recover from the error condition.)



Hopefully this will give you enough pointers to figure out how to configure this correctly given your particular situation!






share|improve this answer

















  • 1




    Thank you, this is a better way of doing my current solution. However, it does not explain how to properly use the OnFailure command in systemd, which was my original question. I find the systemd documentation can be a bit hard to interpret at times. Also, shouldn't I check for exit status 1, rather than 2?
    – trippelganger
    May 4 at 7:43










  • Checking exit status 2 was an example, in case your program does something specific in case of file corruption (some programs use different exit status to signal specific conditions.) A more general solution would be checking for "$$EXIT_STATUS" != 0, but the best is for you to look at your program's documentation (or source code) and look at the three variables systemd uses and try to figure out what's the best way to detect the error condition in your specific case.
    – Filipe Brandenburger
    May 4 at 15:41










  • Indeed I did not explain why OnFailure= was not working for you, I focused on the problem you were trying to solve and I think OnFailure= is not the best solution for that... I added another answer to address OnFailure=, it seems the problem is you have it in the wrong section ([Service], while it should be in [Unit]), I hope that's helpful to you as well. In any case, for this specific problem you described, ExecStopPost= and checking for exit status (or service result, etc.) is probably a better approach.
    – Filipe Brandenburger
    May 4 at 16:03










  • Isn't this kind of situation exactly what OnFailure is used for?
    – trippelganger
    May 5 at 3:04












up vote
1
down vote










up vote
1
down vote









In order to perform some cleanup if the service fails, you can use ExecStopPost=, which is executed whether the service succeeds or not.



In the code you run at ExecStopPost=, you can use one of $SERVICE_RESULT, $EXIT_CODE or $EXIT_STATUS to determine the failure condition and act accordingly. See the documentation on those environment variables to check which one is appropriate for you.



Then you can use Restart=on-failure so that systemd tries to restart your unit when it fails.



Putting it all together, this is what it would look like. Assuming that run_program will exit with status 2 whenever the files are corrupted (hopefully you can adapt this to other failure scenarios from the documentation above), this should work:



[Service]
ExecStart=/bin/run_program
ExecStopPost=/bin/sh -c 'if [ "$$EXIT_STATUS" = 2 ]; then rm /file/to/delete; fi'
Restart=on-failure


(NOTE: The double dollar-sign $$ is to escape this to systemd, so the shell sees $EXIT_STATUS and accesses that variable. Using a single dollar-sign would also work, but then systemd would do that replacement instead and the shell would see [ "2" = 2 ], which arguably also works... Anyways, you can bypass most of that by putting all this logic into a shell script and calling it by its full path in ExecStopPost=, that would be probably better and you could also easily add more commands to the script, such as logging the action taken to recover from the error condition.)



Hopefully this will give you enough pointers to figure out how to configure this correctly given your particular situation!






share|improve this answer













In order to perform some cleanup if the service fails, you can use ExecStopPost=, which is executed whether the service succeeds or not.



In the code you run at ExecStopPost=, you can use one of $SERVICE_RESULT, $EXIT_CODE or $EXIT_STATUS to determine the failure condition and act accordingly. See the documentation on those environment variables to check which one is appropriate for you.



Then you can use Restart=on-failure so that systemd tries to restart your unit when it fails.



Putting it all together, this is what it would look like. Assuming that run_program will exit with status 2 whenever the files are corrupted (hopefully you can adapt this to other failure scenarios from the documentation above), this should work:



[Service]
ExecStart=/bin/run_program
ExecStopPost=/bin/sh -c 'if [ "$$EXIT_STATUS" = 2 ]; then rm /file/to/delete; fi'
Restart=on-failure


(NOTE: The double dollar-sign $$ is to escape this to systemd, so the shell sees $EXIT_STATUS and accesses that variable. Using a single dollar-sign would also work, but then systemd would do that replacement instead and the shell would see [ "2" = 2 ], which arguably also works... Anyways, you can bypass most of that by putting all this logic into a shell script and calling it by its full path in ExecStopPost=, that would be probably better and you could also easily add more commands to the script, such as logging the action taken to recover from the error condition.)



Hopefully this will give you enough pointers to figure out how to configure this correctly given your particular situation!







share|improve this answer













share|improve this answer



share|improve this answer











answered May 3 at 21:10









Filipe Brandenburger

3,451521




3,451521







  • 1




    Thank you, this is a better way of doing my current solution. However, it does not explain how to properly use the OnFailure command in systemd, which was my original question. I find the systemd documentation can be a bit hard to interpret at times. Also, shouldn't I check for exit status 1, rather than 2?
    – trippelganger
    May 4 at 7:43










  • Checking exit status 2 was an example, in case your program does something specific in case of file corruption (some programs use different exit status to signal specific conditions.) A more general solution would be checking for "$$EXIT_STATUS" != 0, but the best is for you to look at your program's documentation (or source code) and look at the three variables systemd uses and try to figure out what's the best way to detect the error condition in your specific case.
    – Filipe Brandenburger
    May 4 at 15:41










  • Indeed I did not explain why OnFailure= was not working for you, I focused on the problem you were trying to solve and I think OnFailure= is not the best solution for that... I added another answer to address OnFailure=, it seems the problem is you have it in the wrong section ([Service], while it should be in [Unit]), I hope that's helpful to you as well. In any case, for this specific problem you described, ExecStopPost= and checking for exit status (or service result, etc.) is probably a better approach.
    – Filipe Brandenburger
    May 4 at 16:03










  • Isn't this kind of situation exactly what OnFailure is used for?
    – trippelganger
    May 5 at 3:04












  • 1




    Thank you, this is a better way of doing my current solution. However, it does not explain how to properly use the OnFailure command in systemd, which was my original question. I find the systemd documentation can be a bit hard to interpret at times. Also, shouldn't I check for exit status 1, rather than 2?
    – trippelganger
    May 4 at 7:43










  • Checking exit status 2 was an example, in case your program does something specific in case of file corruption (some programs use different exit status to signal specific conditions.) A more general solution would be checking for "$$EXIT_STATUS" != 0, but the best is for you to look at your program's documentation (or source code) and look at the three variables systemd uses and try to figure out what's the best way to detect the error condition in your specific case.
    – Filipe Brandenburger
    May 4 at 15:41










  • Indeed I did not explain why OnFailure= was not working for you, I focused on the problem you were trying to solve and I think OnFailure= is not the best solution for that... I added another answer to address OnFailure=, it seems the problem is you have it in the wrong section ([Service], while it should be in [Unit]), I hope that's helpful to you as well. In any case, for this specific problem you described, ExecStopPost= and checking for exit status (or service result, etc.) is probably a better approach.
    – Filipe Brandenburger
    May 4 at 16:03










  • Isn't this kind of situation exactly what OnFailure is used for?
    – trippelganger
    May 5 at 3:04







1




1




Thank you, this is a better way of doing my current solution. However, it does not explain how to properly use the OnFailure command in systemd, which was my original question. I find the systemd documentation can be a bit hard to interpret at times. Also, shouldn't I check for exit status 1, rather than 2?
– trippelganger
May 4 at 7:43




Thank you, this is a better way of doing my current solution. However, it does not explain how to properly use the OnFailure command in systemd, which was my original question. I find the systemd documentation can be a bit hard to interpret at times. Also, shouldn't I check for exit status 1, rather than 2?
– trippelganger
May 4 at 7:43












Checking exit status 2 was an example, in case your program does something specific in case of file corruption (some programs use different exit status to signal specific conditions.) A more general solution would be checking for "$$EXIT_STATUS" != 0, but the best is for you to look at your program's documentation (or source code) and look at the three variables systemd uses and try to figure out what's the best way to detect the error condition in your specific case.
– Filipe Brandenburger
May 4 at 15:41




Checking exit status 2 was an example, in case your program does something specific in case of file corruption (some programs use different exit status to signal specific conditions.) A more general solution would be checking for "$$EXIT_STATUS" != 0, but the best is for you to look at your program's documentation (or source code) and look at the three variables systemd uses and try to figure out what's the best way to detect the error condition in your specific case.
– Filipe Brandenburger
May 4 at 15:41












Indeed I did not explain why OnFailure= was not working for you, I focused on the problem you were trying to solve and I think OnFailure= is not the best solution for that... I added another answer to address OnFailure=, it seems the problem is you have it in the wrong section ([Service], while it should be in [Unit]), I hope that's helpful to you as well. In any case, for this specific problem you described, ExecStopPost= and checking for exit status (or service result, etc.) is probably a better approach.
– Filipe Brandenburger
May 4 at 16:03




Indeed I did not explain why OnFailure= was not working for you, I focused on the problem you were trying to solve and I think OnFailure= is not the best solution for that... I added another answer to address OnFailure=, it seems the problem is you have it in the wrong section ([Service], while it should be in [Unit]), I hope that's helpful to you as well. In any case, for this specific problem you described, ExecStopPost= and checking for exit status (or service result, etc.) is probably a better approach.
– Filipe Brandenburger
May 4 at 16:03












Isn't this kind of situation exactly what OnFailure is used for?
– trippelganger
May 5 at 3:04




Isn't this kind of situation exactly what OnFailure is used for?
– trippelganger
May 5 at 3:04












 

draft saved


draft discarded


























 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f441575%2fproper-way-to-use-onfailure-in-systemd%23new-answer', 'question_page');

);

Post as a guest













































































Popular posts from this blog

How to check contact read email or not when send email to Individual?

How many registers does an x86_64 CPU actually have?

Nur Jahan