Proper way to use OnFailure in systemd
Clash Royale CLAN TAG#URR8PPP
up vote
2
down vote
favorite
I have a service running software that generates some configuration files if they don't exist, and read them if they do exist. The problem I have been facing is that these files sometimes get corrupt, making the software unable to start, and thus making the service fail. In this case I would like to remove these files and restart the service.
I tried creating a service that should get executed in case of failure, by doing this:
[Service]
ExecStart=/bin/run_program
OnFailure=software-fail.service
where this service is:
[Service]
ExecStart=/bin/rm /file/to/delete
ExecStop=systemctl --user start software.service
The problem, however, is that this service doesn't start, even when the service has failed.
I tried doing
systemctl --user enable software-fail.service
but then it starts every time the system starts, just like any other service.
My temporary solution is to use
ExecStopPost=/bin/rm /file/to/delete
but this is not a satisfying way of solving it, as it will always delete the file upon stopping the service, no matter if it was because of failure or not.
Output when failing:
â software.service - Software
Loaded: loaded (/home/trippelganger/.config/systemd/user/software.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Fri 2018-05-04 09:05:26 CEST; 5s ago
Process: 1839 ExecStart=/bin/run_program (code=exited, status=1/FAILURE)
Main PID: 1839 (code=exited, status=1/FAILURE)
May 04 09:05:26 trippelganger systemd[595]: software.service: Main process exited, code=exited, status=1/FAILURE
May 04 09:05:26 trippelganger systemd[595]: software.service: Unit entered failed state.
May 04 09:05:26 trippelganger systemd[595]: software.service: Failed with result 'exit-code'.
Output of systemctl --user status software-fail.service
is:
â software-fail.service - Delete corrupt files
Loaded: loaded (/home/trippelganger/.config/systemd/user/software-fail.service; disabled; vendor preset: enabled)
Active: inactive (dead)
linux debian systemd
add a comment |Â
up vote
2
down vote
favorite
I have a service running software that generates some configuration files if they don't exist, and read them if they do exist. The problem I have been facing is that these files sometimes get corrupt, making the software unable to start, and thus making the service fail. In this case I would like to remove these files and restart the service.
I tried creating a service that should get executed in case of failure, by doing this:
[Service]
ExecStart=/bin/run_program
OnFailure=software-fail.service
where this service is:
[Service]
ExecStart=/bin/rm /file/to/delete
ExecStop=systemctl --user start software.service
The problem, however, is that this service doesn't start, even when the service has failed.
I tried doing
systemctl --user enable software-fail.service
but then it starts every time the system starts, just like any other service.
My temporary solution is to use
ExecStopPost=/bin/rm /file/to/delete
but this is not a satisfying way of solving it, as it will always delete the file upon stopping the service, no matter if it was because of failure or not.
Output when failing:
â software.service - Software
Loaded: loaded (/home/trippelganger/.config/systemd/user/software.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Fri 2018-05-04 09:05:26 CEST; 5s ago
Process: 1839 ExecStart=/bin/run_program (code=exited, status=1/FAILURE)
Main PID: 1839 (code=exited, status=1/FAILURE)
May 04 09:05:26 trippelganger systemd[595]: software.service: Main process exited, code=exited, status=1/FAILURE
May 04 09:05:26 trippelganger systemd[595]: software.service: Unit entered failed state.
May 04 09:05:26 trippelganger systemd[595]: software.service: Failed with result 'exit-code'.
Output of systemctl --user status software-fail.service
is:
â software-fail.service - Delete corrupt files
Loaded: loaded (/home/trippelganger/.config/systemd/user/software-fail.service; disabled; vendor preset: enabled)
Active: inactive (dead)
linux debian systemd
Is there any journal output from your first service file? How about the second service file? That is, can you tell if the first service reaches the failure condition at all? You should update the question with the (relevant) journal output.
â Chiraag
May 3 at 16:26
@Chiraag Updated.
â trippelganger
May 4 at 7:10
@trippelganger Please edit your question to make some corrections.start_if_fail.service
should probably besoftware-fail.service
(from your later systemctl status output) andExecStop=systemctl ...
fails because the path is not absolute, so I imagine what you really have isExecStop=/bin/systemctl ...
(though it's possible this might also be part of the problem you're having.)
â Filipe Brandenburger
May 4 at 16:05
add a comment |Â
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I have a service running software that generates some configuration files if they don't exist, and read them if they do exist. The problem I have been facing is that these files sometimes get corrupt, making the software unable to start, and thus making the service fail. In this case I would like to remove these files and restart the service.
I tried creating a service that should get executed in case of failure, by doing this:
[Service]
ExecStart=/bin/run_program
OnFailure=software-fail.service
where this service is:
[Service]
ExecStart=/bin/rm /file/to/delete
ExecStop=systemctl --user start software.service
The problem, however, is that this service doesn't start, even when the service has failed.
I tried doing
systemctl --user enable software-fail.service
but then it starts every time the system starts, just like any other service.
My temporary solution is to use
ExecStopPost=/bin/rm /file/to/delete
but this is not a satisfying way of solving it, as it will always delete the file upon stopping the service, no matter if it was because of failure or not.
Output when failing:
â software.service - Software
Loaded: loaded (/home/trippelganger/.config/systemd/user/software.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Fri 2018-05-04 09:05:26 CEST; 5s ago
Process: 1839 ExecStart=/bin/run_program (code=exited, status=1/FAILURE)
Main PID: 1839 (code=exited, status=1/FAILURE)
May 04 09:05:26 trippelganger systemd[595]: software.service: Main process exited, code=exited, status=1/FAILURE
May 04 09:05:26 trippelganger systemd[595]: software.service: Unit entered failed state.
May 04 09:05:26 trippelganger systemd[595]: software.service: Failed with result 'exit-code'.
Output of systemctl --user status software-fail.service
is:
â software-fail.service - Delete corrupt files
Loaded: loaded (/home/trippelganger/.config/systemd/user/software-fail.service; disabled; vendor preset: enabled)
Active: inactive (dead)
linux debian systemd
I have a service running software that generates some configuration files if they don't exist, and read them if they do exist. The problem I have been facing is that these files sometimes get corrupt, making the software unable to start, and thus making the service fail. In this case I would like to remove these files and restart the service.
I tried creating a service that should get executed in case of failure, by doing this:
[Service]
ExecStart=/bin/run_program
OnFailure=software-fail.service
where this service is:
[Service]
ExecStart=/bin/rm /file/to/delete
ExecStop=systemctl --user start software.service
The problem, however, is that this service doesn't start, even when the service has failed.
I tried doing
systemctl --user enable software-fail.service
but then it starts every time the system starts, just like any other service.
My temporary solution is to use
ExecStopPost=/bin/rm /file/to/delete
but this is not a satisfying way of solving it, as it will always delete the file upon stopping the service, no matter if it was because of failure or not.
Output when failing:
â software.service - Software
Loaded: loaded (/home/trippelganger/.config/systemd/user/software.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Fri 2018-05-04 09:05:26 CEST; 5s ago
Process: 1839 ExecStart=/bin/run_program (code=exited, status=1/FAILURE)
Main PID: 1839 (code=exited, status=1/FAILURE)
May 04 09:05:26 trippelganger systemd[595]: software.service: Main process exited, code=exited, status=1/FAILURE
May 04 09:05:26 trippelganger systemd[595]: software.service: Unit entered failed state.
May 04 09:05:26 trippelganger systemd[595]: software.service: Failed with result 'exit-code'.
Output of systemctl --user status software-fail.service
is:
â software-fail.service - Delete corrupt files
Loaded: loaded (/home/trippelganger/.config/systemd/user/software-fail.service; disabled; vendor preset: enabled)
Active: inactive (dead)
linux debian systemd
edited May 5 at 3:01
asked May 3 at 15:07
trippelganger
133
133
Is there any journal output from your first service file? How about the second service file? That is, can you tell if the first service reaches the failure condition at all? You should update the question with the (relevant) journal output.
â Chiraag
May 3 at 16:26
@Chiraag Updated.
â trippelganger
May 4 at 7:10
@trippelganger Please edit your question to make some corrections.start_if_fail.service
should probably besoftware-fail.service
(from your later systemctl status output) andExecStop=systemctl ...
fails because the path is not absolute, so I imagine what you really have isExecStop=/bin/systemctl ...
(though it's possible this might also be part of the problem you're having.)
â Filipe Brandenburger
May 4 at 16:05
add a comment |Â
Is there any journal output from your first service file? How about the second service file? That is, can you tell if the first service reaches the failure condition at all? You should update the question with the (relevant) journal output.
â Chiraag
May 3 at 16:26
@Chiraag Updated.
â trippelganger
May 4 at 7:10
@trippelganger Please edit your question to make some corrections.start_if_fail.service
should probably besoftware-fail.service
(from your later systemctl status output) andExecStop=systemctl ...
fails because the path is not absolute, so I imagine what you really have isExecStop=/bin/systemctl ...
(though it's possible this might also be part of the problem you're having.)
â Filipe Brandenburger
May 4 at 16:05
Is there any journal output from your first service file? How about the second service file? That is, can you tell if the first service reaches the failure condition at all? You should update the question with the (relevant) journal output.
â Chiraag
May 3 at 16:26
Is there any journal output from your first service file? How about the second service file? That is, can you tell if the first service reaches the failure condition at all? You should update the question with the (relevant) journal output.
â Chiraag
May 3 at 16:26
@Chiraag Updated.
â trippelganger
May 4 at 7:10
@Chiraag Updated.
â trippelganger
May 4 at 7:10
@trippelganger Please edit your question to make some corrections.
start_if_fail.service
should probably be software-fail.service
(from your later systemctl status output) and ExecStop=systemctl ...
fails because the path is not absolute, so I imagine what you really have is ExecStop=/bin/systemctl ...
(though it's possible this might also be part of the problem you're having.)â Filipe Brandenburger
May 4 at 16:05
@trippelganger Please edit your question to make some corrections.
start_if_fail.service
should probably be software-fail.service
(from your later systemctl status output) and ExecStop=systemctl ...
fails because the path is not absolute, so I imagine what you really have is ExecStop=/bin/systemctl ...
(though it's possible this might also be part of the problem you're having.)â Filipe Brandenburger
May 4 at 16:05
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
0
down vote
accepted
NOTE: You probably want to use ExecStopPost=
instead of OnFailure=
here (see my other answer), but this is trying to address why your OnFailure=
setup is not working.
The problem with OnFailure=
not starting the unit might be because it's in the wrong section, it needs to be in the [Unit]
section and not [Service]
.
You can try this instead:
# software.service
[Unit]
Description=Software
OnFailure=software-fail.service
[Service]
ExecStart=/bin/run_program
And:
# software-fail.service
[Unit]
Description=Delete corrupt files
[Service]
ExecStart=/bin/rm /file/to/delete
ExecStop=/bin/systemctl --user start software.service
I can make it work with this setup.
But note that using OnFailure=
is not ideal here, since you can't really tell why the program failed, and chaining another start of it in ExecStop=
by calling /bin/systemctl start
directly is pretty hacky... The solution using ExecStopPost=
and looking at the exit status is definitely superior.
If you define OnFailure=
inside [Service]
, systemd (at least version 234 from Fedora 27) complains with:
software.service:6: Unknown lvalue 'OnFailure' in section 'Service'
Not sure if you're seeing that in your logs or not... (Maybe this was added in recent systemd?) That should be a hint of what is going on there... I hope this helps.
Thank you, what a stupid mistake! I did not see the "Unknown lvalue" text in my logs, using the latest systemd in Debian Stretch.
â trippelganger
May 5 at 3:05
add a comment |Â
up vote
1
down vote
In order to perform some cleanup if the service fails, you can use ExecStopPost=
, which is executed whether the service succeeds or not.
In the code you run at ExecStopPost=
, you can use one of $SERVICE_RESULT
, $EXIT_CODE
or $EXIT_STATUS
to determine the failure condition and act accordingly. See the documentation on those environment variables to check which one is appropriate for you.
Then you can use Restart=on-failure
so that systemd tries to restart your unit when it fails.
Putting it all together, this is what it would look like. Assuming that run_program
will exit with status 2 whenever the files are corrupted (hopefully you can adapt this to other failure scenarios from the documentation above), this should work:
[Service]
ExecStart=/bin/run_program
ExecStopPost=/bin/sh -c 'if [ "$$EXIT_STATUS" = 2 ]; then rm /file/to/delete; fi'
Restart=on-failure
(NOTE: The double dollar-sign $$
is to escape this to systemd, so the shell sees $EXIT_STATUS
and accesses that variable. Using a single dollar-sign would also work, but then systemd would do that replacement instead and the shell would see [ "2" = 2 ]
, which arguably also works... Anyways, you can bypass most of that by putting all this logic into a shell script and calling it by its full path in ExecStopPost=
, that would be probably better and you could also easily add more commands to the script, such as logging the action taken to recover from the error condition.)
Hopefully this will give you enough pointers to figure out how to configure this correctly given your particular situation!
1
Thank you, this is a better way of doing my current solution. However, it does not explain how to properly use the OnFailure command in systemd, which was my original question. I find the systemd documentation can be a bit hard to interpret at times. Also, shouldn't I check for exit status 1, rather than 2?
â trippelganger
May 4 at 7:43
Checking exit status 2 was an example, in case your program does something specific in case of file corruption (some programs use different exit status to signal specific conditions.) A more general solution would be checking for"$$EXIT_STATUS" != 0
, but the best is for you to look at your program's documentation (or source code) and look at the three variables systemd uses and try to figure out what's the best way to detect the error condition in your specific case.
â Filipe Brandenburger
May 4 at 15:41
Indeed I did not explain whyOnFailure=
was not working for you, I focused on the problem you were trying to solve and I thinkOnFailure=
is not the best solution for that... I added another answer to addressOnFailure=
, it seems the problem is you have it in the wrong section ([Service]
, while it should be in[Unit]
), I hope that's helpful to you as well. In any case, for this specific problem you described,ExecStopPost=
and checking for exit status (or service result, etc.) is probably a better approach.
â Filipe Brandenburger
May 4 at 16:03
Isn't this kind of situation exactly what OnFailure is used for?
â trippelganger
May 5 at 3:04
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
accepted
NOTE: You probably want to use ExecStopPost=
instead of OnFailure=
here (see my other answer), but this is trying to address why your OnFailure=
setup is not working.
The problem with OnFailure=
not starting the unit might be because it's in the wrong section, it needs to be in the [Unit]
section and not [Service]
.
You can try this instead:
# software.service
[Unit]
Description=Software
OnFailure=software-fail.service
[Service]
ExecStart=/bin/run_program
And:
# software-fail.service
[Unit]
Description=Delete corrupt files
[Service]
ExecStart=/bin/rm /file/to/delete
ExecStop=/bin/systemctl --user start software.service
I can make it work with this setup.
But note that using OnFailure=
is not ideal here, since you can't really tell why the program failed, and chaining another start of it in ExecStop=
by calling /bin/systemctl start
directly is pretty hacky... The solution using ExecStopPost=
and looking at the exit status is definitely superior.
If you define OnFailure=
inside [Service]
, systemd (at least version 234 from Fedora 27) complains with:
software.service:6: Unknown lvalue 'OnFailure' in section 'Service'
Not sure if you're seeing that in your logs or not... (Maybe this was added in recent systemd?) That should be a hint of what is going on there... I hope this helps.
Thank you, what a stupid mistake! I did not see the "Unknown lvalue" text in my logs, using the latest systemd in Debian Stretch.
â trippelganger
May 5 at 3:05
add a comment |Â
up vote
0
down vote
accepted
NOTE: You probably want to use ExecStopPost=
instead of OnFailure=
here (see my other answer), but this is trying to address why your OnFailure=
setup is not working.
The problem with OnFailure=
not starting the unit might be because it's in the wrong section, it needs to be in the [Unit]
section and not [Service]
.
You can try this instead:
# software.service
[Unit]
Description=Software
OnFailure=software-fail.service
[Service]
ExecStart=/bin/run_program
And:
# software-fail.service
[Unit]
Description=Delete corrupt files
[Service]
ExecStart=/bin/rm /file/to/delete
ExecStop=/bin/systemctl --user start software.service
I can make it work with this setup.
But note that using OnFailure=
is not ideal here, since you can't really tell why the program failed, and chaining another start of it in ExecStop=
by calling /bin/systemctl start
directly is pretty hacky... The solution using ExecStopPost=
and looking at the exit status is definitely superior.
If you define OnFailure=
inside [Service]
, systemd (at least version 234 from Fedora 27) complains with:
software.service:6: Unknown lvalue 'OnFailure' in section 'Service'
Not sure if you're seeing that in your logs or not... (Maybe this was added in recent systemd?) That should be a hint of what is going on there... I hope this helps.
Thank you, what a stupid mistake! I did not see the "Unknown lvalue" text in my logs, using the latest systemd in Debian Stretch.
â trippelganger
May 5 at 3:05
add a comment |Â
up vote
0
down vote
accepted
up vote
0
down vote
accepted
NOTE: You probably want to use ExecStopPost=
instead of OnFailure=
here (see my other answer), but this is trying to address why your OnFailure=
setup is not working.
The problem with OnFailure=
not starting the unit might be because it's in the wrong section, it needs to be in the [Unit]
section and not [Service]
.
You can try this instead:
# software.service
[Unit]
Description=Software
OnFailure=software-fail.service
[Service]
ExecStart=/bin/run_program
And:
# software-fail.service
[Unit]
Description=Delete corrupt files
[Service]
ExecStart=/bin/rm /file/to/delete
ExecStop=/bin/systemctl --user start software.service
I can make it work with this setup.
But note that using OnFailure=
is not ideal here, since you can't really tell why the program failed, and chaining another start of it in ExecStop=
by calling /bin/systemctl start
directly is pretty hacky... The solution using ExecStopPost=
and looking at the exit status is definitely superior.
If you define OnFailure=
inside [Service]
, systemd (at least version 234 from Fedora 27) complains with:
software.service:6: Unknown lvalue 'OnFailure' in section 'Service'
Not sure if you're seeing that in your logs or not... (Maybe this was added in recent systemd?) That should be a hint of what is going on there... I hope this helps.
NOTE: You probably want to use ExecStopPost=
instead of OnFailure=
here (see my other answer), but this is trying to address why your OnFailure=
setup is not working.
The problem with OnFailure=
not starting the unit might be because it's in the wrong section, it needs to be in the [Unit]
section and not [Service]
.
You can try this instead:
# software.service
[Unit]
Description=Software
OnFailure=software-fail.service
[Service]
ExecStart=/bin/run_program
And:
# software-fail.service
[Unit]
Description=Delete corrupt files
[Service]
ExecStart=/bin/rm /file/to/delete
ExecStop=/bin/systemctl --user start software.service
I can make it work with this setup.
But note that using OnFailure=
is not ideal here, since you can't really tell why the program failed, and chaining another start of it in ExecStop=
by calling /bin/systemctl start
directly is pretty hacky... The solution using ExecStopPost=
and looking at the exit status is definitely superior.
If you define OnFailure=
inside [Service]
, systemd (at least version 234 from Fedora 27) complains with:
software.service:6: Unknown lvalue 'OnFailure' in section 'Service'
Not sure if you're seeing that in your logs or not... (Maybe this was added in recent systemd?) That should be a hint of what is going on there... I hope this helps.
answered May 4 at 16:01
Filipe Brandenburger
3,451521
3,451521
Thank you, what a stupid mistake! I did not see the "Unknown lvalue" text in my logs, using the latest systemd in Debian Stretch.
â trippelganger
May 5 at 3:05
add a comment |Â
Thank you, what a stupid mistake! I did not see the "Unknown lvalue" text in my logs, using the latest systemd in Debian Stretch.
â trippelganger
May 5 at 3:05
Thank you, what a stupid mistake! I did not see the "Unknown lvalue" text in my logs, using the latest systemd in Debian Stretch.
â trippelganger
May 5 at 3:05
Thank you, what a stupid mistake! I did not see the "Unknown lvalue" text in my logs, using the latest systemd in Debian Stretch.
â trippelganger
May 5 at 3:05
add a comment |Â
up vote
1
down vote
In order to perform some cleanup if the service fails, you can use ExecStopPost=
, which is executed whether the service succeeds or not.
In the code you run at ExecStopPost=
, you can use one of $SERVICE_RESULT
, $EXIT_CODE
or $EXIT_STATUS
to determine the failure condition and act accordingly. See the documentation on those environment variables to check which one is appropriate for you.
Then you can use Restart=on-failure
so that systemd tries to restart your unit when it fails.
Putting it all together, this is what it would look like. Assuming that run_program
will exit with status 2 whenever the files are corrupted (hopefully you can adapt this to other failure scenarios from the documentation above), this should work:
[Service]
ExecStart=/bin/run_program
ExecStopPost=/bin/sh -c 'if [ "$$EXIT_STATUS" = 2 ]; then rm /file/to/delete; fi'
Restart=on-failure
(NOTE: The double dollar-sign $$
is to escape this to systemd, so the shell sees $EXIT_STATUS
and accesses that variable. Using a single dollar-sign would also work, but then systemd would do that replacement instead and the shell would see [ "2" = 2 ]
, which arguably also works... Anyways, you can bypass most of that by putting all this logic into a shell script and calling it by its full path in ExecStopPost=
, that would be probably better and you could also easily add more commands to the script, such as logging the action taken to recover from the error condition.)
Hopefully this will give you enough pointers to figure out how to configure this correctly given your particular situation!
1
Thank you, this is a better way of doing my current solution. However, it does not explain how to properly use the OnFailure command in systemd, which was my original question. I find the systemd documentation can be a bit hard to interpret at times. Also, shouldn't I check for exit status 1, rather than 2?
â trippelganger
May 4 at 7:43
Checking exit status 2 was an example, in case your program does something specific in case of file corruption (some programs use different exit status to signal specific conditions.) A more general solution would be checking for"$$EXIT_STATUS" != 0
, but the best is for you to look at your program's documentation (or source code) and look at the three variables systemd uses and try to figure out what's the best way to detect the error condition in your specific case.
â Filipe Brandenburger
May 4 at 15:41
Indeed I did not explain whyOnFailure=
was not working for you, I focused on the problem you were trying to solve and I thinkOnFailure=
is not the best solution for that... I added another answer to addressOnFailure=
, it seems the problem is you have it in the wrong section ([Service]
, while it should be in[Unit]
), I hope that's helpful to you as well. In any case, for this specific problem you described,ExecStopPost=
and checking for exit status (or service result, etc.) is probably a better approach.
â Filipe Brandenburger
May 4 at 16:03
Isn't this kind of situation exactly what OnFailure is used for?
â trippelganger
May 5 at 3:04
add a comment |Â
up vote
1
down vote
In order to perform some cleanup if the service fails, you can use ExecStopPost=
, which is executed whether the service succeeds or not.
In the code you run at ExecStopPost=
, you can use one of $SERVICE_RESULT
, $EXIT_CODE
or $EXIT_STATUS
to determine the failure condition and act accordingly. See the documentation on those environment variables to check which one is appropriate for you.
Then you can use Restart=on-failure
so that systemd tries to restart your unit when it fails.
Putting it all together, this is what it would look like. Assuming that run_program
will exit with status 2 whenever the files are corrupted (hopefully you can adapt this to other failure scenarios from the documentation above), this should work:
[Service]
ExecStart=/bin/run_program
ExecStopPost=/bin/sh -c 'if [ "$$EXIT_STATUS" = 2 ]; then rm /file/to/delete; fi'
Restart=on-failure
(NOTE: The double dollar-sign $$
is to escape this to systemd, so the shell sees $EXIT_STATUS
and accesses that variable. Using a single dollar-sign would also work, but then systemd would do that replacement instead and the shell would see [ "2" = 2 ]
, which arguably also works... Anyways, you can bypass most of that by putting all this logic into a shell script and calling it by its full path in ExecStopPost=
, that would be probably better and you could also easily add more commands to the script, such as logging the action taken to recover from the error condition.)
Hopefully this will give you enough pointers to figure out how to configure this correctly given your particular situation!
1
Thank you, this is a better way of doing my current solution. However, it does not explain how to properly use the OnFailure command in systemd, which was my original question. I find the systemd documentation can be a bit hard to interpret at times. Also, shouldn't I check for exit status 1, rather than 2?
â trippelganger
May 4 at 7:43
Checking exit status 2 was an example, in case your program does something specific in case of file corruption (some programs use different exit status to signal specific conditions.) A more general solution would be checking for"$$EXIT_STATUS" != 0
, but the best is for you to look at your program's documentation (or source code) and look at the three variables systemd uses and try to figure out what's the best way to detect the error condition in your specific case.
â Filipe Brandenburger
May 4 at 15:41
Indeed I did not explain whyOnFailure=
was not working for you, I focused on the problem you were trying to solve and I thinkOnFailure=
is not the best solution for that... I added another answer to addressOnFailure=
, it seems the problem is you have it in the wrong section ([Service]
, while it should be in[Unit]
), I hope that's helpful to you as well. In any case, for this specific problem you described,ExecStopPost=
and checking for exit status (or service result, etc.) is probably a better approach.
â Filipe Brandenburger
May 4 at 16:03
Isn't this kind of situation exactly what OnFailure is used for?
â trippelganger
May 5 at 3:04
add a comment |Â
up vote
1
down vote
up vote
1
down vote
In order to perform some cleanup if the service fails, you can use ExecStopPost=
, which is executed whether the service succeeds or not.
In the code you run at ExecStopPost=
, you can use one of $SERVICE_RESULT
, $EXIT_CODE
or $EXIT_STATUS
to determine the failure condition and act accordingly. See the documentation on those environment variables to check which one is appropriate for you.
Then you can use Restart=on-failure
so that systemd tries to restart your unit when it fails.
Putting it all together, this is what it would look like. Assuming that run_program
will exit with status 2 whenever the files are corrupted (hopefully you can adapt this to other failure scenarios from the documentation above), this should work:
[Service]
ExecStart=/bin/run_program
ExecStopPost=/bin/sh -c 'if [ "$$EXIT_STATUS" = 2 ]; then rm /file/to/delete; fi'
Restart=on-failure
(NOTE: The double dollar-sign $$
is to escape this to systemd, so the shell sees $EXIT_STATUS
and accesses that variable. Using a single dollar-sign would also work, but then systemd would do that replacement instead and the shell would see [ "2" = 2 ]
, which arguably also works... Anyways, you can bypass most of that by putting all this logic into a shell script and calling it by its full path in ExecStopPost=
, that would be probably better and you could also easily add more commands to the script, such as logging the action taken to recover from the error condition.)
Hopefully this will give you enough pointers to figure out how to configure this correctly given your particular situation!
In order to perform some cleanup if the service fails, you can use ExecStopPost=
, which is executed whether the service succeeds or not.
In the code you run at ExecStopPost=
, you can use one of $SERVICE_RESULT
, $EXIT_CODE
or $EXIT_STATUS
to determine the failure condition and act accordingly. See the documentation on those environment variables to check which one is appropriate for you.
Then you can use Restart=on-failure
so that systemd tries to restart your unit when it fails.
Putting it all together, this is what it would look like. Assuming that run_program
will exit with status 2 whenever the files are corrupted (hopefully you can adapt this to other failure scenarios from the documentation above), this should work:
[Service]
ExecStart=/bin/run_program
ExecStopPost=/bin/sh -c 'if [ "$$EXIT_STATUS" = 2 ]; then rm /file/to/delete; fi'
Restart=on-failure
(NOTE: The double dollar-sign $$
is to escape this to systemd, so the shell sees $EXIT_STATUS
and accesses that variable. Using a single dollar-sign would also work, but then systemd would do that replacement instead and the shell would see [ "2" = 2 ]
, which arguably also works... Anyways, you can bypass most of that by putting all this logic into a shell script and calling it by its full path in ExecStopPost=
, that would be probably better and you could also easily add more commands to the script, such as logging the action taken to recover from the error condition.)
Hopefully this will give you enough pointers to figure out how to configure this correctly given your particular situation!
answered May 3 at 21:10
Filipe Brandenburger
3,451521
3,451521
1
Thank you, this is a better way of doing my current solution. However, it does not explain how to properly use the OnFailure command in systemd, which was my original question. I find the systemd documentation can be a bit hard to interpret at times. Also, shouldn't I check for exit status 1, rather than 2?
â trippelganger
May 4 at 7:43
Checking exit status 2 was an example, in case your program does something specific in case of file corruption (some programs use different exit status to signal specific conditions.) A more general solution would be checking for"$$EXIT_STATUS" != 0
, but the best is for you to look at your program's documentation (or source code) and look at the three variables systemd uses and try to figure out what's the best way to detect the error condition in your specific case.
â Filipe Brandenburger
May 4 at 15:41
Indeed I did not explain whyOnFailure=
was not working for you, I focused on the problem you were trying to solve and I thinkOnFailure=
is not the best solution for that... I added another answer to addressOnFailure=
, it seems the problem is you have it in the wrong section ([Service]
, while it should be in[Unit]
), I hope that's helpful to you as well. In any case, for this specific problem you described,ExecStopPost=
and checking for exit status (or service result, etc.) is probably a better approach.
â Filipe Brandenburger
May 4 at 16:03
Isn't this kind of situation exactly what OnFailure is used for?
â trippelganger
May 5 at 3:04
add a comment |Â
1
Thank you, this is a better way of doing my current solution. However, it does not explain how to properly use the OnFailure command in systemd, which was my original question. I find the systemd documentation can be a bit hard to interpret at times. Also, shouldn't I check for exit status 1, rather than 2?
â trippelganger
May 4 at 7:43
Checking exit status 2 was an example, in case your program does something specific in case of file corruption (some programs use different exit status to signal specific conditions.) A more general solution would be checking for"$$EXIT_STATUS" != 0
, but the best is for you to look at your program's documentation (or source code) and look at the three variables systemd uses and try to figure out what's the best way to detect the error condition in your specific case.
â Filipe Brandenburger
May 4 at 15:41
Indeed I did not explain whyOnFailure=
was not working for you, I focused on the problem you were trying to solve and I thinkOnFailure=
is not the best solution for that... I added another answer to addressOnFailure=
, it seems the problem is you have it in the wrong section ([Service]
, while it should be in[Unit]
), I hope that's helpful to you as well. In any case, for this specific problem you described,ExecStopPost=
and checking for exit status (or service result, etc.) is probably a better approach.
â Filipe Brandenburger
May 4 at 16:03
Isn't this kind of situation exactly what OnFailure is used for?
â trippelganger
May 5 at 3:04
1
1
Thank you, this is a better way of doing my current solution. However, it does not explain how to properly use the OnFailure command in systemd, which was my original question. I find the systemd documentation can be a bit hard to interpret at times. Also, shouldn't I check for exit status 1, rather than 2?
â trippelganger
May 4 at 7:43
Thank you, this is a better way of doing my current solution. However, it does not explain how to properly use the OnFailure command in systemd, which was my original question. I find the systemd documentation can be a bit hard to interpret at times. Also, shouldn't I check for exit status 1, rather than 2?
â trippelganger
May 4 at 7:43
Checking exit status 2 was an example, in case your program does something specific in case of file corruption (some programs use different exit status to signal specific conditions.) A more general solution would be checking for
"$$EXIT_STATUS" != 0
, but the best is for you to look at your program's documentation (or source code) and look at the three variables systemd uses and try to figure out what's the best way to detect the error condition in your specific case.â Filipe Brandenburger
May 4 at 15:41
Checking exit status 2 was an example, in case your program does something specific in case of file corruption (some programs use different exit status to signal specific conditions.) A more general solution would be checking for
"$$EXIT_STATUS" != 0
, but the best is for you to look at your program's documentation (or source code) and look at the three variables systemd uses and try to figure out what's the best way to detect the error condition in your specific case.â Filipe Brandenburger
May 4 at 15:41
Indeed I did not explain why
OnFailure=
was not working for you, I focused on the problem you were trying to solve and I think OnFailure=
is not the best solution for that... I added another answer to address OnFailure=
, it seems the problem is you have it in the wrong section ([Service]
, while it should be in [Unit]
), I hope that's helpful to you as well. In any case, for this specific problem you described, ExecStopPost=
and checking for exit status (or service result, etc.) is probably a better approach.â Filipe Brandenburger
May 4 at 16:03
Indeed I did not explain why
OnFailure=
was not working for you, I focused on the problem you were trying to solve and I think OnFailure=
is not the best solution for that... I added another answer to address OnFailure=
, it seems the problem is you have it in the wrong section ([Service]
, while it should be in [Unit]
), I hope that's helpful to you as well. In any case, for this specific problem you described, ExecStopPost=
and checking for exit status (or service result, etc.) is probably a better approach.â Filipe Brandenburger
May 4 at 16:03
Isn't this kind of situation exactly what OnFailure is used for?
â trippelganger
May 5 at 3:04
Isn't this kind of situation exactly what OnFailure is used for?
â trippelganger
May 5 at 3:04
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f441575%2fproper-way-to-use-onfailure-in-systemd%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Is there any journal output from your first service file? How about the second service file? That is, can you tell if the first service reaches the failure condition at all? You should update the question with the (relevant) journal output.
â Chiraag
May 3 at 16:26
@Chiraag Updated.
â trippelganger
May 4 at 7:10
@trippelganger Please edit your question to make some corrections.
start_if_fail.service
should probably besoftware-fail.service
(from your later systemctl status output) andExecStop=systemctl ...
fails because the path is not absolute, so I imagine what you really have isExecStop=/bin/systemctl ...
(though it's possible this might also be part of the problem you're having.)â Filipe Brandenburger
May 4 at 16:05