dd hanging & uninterruptible sleep (kernel quirk?)

up vote
1
down vote

favorite

So, this has been annoying me for years.

This happens with more programs than just dd, but I find it happens very often with programs that involve raw filesystem manipulation.

When I'm copying with dd -- e.g., making a bootable USB disk by doing sudo dd if=somelinuxdistro.iso of=/dev/sdb bs=64K status=progress, it's like all my signals are ignored by the application. (Or by the kernel in the case of SIGKILL) htop shows status D, which apparently means "uninterruptible sleep". It can stay in this state for ages if there's a hardware glitch, and in regular usage I can't seem to find any way of detaching it from the terminal so I can keep working -- often I just end up switching to a different terminal to finish my work.

I've looked this up before, but I've never found an explanation of what this state is for or why the kernel refuses to kill a process in this state -- or what exactly the recommended thing is to do to avoid wasting time. (Nor any recommendations of what to do when this happens.)

In short:
I'd like to have is a way of reliably force killing processes in state D, or at least detaching them from the terminal. And I'd also like an explanation of what's going on in the background to cause them to be in this state in the first place.

edited Sep 10 at 20:01

Rui F Ribeiro

36.8k1273117

asked Sep 10 at 19:59

Alexandria P.

Have you tried Ctrl+Z?
â€“Â G-Man
Sep 10 at 20:23

You can't kill processes in state D, that's how it is. I have a similar problem on my machine, where apparently I/O gets stuck (kernel bug?). Dropping caches (echo 3 > /proc/sys/vm/drop_caches) helps, see if it helps in your case, too.
â€“Â dirkt
Sep 11 at 6:05

@G-Man Yes. And Ctrl+C, and Ctrl+U, and kill -9 [PID]. :P
â€“Â Alexandria P.
Sep 11 at 6:47

add a commentÂ |Â

up vote
1
down vote

favorite

So, this has been annoying me for years.

This happens with more programs than just dd, but I find it happens very often with programs that involve raw filesystem manipulation.

edited Sep 10 at 20:01

Rui F Ribeiro

36.8k1273117

asked Sep 10 at 19:59

Alexandria P.

Have you tried Ctrl+Z?
â€“Â G-Man
Sep 10 at 20:23

You can't kill processes in state D, that's how it is. I have a similar problem on my machine, where apparently I/O gets stuck (kernel bug?). Dropping caches (echo 3 > /proc/sys/vm/drop_caches) helps, see if it helps in your case, too.
â€“Â dirkt
Sep 11 at 6:05

@G-Man Yes. And Ctrl+C, and Ctrl+U, and kill -9 [PID]. :P
â€“Â Alexandria P.
Sep 11 at 6:47

add a commentÂ |Â

up vote
1
down vote

favorite

So, this has been annoying me for years.

This happens with more programs than just dd, but I find it happens very often with programs that involve raw filesystem manipulation.

edited Sep 10 at 20:01

Rui F Ribeiro

36.8k1273117

asked Sep 10 at 19:59

Alexandria P.

So, this has been annoying me for years.

This happens with more programs than just dd, but I find it happens very often with programs that involve raw filesystem manipulation.

filesystems kernel dd hang

edited Sep 10 at 20:01

Rui F Ribeiro

36.8k1273117

asked Sep 10 at 19:59

Alexandria P.

edited Sep 10 at 20:01

Rui F Ribeiro

36.8k1273117

asked Sep 10 at 19:59

Alexandria P.

edited Sep 10 at 20:01

Rui F Ribeiro

36.8k1273117

edited Sep 10 at 20:01

Rui F Ribeiro

36.8k1273117

edited Sep 10 at 20:01

Rui F Ribeiro

36.8k1273117

asked Sep 10 at 19:59

Alexandria P.

asked Sep 10 at 19:59

Alexandria P.

asked Sep 10 at 19:59

Alexandria P.

Have you tried Ctrl+Z?
â€“Â G-Man
Sep 10 at 20:23

You can't kill processes in state D, that's how it is. I have a similar problem on my machine, where apparently I/O gets stuck (kernel bug?). Dropping caches (echo 3 > /proc/sys/vm/drop_caches) helps, see if it helps in your case, too.
â€“Â dirkt
Sep 11 at 6:05

@G-Man Yes. And Ctrl+C, and Ctrl+U, and kill -9 [PID]. :P
â€“Â Alexandria P.
Sep 11 at 6:47

add a commentÂ |Â

Have you tried Ctrl+Z?
â€“Â G-Man
Sep 10 at 20:23

You can't kill processes in state D, that's how it is. I have a similar problem on my machine, where apparently I/O gets stuck (kernel bug?). Dropping caches (echo 3 > /proc/sys/vm/drop_caches) helps, see if it helps in your case, too.
â€“Â dirkt
Sep 11 at 6:05

@G-Man Yes. And Ctrl+C, and Ctrl+U, and kill -9 [PID]. :P
â€“Â Alexandria P.
Sep 11 at 6:47

Have you tried Ctrl+Z?
â€“Â G-Man
Sep 10 at 20:23

You can't kill processes in state D, that's how it is. I have a similar problem on my machine, where apparently I/O gets stuck (kernel bug?). Dropping caches (echo 3 > /proc/sys/vm/drop_caches) helps, see if it helps in your case, too.
â€“Â dirkt
Sep 11 at 6:05

@G-Man Yes. And Ctrl+C, and Ctrl+U, and kill -9 [PID]. :P
â€“Â Alexandria P.
Sep 11 at 6:47

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
1
down vote

If you cannot interrupt an "uninterruptable read" and this is not related to a switched off NFS server, you discovered a driver bug.

I/O to local background storage should not have a timeout that is larger than 5-10 minutes. So if you type ^C or ^Z and nothing happens within 10 minutes, there is a driver bug.

The background is that UNIX defines that so called fast IO is not interruptable by signals because fast IO will terminate after a forseeable amount of time.

Making IO interruptable by signals causes a high overhead as there is a need to go back to a clean state. Everything that happened after startin the IO needs to be unwound and a return can only happen from where the IO was initiated.

Even worse, if a background storage driver did implement interruptable IO, this would cause unhandlable problems in filesystems above such a driver. You are using a driver that is intended to be used as background storage for a filesystem...

You could call dmesg and check the kernel messages for your problem. If interrupting really does not work after 10 minutes (when one read or write system call is expected to time out and there is a chance to kill dd between two such syscalls) you need to reboot.

If this is a device at USB, you could try to pull the device before you reboot.

edited Sep 11 at 7:32

answered Sep 10 at 20:51

schily

9,64131537

1

I understand that there may be problems with the driver or with the hardware -- but this doesn't address my question of how to deal with the workflow problem. Nor does it explain why fast IO is designed to be non-interruptable when it potentially causes problems by assuming the backend is working properly. What advantage is there to blocking the user from killing a process like this? If the user is trying to force kill it, they surely already understand that file corruption is likely.
â€“Â Alexandria P.
Sep 10 at 21:04

1

Can you clarify what you meant by "Making IO interruptable by signals causes a high overhead as there is a need to go back to a clean state"?
â€“Â Alexandria P.
Sep 11 at 3:33

Hmmm. I'm still a little confused. Why can't the kernel just drop all the IO stuff it was doing? I mean, that's seemingly what happens when the device gets physically unplugged -- so it can't be any worse than unplugging a device.
â€“Â Alexandria P.
Sep 12 at 7:25

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f468100%2fdd-hanging-uninterruptible-sleep-kernel-quirk%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

If you cannot interrupt an "uninterruptable read" and this is not related to a switched off NFS server, you discovered a driver bug.

I/O to local background storage should not have a timeout that is larger than 5-10 minutes. So if you type ^C or ^Z and nothing happens within 10 minutes, there is a driver bug.

The background is that UNIX defines that so called fast IO is not interruptable by signals because fast IO will terminate after a forseeable amount of time.

If this is a device at USB, you could try to pull the device before you reboot.

edited Sep 11 at 7:32

answered Sep 10 at 20:51

schily

9,64131537

1

I understand that there may be problems with the driver or with the hardware -- but this doesn't address my question of how to deal with the workflow problem. Nor does it explain why fast IO is designed to be non-interruptable when it potentially causes problems by assuming the backend is working properly. What advantage is there to blocking the user from killing a process like this? If the user is trying to force kill it, they surely already understand that file corruption is likely.
â€“Â Alexandria P.
Sep 10 at 21:04

1

Can you clarify what you meant by "Making IO interruptable by signals causes a high overhead as there is a need to go back to a clean state"?
â€“Â Alexandria P.
Sep 11 at 3:33

Hmmm. I'm still a little confused. Why can't the kernel just drop all the IO stuff it was doing? I mean, that's seemingly what happens when the device gets physically unplugged -- so it can't be any worse than unplugging a device.
â€“Â Alexandria P.
Sep 12 at 7:25

add a commentÂ |Â

up vote
1
down vote

If you cannot interrupt an "uninterruptable read" and this is not related to a switched off NFS server, you discovered a driver bug.

I/O to local background storage should not have a timeout that is larger than 5-10 minutes. So if you type ^C or ^Z and nothing happens within 10 minutes, there is a driver bug.

The background is that UNIX defines that so called fast IO is not interruptable by signals because fast IO will terminate after a forseeable amount of time.

If this is a device at USB, you could try to pull the device before you reboot.

edited Sep 11 at 7:32

answered Sep 10 at 20:51

schily

9,64131537

1

I understand that there may be problems with the driver or with the hardware -- but this doesn't address my question of how to deal with the workflow problem. Nor does it explain why fast IO is designed to be non-interruptable when it potentially causes problems by assuming the backend is working properly. What advantage is there to blocking the user from killing a process like this? If the user is trying to force kill it, they surely already understand that file corruption is likely.
â€“Â Alexandria P.
Sep 10 at 21:04

1

Can you clarify what you meant by "Making IO interruptable by signals causes a high overhead as there is a need to go back to a clean state"?
â€“Â Alexandria P.
Sep 11 at 3:33

Hmmm. I'm still a little confused. Why can't the kernel just drop all the IO stuff it was doing? I mean, that's seemingly what happens when the device gets physically unplugged -- so it can't be any worse than unplugging a device.
â€“Â Alexandria P.
Sep 12 at 7:25

add a commentÂ |Â

up vote
1
down vote

If you cannot interrupt an "uninterruptable read" and this is not related to a switched off NFS server, you discovered a driver bug.

I/O to local background storage should not have a timeout that is larger than 5-10 minutes. So if you type ^C or ^Z and nothing happens within 10 minutes, there is a driver bug.

The background is that UNIX defines that so called fast IO is not interruptable by signals because fast IO will terminate after a forseeable amount of time.

If this is a device at USB, you could try to pull the device before you reboot.

edited Sep 11 at 7:32

answered Sep 10 at 20:51

schily

9,64131537

If you cannot interrupt an "uninterruptable read" and this is not related to a switched off NFS server, you discovered a driver bug.

I/O to local background storage should not have a timeout that is larger than 5-10 minutes. So if you type ^C or ^Z and nothing happens within 10 minutes, there is a driver bug.

The background is that UNIX defines that so called fast IO is not interruptable by signals because fast IO will terminate after a forseeable amount of time.

If this is a device at USB, you could try to pull the device before you reboot.

edited Sep 11 at 7:32

answered Sep 10 at 20:51

schily

9,64131537

edited Sep 11 at 7:32

answered Sep 10 at 20:51

schily

9,64131537

answered Sep 10 at 20:51

schily

9,64131537

answered Sep 10 at 20:51

schily

9,64131537

1

I understand that there may be problems with the driver or with the hardware -- but this doesn't address my question of how to deal with the workflow problem. Nor does it explain why fast IO is designed to be non-interruptable when it potentially causes problems by assuming the backend is working properly. What advantage is there to blocking the user from killing a process like this? If the user is trying to force kill it, they surely already understand that file corruption is likely.
â€“Â Alexandria P.
Sep 10 at 21:04

1

Can you clarify what you meant by "Making IO interruptable by signals causes a high overhead as there is a need to go back to a clean state"?
â€“Â Alexandria P.
Sep 11 at 3:33

Hmmm. I'm still a little confused. Why can't the kernel just drop all the IO stuff it was doing? I mean, that's seemingly what happens when the device gets physically unplugged -- so it can't be any worse than unplugging a device.
â€“Â Alexandria P.
Sep 12 at 7:25

add a commentÂ |Â

1

I understand that there may be problems with the driver or with the hardware -- but this doesn't address my question of how to deal with the workflow problem. Nor does it explain why fast IO is designed to be non-interruptable when it potentially causes problems by assuming the backend is working properly. What advantage is there to blocking the user from killing a process like this? If the user is trying to force kill it, they surely already understand that file corruption is likely.
â€“Â Alexandria P.
Sep 10 at 21:04

1

Can you clarify what you meant by "Making IO interruptable by signals causes a high overhead as there is a need to go back to a clean state"?
â€“Â Alexandria P.
Sep 11 at 3:33

Hmmm. I'm still a little confused. Why can't the kernel just drop all the IO stuff it was doing? I mean, that's seemingly what happens when the device gets physically unplugged -- so it can't be any worse than unplugging a device.
â€“Â Alexandria P.
Sep 12 at 7:25

I understand that there may be problems with the driver or with the hardware -- but this doesn't address my question of how to deal with the workflow problem. Nor does it explain why fast IO is designed to be non-interruptable when it potentially causes problems by assuming the backend is working properly. What advantage is there to blocking the user from killing a process like this? If the user is trying to force kill it, they surely already understand that file corruption is likely.
â€“Â Alexandria P.
Sep 10 at 21:04

Can you clarify what you meant by "Making IO interruptable by signals causes a high overhead as there is a need to go back to a clean state"?
â€“Â Alexandria P.
Sep 11 at 3:33

Hmmm. I'm still a little confused. Why can't the kernel just drop all the IO stuff it was doing? I mean, that's seemingly what happens when the device gets physically unplugged -- so it can't be any worse than unplugging a device.
â€“Â Alexandria P.
Sep 12 at 7:25

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu