Understanding smartctl and hard-drive errors

up vote
0
down vote

favorite

I have a raidz2 ZFS pool and my 2 disks started to give I/O error and after that zfs marked them as faulted. click for dmesg log

I removed the disks and I ran some test on them. Smartctl says;

DISK 1 "click for full log= SMART Health Status: DATA CHANNEL IMPENDING FAILURE DATA ERROR RATE TOO HIGH [asc=5d, ascq=32]
DISK 2 "click for full log= SMART Health Status: HARDWARE IMPENDING FAILURE GENERAL HARD DRIVE FAILURE [asc=5d, ascq=10]

I created a new pool from the "DISK 1" and I started a fio test but i did not see any I/O error on the disk. I did not encounter any error like the previous one.. The disk working normal. Also I created a pool with 4 disk and Disk Utilization was normal too.

I tried this test for 4 days and I have not encountered an error. The disk working like the others right now.

fio --randrepeat=0 --ioengine=libaio --name=test --filename=/disktest/fiofile 
--bs=1024k --iodepth=64 --size=5T --readwrite=readwrite --rwmixread=60 --numjobs=20

I have few questions;

1- Why the disk do not give error anymore?

2- If the disk working normal then why it caused I/O error on first pool?

3- What is the best way understanding a hard-drive faulted or not?

4- How we can reset the hard-drive error counters?

5- The disk is garbage or not?

The disk attached from; Controller -> LSI3008HBA -> 2x SAS-cable -> "SC946ED-R2KJBOD" 2xExpander -> Multipath SAS disks.

edited Aug 14 at 9:25

asked Aug 13 at 12:29

Morphinz

13111

First thing I'd do is to look at the SMART attributes, but -d scsi prevents them from being shown. You didn't say how your disks are attached, so if possible, try again without -d scsi. VALUE is normalized to 100, lower is worse.
â€“Â dirkt
Aug 13 at 18:39

@dirkt Controller -> LSI3008HBA -> 2x SAS-cable -> "SC946ED-R2KJBOD" 2xExpander -> Mutlipath SAS disks. I use -d scsci because its not working with different way or I could not. :) What is your advice?
â€“Â Morphinz
Aug 14 at 9:19

For LSI controllers, try -d megaraid,N with a suitable N, see here.
â€“Â dirkt
Aug 14 at 10:25

@dirkt Its HBA not Raid card. Your method works on raid cards.
â€“Â Morphinz
Sep 5 at 12:13

Even if it has "raid" in the name, it might also work on other LSI controllers, so it's worth a try. More specifically, it will work on any hardware that supports this particular access method. If your card doesn't support it, then it doesn't; in that case there will probably be no way to get at this information unless you dig up a datasheet that describes how to send SMART commands for your controller.
â€“Â dirkt
Sep 5 at 14:08

add a commentÂ |Â

up vote
0
down vote

favorite

I have a raidz2 ZFS pool and my 2 disks started to give I/O error and after that zfs marked them as faulted. click for dmesg log

I removed the disks and I ran some test on them. Smartctl says;

fio --randrepeat=0 --ioengine=libaio --name=test --filename=/disktest/fiofile 
--bs=1024k --iodepth=64 --size=5T --readwrite=readwrite --rwmixread=60 --numjobs=20

The disk attached from; Controller -> LSI3008HBA -> 2x SAS-cable -> "SC946ED-R2KJBOD" 2xExpander -> Multipath SAS disks.

edited Aug 14 at 9:25

asked Aug 13 at 12:29

Morphinz

13111

First thing I'd do is to look at the SMART attributes, but -d scsi prevents them from being shown. You didn't say how your disks are attached, so if possible, try again without -d scsi. VALUE is normalized to 100, lower is worse.
â€“Â dirkt
Aug 13 at 18:39

@dirkt Controller -> LSI3008HBA -> 2x SAS-cable -> "SC946ED-R2KJBOD" 2xExpander -> Mutlipath SAS disks. I use -d scsci because its not working with different way or I could not. :) What is your advice?
â€“Â Morphinz
Aug 14 at 9:19

For LSI controllers, try -d megaraid,N with a suitable N, see here.
â€“Â dirkt
Aug 14 at 10:25

@dirkt Its HBA not Raid card. Your method works on raid cards.
â€“Â Morphinz
Sep 5 at 12:13

Even if it has "raid" in the name, it might also work on other LSI controllers, so it's worth a try. More specifically, it will work on any hardware that supports this particular access method. If your card doesn't support it, then it doesn't; in that case there will probably be no way to get at this information unless you dig up a datasheet that describes how to send SMART commands for your controller.
â€“Â dirkt
Sep 5 at 14:08

add a commentÂ |Â

up vote
0
down vote

favorite

I have a raidz2 ZFS pool and my 2 disks started to give I/O error and after that zfs marked them as faulted. click for dmesg log

I removed the disks and I ran some test on them. Smartctl says;

fio --randrepeat=0 --ioengine=libaio --name=test --filename=/disktest/fiofile 
--bs=1024k --iodepth=64 --size=5T --readwrite=readwrite --rwmixread=60 --numjobs=20

The disk attached from; Controller -> LSI3008HBA -> 2x SAS-cable -> "SC946ED-R2KJBOD" 2xExpander -> Multipath SAS disks.

edited Aug 14 at 9:25

asked Aug 13 at 12:29

Morphinz

13111

I have a raidz2 ZFS pool and my 2 disks started to give I/O error and after that zfs marked them as faulted. click for dmesg log

I removed the disks and I ran some test on them. Smartctl says;

fio --randrepeat=0 --ioengine=libaio --name=test --filename=/disktest/fiofile 
--bs=1024k --iodepth=64 --size=5T --readwrite=readwrite --rwmixread=60 --numjobs=20

The disk attached from; Controller -> LSI3008HBA -> 2x SAS-cable -> "SC946ED-R2KJBOD" 2xExpander -> Multipath SAS disks.

linux hard-disk zfs smartctl

edited Aug 14 at 9:25

asked Aug 13 at 12:29

Morphinz

13111

edited Aug 14 at 9:25

asked Aug 13 at 12:29

Morphinz

13111

edited Aug 14 at 9:25

asked Aug 13 at 12:29

Morphinz

13111

asked Aug 13 at 12:29

Morphinz

13111

asked Aug 13 at 12:29

Morphinz

13111

First thing I'd do is to look at the SMART attributes, but -d scsi prevents them from being shown. You didn't say how your disks are attached, so if possible, try again without -d scsi. VALUE is normalized to 100, lower is worse.
â€“Â dirkt
Aug 13 at 18:39

@dirkt Controller -> LSI3008HBA -> 2x SAS-cable -> "SC946ED-R2KJBOD" 2xExpander -> Mutlipath SAS disks. I use -d scsci because its not working with different way or I could not. :) What is your advice?
â€“Â Morphinz
Aug 14 at 9:19

For LSI controllers, try -d megaraid,N with a suitable N, see here.
â€“Â dirkt
Aug 14 at 10:25

@dirkt Its HBA not Raid card. Your method works on raid cards.
â€“Â Morphinz
Sep 5 at 12:13

Even if it has "raid" in the name, it might also work on other LSI controllers, so it's worth a try. More specifically, it will work on any hardware that supports this particular access method. If your card doesn't support it, then it doesn't; in that case there will probably be no way to get at this information unless you dig up a datasheet that describes how to send SMART commands for your controller.
â€“Â dirkt
Sep 5 at 14:08

add a commentÂ |Â

First thing I'd do is to look at the SMART attributes, but -d scsi prevents them from being shown. You didn't say how your disks are attached, so if possible, try again without -d scsi. VALUE is normalized to 100, lower is worse.
â€“Â dirkt
Aug 13 at 18:39

@dirkt Controller -> LSI3008HBA -> 2x SAS-cable -> "SC946ED-R2KJBOD" 2xExpander -> Mutlipath SAS disks. I use -d scsci because its not working with different way or I could not. :) What is your advice?
â€“Â Morphinz
Aug 14 at 9:19

For LSI controllers, try -d megaraid,N with a suitable N, see here.
â€“Â dirkt
Aug 14 at 10:25

@dirkt Its HBA not Raid card. Your method works on raid cards.
â€“Â Morphinz
Sep 5 at 12:13

Even if it has "raid" in the name, it might also work on other LSI controllers, so it's worth a try. More specifically, it will work on any hardware that supports this particular access method. If your card doesn't support it, then it doesn't; in that case there will probably be no way to get at this information unless you dig up a datasheet that describes how to send SMART commands for your controller.
â€“Â dirkt
Sep 5 at 14:08

First thing I'd do is to look at the SMART attributes, but -d scsi prevents them from being shown. You didn't say how your disks are attached, so if possible, try again without -d scsi. VALUE is normalized to 100, lower is worse.
â€“Â dirkt
Aug 13 at 18:39

@dirkt Controller -> LSI3008HBA -> 2x SAS-cable -> "SC946ED-R2KJBOD" 2xExpander -> Mutlipath SAS disks. I use -d scsci because its not working with different way or I could not. :) What is your advice?
â€“Â Morphinz
Aug 14 at 9:19

For LSI controllers, try -d megaraid,N with a suitable N, see here.
â€“Â dirkt
Aug 14 at 10:25

@dirkt Its HBA not Raid card. Your method works on raid cards.
â€“Â Morphinz
Sep 5 at 12:13

Even if it has "raid" in the name, it might also work on other LSI controllers, so it's worth a try. More specifically, it will work on any hardware that supports this particular access method. If your card doesn't support it, then it doesn't; in that case there will probably be no way to get at this information unless you dig up a datasheet that describes how to send SMART commands for your controller.
â€“Â dirkt
Sep 5 at 14:08

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
1
down vote

Some faults can come and go. There's nothing that guarantees you will be warned before a disk is going to die but if SMART starts spitting out failure errors it's better not to risk it and just replace the drive.

Errors can come and go because sometimes the disk keeps retrying problem regions until it succeeds (at which point it will generally try and avoid using that region again if it can).

You could run a long SMART self test and/or read/write to every LBA in use (ZFS has a scrub (aka resilvering) process that can be initiated). Watch out though - these might make the disk fail for good...

You can't.

Hard to say but let's put it another way: is the money saved by not replacing it unnecessarily worth the risk of having it suddenly fail?

answered Aug 14 at 6:46

Anon

1,3101018

@Morphinz did this answer you questions?
â€“Â Anon
Sep 6 at 3:06

thank you for your answer but this is an info about I won't have any control on my disks. My problem getting information and be able to control my disks. Because of 1 faulted disk I'm facing suspend issue. When the problem making noise I need to figure out and stop it. Or Kernel,Multipathd,Zfs should do this for me. Because of 1 disk I dont want to reboot my server. For example I can listen dmesg and when i see "attempting task abort, I/O errors" and I can set the disk as offline. Or I can flush multipath and drop the disk via "/sys/block". With this way the disk will be not a problem.
â€“Â Morphinz
Sep 11 at 8:39

Even If i try "zpool import mypool" when zfs searching the disks, the faulted disk causing HBA reset. Thats causing "zpool import" hang for 1-2 minute and after that zpool import output gives me "Unavail, or faulted" everydisks. This issue really important and I cant understand why nobody cares it. Maybe my Broadcom 3008 HBA LSI cards causing this and thats why any other user do not have the problem. I really, really need to stop this HBA reset issue. If I can close the code from kernel I will... Or if I need to change these HBA card I will...
â€“Â Morphinz
Sep 11 at 8:45

You might be better off asking a new ZFS specific question as your comments show you need a different answer to those of your original 5 questions...
â€“Â Anon
Sep 12 at 18:16

(FYI: grox.net/sysadm/unix/linux_disk_hotplug_helpful_commands talks about using echo offline > /sys/block/<blockdev>/device/state) to offline a disk.
â€“Â Anon
Sep 12 at 18:19

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f462287%2funderstanding-smartctl-and-hard-drive-errors%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

Some faults can come and go. There's nothing that guarantees you will be warned before a disk is going to die but if SMART starts spitting out failure errors it's better not to risk it and just replace the drive.

Errors can come and go because sometimes the disk keeps retrying problem regions until it succeeds (at which point it will generally try and avoid using that region again if it can).

You could run a long SMART self test and/or read/write to every LBA in use (ZFS has a scrub (aka resilvering) process that can be initiated). Watch out though - these might make the disk fail for good...

You can't.

Hard to say but let's put it another way: is the money saved by not replacing it unnecessarily worth the risk of having it suddenly fail?

answered Aug 14 at 6:46

Anon

1,3101018

@Morphinz did this answer you questions?
â€“Â Anon
Sep 6 at 3:06

thank you for your answer but this is an info about I won't have any control on my disks. My problem getting information and be able to control my disks. Because of 1 faulted disk I'm facing suspend issue. When the problem making noise I need to figure out and stop it. Or Kernel,Multipathd,Zfs should do this for me. Because of 1 disk I dont want to reboot my server. For example I can listen dmesg and when i see "attempting task abort, I/O errors" and I can set the disk as offline. Or I can flush multipath and drop the disk via "/sys/block". With this way the disk will be not a problem.
â€“Â Morphinz
Sep 11 at 8:39

Even If i try "zpool import mypool" when zfs searching the disks, the faulted disk causing HBA reset. Thats causing "zpool import" hang for 1-2 minute and after that zpool import output gives me "Unavail, or faulted" everydisks. This issue really important and I cant understand why nobody cares it. Maybe my Broadcom 3008 HBA LSI cards causing this and thats why any other user do not have the problem. I really, really need to stop this HBA reset issue. If I can close the code from kernel I will... Or if I need to change these HBA card I will...
â€“Â Morphinz
Sep 11 at 8:45

You might be better off asking a new ZFS specific question as your comments show you need a different answer to those of your original 5 questions...
â€“Â Anon
Sep 12 at 18:16

(FYI: grox.net/sysadm/unix/linux_disk_hotplug_helpful_commands talks about using echo offline > /sys/block/<blockdev>/device/state) to offline a disk.
â€“Â Anon
Sep 12 at 18:19

add a commentÂ |Â

up vote
1
down vote

Some faults can come and go. There's nothing that guarantees you will be warned before a disk is going to die but if SMART starts spitting out failure errors it's better not to risk it and just replace the drive.

Errors can come and go because sometimes the disk keeps retrying problem regions until it succeeds (at which point it will generally try and avoid using that region again if it can).

You could run a long SMART self test and/or read/write to every LBA in use (ZFS has a scrub (aka resilvering) process that can be initiated). Watch out though - these might make the disk fail for good...

You can't.

Hard to say but let's put it another way: is the money saved by not replacing it unnecessarily worth the risk of having it suddenly fail?

answered Aug 14 at 6:46

Anon

1,3101018

@Morphinz did this answer you questions?
â€“Â Anon
Sep 6 at 3:06

thank you for your answer but this is an info about I won't have any control on my disks. My problem getting information and be able to control my disks. Because of 1 faulted disk I'm facing suspend issue. When the problem making noise I need to figure out and stop it. Or Kernel,Multipathd,Zfs should do this for me. Because of 1 disk I dont want to reboot my server. For example I can listen dmesg and when i see "attempting task abort, I/O errors" and I can set the disk as offline. Or I can flush multipath and drop the disk via "/sys/block". With this way the disk will be not a problem.
â€“Â Morphinz
Sep 11 at 8:39

Even If i try "zpool import mypool" when zfs searching the disks, the faulted disk causing HBA reset. Thats causing "zpool import" hang for 1-2 minute and after that zpool import output gives me "Unavail, or faulted" everydisks. This issue really important and I cant understand why nobody cares it. Maybe my Broadcom 3008 HBA LSI cards causing this and thats why any other user do not have the problem. I really, really need to stop this HBA reset issue. If I can close the code from kernel I will... Or if I need to change these HBA card I will...
â€“Â Morphinz
Sep 11 at 8:45

You might be better off asking a new ZFS specific question as your comments show you need a different answer to those of your original 5 questions...
â€“Â Anon
Sep 12 at 18:16

(FYI: grox.net/sysadm/unix/linux_disk_hotplug_helpful_commands talks about using echo offline > /sys/block/<blockdev>/device/state) to offline a disk.
â€“Â Anon
Sep 12 at 18:19

add a commentÂ |Â

up vote
1
down vote

Some faults can come and go. There's nothing that guarantees you will be warned before a disk is going to die but if SMART starts spitting out failure errors it's better not to risk it and just replace the drive.

Errors can come and go because sometimes the disk keeps retrying problem regions until it succeeds (at which point it will generally try and avoid using that region again if it can).

You could run a long SMART self test and/or read/write to every LBA in use (ZFS has a scrub (aka resilvering) process that can be initiated). Watch out though - these might make the disk fail for good...

You can't.

Hard to say but let's put it another way: is the money saved by not replacing it unnecessarily worth the risk of having it suddenly fail?

answered Aug 14 at 6:46

Anon

1,3101018

Some faults can come and go. There's nothing that guarantees you will be warned before a disk is going to die but if SMART starts spitting out failure errors it's better not to risk it and just replace the drive.

Errors can come and go because sometimes the disk keeps retrying problem regions until it succeeds (at which point it will generally try and avoid using that region again if it can).

You could run a long SMART self test and/or read/write to every LBA in use (ZFS has a scrub (aka resilvering) process that can be initiated). Watch out though - these might make the disk fail for good...

You can't.

Hard to say but let's put it another way: is the money saved by not replacing it unnecessarily worth the risk of having it suddenly fail?

answered Aug 14 at 6:46

Anon

1,3101018

answered Aug 14 at 6:46

Anon

1,3101018

answered Aug 14 at 6:46

Anon

1,3101018

answered Aug 14 at 6:46

Anon

1,3101018

@Morphinz did this answer you questions?
â€“Â Anon
Sep 6 at 3:06

thank you for your answer but this is an info about I won't have any control on my disks. My problem getting information and be able to control my disks. Because of 1 faulted disk I'm facing suspend issue. When the problem making noise I need to figure out and stop it. Or Kernel,Multipathd,Zfs should do this for me. Because of 1 disk I dont want to reboot my server. For example I can listen dmesg and when i see "attempting task abort, I/O errors" and I can set the disk as offline. Or I can flush multipath and drop the disk via "/sys/block". With this way the disk will be not a problem.
â€“Â Morphinz
Sep 11 at 8:39

Even If i try "zpool import mypool" when zfs searching the disks, the faulted disk causing HBA reset. Thats causing "zpool import" hang for 1-2 minute and after that zpool import output gives me "Unavail, or faulted" everydisks. This issue really important and I cant understand why nobody cares it. Maybe my Broadcom 3008 HBA LSI cards causing this and thats why any other user do not have the problem. I really, really need to stop this HBA reset issue. If I can close the code from kernel I will... Or if I need to change these HBA card I will...
â€“Â Morphinz
Sep 11 at 8:45

You might be better off asking a new ZFS specific question as your comments show you need a different answer to those of your original 5 questions...
â€“Â Anon
Sep 12 at 18:16

(FYI: grox.net/sysadm/unix/linux_disk_hotplug_helpful_commands talks about using echo offline > /sys/block/<blockdev>/device/state) to offline a disk.
â€“Â Anon
Sep 12 at 18:19

add a commentÂ |Â

@Morphinz did this answer you questions?
â€“Â Anon
Sep 6 at 3:06

thank you for your answer but this is an info about I won't have any control on my disks. My problem getting information and be able to control my disks. Because of 1 faulted disk I'm facing suspend issue. When the problem making noise I need to figure out and stop it. Or Kernel,Multipathd,Zfs should do this for me. Because of 1 disk I dont want to reboot my server. For example I can listen dmesg and when i see "attempting task abort, I/O errors" and I can set the disk as offline. Or I can flush multipath and drop the disk via "/sys/block". With this way the disk will be not a problem.
â€“Â Morphinz
Sep 11 at 8:39

Even If i try "zpool import mypool" when zfs searching the disks, the faulted disk causing HBA reset. Thats causing "zpool import" hang for 1-2 minute and after that zpool import output gives me "Unavail, or faulted" everydisks. This issue really important and I cant understand why nobody cares it. Maybe my Broadcom 3008 HBA LSI cards causing this and thats why any other user do not have the problem. I really, really need to stop this HBA reset issue. If I can close the code from kernel I will... Or if I need to change these HBA card I will...
â€“Â Morphinz
Sep 11 at 8:45

You might be better off asking a new ZFS specific question as your comments show you need a different answer to those of your original 5 questions...
â€“Â Anon
Sep 12 at 18:16

(FYI: grox.net/sysadm/unix/linux_disk_hotplug_helpful_commands talks about using echo offline > /sys/block/<blockdev>/device/state) to offline a disk.
â€“Â Anon
Sep 12 at 18:19

@Morphinz did this answer you questions?
â€“Â Anon
Sep 6 at 3:06

thank you for your answer but this is an info about I won't have any control on my disks. My problem getting information and be able to control my disks. Because of 1 faulted disk I'm facing suspend issue. When the problem making noise I need to figure out and stop it. Or Kernel,Multipathd,Zfs should do this for me. Because of 1 disk I dont want to reboot my server. For example I can listen dmesg and when i see "attempting task abort, I/O errors" and I can set the disk as offline. Or I can flush multipath and drop the disk via "/sys/block". With this way the disk will be not a problem.
â€“Â Morphinz
Sep 11 at 8:39

Even If i try "zpool import mypool" when zfs searching the disks, the faulted disk causing HBA reset. Thats causing "zpool import" hang for 1-2 minute and after that zpool import output gives me "Unavail, or faulted" everydisks. This issue really important and I cant understand why nobody cares it. Maybe my Broadcom 3008 HBA LSI cards causing this and thats why any other user do not have the problem. I really, really need to stop this HBA reset issue. If I can close the code from kernel I will... Or if I need to change these HBA card I will...
â€“Â Morphinz
Sep 11 at 8:45

You might be better off asking a new ZFS specific question as your comments show you need a different answer to those of your original 5 questions...
â€“Â Anon
Sep 12 at 18:16

(FYI: grox.net/sysadm/unix/linux_disk_hotplug_helpful_commands talks about using echo offline > /sys/block/<blockdev>/device/state) to offline a disk.
â€“Â Anon
Sep 12 at 18:19

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu