How do I work out what's trashing my RAID?

up vote
1
down vote

favorite

I have an x86_64 Ubuntu 17.10 install (stock 4.13 kernel) with an SSD and three 1TB WD HDDs which each have a 750GB partition that's used in a 1.45TB RAID5 array. The SSD has my / on it, and the RAID array has LVM defined which I use for /home.

Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10] 
md0 : active raid5 sdc1[3] sdd1[1] sdb1[0]
 1572601856 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
 [====>................] resync = 21.3% (168261416/786300928) finish=64.7min speed=159157K/sec
 bitmap: 6/6 pages [24KB], 65536KB chunk

It's worked fine, until about Christmas-time, since when I've repeatedly turned my computer on and found:

[ 2.334153] md/raid:md0: not clean -- starting background reconstruction
[ 2.334164] md/raid:md0: device sdc1 operational as raid disk 2
[ 2.334165] md/raid:md0: device sdd1 operational as raid disk 1
[ 2.334165] md/raid:md0: device sdb1 operational as raid disk 0
[ 2.334333] md/raid:md0: raid level 5 active with 3 out of 3 devices, algorithm 2
[ 2.334479] md0: bitmap file is out of date (39126 < 39127) -- forcing full recovery
[ 2.334493] md0: bitmap file is out of date, doing full recovery
[ 2.422418] md0: detected capacity change from 0 to 1610344300544
[ 2.422606] md: resync of RAID array md0
...
[ 9.537010] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null)

So to be clear, this is the bitmap itself is out of date, and thus a full (slow) resync takes place. The filesystem itself comes up clean. I assume it's a timing problem on shutdown and LVM is being unmounted but the RAID not halted before poweroff? I can't see any odd behaviour when I turn the machine off. The syslogs show some things shutting down and that's it.

If I perform a halt instead of a poweroff this drastically reduces the chances of this happening, but it's still happened this morning, hence finally writing about it after being out of ideas for three months.

Detail of the RAID array:

/dev/md0:
 Version : 1.2
 Creation Time : Fri Sep 11 17:49:27 2015
 Raid Level : raid5
 Array Size : 1572601856 (1499.75 GiB 1610.34 GB)
 Used Dev Size : 786300928 (749.88 GiB 805.17 GB)
 Raid Devices : 3
 Total Devices : 3
 Persistence : Superblock is persistent

 Intent Bitmap : Internal

 Update Time : Mon Apr 2 08:38:28 2018
 State : active, resyncing 
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
 Spare Devices : 0

 Layout : left-symmetric
 Chunk Size : 512K

 Resync Status : 52% complete

 Name : underlay:0 (local to host underlay)
 UUID : 520c8995:8d934562:0e2f5b8e:d460bfed
 Events : 40381

 Number Major Minor RaidDevice State
 0 8 17 0 active sync /dev/sdb1
 1 8 49 1 active sync /dev/sdd1
 3 8 33 2 active sync /dev/sdc1

I don't even know how to investigate this further. I've set GRUB to disable splash screens so I can watch dmesg on screen and see nothing interesting. Sometimes I've had services fail to exit and systemd has waited 90s before killing them. I've not been able to work out which they are and whether they'd be the ones that cause a safe unmount but unsafe RAID (turn off? disable? unmount?). I don't even really understand how the kernel normally turns off RAIDs to see what it's doing wrong here.

Secondly, any tips on a RAID resync not totally destroying the interactivity of my desktop would be appreciated. IO throttling via /proc/sys/dev/raid/speed_limit_max doesn't actually work in the way I hoped, my computer just syncs at full tilt for e.g. 10s then waits for 3s so it syncs slower and is still annoying to use for two hours.

asked Apr 2 at 7:46

Widget

migrated from serverfault.com Apr 2 at 21:24

This question came from our site for system and network administrators.

1

Please don't use RAID 5, it's been 'dead' for a decade, nobody uses it, it's dangerous and makes you lose data.
â€“Â Chopper3
Apr 2 at 8:09

What do you recommend instead? RAID 1?
â€“Â Widget
Apr 2 at 8:48

Move to ZFS raidz1 if you have the RAM for it.
â€“Â Geoffrey
Apr 2 at 18:41

1

i don't fancy an out of kernel filesystem, and I'm not convinced there's anything actually wrong with RAID5
â€“Â Widget
Apr 2 at 20:54

1

RAID5 has challenges, but it is hyperbole to describe it as dead or to state that no-one uses it. The two challenges (off the top of my head) with RAID5 at high capacities is that 1) it could (but generally does not) detect media errors, but is generally unable to correct them, and 2) if a disk fails and must be replaced, the chances are good that you will experience another error during the rebuild when you have no redundancy. RAID6 is better, but provides N-2 capacity instead of N-1.
â€“Â Slartibartfast
Apr 3 at 2:40

Â |Â
show 3 more comments

up vote
1
down vote

favorite

Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10] 
md0 : active raid5 sdc1[3] sdd1[1] sdb1[0]
 1572601856 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
 [====>................] resync = 21.3% (168261416/786300928) finish=64.7min speed=159157K/sec
 bitmap: 6/6 pages [24KB], 65536KB chunk

It's worked fine, until about Christmas-time, since when I've repeatedly turned my computer on and found:

[ 2.334153] md/raid:md0: not clean -- starting background reconstruction
[ 2.334164] md/raid:md0: device sdc1 operational as raid disk 2
[ 2.334165] md/raid:md0: device sdd1 operational as raid disk 1
[ 2.334165] md/raid:md0: device sdb1 operational as raid disk 0
[ 2.334333] md/raid:md0: raid level 5 active with 3 out of 3 devices, algorithm 2
[ 2.334479] md0: bitmap file is out of date (39126 < 39127) -- forcing full recovery
[ 2.334493] md0: bitmap file is out of date, doing full recovery
[ 2.422418] md0: detected capacity change from 0 to 1610344300544
[ 2.422606] md: resync of RAID array md0
...
[ 9.537010] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null)

Detail of the RAID array:

/dev/md0:
 Version : 1.2
 Creation Time : Fri Sep 11 17:49:27 2015
 Raid Level : raid5
 Array Size : 1572601856 (1499.75 GiB 1610.34 GB)
 Used Dev Size : 786300928 (749.88 GiB 805.17 GB)
 Raid Devices : 3
 Total Devices : 3
 Persistence : Superblock is persistent

 Intent Bitmap : Internal

 Update Time : Mon Apr 2 08:38:28 2018
 State : active, resyncing 
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
 Spare Devices : 0

 Layout : left-symmetric
 Chunk Size : 512K

 Resync Status : 52% complete

 Name : underlay:0 (local to host underlay)
 UUID : 520c8995:8d934562:0e2f5b8e:d460bfed
 Events : 40381

 Number Major Minor RaidDevice State
 0 8 17 0 active sync /dev/sdb1
 1 8 49 1 active sync /dev/sdd1
 3 8 33 2 active sync /dev/sdc1

asked Apr 2 at 7:46

Widget

migrated from serverfault.com Apr 2 at 21:24

This question came from our site for system and network administrators.

1

Please don't use RAID 5, it's been 'dead' for a decade, nobody uses it, it's dangerous and makes you lose data.
â€“Â Chopper3
Apr 2 at 8:09

What do you recommend instead? RAID 1?
â€“Â Widget
Apr 2 at 8:48

Move to ZFS raidz1 if you have the RAM for it.
â€“Â Geoffrey
Apr 2 at 18:41

1

i don't fancy an out of kernel filesystem, and I'm not convinced there's anything actually wrong with RAID5
â€“Â Widget
Apr 2 at 20:54

1

RAID5 has challenges, but it is hyperbole to describe it as dead or to state that no-one uses it. The two challenges (off the top of my head) with RAID5 at high capacities is that 1) it could (but generally does not) detect media errors, but is generally unable to correct them, and 2) if a disk fails and must be replaced, the chances are good that you will experience another error during the rebuild when you have no redundancy. RAID6 is better, but provides N-2 capacity instead of N-1.
â€“Â Slartibartfast
Apr 3 at 2:40

Â |Â
show 3 more comments

up vote
1
down vote

favorite

Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10] 
md0 : active raid5 sdc1[3] sdd1[1] sdb1[0]
 1572601856 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
 [====>................] resync = 21.3% (168261416/786300928) finish=64.7min speed=159157K/sec
 bitmap: 6/6 pages [24KB], 65536KB chunk

It's worked fine, until about Christmas-time, since when I've repeatedly turned my computer on and found:

[ 2.334153] md/raid:md0: not clean -- starting background reconstruction
[ 2.334164] md/raid:md0: device sdc1 operational as raid disk 2
[ 2.334165] md/raid:md0: device sdd1 operational as raid disk 1
[ 2.334165] md/raid:md0: device sdb1 operational as raid disk 0
[ 2.334333] md/raid:md0: raid level 5 active with 3 out of 3 devices, algorithm 2
[ 2.334479] md0: bitmap file is out of date (39126 < 39127) -- forcing full recovery
[ 2.334493] md0: bitmap file is out of date, doing full recovery
[ 2.422418] md0: detected capacity change from 0 to 1610344300544
[ 2.422606] md: resync of RAID array md0
...
[ 9.537010] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null)

Detail of the RAID array:

/dev/md0:
 Version : 1.2
 Creation Time : Fri Sep 11 17:49:27 2015
 Raid Level : raid5
 Array Size : 1572601856 (1499.75 GiB 1610.34 GB)
 Used Dev Size : 786300928 (749.88 GiB 805.17 GB)
 Raid Devices : 3
 Total Devices : 3
 Persistence : Superblock is persistent

 Intent Bitmap : Internal

 Update Time : Mon Apr 2 08:38:28 2018
 State : active, resyncing 
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
 Spare Devices : 0

 Layout : left-symmetric
 Chunk Size : 512K

 Resync Status : 52% complete

 Name : underlay:0 (local to host underlay)
 UUID : 520c8995:8d934562:0e2f5b8e:d460bfed
 Events : 40381

 Number Major Minor RaidDevice State
 0 8 17 0 active sync /dev/sdb1
 1 8 49 1 active sync /dev/sdd1
 3 8 33 2 active sync /dev/sdc1

asked Apr 2 at 7:46

Widget

Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10] 
md0 : active raid5 sdc1[3] sdd1[1] sdb1[0]
 1572601856 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
 [====>................] resync = 21.3% (168261416/786300928) finish=64.7min speed=159157K/sec
 bitmap: 6/6 pages [24KB], 65536KB chunk

It's worked fine, until about Christmas-time, since when I've repeatedly turned my computer on and found:

[ 2.334153] md/raid:md0: not clean -- starting background reconstruction
[ 2.334164] md/raid:md0: device sdc1 operational as raid disk 2
[ 2.334165] md/raid:md0: device sdd1 operational as raid disk 1
[ 2.334165] md/raid:md0: device sdb1 operational as raid disk 0
[ 2.334333] md/raid:md0: raid level 5 active with 3 out of 3 devices, algorithm 2
[ 2.334479] md0: bitmap file is out of date (39126 < 39127) -- forcing full recovery
[ 2.334493] md0: bitmap file is out of date, doing full recovery
[ 2.422418] md0: detected capacity change from 0 to 1610344300544
[ 2.422606] md: resync of RAID array md0
...
[ 9.537010] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null)

Detail of the RAID array:

/dev/md0:
 Version : 1.2
 Creation Time : Fri Sep 11 17:49:27 2015
 Raid Level : raid5
 Array Size : 1572601856 (1499.75 GiB 1610.34 GB)
 Used Dev Size : 786300928 (749.88 GiB 805.17 GB)
 Raid Devices : 3
 Total Devices : 3
 Persistence : Superblock is persistent

 Intent Bitmap : Internal

 Update Time : Mon Apr 2 08:38:28 2018
 State : active, resyncing 
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
 Spare Devices : 0

 Layout : left-symmetric
 Chunk Size : 512K

 Resync Status : 52% complete

 Name : underlay:0 (local to host underlay)
 UUID : 520c8995:8d934562:0e2f5b8e:d460bfed
 Events : 40381

 Number Major Minor RaidDevice State
 0 8 17 0 active sync /dev/sdb1
 1 8 49 1 active sync /dev/sdd1
 3 8 33 2 active sync /dev/sdc1

asked Apr 2 at 7:46

Widget

asked Apr 2 at 7:46

Widget

asked Apr 2 at 7:46

Widget

asked Apr 2 at 7:46

Widget

migrated from serverfault.com Apr 2 at 21:24

This question came from our site for system and network administrators.

migrated from serverfault.com Apr 2 at 21:24

This question came from our site for system and network administrators.

1

Please don't use RAID 5, it's been 'dead' for a decade, nobody uses it, it's dangerous and makes you lose data.
â€“Â Chopper3
Apr 2 at 8:09

What do you recommend instead? RAID 1?
â€“Â Widget
Apr 2 at 8:48

Move to ZFS raidz1 if you have the RAM for it.
â€“Â Geoffrey
Apr 2 at 18:41

1

i don't fancy an out of kernel filesystem, and I'm not convinced there's anything actually wrong with RAID5
â€“Â Widget
Apr 2 at 20:54

1

RAID5 has challenges, but it is hyperbole to describe it as dead or to state that no-one uses it. The two challenges (off the top of my head) with RAID5 at high capacities is that 1) it could (but generally does not) detect media errors, but is generally unable to correct them, and 2) if a disk fails and must be replaced, the chances are good that you will experience another error during the rebuild when you have no redundancy. RAID6 is better, but provides N-2 capacity instead of N-1.
â€“Â Slartibartfast
Apr 3 at 2:40

Â |Â
show 3 more comments

1

Please don't use RAID 5, it's been 'dead' for a decade, nobody uses it, it's dangerous and makes you lose data.
â€“Â Chopper3
Apr 2 at 8:09

What do you recommend instead? RAID 1?
â€“Â Widget
Apr 2 at 8:48

Move to ZFS raidz1 if you have the RAM for it.
â€“Â Geoffrey
Apr 2 at 18:41

1

i don't fancy an out of kernel filesystem, and I'm not convinced there's anything actually wrong with RAID5
â€“Â Widget
Apr 2 at 20:54

1

RAID5 has challenges, but it is hyperbole to describe it as dead or to state that no-one uses it. The two challenges (off the top of my head) with RAID5 at high capacities is that 1) it could (but generally does not) detect media errors, but is generally unable to correct them, and 2) if a disk fails and must be replaced, the chances are good that you will experience another error during the rebuild when you have no redundancy. RAID6 is better, but provides N-2 capacity instead of N-1.
â€“Â Slartibartfast
Apr 3 at 2:40

Please don't use RAID 5, it's been 'dead' for a decade, nobody uses it, it's dangerous and makes you lose data.
â€“Â Chopper3
Apr 2 at 8:09

What do you recommend instead? RAID 1?
â€“Â Widget
Apr 2 at 8:48

Move to ZFS raidz1 if you have the RAM for it.
â€“Â Geoffrey
Apr 2 at 18:41

i don't fancy an out of kernel filesystem, and I'm not convinced there's anything actually wrong with RAID5
â€“Â Widget
Apr 2 at 20:54

RAID5 has challenges, but it is hyperbole to describe it as dead or to state that no-one uses it. The two challenges (off the top of my head) with RAID5 at high capacities is that 1) it could (but generally does not) detect media errors, but is generally unable to correct them, and 2) if a disk fails and must be replaced, the chances are good that you will experience another error during the rebuild when you have no redundancy. RAID6 is better, but provides N-2 capacity instead of N-1.
â€“Â Slartibartfast
Apr 3 at 2:40

Â |Â
show 3 more comments

1 Answer
1

active

oldest

votes

up vote
0
down vote

The problem turned out to be a network mount in my fstab that was sometimes hanging on shutdown. I'm not sure why as the network mount wasn't on a mountpoint inside the RAID filesystem, they both mounted on / which is my SSD.

I only really spotted it as migrating to 18.04 didn't fix it and I had delays on startup which turned out to be related to the netmount.

answered Jul 29 at 11:50

Widget

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f435138%2fhow-do-i-work-out-whats-trashing-my-raid%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

I only really spotted it as migrating to 18.04 didn't fix it and I had delays on startup which turned out to be related to the netmount.

answered Jul 29 at 11:50

Widget

add a commentÂ |Â

up vote
0
down vote

I only really spotted it as migrating to 18.04 didn't fix it and I had delays on startup which turned out to be related to the netmount.

answered Jul 29 at 11:50

Widget

add a commentÂ |Â

up vote
0
down vote

I only really spotted it as migrating to 18.04 didn't fix it and I had delays on startup which turned out to be related to the netmount.

answered Jul 29 at 11:50

Widget

I only really spotted it as migrating to 18.04 didn't fix it and I had delays on startup which turned out to be related to the netmount.

answered Jul 29 at 11:50

Widget

answered Jul 29 at 11:50

Widget

answered Jul 29 at 11:50

Widget

answered Jul 29 at 11:50

Widget

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu