How do I work out what's trashing my RAID?
Clash Royale CLAN TAG#URR8PPP
up vote
1
down vote
favorite
I have an x86_64 Ubuntu 17.10 install (stock 4.13 kernel) with an SSD and three 1TB WD HDDs which each have a 750GB partition that's used in a 1.45TB RAID5 array. The SSD has my /
on it, and the RAID array has LVM defined which I use for /home
.
Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
md0 : active raid5 sdc1[3] sdd1[1] sdb1[0]
1572601856 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
[====>................] resync = 21.3% (168261416/786300928) finish=64.7min speed=159157K/sec
bitmap: 6/6 pages [24KB], 65536KB chunk
It's worked fine, until about Christmas-time, since when I've repeatedly turned my computer on and found:
[ 2.334153] md/raid:md0: not clean -- starting background reconstruction
[ 2.334164] md/raid:md0: device sdc1 operational as raid disk 2
[ 2.334165] md/raid:md0: device sdd1 operational as raid disk 1
[ 2.334165] md/raid:md0: device sdb1 operational as raid disk 0
[ 2.334333] md/raid:md0: raid level 5 active with 3 out of 3 devices, algorithm 2
[ 2.334479] md0: bitmap file is out of date (39126 < 39127) -- forcing full recovery
[ 2.334493] md0: bitmap file is out of date, doing full recovery
[ 2.422418] md0: detected capacity change from 0 to 1610344300544
[ 2.422606] md: resync of RAID array md0
...
[ 9.537010] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null)
So to be clear, this is the bitmap itself is out of date, and thus a full (slow) resync takes place. The filesystem itself comes up clean. I assume it's a timing problem on shutdown and LVM is being unmounted but the RAID not halted before poweroff? I can't see any odd behaviour when I turn the machine off. The syslogs show some things shutting down and that's it.
If I perform a halt instead of a poweroff this drastically reduces the chances of this happening, but it's still happened this morning, hence finally writing about it after being out of ideas for three months.
Detail of the RAID array:
/dev/md0:
Version : 1.2
Creation Time : Fri Sep 11 17:49:27 2015
Raid Level : raid5
Array Size : 1572601856 (1499.75 GiB 1610.34 GB)
Used Dev Size : 786300928 (749.88 GiB 805.17 GB)
Raid Devices : 3
Total Devices : 3
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Mon Apr 2 08:38:28 2018
State : active, resyncing
Active Devices : 3
Working Devices : 3
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Resync Status : 52% complete
Name : underlay:0 (local to host underlay)
UUID : 520c8995:8d934562:0e2f5b8e:d460bfed
Events : 40381
Number Major Minor RaidDevice State
0 8 17 0 active sync /dev/sdb1
1 8 49 1 active sync /dev/sdd1
3 8 33 2 active sync /dev/sdc1
I don't even know how to investigate this further. I've set GRUB to disable splash screens so I can watch dmesg on screen and see nothing interesting. Sometimes I've had services fail to exit and systemd has waited 90s before killing them. I've not been able to work out which they are and whether they'd be the ones that cause a safe unmount but unsafe RAID (turn off? disable? unmount?). I don't even really understand how the kernel normally turns off RAIDs to see what it's doing wrong here.
Secondly, any tips on a RAID resync not totally destroying the interactivity of my desktop would be appreciated. IO throttling via /proc/sys/dev/raid/speed_limit_max
doesn't actually work in the way I hoped, my computer just syncs at full tilt for e.g. 10s then waits for 3s so it syncs slower and is still annoying to use for two hours.
ubuntu raid software-raid raid5
migrated from serverfault.com Apr 2 at 21:24
This question came from our site for system and network administrators.
 |Â
show 3 more comments
up vote
1
down vote
favorite
I have an x86_64 Ubuntu 17.10 install (stock 4.13 kernel) with an SSD and three 1TB WD HDDs which each have a 750GB partition that's used in a 1.45TB RAID5 array. The SSD has my /
on it, and the RAID array has LVM defined which I use for /home
.
Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
md0 : active raid5 sdc1[3] sdd1[1] sdb1[0]
1572601856 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
[====>................] resync = 21.3% (168261416/786300928) finish=64.7min speed=159157K/sec
bitmap: 6/6 pages [24KB], 65536KB chunk
It's worked fine, until about Christmas-time, since when I've repeatedly turned my computer on and found:
[ 2.334153] md/raid:md0: not clean -- starting background reconstruction
[ 2.334164] md/raid:md0: device sdc1 operational as raid disk 2
[ 2.334165] md/raid:md0: device sdd1 operational as raid disk 1
[ 2.334165] md/raid:md0: device sdb1 operational as raid disk 0
[ 2.334333] md/raid:md0: raid level 5 active with 3 out of 3 devices, algorithm 2
[ 2.334479] md0: bitmap file is out of date (39126 < 39127) -- forcing full recovery
[ 2.334493] md0: bitmap file is out of date, doing full recovery
[ 2.422418] md0: detected capacity change from 0 to 1610344300544
[ 2.422606] md: resync of RAID array md0
...
[ 9.537010] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null)
So to be clear, this is the bitmap itself is out of date, and thus a full (slow) resync takes place. The filesystem itself comes up clean. I assume it's a timing problem on shutdown and LVM is being unmounted but the RAID not halted before poweroff? I can't see any odd behaviour when I turn the machine off. The syslogs show some things shutting down and that's it.
If I perform a halt instead of a poweroff this drastically reduces the chances of this happening, but it's still happened this morning, hence finally writing about it after being out of ideas for three months.
Detail of the RAID array:
/dev/md0:
Version : 1.2
Creation Time : Fri Sep 11 17:49:27 2015
Raid Level : raid5
Array Size : 1572601856 (1499.75 GiB 1610.34 GB)
Used Dev Size : 786300928 (749.88 GiB 805.17 GB)
Raid Devices : 3
Total Devices : 3
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Mon Apr 2 08:38:28 2018
State : active, resyncing
Active Devices : 3
Working Devices : 3
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Resync Status : 52% complete
Name : underlay:0 (local to host underlay)
UUID : 520c8995:8d934562:0e2f5b8e:d460bfed
Events : 40381
Number Major Minor RaidDevice State
0 8 17 0 active sync /dev/sdb1
1 8 49 1 active sync /dev/sdd1
3 8 33 2 active sync /dev/sdc1
I don't even know how to investigate this further. I've set GRUB to disable splash screens so I can watch dmesg on screen and see nothing interesting. Sometimes I've had services fail to exit and systemd has waited 90s before killing them. I've not been able to work out which they are and whether they'd be the ones that cause a safe unmount but unsafe RAID (turn off? disable? unmount?). I don't even really understand how the kernel normally turns off RAIDs to see what it's doing wrong here.
Secondly, any tips on a RAID resync not totally destroying the interactivity of my desktop would be appreciated. IO throttling via /proc/sys/dev/raid/speed_limit_max
doesn't actually work in the way I hoped, my computer just syncs at full tilt for e.g. 10s then waits for 3s so it syncs slower and is still annoying to use for two hours.
ubuntu raid software-raid raid5
migrated from serverfault.com Apr 2 at 21:24
This question came from our site for system and network administrators.
1
Please don't use RAID 5, it's been 'dead' for a decade, nobody uses it, it's dangerous and makes you lose data.
â Chopper3
Apr 2 at 8:09
What do you recommend instead? RAID 1?
â Widget
Apr 2 at 8:48
Move to ZFS raidz1 if you have the RAM for it.
â Geoffrey
Apr 2 at 18:41
1
i don't fancy an out of kernel filesystem, and I'm not convinced there's anything actually wrong with RAID5
â Widget
Apr 2 at 20:54
1
RAID5 has challenges, but it is hyperbole to describe it as dead or to state that no-one uses it. The two challenges (off the top of my head) with RAID5 at high capacities is that 1) it could (but generally does not) detect media errors, but is generally unable to correct them, and 2) if a disk fails and must be replaced, the chances are good that you will experience another error during the rebuild when you have no redundancy. RAID6 is better, but provides N-2 capacity instead of N-1.
â Slartibartfast
Apr 3 at 2:40
 |Â
show 3 more comments
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I have an x86_64 Ubuntu 17.10 install (stock 4.13 kernel) with an SSD and three 1TB WD HDDs which each have a 750GB partition that's used in a 1.45TB RAID5 array. The SSD has my /
on it, and the RAID array has LVM defined which I use for /home
.
Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
md0 : active raid5 sdc1[3] sdd1[1] sdb1[0]
1572601856 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
[====>................] resync = 21.3% (168261416/786300928) finish=64.7min speed=159157K/sec
bitmap: 6/6 pages [24KB], 65536KB chunk
It's worked fine, until about Christmas-time, since when I've repeatedly turned my computer on and found:
[ 2.334153] md/raid:md0: not clean -- starting background reconstruction
[ 2.334164] md/raid:md0: device sdc1 operational as raid disk 2
[ 2.334165] md/raid:md0: device sdd1 operational as raid disk 1
[ 2.334165] md/raid:md0: device sdb1 operational as raid disk 0
[ 2.334333] md/raid:md0: raid level 5 active with 3 out of 3 devices, algorithm 2
[ 2.334479] md0: bitmap file is out of date (39126 < 39127) -- forcing full recovery
[ 2.334493] md0: bitmap file is out of date, doing full recovery
[ 2.422418] md0: detected capacity change from 0 to 1610344300544
[ 2.422606] md: resync of RAID array md0
...
[ 9.537010] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null)
So to be clear, this is the bitmap itself is out of date, and thus a full (slow) resync takes place. The filesystem itself comes up clean. I assume it's a timing problem on shutdown and LVM is being unmounted but the RAID not halted before poweroff? I can't see any odd behaviour when I turn the machine off. The syslogs show some things shutting down and that's it.
If I perform a halt instead of a poweroff this drastically reduces the chances of this happening, but it's still happened this morning, hence finally writing about it after being out of ideas for three months.
Detail of the RAID array:
/dev/md0:
Version : 1.2
Creation Time : Fri Sep 11 17:49:27 2015
Raid Level : raid5
Array Size : 1572601856 (1499.75 GiB 1610.34 GB)
Used Dev Size : 786300928 (749.88 GiB 805.17 GB)
Raid Devices : 3
Total Devices : 3
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Mon Apr 2 08:38:28 2018
State : active, resyncing
Active Devices : 3
Working Devices : 3
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Resync Status : 52% complete
Name : underlay:0 (local to host underlay)
UUID : 520c8995:8d934562:0e2f5b8e:d460bfed
Events : 40381
Number Major Minor RaidDevice State
0 8 17 0 active sync /dev/sdb1
1 8 49 1 active sync /dev/sdd1
3 8 33 2 active sync /dev/sdc1
I don't even know how to investigate this further. I've set GRUB to disable splash screens so I can watch dmesg on screen and see nothing interesting. Sometimes I've had services fail to exit and systemd has waited 90s before killing them. I've not been able to work out which they are and whether they'd be the ones that cause a safe unmount but unsafe RAID (turn off? disable? unmount?). I don't even really understand how the kernel normally turns off RAIDs to see what it's doing wrong here.
Secondly, any tips on a RAID resync not totally destroying the interactivity of my desktop would be appreciated. IO throttling via /proc/sys/dev/raid/speed_limit_max
doesn't actually work in the way I hoped, my computer just syncs at full tilt for e.g. 10s then waits for 3s so it syncs slower and is still annoying to use for two hours.
ubuntu raid software-raid raid5
I have an x86_64 Ubuntu 17.10 install (stock 4.13 kernel) with an SSD and three 1TB WD HDDs which each have a 750GB partition that's used in a 1.45TB RAID5 array. The SSD has my /
on it, and the RAID array has LVM defined which I use for /home
.
Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
md0 : active raid5 sdc1[3] sdd1[1] sdb1[0]
1572601856 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
[====>................] resync = 21.3% (168261416/786300928) finish=64.7min speed=159157K/sec
bitmap: 6/6 pages [24KB], 65536KB chunk
It's worked fine, until about Christmas-time, since when I've repeatedly turned my computer on and found:
[ 2.334153] md/raid:md0: not clean -- starting background reconstruction
[ 2.334164] md/raid:md0: device sdc1 operational as raid disk 2
[ 2.334165] md/raid:md0: device sdd1 operational as raid disk 1
[ 2.334165] md/raid:md0: device sdb1 operational as raid disk 0
[ 2.334333] md/raid:md0: raid level 5 active with 3 out of 3 devices, algorithm 2
[ 2.334479] md0: bitmap file is out of date (39126 < 39127) -- forcing full recovery
[ 2.334493] md0: bitmap file is out of date, doing full recovery
[ 2.422418] md0: detected capacity change from 0 to 1610344300544
[ 2.422606] md: resync of RAID array md0
...
[ 9.537010] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null)
So to be clear, this is the bitmap itself is out of date, and thus a full (slow) resync takes place. The filesystem itself comes up clean. I assume it's a timing problem on shutdown and LVM is being unmounted but the RAID not halted before poweroff? I can't see any odd behaviour when I turn the machine off. The syslogs show some things shutting down and that's it.
If I perform a halt instead of a poweroff this drastically reduces the chances of this happening, but it's still happened this morning, hence finally writing about it after being out of ideas for three months.
Detail of the RAID array:
/dev/md0:
Version : 1.2
Creation Time : Fri Sep 11 17:49:27 2015
Raid Level : raid5
Array Size : 1572601856 (1499.75 GiB 1610.34 GB)
Used Dev Size : 786300928 (749.88 GiB 805.17 GB)
Raid Devices : 3
Total Devices : 3
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Mon Apr 2 08:38:28 2018
State : active, resyncing
Active Devices : 3
Working Devices : 3
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Resync Status : 52% complete
Name : underlay:0 (local to host underlay)
UUID : 520c8995:8d934562:0e2f5b8e:d460bfed
Events : 40381
Number Major Minor RaidDevice State
0 8 17 0 active sync /dev/sdb1
1 8 49 1 active sync /dev/sdd1
3 8 33 2 active sync /dev/sdc1
I don't even know how to investigate this further. I've set GRUB to disable splash screens so I can watch dmesg on screen and see nothing interesting. Sometimes I've had services fail to exit and systemd has waited 90s before killing them. I've not been able to work out which they are and whether they'd be the ones that cause a safe unmount but unsafe RAID (turn off? disable? unmount?). I don't even really understand how the kernel normally turns off RAIDs to see what it's doing wrong here.
Secondly, any tips on a RAID resync not totally destroying the interactivity of my desktop would be appreciated. IO throttling via /proc/sys/dev/raid/speed_limit_max
doesn't actually work in the way I hoped, my computer just syncs at full tilt for e.g. 10s then waits for 3s so it syncs slower and is still annoying to use for two hours.
ubuntu raid software-raid raid5
asked Apr 2 at 7:46
Widget
61
61
migrated from serverfault.com Apr 2 at 21:24
This question came from our site for system and network administrators.
migrated from serverfault.com Apr 2 at 21:24
This question came from our site for system and network administrators.
1
Please don't use RAID 5, it's been 'dead' for a decade, nobody uses it, it's dangerous and makes you lose data.
â Chopper3
Apr 2 at 8:09
What do you recommend instead? RAID 1?
â Widget
Apr 2 at 8:48
Move to ZFS raidz1 if you have the RAM for it.
â Geoffrey
Apr 2 at 18:41
1
i don't fancy an out of kernel filesystem, and I'm not convinced there's anything actually wrong with RAID5
â Widget
Apr 2 at 20:54
1
RAID5 has challenges, but it is hyperbole to describe it as dead or to state that no-one uses it. The two challenges (off the top of my head) with RAID5 at high capacities is that 1) it could (but generally does not) detect media errors, but is generally unable to correct them, and 2) if a disk fails and must be replaced, the chances are good that you will experience another error during the rebuild when you have no redundancy. RAID6 is better, but provides N-2 capacity instead of N-1.
â Slartibartfast
Apr 3 at 2:40
 |Â
show 3 more comments
1
Please don't use RAID 5, it's been 'dead' for a decade, nobody uses it, it's dangerous and makes you lose data.
â Chopper3
Apr 2 at 8:09
What do you recommend instead? RAID 1?
â Widget
Apr 2 at 8:48
Move to ZFS raidz1 if you have the RAM for it.
â Geoffrey
Apr 2 at 18:41
1
i don't fancy an out of kernel filesystem, and I'm not convinced there's anything actually wrong with RAID5
â Widget
Apr 2 at 20:54
1
RAID5 has challenges, but it is hyperbole to describe it as dead or to state that no-one uses it. The two challenges (off the top of my head) with RAID5 at high capacities is that 1) it could (but generally does not) detect media errors, but is generally unable to correct them, and 2) if a disk fails and must be replaced, the chances are good that you will experience another error during the rebuild when you have no redundancy. RAID6 is better, but provides N-2 capacity instead of N-1.
â Slartibartfast
Apr 3 at 2:40
1
1
Please don't use RAID 5, it's been 'dead' for a decade, nobody uses it, it's dangerous and makes you lose data.
â Chopper3
Apr 2 at 8:09
Please don't use RAID 5, it's been 'dead' for a decade, nobody uses it, it's dangerous and makes you lose data.
â Chopper3
Apr 2 at 8:09
What do you recommend instead? RAID 1?
â Widget
Apr 2 at 8:48
What do you recommend instead? RAID 1?
â Widget
Apr 2 at 8:48
Move to ZFS raidz1 if you have the RAM for it.
â Geoffrey
Apr 2 at 18:41
Move to ZFS raidz1 if you have the RAM for it.
â Geoffrey
Apr 2 at 18:41
1
1
i don't fancy an out of kernel filesystem, and I'm not convinced there's anything actually wrong with RAID5
â Widget
Apr 2 at 20:54
i don't fancy an out of kernel filesystem, and I'm not convinced there's anything actually wrong with RAID5
â Widget
Apr 2 at 20:54
1
1
RAID5 has challenges, but it is hyperbole to describe it as dead or to state that no-one uses it. The two challenges (off the top of my head) with RAID5 at high capacities is that 1) it could (but generally does not) detect media errors, but is generally unable to correct them, and 2) if a disk fails and must be replaced, the chances are good that you will experience another error during the rebuild when you have no redundancy. RAID6 is better, but provides N-2 capacity instead of N-1.
â Slartibartfast
Apr 3 at 2:40
RAID5 has challenges, but it is hyperbole to describe it as dead or to state that no-one uses it. The two challenges (off the top of my head) with RAID5 at high capacities is that 1) it could (but generally does not) detect media errors, but is generally unable to correct them, and 2) if a disk fails and must be replaced, the chances are good that you will experience another error during the rebuild when you have no redundancy. RAID6 is better, but provides N-2 capacity instead of N-1.
â Slartibartfast
Apr 3 at 2:40
 |Â
show 3 more comments
1 Answer
1
active
oldest
votes
up vote
0
down vote
The problem turned out to be a network mount in my fstab
that was sometimes hanging on shutdown. I'm not sure why as the network mount wasn't on a mountpoint inside the RAID filesystem, they both mounted on /
which is my SSD.
I only really spotted it as migrating to 18.04 didn't fix it and I had delays on startup which turned out to be related to the netmount.
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
The problem turned out to be a network mount in my fstab
that was sometimes hanging on shutdown. I'm not sure why as the network mount wasn't on a mountpoint inside the RAID filesystem, they both mounted on /
which is my SSD.
I only really spotted it as migrating to 18.04 didn't fix it and I had delays on startup which turned out to be related to the netmount.
add a comment |Â
up vote
0
down vote
The problem turned out to be a network mount in my fstab
that was sometimes hanging on shutdown. I'm not sure why as the network mount wasn't on a mountpoint inside the RAID filesystem, they both mounted on /
which is my SSD.
I only really spotted it as migrating to 18.04 didn't fix it and I had delays on startup which turned out to be related to the netmount.
add a comment |Â
up vote
0
down vote
up vote
0
down vote
The problem turned out to be a network mount in my fstab
that was sometimes hanging on shutdown. I'm not sure why as the network mount wasn't on a mountpoint inside the RAID filesystem, they both mounted on /
which is my SSD.
I only really spotted it as migrating to 18.04 didn't fix it and I had delays on startup which turned out to be related to the netmount.
The problem turned out to be a network mount in my fstab
that was sometimes hanging on shutdown. I'm not sure why as the network mount wasn't on a mountpoint inside the RAID filesystem, they both mounted on /
which is my SSD.
I only really spotted it as migrating to 18.04 didn't fix it and I had delays on startup which turned out to be related to the netmount.
answered Jul 29 at 11:50
Widget
61
61
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f435138%2fhow-do-i-work-out-whats-trashing-my-raid%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
1
Please don't use RAID 5, it's been 'dead' for a decade, nobody uses it, it's dangerous and makes you lose data.
â Chopper3
Apr 2 at 8:09
What do you recommend instead? RAID 1?
â Widget
Apr 2 at 8:48
Move to ZFS raidz1 if you have the RAM for it.
â Geoffrey
Apr 2 at 18:41
1
i don't fancy an out of kernel filesystem, and I'm not convinced there's anything actually wrong with RAID5
â Widget
Apr 2 at 20:54
1
RAID5 has challenges, but it is hyperbole to describe it as dead or to state that no-one uses it. The two challenges (off the top of my head) with RAID5 at high capacities is that 1) it could (but generally does not) detect media errors, but is generally unable to correct them, and 2) if a disk fails and must be replaced, the chances are good that you will experience another error during the rebuild when you have no redundancy. RAID6 is better, but provides N-2 capacity instead of N-1.
â Slartibartfast
Apr 3 at 2:40