Optimize ext4 for always full operation
Clash Royale CLAN TAG#URR8PPP
Our application writes data to disk as a huge ring buffer (30 to 150TB); writing new files while deleting old files. As such, by definition, the disk is always "near full".
The writer process creates various files at a net input speed of about 100-150 Mbits/s. Data files are a mixture of 1GB 'data' files and several smaller meta data files. (The input speed is constant, but note new file sets are created only once per two minutes).
There is a separate deleter process which deletes the "oldest" files every 30s. It keeps deleting until there it reaches 15GB free space headroom on the disk.
So in stable operation, all data partitions have only 15GB free space.
On this SO question relating to file system slowdown, DepressedDaniel commented:
Sync hanging just means the filesystem is working hard to save the
latest operations consistently. It is most certainly trying to shuffle
data around on the disk in that time. I don't know the details, but
I'm pretty sure if your filesystem is heavily fragmented, ext4 will
try to do something about that. And that can't be good if the
filesystem is nearly 100% full. The only reasonable way to utilize a
filesystem at near 100% of capacity is to statically initialize it
with some files and then overwrite those same files in place (to avoid
fragmenting). Probably works best with ext2/3.
Is ext4 a bad choice for this application? Since we are running live, what tuning can be done to ext4 to avoid fragmentation, slow downs, or other performance limitations? Changing from ext4 would be quite difficult...
(and re-writing statically created files means rewriting the entire application)
Thanks!
EDIT I
The server has 50 to 100 TB of disks attached (24 drives). The Areca RAID controller manages the 24 drives as a RAID-6 raid set.
From there we divide into several partitions/volumes, with each volume being 5 to 10TB. So the size of any one volume is not huge.
The "writer" process finds the first volume with "enough" space and writes a file there. After the file is written the process is repeated.
For a brand new machine, the volumes are filled up in order. If all volumes are "full" then the "deleter" process starts deleting the oldest files until "enough" space is available.
Over a long time, because of the action of other processes, the time sequence of files becomes randomly distributed across all volumes.
filesystems ext4 ext3
|
show 6 more comments
Our application writes data to disk as a huge ring buffer (30 to 150TB); writing new files while deleting old files. As such, by definition, the disk is always "near full".
The writer process creates various files at a net input speed of about 100-150 Mbits/s. Data files are a mixture of 1GB 'data' files and several smaller meta data files. (The input speed is constant, but note new file sets are created only once per two minutes).
There is a separate deleter process which deletes the "oldest" files every 30s. It keeps deleting until there it reaches 15GB free space headroom on the disk.
So in stable operation, all data partitions have only 15GB free space.
On this SO question relating to file system slowdown, DepressedDaniel commented:
Sync hanging just means the filesystem is working hard to save the
latest operations consistently. It is most certainly trying to shuffle
data around on the disk in that time. I don't know the details, but
I'm pretty sure if your filesystem is heavily fragmented, ext4 will
try to do something about that. And that can't be good if the
filesystem is nearly 100% full. The only reasonable way to utilize a
filesystem at near 100% of capacity is to statically initialize it
with some files and then overwrite those same files in place (to avoid
fragmenting). Probably works best with ext2/3.
Is ext4 a bad choice for this application? Since we are running live, what tuning can be done to ext4 to avoid fragmentation, slow downs, or other performance limitations? Changing from ext4 would be quite difficult...
(and re-writing statically created files means rewriting the entire application)
Thanks!
EDIT I
The server has 50 to 100 TB of disks attached (24 drives). The Areca RAID controller manages the 24 drives as a RAID-6 raid set.
From there we divide into several partitions/volumes, with each volume being 5 to 10TB. So the size of any one volume is not huge.
The "writer" process finds the first volume with "enough" space and writes a file there. After the file is written the process is repeated.
For a brand new machine, the volumes are filled up in order. If all volumes are "full" then the "deleter" process starts deleting the oldest files until "enough" space is available.
Over a long time, because of the action of other processes, the time sequence of files becomes randomly distributed across all volumes.
filesystems ext4 ext3
2
When creating the large data files, do you already usefallocate(fd,FALLOC_FL_ZERO_RANGE,0,length)
to allocate the disk space before writing to the file? Could you use a "fixed" allocation size for the large data files (assuming they don't have much variation in size)? This is a difficult case, because the smaller metadata files may cause fragmentation of the large files. Could you use different partitions for the large data files and small metadata files?
– Nominal Animal
Dec 5 '16 at 13:35
Do you have any reader processes? Do they read the oldest data files, or is it random?
– Mark Plotnick
Dec 5 '16 at 22:15
All files are opened with fopen() and no pre-allocation is done. Using different partitions would be difficult. For the large files I could pre-allocate using a heuristic guess of the size. But the final size could be different. Would the allocated space be returned to "free" after fclose() ?
– Danny
Dec 6 '16 at 5:32
Mark, yes there are reader processes. The 'deleter' reads directory information and some of the meta-data files. Also, the big data files could be read by the player app. (application is similar to a video server, with constant bit rate in for the recorder and and (if activated) constant bit rate out for the player.
– Danny
Dec 6 '16 at 5:34
1) IMO it would be better if you could make this question self-sufficient. If you were asking a hypothetical question, one answer would be to test it. But you've tested it and found at least one BIG problem; that's the most important reason you're asking, right? 2) Secondly - I was modelling the algorithms you gave as the only significant IO load on this storage. I'm not sure exactly what I'm supposed to understand from the edit mentioning other processes which cause a different distribution of files.
– sourcejedi
Jan 1 at 11:36
|
show 6 more comments
Our application writes data to disk as a huge ring buffer (30 to 150TB); writing new files while deleting old files. As such, by definition, the disk is always "near full".
The writer process creates various files at a net input speed of about 100-150 Mbits/s. Data files are a mixture of 1GB 'data' files and several smaller meta data files. (The input speed is constant, but note new file sets are created only once per two minutes).
There is a separate deleter process which deletes the "oldest" files every 30s. It keeps deleting until there it reaches 15GB free space headroom on the disk.
So in stable operation, all data partitions have only 15GB free space.
On this SO question relating to file system slowdown, DepressedDaniel commented:
Sync hanging just means the filesystem is working hard to save the
latest operations consistently. It is most certainly trying to shuffle
data around on the disk in that time. I don't know the details, but
I'm pretty sure if your filesystem is heavily fragmented, ext4 will
try to do something about that. And that can't be good if the
filesystem is nearly 100% full. The only reasonable way to utilize a
filesystem at near 100% of capacity is to statically initialize it
with some files and then overwrite those same files in place (to avoid
fragmenting). Probably works best with ext2/3.
Is ext4 a bad choice for this application? Since we are running live, what tuning can be done to ext4 to avoid fragmentation, slow downs, or other performance limitations? Changing from ext4 would be quite difficult...
(and re-writing statically created files means rewriting the entire application)
Thanks!
EDIT I
The server has 50 to 100 TB of disks attached (24 drives). The Areca RAID controller manages the 24 drives as a RAID-6 raid set.
From there we divide into several partitions/volumes, with each volume being 5 to 10TB. So the size of any one volume is not huge.
The "writer" process finds the first volume with "enough" space and writes a file there. After the file is written the process is repeated.
For a brand new machine, the volumes are filled up in order. If all volumes are "full" then the "deleter" process starts deleting the oldest files until "enough" space is available.
Over a long time, because of the action of other processes, the time sequence of files becomes randomly distributed across all volumes.
filesystems ext4 ext3
Our application writes data to disk as a huge ring buffer (30 to 150TB); writing new files while deleting old files. As such, by definition, the disk is always "near full".
The writer process creates various files at a net input speed of about 100-150 Mbits/s. Data files are a mixture of 1GB 'data' files and several smaller meta data files. (The input speed is constant, but note new file sets are created only once per two minutes).
There is a separate deleter process which deletes the "oldest" files every 30s. It keeps deleting until there it reaches 15GB free space headroom on the disk.
So in stable operation, all data partitions have only 15GB free space.
On this SO question relating to file system slowdown, DepressedDaniel commented:
Sync hanging just means the filesystem is working hard to save the
latest operations consistently. It is most certainly trying to shuffle
data around on the disk in that time. I don't know the details, but
I'm pretty sure if your filesystem is heavily fragmented, ext4 will
try to do something about that. And that can't be good if the
filesystem is nearly 100% full. The only reasonable way to utilize a
filesystem at near 100% of capacity is to statically initialize it
with some files and then overwrite those same files in place (to avoid
fragmenting). Probably works best with ext2/3.
Is ext4 a bad choice for this application? Since we are running live, what tuning can be done to ext4 to avoid fragmentation, slow downs, or other performance limitations? Changing from ext4 would be quite difficult...
(and re-writing statically created files means rewriting the entire application)
Thanks!
EDIT I
The server has 50 to 100 TB of disks attached (24 drives). The Areca RAID controller manages the 24 drives as a RAID-6 raid set.
From there we divide into several partitions/volumes, with each volume being 5 to 10TB. So the size of any one volume is not huge.
The "writer" process finds the first volume with "enough" space and writes a file there. After the file is written the process is repeated.
For a brand new machine, the volumes are filled up in order. If all volumes are "full" then the "deleter" process starts deleting the oldest files until "enough" space is available.
Over a long time, because of the action of other processes, the time sequence of files becomes randomly distributed across all volumes.
filesystems ext4 ext3
filesystems ext4 ext3
edited Jan 1 at 13:22
sourcejedi
23.3k437102
23.3k437102
asked Dec 5 '16 at 12:52
DannyDanny
1787
1787
2
When creating the large data files, do you already usefallocate(fd,FALLOC_FL_ZERO_RANGE,0,length)
to allocate the disk space before writing to the file? Could you use a "fixed" allocation size for the large data files (assuming they don't have much variation in size)? This is a difficult case, because the smaller metadata files may cause fragmentation of the large files. Could you use different partitions for the large data files and small metadata files?
– Nominal Animal
Dec 5 '16 at 13:35
Do you have any reader processes? Do they read the oldest data files, or is it random?
– Mark Plotnick
Dec 5 '16 at 22:15
All files are opened with fopen() and no pre-allocation is done. Using different partitions would be difficult. For the large files I could pre-allocate using a heuristic guess of the size. But the final size could be different. Would the allocated space be returned to "free" after fclose() ?
– Danny
Dec 6 '16 at 5:32
Mark, yes there are reader processes. The 'deleter' reads directory information and some of the meta-data files. Also, the big data files could be read by the player app. (application is similar to a video server, with constant bit rate in for the recorder and and (if activated) constant bit rate out for the player.
– Danny
Dec 6 '16 at 5:34
1) IMO it would be better if you could make this question self-sufficient. If you were asking a hypothetical question, one answer would be to test it. But you've tested it and found at least one BIG problem; that's the most important reason you're asking, right? 2) Secondly - I was modelling the algorithms you gave as the only significant IO load on this storage. I'm not sure exactly what I'm supposed to understand from the edit mentioning other processes which cause a different distribution of files.
– sourcejedi
Jan 1 at 11:36
|
show 6 more comments
2
When creating the large data files, do you already usefallocate(fd,FALLOC_FL_ZERO_RANGE,0,length)
to allocate the disk space before writing to the file? Could you use a "fixed" allocation size for the large data files (assuming they don't have much variation in size)? This is a difficult case, because the smaller metadata files may cause fragmentation of the large files. Could you use different partitions for the large data files and small metadata files?
– Nominal Animal
Dec 5 '16 at 13:35
Do you have any reader processes? Do they read the oldest data files, or is it random?
– Mark Plotnick
Dec 5 '16 at 22:15
All files are opened with fopen() and no pre-allocation is done. Using different partitions would be difficult. For the large files I could pre-allocate using a heuristic guess of the size. But the final size could be different. Would the allocated space be returned to "free" after fclose() ?
– Danny
Dec 6 '16 at 5:32
Mark, yes there are reader processes. The 'deleter' reads directory information and some of the meta-data files. Also, the big data files could be read by the player app. (application is similar to a video server, with constant bit rate in for the recorder and and (if activated) constant bit rate out for the player.
– Danny
Dec 6 '16 at 5:34
1) IMO it would be better if you could make this question self-sufficient. If you were asking a hypothetical question, one answer would be to test it. But you've tested it and found at least one BIG problem; that's the most important reason you're asking, right? 2) Secondly - I was modelling the algorithms you gave as the only significant IO load on this storage. I'm not sure exactly what I'm supposed to understand from the edit mentioning other processes which cause a different distribution of files.
– sourcejedi
Jan 1 at 11:36
2
2
When creating the large data files, do you already use
fallocate(fd,FALLOC_FL_ZERO_RANGE,0,length)
to allocate the disk space before writing to the file? Could you use a "fixed" allocation size for the large data files (assuming they don't have much variation in size)? This is a difficult case, because the smaller metadata files may cause fragmentation of the large files. Could you use different partitions for the large data files and small metadata files?– Nominal Animal
Dec 5 '16 at 13:35
When creating the large data files, do you already use
fallocate(fd,FALLOC_FL_ZERO_RANGE,0,length)
to allocate the disk space before writing to the file? Could you use a "fixed" allocation size for the large data files (assuming they don't have much variation in size)? This is a difficult case, because the smaller metadata files may cause fragmentation of the large files. Could you use different partitions for the large data files and small metadata files?– Nominal Animal
Dec 5 '16 at 13:35
Do you have any reader processes? Do they read the oldest data files, or is it random?
– Mark Plotnick
Dec 5 '16 at 22:15
Do you have any reader processes? Do they read the oldest data files, or is it random?
– Mark Plotnick
Dec 5 '16 at 22:15
All files are opened with fopen() and no pre-allocation is done. Using different partitions would be difficult. For the large files I could pre-allocate using a heuristic guess of the size. But the final size could be different. Would the allocated space be returned to "free" after fclose() ?
– Danny
Dec 6 '16 at 5:32
All files are opened with fopen() and no pre-allocation is done. Using different partitions would be difficult. For the large files I could pre-allocate using a heuristic guess of the size. But the final size could be different. Would the allocated space be returned to "free" after fclose() ?
– Danny
Dec 6 '16 at 5:32
Mark, yes there are reader processes. The 'deleter' reads directory information and some of the meta-data files. Also, the big data files could be read by the player app. (application is similar to a video server, with constant bit rate in for the recorder and and (if activated) constant bit rate out for the player.
– Danny
Dec 6 '16 at 5:34
Mark, yes there are reader processes. The 'deleter' reads directory information and some of the meta-data files. Also, the big data files could be read by the player app. (application is similar to a video server, with constant bit rate in for the recorder and and (if activated) constant bit rate out for the player.
– Danny
Dec 6 '16 at 5:34
1) IMO it would be better if you could make this question self-sufficient. If you were asking a hypothetical question, one answer would be to test it. But you've tested it and found at least one BIG problem; that's the most important reason you're asking, right? 2) Secondly - I was modelling the algorithms you gave as the only significant IO load on this storage. I'm not sure exactly what I'm supposed to understand from the edit mentioning other processes which cause a different distribution of files.
– sourcejedi
Jan 1 at 11:36
1) IMO it would be better if you could make this question self-sufficient. If you were asking a hypothetical question, one answer would be to test it. But you've tested it and found at least one BIG problem; that's the most important reason you're asking, right? 2) Secondly - I was modelling the algorithms you gave as the only significant IO load on this storage. I'm not sure exactly what I'm supposed to understand from the edit mentioning other processes which cause a different distribution of files.
– sourcejedi
Jan 1 at 11:36
|
show 6 more comments
2 Answers
2
active
oldest
votes
In principle, I don't see why strict ring-buffer writes would pose any challenge regarding fragmentation. It seems like it would be straightforward. The quote sounds to me like it is based on advice from more general write workloads. But looking at the linked SO question I see you have a real problem...
Since you are concerned about fragmentation, you should consider how to measure it! e4defrag
exists. It has only two options. -c
only shows the current state and does not defrag. -v
shows per-file statistics. All combinations of options are valid (including no options). Although it does not provide any explicit method to limit the performance impact on a running system, e4defrag
supports being run on individual files, so you can rate-limit it yourself.
(XFS also has a defrag tool, though I haven't used it.)
e2freefrag
can show free space fragmentation. If you use the CFQ IO scheduler, then you can run it with a reduced IO priority using ionice
.
The quote guesses wrong, the reply by Stephen Kitt is correct. ext4 does not perform any automatic defragmentation. It does not try to "shuffle around" data which has already been written.
Discarding this strange misconception leaves no reason to suggest "ext2/ext3". Apart from anything else, the ext3 code does not exist in current kernels. The ext4 code is used to mount ext3. ext3 is a subset of ext4. In particular when you are creating relatively large files, it just seems silly not to use extents, and those are an ext4-specific feature.
I believe "hanging" is more often associated with the journal. See e.g. comments from (the in-progress filesystem) bcachefs -
Tail latency has been the bane of ext4 users for many years - dependencies in the journalling code and elsewhere can lead to 30+ second latencies on simple operations (e.g. unlinks) on multithreaded workloads. No one seems to know how to fix them.
In bcachefs, the only reason a thread blocks on IO is because it explicitly asked to (an uncached read or an fsync operation), or resource exhaustion - full stop. Locks that would block foreground operations are never held while doing IO. While bcachefs isn't a realtime filesystem today (it lacks e.g. realtime scheduling for IO), it very conceivably could be one day.
Don't ask me to interpret the extent to which using XFS can avoid the above problem. I don't know. But if you were considering testing an alternative filesystem setup, XFS is the first thing I would try.
I'm struggling to find much information about the effects of disabling journalling on ext4. At least it doesn't seem to be one of the common options considered when tuning performance.
I'm not sure why you're using sys_sync(). It's usually better avoided (see e.g. here). I'm not sure that really explains your problem, but it seems an unfortunate thing to come across when trying to narrow this down.
add a comment |
Here's an alternate approach, however it's somewhat involved.
Create many smaller partitions, let's say 10 or 20 of them. LVM2 might come in handy in this scenario. Then use the partitions in a ring-buffer fashion as follows:
One of the partitions would always be the 'active' one, where new data gets written to until it is completely full or nearly so. You don't need to leave any headroom. When the active partition has become full or doesn't have enough free space to hold the next chunk of data, switch to the next partition which then becomes the active one.
Your deleter process will always make sure that there is at least one completely empty partition available. If there isn't one--and this is the crucial part--it will simply reformat the oldest partition, creating a fresh new file system. This new partition will later be able to receive new data with minimal to no fragmentation.
I didn't mention in the question, but that's actually what we do. See the edited question above.
– Danny
Jan 1 at 3:22
@Danny if "the time sequence of files becomes randomly distributed across all volumes", then surely you cannot actually do "and this is the crucial part--simply reformat the oldest partition, creating a fresh new file system. This new partition will later be able to receive new data with minimal to no fragmentation."
– sourcejedi
Jan 1 at 14:16
Sorry, my bad. Somehow didn't see/read your last two paragraphs. We have 10-12 smaller partitions, but the deleter removes only the oldest files (1GB each) until "enough" free space is available. Then it stops and waits for the disk to be "too full" again. "enough" and "too full" can be adjusted for tuning.
– Danny
Jan 2 at 7:36
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f328132%2foptimize-ext4-for-always-full-operation%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
In principle, I don't see why strict ring-buffer writes would pose any challenge regarding fragmentation. It seems like it would be straightforward. The quote sounds to me like it is based on advice from more general write workloads. But looking at the linked SO question I see you have a real problem...
Since you are concerned about fragmentation, you should consider how to measure it! e4defrag
exists. It has only two options. -c
only shows the current state and does not defrag. -v
shows per-file statistics. All combinations of options are valid (including no options). Although it does not provide any explicit method to limit the performance impact on a running system, e4defrag
supports being run on individual files, so you can rate-limit it yourself.
(XFS also has a defrag tool, though I haven't used it.)
e2freefrag
can show free space fragmentation. If you use the CFQ IO scheduler, then you can run it with a reduced IO priority using ionice
.
The quote guesses wrong, the reply by Stephen Kitt is correct. ext4 does not perform any automatic defragmentation. It does not try to "shuffle around" data which has already been written.
Discarding this strange misconception leaves no reason to suggest "ext2/ext3". Apart from anything else, the ext3 code does not exist in current kernels. The ext4 code is used to mount ext3. ext3 is a subset of ext4. In particular when you are creating relatively large files, it just seems silly not to use extents, and those are an ext4-specific feature.
I believe "hanging" is more often associated with the journal. See e.g. comments from (the in-progress filesystem) bcachefs -
Tail latency has been the bane of ext4 users for many years - dependencies in the journalling code and elsewhere can lead to 30+ second latencies on simple operations (e.g. unlinks) on multithreaded workloads. No one seems to know how to fix them.
In bcachefs, the only reason a thread blocks on IO is because it explicitly asked to (an uncached read or an fsync operation), or resource exhaustion - full stop. Locks that would block foreground operations are never held while doing IO. While bcachefs isn't a realtime filesystem today (it lacks e.g. realtime scheduling for IO), it very conceivably could be one day.
Don't ask me to interpret the extent to which using XFS can avoid the above problem. I don't know. But if you were considering testing an alternative filesystem setup, XFS is the first thing I would try.
I'm struggling to find much information about the effects of disabling journalling on ext4. At least it doesn't seem to be one of the common options considered when tuning performance.
I'm not sure why you're using sys_sync(). It's usually better avoided (see e.g. here). I'm not sure that really explains your problem, but it seems an unfortunate thing to come across when trying to narrow this down.
add a comment |
In principle, I don't see why strict ring-buffer writes would pose any challenge regarding fragmentation. It seems like it would be straightforward. The quote sounds to me like it is based on advice from more general write workloads. But looking at the linked SO question I see you have a real problem...
Since you are concerned about fragmentation, you should consider how to measure it! e4defrag
exists. It has only two options. -c
only shows the current state and does not defrag. -v
shows per-file statistics. All combinations of options are valid (including no options). Although it does not provide any explicit method to limit the performance impact on a running system, e4defrag
supports being run on individual files, so you can rate-limit it yourself.
(XFS also has a defrag tool, though I haven't used it.)
e2freefrag
can show free space fragmentation. If you use the CFQ IO scheduler, then you can run it with a reduced IO priority using ionice
.
The quote guesses wrong, the reply by Stephen Kitt is correct. ext4 does not perform any automatic defragmentation. It does not try to "shuffle around" data which has already been written.
Discarding this strange misconception leaves no reason to suggest "ext2/ext3". Apart from anything else, the ext3 code does not exist in current kernels. The ext4 code is used to mount ext3. ext3 is a subset of ext4. In particular when you are creating relatively large files, it just seems silly not to use extents, and those are an ext4-specific feature.
I believe "hanging" is more often associated with the journal. See e.g. comments from (the in-progress filesystem) bcachefs -
Tail latency has been the bane of ext4 users for many years - dependencies in the journalling code and elsewhere can lead to 30+ second latencies on simple operations (e.g. unlinks) on multithreaded workloads. No one seems to know how to fix them.
In bcachefs, the only reason a thread blocks on IO is because it explicitly asked to (an uncached read or an fsync operation), or resource exhaustion - full stop. Locks that would block foreground operations are never held while doing IO. While bcachefs isn't a realtime filesystem today (it lacks e.g. realtime scheduling for IO), it very conceivably could be one day.
Don't ask me to interpret the extent to which using XFS can avoid the above problem. I don't know. But if you were considering testing an alternative filesystem setup, XFS is the first thing I would try.
I'm struggling to find much information about the effects of disabling journalling on ext4. At least it doesn't seem to be one of the common options considered when tuning performance.
I'm not sure why you're using sys_sync(). It's usually better avoided (see e.g. here). I'm not sure that really explains your problem, but it seems an unfortunate thing to come across when trying to narrow this down.
add a comment |
In principle, I don't see why strict ring-buffer writes would pose any challenge regarding fragmentation. It seems like it would be straightforward. The quote sounds to me like it is based on advice from more general write workloads. But looking at the linked SO question I see you have a real problem...
Since you are concerned about fragmentation, you should consider how to measure it! e4defrag
exists. It has only two options. -c
only shows the current state and does not defrag. -v
shows per-file statistics. All combinations of options are valid (including no options). Although it does not provide any explicit method to limit the performance impact on a running system, e4defrag
supports being run on individual files, so you can rate-limit it yourself.
(XFS also has a defrag tool, though I haven't used it.)
e2freefrag
can show free space fragmentation. If you use the CFQ IO scheduler, then you can run it with a reduced IO priority using ionice
.
The quote guesses wrong, the reply by Stephen Kitt is correct. ext4 does not perform any automatic defragmentation. It does not try to "shuffle around" data which has already been written.
Discarding this strange misconception leaves no reason to suggest "ext2/ext3". Apart from anything else, the ext3 code does not exist in current kernels. The ext4 code is used to mount ext3. ext3 is a subset of ext4. In particular when you are creating relatively large files, it just seems silly not to use extents, and those are an ext4-specific feature.
I believe "hanging" is more often associated with the journal. See e.g. comments from (the in-progress filesystem) bcachefs -
Tail latency has been the bane of ext4 users for many years - dependencies in the journalling code and elsewhere can lead to 30+ second latencies on simple operations (e.g. unlinks) on multithreaded workloads. No one seems to know how to fix them.
In bcachefs, the only reason a thread blocks on IO is because it explicitly asked to (an uncached read or an fsync operation), or resource exhaustion - full stop. Locks that would block foreground operations are never held while doing IO. While bcachefs isn't a realtime filesystem today (it lacks e.g. realtime scheduling for IO), it very conceivably could be one day.
Don't ask me to interpret the extent to which using XFS can avoid the above problem. I don't know. But if you were considering testing an alternative filesystem setup, XFS is the first thing I would try.
I'm struggling to find much information about the effects of disabling journalling on ext4. At least it doesn't seem to be one of the common options considered when tuning performance.
I'm not sure why you're using sys_sync(). It's usually better avoided (see e.g. here). I'm not sure that really explains your problem, but it seems an unfortunate thing to come across when trying to narrow this down.
In principle, I don't see why strict ring-buffer writes would pose any challenge regarding fragmentation. It seems like it would be straightforward. The quote sounds to me like it is based on advice from more general write workloads. But looking at the linked SO question I see you have a real problem...
Since you are concerned about fragmentation, you should consider how to measure it! e4defrag
exists. It has only two options. -c
only shows the current state and does not defrag. -v
shows per-file statistics. All combinations of options are valid (including no options). Although it does not provide any explicit method to limit the performance impact on a running system, e4defrag
supports being run on individual files, so you can rate-limit it yourself.
(XFS also has a defrag tool, though I haven't used it.)
e2freefrag
can show free space fragmentation. If you use the CFQ IO scheduler, then you can run it with a reduced IO priority using ionice
.
The quote guesses wrong, the reply by Stephen Kitt is correct. ext4 does not perform any automatic defragmentation. It does not try to "shuffle around" data which has already been written.
Discarding this strange misconception leaves no reason to suggest "ext2/ext3". Apart from anything else, the ext3 code does not exist in current kernels. The ext4 code is used to mount ext3. ext3 is a subset of ext4. In particular when you are creating relatively large files, it just seems silly not to use extents, and those are an ext4-specific feature.
I believe "hanging" is more often associated with the journal. See e.g. comments from (the in-progress filesystem) bcachefs -
Tail latency has been the bane of ext4 users for many years - dependencies in the journalling code and elsewhere can lead to 30+ second latencies on simple operations (e.g. unlinks) on multithreaded workloads. No one seems to know how to fix them.
In bcachefs, the only reason a thread blocks on IO is because it explicitly asked to (an uncached read or an fsync operation), or resource exhaustion - full stop. Locks that would block foreground operations are never held while doing IO. While bcachefs isn't a realtime filesystem today (it lacks e.g. realtime scheduling for IO), it very conceivably could be one day.
Don't ask me to interpret the extent to which using XFS can avoid the above problem. I don't know. But if you were considering testing an alternative filesystem setup, XFS is the first thing I would try.
I'm struggling to find much information about the effects of disabling journalling on ext4. At least it doesn't seem to be one of the common options considered when tuning performance.
I'm not sure why you're using sys_sync(). It's usually better avoided (see e.g. here). I'm not sure that really explains your problem, but it seems an unfortunate thing to come across when trying to narrow this down.
edited Jan 1 at 14:37
answered Dec 31 '18 at 21:03
sourcejedisourcejedi
23.3k437102
23.3k437102
add a comment |
add a comment |
Here's an alternate approach, however it's somewhat involved.
Create many smaller partitions, let's say 10 or 20 of them. LVM2 might come in handy in this scenario. Then use the partitions in a ring-buffer fashion as follows:
One of the partitions would always be the 'active' one, where new data gets written to until it is completely full or nearly so. You don't need to leave any headroom. When the active partition has become full or doesn't have enough free space to hold the next chunk of data, switch to the next partition which then becomes the active one.
Your deleter process will always make sure that there is at least one completely empty partition available. If there isn't one--and this is the crucial part--it will simply reformat the oldest partition, creating a fresh new file system. This new partition will later be able to receive new data with minimal to no fragmentation.
I didn't mention in the question, but that's actually what we do. See the edited question above.
– Danny
Jan 1 at 3:22
@Danny if "the time sequence of files becomes randomly distributed across all volumes", then surely you cannot actually do "and this is the crucial part--simply reformat the oldest partition, creating a fresh new file system. This new partition will later be able to receive new data with minimal to no fragmentation."
– sourcejedi
Jan 1 at 14:16
Sorry, my bad. Somehow didn't see/read your last two paragraphs. We have 10-12 smaller partitions, but the deleter removes only the oldest files (1GB each) until "enough" free space is available. Then it stops and waits for the disk to be "too full" again. "enough" and "too full" can be adjusted for tuning.
– Danny
Jan 2 at 7:36
add a comment |
Here's an alternate approach, however it's somewhat involved.
Create many smaller partitions, let's say 10 or 20 of them. LVM2 might come in handy in this scenario. Then use the partitions in a ring-buffer fashion as follows:
One of the partitions would always be the 'active' one, where new data gets written to until it is completely full or nearly so. You don't need to leave any headroom. When the active partition has become full or doesn't have enough free space to hold the next chunk of data, switch to the next partition which then becomes the active one.
Your deleter process will always make sure that there is at least one completely empty partition available. If there isn't one--and this is the crucial part--it will simply reformat the oldest partition, creating a fresh new file system. This new partition will later be able to receive new data with minimal to no fragmentation.
I didn't mention in the question, but that's actually what we do. See the edited question above.
– Danny
Jan 1 at 3:22
@Danny if "the time sequence of files becomes randomly distributed across all volumes", then surely you cannot actually do "and this is the crucial part--simply reformat the oldest partition, creating a fresh new file system. This new partition will later be able to receive new data with minimal to no fragmentation."
– sourcejedi
Jan 1 at 14:16
Sorry, my bad. Somehow didn't see/read your last two paragraphs. We have 10-12 smaller partitions, but the deleter removes only the oldest files (1GB each) until "enough" free space is available. Then it stops and waits for the disk to be "too full" again. "enough" and "too full" can be adjusted for tuning.
– Danny
Jan 2 at 7:36
add a comment |
Here's an alternate approach, however it's somewhat involved.
Create many smaller partitions, let's say 10 or 20 of them. LVM2 might come in handy in this scenario. Then use the partitions in a ring-buffer fashion as follows:
One of the partitions would always be the 'active' one, where new data gets written to until it is completely full or nearly so. You don't need to leave any headroom. When the active partition has become full or doesn't have enough free space to hold the next chunk of data, switch to the next partition which then becomes the active one.
Your deleter process will always make sure that there is at least one completely empty partition available. If there isn't one--and this is the crucial part--it will simply reformat the oldest partition, creating a fresh new file system. This new partition will later be able to receive new data with minimal to no fragmentation.
Here's an alternate approach, however it's somewhat involved.
Create many smaller partitions, let's say 10 or 20 of them. LVM2 might come in handy in this scenario. Then use the partitions in a ring-buffer fashion as follows:
One of the partitions would always be the 'active' one, where new data gets written to until it is completely full or nearly so. You don't need to leave any headroom. When the active partition has become full or doesn't have enough free space to hold the next chunk of data, switch to the next partition which then becomes the active one.
Your deleter process will always make sure that there is at least one completely empty partition available. If there isn't one--and this is the crucial part--it will simply reformat the oldest partition, creating a fresh new file system. This new partition will later be able to receive new data with minimal to no fragmentation.
answered Dec 31 '18 at 20:27
jlhjlh
1316
1316
I didn't mention in the question, but that's actually what we do. See the edited question above.
– Danny
Jan 1 at 3:22
@Danny if "the time sequence of files becomes randomly distributed across all volumes", then surely you cannot actually do "and this is the crucial part--simply reformat the oldest partition, creating a fresh new file system. This new partition will later be able to receive new data with minimal to no fragmentation."
– sourcejedi
Jan 1 at 14:16
Sorry, my bad. Somehow didn't see/read your last two paragraphs. We have 10-12 smaller partitions, but the deleter removes only the oldest files (1GB each) until "enough" free space is available. Then it stops and waits for the disk to be "too full" again. "enough" and "too full" can be adjusted for tuning.
– Danny
Jan 2 at 7:36
add a comment |
I didn't mention in the question, but that's actually what we do. See the edited question above.
– Danny
Jan 1 at 3:22
@Danny if "the time sequence of files becomes randomly distributed across all volumes", then surely you cannot actually do "and this is the crucial part--simply reformat the oldest partition, creating a fresh new file system. This new partition will later be able to receive new data with minimal to no fragmentation."
– sourcejedi
Jan 1 at 14:16
Sorry, my bad. Somehow didn't see/read your last two paragraphs. We have 10-12 smaller partitions, but the deleter removes only the oldest files (1GB each) until "enough" free space is available. Then it stops and waits for the disk to be "too full" again. "enough" and "too full" can be adjusted for tuning.
– Danny
Jan 2 at 7:36
I didn't mention in the question, but that's actually what we do. See the edited question above.
– Danny
Jan 1 at 3:22
I didn't mention in the question, but that's actually what we do. See the edited question above.
– Danny
Jan 1 at 3:22
@Danny if "the time sequence of files becomes randomly distributed across all volumes", then surely you cannot actually do "and this is the crucial part--simply reformat the oldest partition, creating a fresh new file system. This new partition will later be able to receive new data with minimal to no fragmentation."
– sourcejedi
Jan 1 at 14:16
@Danny if "the time sequence of files becomes randomly distributed across all volumes", then surely you cannot actually do "and this is the crucial part--simply reformat the oldest partition, creating a fresh new file system. This new partition will later be able to receive new data with minimal to no fragmentation."
– sourcejedi
Jan 1 at 14:16
Sorry, my bad. Somehow didn't see/read your last two paragraphs. We have 10-12 smaller partitions, but the deleter removes only the oldest files (1GB each) until "enough" free space is available. Then it stops and waits for the disk to be "too full" again. "enough" and "too full" can be adjusted for tuning.
– Danny
Jan 2 at 7:36
Sorry, my bad. Somehow didn't see/read your last two paragraphs. We have 10-12 smaller partitions, but the deleter removes only the oldest files (1GB each) until "enough" free space is available. Then it stops and waits for the disk to be "too full" again. "enough" and "too full" can be adjusted for tuning.
– Danny
Jan 2 at 7:36
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f328132%2foptimize-ext4-for-always-full-operation%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
When creating the large data files, do you already use
fallocate(fd,FALLOC_FL_ZERO_RANGE,0,length)
to allocate the disk space before writing to the file? Could you use a "fixed" allocation size for the large data files (assuming they don't have much variation in size)? This is a difficult case, because the smaller metadata files may cause fragmentation of the large files. Could you use different partitions for the large data files and small metadata files?– Nominal Animal
Dec 5 '16 at 13:35
Do you have any reader processes? Do they read the oldest data files, or is it random?
– Mark Plotnick
Dec 5 '16 at 22:15
All files are opened with fopen() and no pre-allocation is done. Using different partitions would be difficult. For the large files I could pre-allocate using a heuristic guess of the size. But the final size could be different. Would the allocated space be returned to "free" after fclose() ?
– Danny
Dec 6 '16 at 5:32
Mark, yes there are reader processes. The 'deleter' reads directory information and some of the meta-data files. Also, the big data files could be read by the player app. (application is similar to a video server, with constant bit rate in for the recorder and and (if activated) constant bit rate out for the player.
– Danny
Dec 6 '16 at 5:34
1) IMO it would be better if you could make this question self-sufficient. If you were asking a hypothetical question, one answer would be to test it. But you've tested it and found at least one BIG problem; that's the most important reason you're asking, right? 2) Secondly - I was modelling the algorithms you gave as the only significant IO load on this storage. I'm not sure exactly what I'm supposed to understand from the edit mentioning other processes which cause a different distribution of files.
– sourcejedi
Jan 1 at 11:36