Optimize ext4 for always full operation

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP












7















Our application writes data to disk as a huge ring buffer (30 to 150TB); writing new files while deleting old files. As such, by definition, the disk is always "near full".



The writer process creates various files at a net input speed of about 100-150 Mbits/s. Data files are a mixture of 1GB 'data' files and several smaller meta data files. (The input speed is constant, but note new file sets are created only once per two minutes).



There is a separate deleter process which deletes the "oldest" files every 30s. It keeps deleting until there it reaches 15GB free space headroom on the disk.



So in stable operation, all data partitions have only 15GB free space.



On this SO question relating to file system slowdown, DepressedDaniel commented:




Sync hanging just means the filesystem is working hard to save the
latest operations consistently. It is most certainly trying to shuffle
data around on the disk in that time. I don't know the details, but
I'm pretty sure if your filesystem is heavily fragmented, ext4 will
try to do something about that. And that can't be good if the
filesystem is nearly 100% full. The only reasonable way to utilize a
filesystem at near 100% of capacity is to statically initialize it
with some files and then overwrite those same files in place (to avoid
fragmenting). Probably works best with ext2/3.




Is ext4 a bad choice for this application? Since we are running live, what tuning can be done to ext4 to avoid fragmentation, slow downs, or other performance limitations? Changing from ext4 would be quite difficult...



(and re-writing statically created files means rewriting the entire application)



Thanks!



EDIT I



The server has 50 to 100 TB of disks attached (24 drives). The Areca RAID controller manages the 24 drives as a RAID-6 raid set.



From there we divide into several partitions/volumes, with each volume being 5 to 10TB. So the size of any one volume is not huge.



The "writer" process finds the first volume with "enough" space and writes a file there. After the file is written the process is repeated.



For a brand new machine, the volumes are filled up in order. If all volumes are "full" then the "deleter" process starts deleting the oldest files until "enough" space is available.



Over a long time, because of the action of other processes, the time sequence of files becomes randomly distributed across all volumes.










share|improve this question



















  • 2





    When creating the large data files, do you already use fallocate(fd,FALLOC_FL_ZERO_RANGE,0,length) to allocate the disk space before writing to the file? Could you use a "fixed" allocation size for the large data files (assuming they don't have much variation in size)? This is a difficult case, because the smaller metadata files may cause fragmentation of the large files. Could you use different partitions for the large data files and small metadata files?

    – Nominal Animal
    Dec 5 '16 at 13:35











  • Do you have any reader processes? Do they read the oldest data files, or is it random?

    – Mark Plotnick
    Dec 5 '16 at 22:15











  • All files are opened with fopen() and no pre-allocation is done. Using different partitions would be difficult. For the large files I could pre-allocate using a heuristic guess of the size. But the final size could be different. Would the allocated space be returned to "free" after fclose() ?

    – Danny
    Dec 6 '16 at 5:32











  • Mark, yes there are reader processes. The 'deleter' reads directory information and some of the meta-data files. Also, the big data files could be read by the player app. (application is similar to a video server, with constant bit rate in for the recorder and and (if activated) constant bit rate out for the player.

    – Danny
    Dec 6 '16 at 5:34











  • 1) IMO it would be better if you could make this question self-sufficient. If you were asking a hypothetical question, one answer would be to test it. But you've tested it and found at least one BIG problem; that's the most important reason you're asking, right? 2) Secondly - I was modelling the algorithms you gave as the only significant IO load on this storage. I'm not sure exactly what I'm supposed to understand from the edit mentioning other processes which cause a different distribution of files.

    – sourcejedi
    Jan 1 at 11:36
















7















Our application writes data to disk as a huge ring buffer (30 to 150TB); writing new files while deleting old files. As such, by definition, the disk is always "near full".



The writer process creates various files at a net input speed of about 100-150 Mbits/s. Data files are a mixture of 1GB 'data' files and several smaller meta data files. (The input speed is constant, but note new file sets are created only once per two minutes).



There is a separate deleter process which deletes the "oldest" files every 30s. It keeps deleting until there it reaches 15GB free space headroom on the disk.



So in stable operation, all data partitions have only 15GB free space.



On this SO question relating to file system slowdown, DepressedDaniel commented:




Sync hanging just means the filesystem is working hard to save the
latest operations consistently. It is most certainly trying to shuffle
data around on the disk in that time. I don't know the details, but
I'm pretty sure if your filesystem is heavily fragmented, ext4 will
try to do something about that. And that can't be good if the
filesystem is nearly 100% full. The only reasonable way to utilize a
filesystem at near 100% of capacity is to statically initialize it
with some files and then overwrite those same files in place (to avoid
fragmenting). Probably works best with ext2/3.




Is ext4 a bad choice for this application? Since we are running live, what tuning can be done to ext4 to avoid fragmentation, slow downs, or other performance limitations? Changing from ext4 would be quite difficult...



(and re-writing statically created files means rewriting the entire application)



Thanks!



EDIT I



The server has 50 to 100 TB of disks attached (24 drives). The Areca RAID controller manages the 24 drives as a RAID-6 raid set.



From there we divide into several partitions/volumes, with each volume being 5 to 10TB. So the size of any one volume is not huge.



The "writer" process finds the first volume with "enough" space and writes a file there. After the file is written the process is repeated.



For a brand new machine, the volumes are filled up in order. If all volumes are "full" then the "deleter" process starts deleting the oldest files until "enough" space is available.



Over a long time, because of the action of other processes, the time sequence of files becomes randomly distributed across all volumes.










share|improve this question



















  • 2





    When creating the large data files, do you already use fallocate(fd,FALLOC_FL_ZERO_RANGE,0,length) to allocate the disk space before writing to the file? Could you use a "fixed" allocation size for the large data files (assuming they don't have much variation in size)? This is a difficult case, because the smaller metadata files may cause fragmentation of the large files. Could you use different partitions for the large data files and small metadata files?

    – Nominal Animal
    Dec 5 '16 at 13:35











  • Do you have any reader processes? Do they read the oldest data files, or is it random?

    – Mark Plotnick
    Dec 5 '16 at 22:15











  • All files are opened with fopen() and no pre-allocation is done. Using different partitions would be difficult. For the large files I could pre-allocate using a heuristic guess of the size. But the final size could be different. Would the allocated space be returned to "free" after fclose() ?

    – Danny
    Dec 6 '16 at 5:32











  • Mark, yes there are reader processes. The 'deleter' reads directory information and some of the meta-data files. Also, the big data files could be read by the player app. (application is similar to a video server, with constant bit rate in for the recorder and and (if activated) constant bit rate out for the player.

    – Danny
    Dec 6 '16 at 5:34











  • 1) IMO it would be better if you could make this question self-sufficient. If you were asking a hypothetical question, one answer would be to test it. But you've tested it and found at least one BIG problem; that's the most important reason you're asking, right? 2) Secondly - I was modelling the algorithms you gave as the only significant IO load on this storage. I'm not sure exactly what I'm supposed to understand from the edit mentioning other processes which cause a different distribution of files.

    – sourcejedi
    Jan 1 at 11:36














7












7








7


2






Our application writes data to disk as a huge ring buffer (30 to 150TB); writing new files while deleting old files. As such, by definition, the disk is always "near full".



The writer process creates various files at a net input speed of about 100-150 Mbits/s. Data files are a mixture of 1GB 'data' files and several smaller meta data files. (The input speed is constant, but note new file sets are created only once per two minutes).



There is a separate deleter process which deletes the "oldest" files every 30s. It keeps deleting until there it reaches 15GB free space headroom on the disk.



So in stable operation, all data partitions have only 15GB free space.



On this SO question relating to file system slowdown, DepressedDaniel commented:




Sync hanging just means the filesystem is working hard to save the
latest operations consistently. It is most certainly trying to shuffle
data around on the disk in that time. I don't know the details, but
I'm pretty sure if your filesystem is heavily fragmented, ext4 will
try to do something about that. And that can't be good if the
filesystem is nearly 100% full. The only reasonable way to utilize a
filesystem at near 100% of capacity is to statically initialize it
with some files and then overwrite those same files in place (to avoid
fragmenting). Probably works best with ext2/3.




Is ext4 a bad choice for this application? Since we are running live, what tuning can be done to ext4 to avoid fragmentation, slow downs, or other performance limitations? Changing from ext4 would be quite difficult...



(and re-writing statically created files means rewriting the entire application)



Thanks!



EDIT I



The server has 50 to 100 TB of disks attached (24 drives). The Areca RAID controller manages the 24 drives as a RAID-6 raid set.



From there we divide into several partitions/volumes, with each volume being 5 to 10TB. So the size of any one volume is not huge.



The "writer" process finds the first volume with "enough" space and writes a file there. After the file is written the process is repeated.



For a brand new machine, the volumes are filled up in order. If all volumes are "full" then the "deleter" process starts deleting the oldest files until "enough" space is available.



Over a long time, because of the action of other processes, the time sequence of files becomes randomly distributed across all volumes.










share|improve this question
















Our application writes data to disk as a huge ring buffer (30 to 150TB); writing new files while deleting old files. As such, by definition, the disk is always "near full".



The writer process creates various files at a net input speed of about 100-150 Mbits/s. Data files are a mixture of 1GB 'data' files and several smaller meta data files. (The input speed is constant, but note new file sets are created only once per two minutes).



There is a separate deleter process which deletes the "oldest" files every 30s. It keeps deleting until there it reaches 15GB free space headroom on the disk.



So in stable operation, all data partitions have only 15GB free space.



On this SO question relating to file system slowdown, DepressedDaniel commented:




Sync hanging just means the filesystem is working hard to save the
latest operations consistently. It is most certainly trying to shuffle
data around on the disk in that time. I don't know the details, but
I'm pretty sure if your filesystem is heavily fragmented, ext4 will
try to do something about that. And that can't be good if the
filesystem is nearly 100% full. The only reasonable way to utilize a
filesystem at near 100% of capacity is to statically initialize it
with some files and then overwrite those same files in place (to avoid
fragmenting). Probably works best with ext2/3.




Is ext4 a bad choice for this application? Since we are running live, what tuning can be done to ext4 to avoid fragmentation, slow downs, or other performance limitations? Changing from ext4 would be quite difficult...



(and re-writing statically created files means rewriting the entire application)



Thanks!



EDIT I



The server has 50 to 100 TB of disks attached (24 drives). The Areca RAID controller manages the 24 drives as a RAID-6 raid set.



From there we divide into several partitions/volumes, with each volume being 5 to 10TB. So the size of any one volume is not huge.



The "writer" process finds the first volume with "enough" space and writes a file there. After the file is written the process is repeated.



For a brand new machine, the volumes are filled up in order. If all volumes are "full" then the "deleter" process starts deleting the oldest files until "enough" space is available.



Over a long time, because of the action of other processes, the time sequence of files becomes randomly distributed across all volumes.







filesystems ext4 ext3






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 1 at 13:22









sourcejedi

23.3k437102




23.3k437102










asked Dec 5 '16 at 12:52









DannyDanny

1787




1787







  • 2





    When creating the large data files, do you already use fallocate(fd,FALLOC_FL_ZERO_RANGE,0,length) to allocate the disk space before writing to the file? Could you use a "fixed" allocation size for the large data files (assuming they don't have much variation in size)? This is a difficult case, because the smaller metadata files may cause fragmentation of the large files. Could you use different partitions for the large data files and small metadata files?

    – Nominal Animal
    Dec 5 '16 at 13:35











  • Do you have any reader processes? Do they read the oldest data files, or is it random?

    – Mark Plotnick
    Dec 5 '16 at 22:15











  • All files are opened with fopen() and no pre-allocation is done. Using different partitions would be difficult. For the large files I could pre-allocate using a heuristic guess of the size. But the final size could be different. Would the allocated space be returned to "free" after fclose() ?

    – Danny
    Dec 6 '16 at 5:32











  • Mark, yes there are reader processes. The 'deleter' reads directory information and some of the meta-data files. Also, the big data files could be read by the player app. (application is similar to a video server, with constant bit rate in for the recorder and and (if activated) constant bit rate out for the player.

    – Danny
    Dec 6 '16 at 5:34











  • 1) IMO it would be better if you could make this question self-sufficient. If you were asking a hypothetical question, one answer would be to test it. But you've tested it and found at least one BIG problem; that's the most important reason you're asking, right? 2) Secondly - I was modelling the algorithms you gave as the only significant IO load on this storage. I'm not sure exactly what I'm supposed to understand from the edit mentioning other processes which cause a different distribution of files.

    – sourcejedi
    Jan 1 at 11:36













  • 2





    When creating the large data files, do you already use fallocate(fd,FALLOC_FL_ZERO_RANGE,0,length) to allocate the disk space before writing to the file? Could you use a "fixed" allocation size for the large data files (assuming they don't have much variation in size)? This is a difficult case, because the smaller metadata files may cause fragmentation of the large files. Could you use different partitions for the large data files and small metadata files?

    – Nominal Animal
    Dec 5 '16 at 13:35











  • Do you have any reader processes? Do they read the oldest data files, or is it random?

    – Mark Plotnick
    Dec 5 '16 at 22:15











  • All files are opened with fopen() and no pre-allocation is done. Using different partitions would be difficult. For the large files I could pre-allocate using a heuristic guess of the size. But the final size could be different. Would the allocated space be returned to "free" after fclose() ?

    – Danny
    Dec 6 '16 at 5:32











  • Mark, yes there are reader processes. The 'deleter' reads directory information and some of the meta-data files. Also, the big data files could be read by the player app. (application is similar to a video server, with constant bit rate in for the recorder and and (if activated) constant bit rate out for the player.

    – Danny
    Dec 6 '16 at 5:34











  • 1) IMO it would be better if you could make this question self-sufficient. If you were asking a hypothetical question, one answer would be to test it. But you've tested it and found at least one BIG problem; that's the most important reason you're asking, right? 2) Secondly - I was modelling the algorithms you gave as the only significant IO load on this storage. I'm not sure exactly what I'm supposed to understand from the edit mentioning other processes which cause a different distribution of files.

    – sourcejedi
    Jan 1 at 11:36








2




2





When creating the large data files, do you already use fallocate(fd,FALLOC_FL_ZERO_RANGE,0,length) to allocate the disk space before writing to the file? Could you use a "fixed" allocation size for the large data files (assuming they don't have much variation in size)? This is a difficult case, because the smaller metadata files may cause fragmentation of the large files. Could you use different partitions for the large data files and small metadata files?

– Nominal Animal
Dec 5 '16 at 13:35





When creating the large data files, do you already use fallocate(fd,FALLOC_FL_ZERO_RANGE,0,length) to allocate the disk space before writing to the file? Could you use a "fixed" allocation size for the large data files (assuming they don't have much variation in size)? This is a difficult case, because the smaller metadata files may cause fragmentation of the large files. Could you use different partitions for the large data files and small metadata files?

– Nominal Animal
Dec 5 '16 at 13:35













Do you have any reader processes? Do they read the oldest data files, or is it random?

– Mark Plotnick
Dec 5 '16 at 22:15





Do you have any reader processes? Do they read the oldest data files, or is it random?

– Mark Plotnick
Dec 5 '16 at 22:15













All files are opened with fopen() and no pre-allocation is done. Using different partitions would be difficult. For the large files I could pre-allocate using a heuristic guess of the size. But the final size could be different. Would the allocated space be returned to "free" after fclose() ?

– Danny
Dec 6 '16 at 5:32





All files are opened with fopen() and no pre-allocation is done. Using different partitions would be difficult. For the large files I could pre-allocate using a heuristic guess of the size. But the final size could be different. Would the allocated space be returned to "free" after fclose() ?

– Danny
Dec 6 '16 at 5:32













Mark, yes there are reader processes. The 'deleter' reads directory information and some of the meta-data files. Also, the big data files could be read by the player app. (application is similar to a video server, with constant bit rate in for the recorder and and (if activated) constant bit rate out for the player.

– Danny
Dec 6 '16 at 5:34





Mark, yes there are reader processes. The 'deleter' reads directory information and some of the meta-data files. Also, the big data files could be read by the player app. (application is similar to a video server, with constant bit rate in for the recorder and and (if activated) constant bit rate out for the player.

– Danny
Dec 6 '16 at 5:34













1) IMO it would be better if you could make this question self-sufficient. If you were asking a hypothetical question, one answer would be to test it. But you've tested it and found at least one BIG problem; that's the most important reason you're asking, right? 2) Secondly - I was modelling the algorithms you gave as the only significant IO load on this storage. I'm not sure exactly what I'm supposed to understand from the edit mentioning other processes which cause a different distribution of files.

– sourcejedi
Jan 1 at 11:36






1) IMO it would be better if you could make this question self-sufficient. If you were asking a hypothetical question, one answer would be to test it. But you've tested it and found at least one BIG problem; that's the most important reason you're asking, right? 2) Secondly - I was modelling the algorithms you gave as the only significant IO load on this storage. I'm not sure exactly what I'm supposed to understand from the edit mentioning other processes which cause a different distribution of files.

– sourcejedi
Jan 1 at 11:36











2 Answers
2






active

oldest

votes


















3














In principle, I don't see why strict ring-buffer writes would pose any challenge regarding fragmentation. It seems like it would be straightforward. The quote sounds to me like it is based on advice from more general write workloads. But looking at the linked SO question I see you have a real problem...



Since you are concerned about fragmentation, you should consider how to measure it! e4defrag exists. It has only two options. -c only shows the current state and does not defrag. -v shows per-file statistics. All combinations of options are valid (including no options). Although it does not provide any explicit method to limit the performance impact on a running system, e4defrag supports being run on individual files, so you can rate-limit it yourself.



(XFS also has a defrag tool, though I haven't used it.)



e2freefrag can show free space fragmentation. If you use the CFQ IO scheduler, then you can run it with a reduced IO priority using ionice.



The quote guesses wrong, the reply by Stephen Kitt is correct. ext4 does not perform any automatic defragmentation. It does not try to "shuffle around" data which has already been written.



Discarding this strange misconception leaves no reason to suggest "ext2/ext3". Apart from anything else, the ext3 code does not exist in current kernels. The ext4 code is used to mount ext3. ext3 is a subset of ext4. In particular when you are creating relatively large files, it just seems silly not to use extents, and those are an ext4-specific feature.



I believe "hanging" is more often associated with the journal. See e.g. comments from (the in-progress filesystem) bcachefs -




Tail latency has been the bane of ext4 users for many years - dependencies in the journalling code and elsewhere can lead to 30+ second latencies on simple operations (e.g. unlinks) on multithreaded workloads. No one seems to know how to fix them.



In bcachefs, the only reason a thread blocks on IO is because it explicitly asked to (an uncached read or an fsync operation), or resource exhaustion - full stop. Locks that would block foreground operations are never held while doing IO. While bcachefs isn't a realtime filesystem today (it lacks e.g. realtime scheduling for IO), it very conceivably could be one day.




Don't ask me to interpret the extent to which using XFS can avoid the above problem. I don't know. But if you were considering testing an alternative filesystem setup, XFS is the first thing I would try.



I'm struggling to find much information about the effects of disabling journalling on ext4. At least it doesn't seem to be one of the common options considered when tuning performance.



I'm not sure why you're using sys_sync(). It's usually better avoided (see e.g. here). I'm not sure that really explains your problem, but it seems an unfortunate thing to come across when trying to narrow this down.






share|improve this answer
































    2














    Here's an alternate approach, however it's somewhat involved.



    Create many smaller partitions, let's say 10 or 20 of them. LVM2 might come in handy in this scenario. Then use the partitions in a ring-buffer fashion as follows:



    One of the partitions would always be the 'active' one, where new data gets written to until it is completely full or nearly so. You don't need to leave any headroom. When the active partition has become full or doesn't have enough free space to hold the next chunk of data, switch to the next partition which then becomes the active one.



    Your deleter process will always make sure that there is at least one completely empty partition available. If there isn't one--and this is the crucial part--it will simply reformat the oldest partition, creating a fresh new file system. This new partition will later be able to receive new data with minimal to no fragmentation.






    share|improve this answer























    • I didn't mention in the question, but that's actually what we do. See the edited question above.

      – Danny
      Jan 1 at 3:22











    • @Danny if "the time sequence of files becomes randomly distributed across all volumes", then surely you cannot actually do "and this is the crucial part--simply reformat the oldest partition, creating a fresh new file system. This new partition will later be able to receive new data with minimal to no fragmentation."

      – sourcejedi
      Jan 1 at 14:16











    • Sorry, my bad. Somehow didn't see/read your last two paragraphs. We have 10-12 smaller partitions, but the deleter removes only the oldest files (1GB each) until "enough" free space is available. Then it stops and waits for the disk to be "too full" again. "enough" and "too full" can be adjusted for tuning.

      – Danny
      Jan 2 at 7:36










    Your Answer








    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "106"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f328132%2foptimize-ext4-for-always-full-operation%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    3














    In principle, I don't see why strict ring-buffer writes would pose any challenge regarding fragmentation. It seems like it would be straightforward. The quote sounds to me like it is based on advice from more general write workloads. But looking at the linked SO question I see you have a real problem...



    Since you are concerned about fragmentation, you should consider how to measure it! e4defrag exists. It has only two options. -c only shows the current state and does not defrag. -v shows per-file statistics. All combinations of options are valid (including no options). Although it does not provide any explicit method to limit the performance impact on a running system, e4defrag supports being run on individual files, so you can rate-limit it yourself.



    (XFS also has a defrag tool, though I haven't used it.)



    e2freefrag can show free space fragmentation. If you use the CFQ IO scheduler, then you can run it with a reduced IO priority using ionice.



    The quote guesses wrong, the reply by Stephen Kitt is correct. ext4 does not perform any automatic defragmentation. It does not try to "shuffle around" data which has already been written.



    Discarding this strange misconception leaves no reason to suggest "ext2/ext3". Apart from anything else, the ext3 code does not exist in current kernels. The ext4 code is used to mount ext3. ext3 is a subset of ext4. In particular when you are creating relatively large files, it just seems silly not to use extents, and those are an ext4-specific feature.



    I believe "hanging" is more often associated with the journal. See e.g. comments from (the in-progress filesystem) bcachefs -




    Tail latency has been the bane of ext4 users for many years - dependencies in the journalling code and elsewhere can lead to 30+ second latencies on simple operations (e.g. unlinks) on multithreaded workloads. No one seems to know how to fix them.



    In bcachefs, the only reason a thread blocks on IO is because it explicitly asked to (an uncached read or an fsync operation), or resource exhaustion - full stop. Locks that would block foreground operations are never held while doing IO. While bcachefs isn't a realtime filesystem today (it lacks e.g. realtime scheduling for IO), it very conceivably could be one day.




    Don't ask me to interpret the extent to which using XFS can avoid the above problem. I don't know. But if you were considering testing an alternative filesystem setup, XFS is the first thing I would try.



    I'm struggling to find much information about the effects of disabling journalling on ext4. At least it doesn't seem to be one of the common options considered when tuning performance.



    I'm not sure why you're using sys_sync(). It's usually better avoided (see e.g. here). I'm not sure that really explains your problem, but it seems an unfortunate thing to come across when trying to narrow this down.






    share|improve this answer





























      3














      In principle, I don't see why strict ring-buffer writes would pose any challenge regarding fragmentation. It seems like it would be straightforward. The quote sounds to me like it is based on advice from more general write workloads. But looking at the linked SO question I see you have a real problem...



      Since you are concerned about fragmentation, you should consider how to measure it! e4defrag exists. It has only two options. -c only shows the current state and does not defrag. -v shows per-file statistics. All combinations of options are valid (including no options). Although it does not provide any explicit method to limit the performance impact on a running system, e4defrag supports being run on individual files, so you can rate-limit it yourself.



      (XFS also has a defrag tool, though I haven't used it.)



      e2freefrag can show free space fragmentation. If you use the CFQ IO scheduler, then you can run it with a reduced IO priority using ionice.



      The quote guesses wrong, the reply by Stephen Kitt is correct. ext4 does not perform any automatic defragmentation. It does not try to "shuffle around" data which has already been written.



      Discarding this strange misconception leaves no reason to suggest "ext2/ext3". Apart from anything else, the ext3 code does not exist in current kernels. The ext4 code is used to mount ext3. ext3 is a subset of ext4. In particular when you are creating relatively large files, it just seems silly not to use extents, and those are an ext4-specific feature.



      I believe "hanging" is more often associated with the journal. See e.g. comments from (the in-progress filesystem) bcachefs -




      Tail latency has been the bane of ext4 users for many years - dependencies in the journalling code and elsewhere can lead to 30+ second latencies on simple operations (e.g. unlinks) on multithreaded workloads. No one seems to know how to fix them.



      In bcachefs, the only reason a thread blocks on IO is because it explicitly asked to (an uncached read or an fsync operation), or resource exhaustion - full stop. Locks that would block foreground operations are never held while doing IO. While bcachefs isn't a realtime filesystem today (it lacks e.g. realtime scheduling for IO), it very conceivably could be one day.




      Don't ask me to interpret the extent to which using XFS can avoid the above problem. I don't know. But if you were considering testing an alternative filesystem setup, XFS is the first thing I would try.



      I'm struggling to find much information about the effects of disabling journalling on ext4. At least it doesn't seem to be one of the common options considered when tuning performance.



      I'm not sure why you're using sys_sync(). It's usually better avoided (see e.g. here). I'm not sure that really explains your problem, but it seems an unfortunate thing to come across when trying to narrow this down.






      share|improve this answer



























        3












        3








        3







        In principle, I don't see why strict ring-buffer writes would pose any challenge regarding fragmentation. It seems like it would be straightforward. The quote sounds to me like it is based on advice from more general write workloads. But looking at the linked SO question I see you have a real problem...



        Since you are concerned about fragmentation, you should consider how to measure it! e4defrag exists. It has only two options. -c only shows the current state and does not defrag. -v shows per-file statistics. All combinations of options are valid (including no options). Although it does not provide any explicit method to limit the performance impact on a running system, e4defrag supports being run on individual files, so you can rate-limit it yourself.



        (XFS also has a defrag tool, though I haven't used it.)



        e2freefrag can show free space fragmentation. If you use the CFQ IO scheduler, then you can run it with a reduced IO priority using ionice.



        The quote guesses wrong, the reply by Stephen Kitt is correct. ext4 does not perform any automatic defragmentation. It does not try to "shuffle around" data which has already been written.



        Discarding this strange misconception leaves no reason to suggest "ext2/ext3". Apart from anything else, the ext3 code does not exist in current kernels. The ext4 code is used to mount ext3. ext3 is a subset of ext4. In particular when you are creating relatively large files, it just seems silly not to use extents, and those are an ext4-specific feature.



        I believe "hanging" is more often associated with the journal. See e.g. comments from (the in-progress filesystem) bcachefs -




        Tail latency has been the bane of ext4 users for many years - dependencies in the journalling code and elsewhere can lead to 30+ second latencies on simple operations (e.g. unlinks) on multithreaded workloads. No one seems to know how to fix them.



        In bcachefs, the only reason a thread blocks on IO is because it explicitly asked to (an uncached read or an fsync operation), or resource exhaustion - full stop. Locks that would block foreground operations are never held while doing IO. While bcachefs isn't a realtime filesystem today (it lacks e.g. realtime scheduling for IO), it very conceivably could be one day.




        Don't ask me to interpret the extent to which using XFS can avoid the above problem. I don't know. But if you were considering testing an alternative filesystem setup, XFS is the first thing I would try.



        I'm struggling to find much information about the effects of disabling journalling on ext4. At least it doesn't seem to be one of the common options considered when tuning performance.



        I'm not sure why you're using sys_sync(). It's usually better avoided (see e.g. here). I'm not sure that really explains your problem, but it seems an unfortunate thing to come across when trying to narrow this down.






        share|improve this answer















        In principle, I don't see why strict ring-buffer writes would pose any challenge regarding fragmentation. It seems like it would be straightforward. The quote sounds to me like it is based on advice from more general write workloads. But looking at the linked SO question I see you have a real problem...



        Since you are concerned about fragmentation, you should consider how to measure it! e4defrag exists. It has only two options. -c only shows the current state and does not defrag. -v shows per-file statistics. All combinations of options are valid (including no options). Although it does not provide any explicit method to limit the performance impact on a running system, e4defrag supports being run on individual files, so you can rate-limit it yourself.



        (XFS also has a defrag tool, though I haven't used it.)



        e2freefrag can show free space fragmentation. If you use the CFQ IO scheduler, then you can run it with a reduced IO priority using ionice.



        The quote guesses wrong, the reply by Stephen Kitt is correct. ext4 does not perform any automatic defragmentation. It does not try to "shuffle around" data which has already been written.



        Discarding this strange misconception leaves no reason to suggest "ext2/ext3". Apart from anything else, the ext3 code does not exist in current kernels. The ext4 code is used to mount ext3. ext3 is a subset of ext4. In particular when you are creating relatively large files, it just seems silly not to use extents, and those are an ext4-specific feature.



        I believe "hanging" is more often associated with the journal. See e.g. comments from (the in-progress filesystem) bcachefs -




        Tail latency has been the bane of ext4 users for many years - dependencies in the journalling code and elsewhere can lead to 30+ second latencies on simple operations (e.g. unlinks) on multithreaded workloads. No one seems to know how to fix them.



        In bcachefs, the only reason a thread blocks on IO is because it explicitly asked to (an uncached read or an fsync operation), or resource exhaustion - full stop. Locks that would block foreground operations are never held while doing IO. While bcachefs isn't a realtime filesystem today (it lacks e.g. realtime scheduling for IO), it very conceivably could be one day.




        Don't ask me to interpret the extent to which using XFS can avoid the above problem. I don't know. But if you were considering testing an alternative filesystem setup, XFS is the first thing I would try.



        I'm struggling to find much information about the effects of disabling journalling on ext4. At least it doesn't seem to be one of the common options considered when tuning performance.



        I'm not sure why you're using sys_sync(). It's usually better avoided (see e.g. here). I'm not sure that really explains your problem, but it seems an unfortunate thing to come across when trying to narrow this down.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Jan 1 at 14:37

























        answered Dec 31 '18 at 21:03









        sourcejedisourcejedi

        23.3k437102




        23.3k437102























            2














            Here's an alternate approach, however it's somewhat involved.



            Create many smaller partitions, let's say 10 or 20 of them. LVM2 might come in handy in this scenario. Then use the partitions in a ring-buffer fashion as follows:



            One of the partitions would always be the 'active' one, where new data gets written to until it is completely full or nearly so. You don't need to leave any headroom. When the active partition has become full or doesn't have enough free space to hold the next chunk of data, switch to the next partition which then becomes the active one.



            Your deleter process will always make sure that there is at least one completely empty partition available. If there isn't one--and this is the crucial part--it will simply reformat the oldest partition, creating a fresh new file system. This new partition will later be able to receive new data with minimal to no fragmentation.






            share|improve this answer























            • I didn't mention in the question, but that's actually what we do. See the edited question above.

              – Danny
              Jan 1 at 3:22











            • @Danny if "the time sequence of files becomes randomly distributed across all volumes", then surely you cannot actually do "and this is the crucial part--simply reformat the oldest partition, creating a fresh new file system. This new partition will later be able to receive new data with minimal to no fragmentation."

              – sourcejedi
              Jan 1 at 14:16











            • Sorry, my bad. Somehow didn't see/read your last two paragraphs. We have 10-12 smaller partitions, but the deleter removes only the oldest files (1GB each) until "enough" free space is available. Then it stops and waits for the disk to be "too full" again. "enough" and "too full" can be adjusted for tuning.

              – Danny
              Jan 2 at 7:36















            2














            Here's an alternate approach, however it's somewhat involved.



            Create many smaller partitions, let's say 10 or 20 of them. LVM2 might come in handy in this scenario. Then use the partitions in a ring-buffer fashion as follows:



            One of the partitions would always be the 'active' one, where new data gets written to until it is completely full or nearly so. You don't need to leave any headroom. When the active partition has become full or doesn't have enough free space to hold the next chunk of data, switch to the next partition which then becomes the active one.



            Your deleter process will always make sure that there is at least one completely empty partition available. If there isn't one--and this is the crucial part--it will simply reformat the oldest partition, creating a fresh new file system. This new partition will later be able to receive new data with minimal to no fragmentation.






            share|improve this answer























            • I didn't mention in the question, but that's actually what we do. See the edited question above.

              – Danny
              Jan 1 at 3:22











            • @Danny if "the time sequence of files becomes randomly distributed across all volumes", then surely you cannot actually do "and this is the crucial part--simply reformat the oldest partition, creating a fresh new file system. This new partition will later be able to receive new data with minimal to no fragmentation."

              – sourcejedi
              Jan 1 at 14:16











            • Sorry, my bad. Somehow didn't see/read your last two paragraphs. We have 10-12 smaller partitions, but the deleter removes only the oldest files (1GB each) until "enough" free space is available. Then it stops and waits for the disk to be "too full" again. "enough" and "too full" can be adjusted for tuning.

              – Danny
              Jan 2 at 7:36













            2












            2








            2







            Here's an alternate approach, however it's somewhat involved.



            Create many smaller partitions, let's say 10 or 20 of them. LVM2 might come in handy in this scenario. Then use the partitions in a ring-buffer fashion as follows:



            One of the partitions would always be the 'active' one, where new data gets written to until it is completely full or nearly so. You don't need to leave any headroom. When the active partition has become full or doesn't have enough free space to hold the next chunk of data, switch to the next partition which then becomes the active one.



            Your deleter process will always make sure that there is at least one completely empty partition available. If there isn't one--and this is the crucial part--it will simply reformat the oldest partition, creating a fresh new file system. This new partition will later be able to receive new data with minimal to no fragmentation.






            share|improve this answer













            Here's an alternate approach, however it's somewhat involved.



            Create many smaller partitions, let's say 10 or 20 of them. LVM2 might come in handy in this scenario. Then use the partitions in a ring-buffer fashion as follows:



            One of the partitions would always be the 'active' one, where new data gets written to until it is completely full or nearly so. You don't need to leave any headroom. When the active partition has become full or doesn't have enough free space to hold the next chunk of data, switch to the next partition which then becomes the active one.



            Your deleter process will always make sure that there is at least one completely empty partition available. If there isn't one--and this is the crucial part--it will simply reformat the oldest partition, creating a fresh new file system. This new partition will later be able to receive new data with minimal to no fragmentation.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Dec 31 '18 at 20:27









            jlhjlh

            1316




            1316












            • I didn't mention in the question, but that's actually what we do. See the edited question above.

              – Danny
              Jan 1 at 3:22











            • @Danny if "the time sequence of files becomes randomly distributed across all volumes", then surely you cannot actually do "and this is the crucial part--simply reformat the oldest partition, creating a fresh new file system. This new partition will later be able to receive new data with minimal to no fragmentation."

              – sourcejedi
              Jan 1 at 14:16











            • Sorry, my bad. Somehow didn't see/read your last two paragraphs. We have 10-12 smaller partitions, but the deleter removes only the oldest files (1GB each) until "enough" free space is available. Then it stops and waits for the disk to be "too full" again. "enough" and "too full" can be adjusted for tuning.

              – Danny
              Jan 2 at 7:36

















            • I didn't mention in the question, but that's actually what we do. See the edited question above.

              – Danny
              Jan 1 at 3:22











            • @Danny if "the time sequence of files becomes randomly distributed across all volumes", then surely you cannot actually do "and this is the crucial part--simply reformat the oldest partition, creating a fresh new file system. This new partition will later be able to receive new data with minimal to no fragmentation."

              – sourcejedi
              Jan 1 at 14:16











            • Sorry, my bad. Somehow didn't see/read your last two paragraphs. We have 10-12 smaller partitions, but the deleter removes only the oldest files (1GB each) until "enough" free space is available. Then it stops and waits for the disk to be "too full" again. "enough" and "too full" can be adjusted for tuning.

              – Danny
              Jan 2 at 7:36
















            I didn't mention in the question, but that's actually what we do. See the edited question above.

            – Danny
            Jan 1 at 3:22





            I didn't mention in the question, but that's actually what we do. See the edited question above.

            – Danny
            Jan 1 at 3:22













            @Danny if "the time sequence of files becomes randomly distributed across all volumes", then surely you cannot actually do "and this is the crucial part--simply reformat the oldest partition, creating a fresh new file system. This new partition will later be able to receive new data with minimal to no fragmentation."

            – sourcejedi
            Jan 1 at 14:16





            @Danny if "the time sequence of files becomes randomly distributed across all volumes", then surely you cannot actually do "and this is the crucial part--simply reformat the oldest partition, creating a fresh new file system. This new partition will later be able to receive new data with minimal to no fragmentation."

            – sourcejedi
            Jan 1 at 14:16













            Sorry, my bad. Somehow didn't see/read your last two paragraphs. We have 10-12 smaller partitions, but the deleter removes only the oldest files (1GB each) until "enough" free space is available. Then it stops and waits for the disk to be "too full" again. "enough" and "too full" can be adjusted for tuning.

            – Danny
            Jan 2 at 7:36





            Sorry, my bad. Somehow didn't see/read your last two paragraphs. We have 10-12 smaller partitions, but the deleter removes only the oldest files (1GB each) until "enough" free space is available. Then it stops and waits for the disk to be "too full" again. "enough" and "too full" can be adjusted for tuning.

            – Danny
            Jan 2 at 7:36

















            draft saved

            draft discarded
















































            Thanks for contributing an answer to Unix & Linux Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f328132%2foptimize-ext4-for-always-full-operation%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown






            Popular posts from this blog

            How to check contact read email or not when send email to Individual?

            Displaying single band from multi-band raster using QGIS

            How many registers does an x86_64 CPU actually have?