What could explain this strange sparse file handling of/in tmpfs?
Clash Royale CLAN TAG#URR8PPP
up vote
12
down vote
favorite
On my ext4
filesystem partition I can run the following code:
fs="/mnt/ext4"
#create sparse 100M file on $fs
dd if=/dev/zero
of=$fs/sparse100M conv=sparse seek=$((100*2*1024-1)) count=1 2> /dev/null
#show its actual used size before
echo "Before:"
ls $fs/sparse100M -s
#setting the sparse file up as loopback and run md5sum on loopback
losetup /dev/loop0 $fs/sparse100M
md5sum /dev/loop0
#show its actual used size afterwards
echo "After:"
ls $fs/sparse100M -s
#release loopback and remove file
losetup -d /dev/loop0
rm $fs/sparse100M
which yields
Before:
0 sparse100M
2f282b84e7e608d5852449ed940bfc51 /dev/loop0
After:
0 sparse100M
Doing the very same thing on tmpfs as with:
fs="/tmp"
yields
Before:
0 /tmp/sparse100M
2f282b84e7e608d5852449ed940bfc51 /dev/loop0
After:
102400 /tmp/sparse100M
which basically means that something I expected to merely read the data, caused the sparse file to "blow up like a balloon"?
I expect that is because of less perfect support for sparse file in tmpfs
filesystem, and in particular because of the missing FIEMAP ioctl, but I am not sure what causes this behaviour? Can you tell me?
ext4 tmpfs sparse-files
add a comment |Â
up vote
12
down vote
favorite
On my ext4
filesystem partition I can run the following code:
fs="/mnt/ext4"
#create sparse 100M file on $fs
dd if=/dev/zero
of=$fs/sparse100M conv=sparse seek=$((100*2*1024-1)) count=1 2> /dev/null
#show its actual used size before
echo "Before:"
ls $fs/sparse100M -s
#setting the sparse file up as loopback and run md5sum on loopback
losetup /dev/loop0 $fs/sparse100M
md5sum /dev/loop0
#show its actual used size afterwards
echo "After:"
ls $fs/sparse100M -s
#release loopback and remove file
losetup -d /dev/loop0
rm $fs/sparse100M
which yields
Before:
0 sparse100M
2f282b84e7e608d5852449ed940bfc51 /dev/loop0
After:
0 sparse100M
Doing the very same thing on tmpfs as with:
fs="/tmp"
yields
Before:
0 /tmp/sparse100M
2f282b84e7e608d5852449ed940bfc51 /dev/loop0
After:
102400 /tmp/sparse100M
which basically means that something I expected to merely read the data, caused the sparse file to "blow up like a balloon"?
I expect that is because of less perfect support for sparse file in tmpfs
filesystem, and in particular because of the missing FIEMAP ioctl, but I am not sure what causes this behaviour? Can you tell me?
ext4 tmpfs sparse-files
hum. There is a shared (copy-on-write) zero page, that could be used when a sparse page needed to be mmap()ed, for example. So I'm not sure why any type of read from a sparse tmpfs file would require allocating real memory. lwn.net/Articles/517465 . I wondered if this was some side effect of the conversion of loop to use direct io, but it seems there should not be any difference when you try to use the new type of loop on tmpfs. spinics.net/lists/linux-fsdevel/msg60337.html
â sourcejedi
Sep 15 at 19:26
maybe this might get an answer if it were on SO ? just a thought
â Marcus Linsner
Sep 15 at 22:22
1
The output of /tmp has different files Before/After. Is that a typo? Before: 0 /tmp/sparse100 (without M at the end) After: 102400 /tmp/sparse100M (with the trailing M).
â YoMismo
Sep 19 at 13:16
@YoMismo, yes was a only a little typo
â humanityANDpeace
Sep 21 at 8:13
add a comment |Â
up vote
12
down vote
favorite
up vote
12
down vote
favorite
On my ext4
filesystem partition I can run the following code:
fs="/mnt/ext4"
#create sparse 100M file on $fs
dd if=/dev/zero
of=$fs/sparse100M conv=sparse seek=$((100*2*1024-1)) count=1 2> /dev/null
#show its actual used size before
echo "Before:"
ls $fs/sparse100M -s
#setting the sparse file up as loopback and run md5sum on loopback
losetup /dev/loop0 $fs/sparse100M
md5sum /dev/loop0
#show its actual used size afterwards
echo "After:"
ls $fs/sparse100M -s
#release loopback and remove file
losetup -d /dev/loop0
rm $fs/sparse100M
which yields
Before:
0 sparse100M
2f282b84e7e608d5852449ed940bfc51 /dev/loop0
After:
0 sparse100M
Doing the very same thing on tmpfs as with:
fs="/tmp"
yields
Before:
0 /tmp/sparse100M
2f282b84e7e608d5852449ed940bfc51 /dev/loop0
After:
102400 /tmp/sparse100M
which basically means that something I expected to merely read the data, caused the sparse file to "blow up like a balloon"?
I expect that is because of less perfect support for sparse file in tmpfs
filesystem, and in particular because of the missing FIEMAP ioctl, but I am not sure what causes this behaviour? Can you tell me?
ext4 tmpfs sparse-files
On my ext4
filesystem partition I can run the following code:
fs="/mnt/ext4"
#create sparse 100M file on $fs
dd if=/dev/zero
of=$fs/sparse100M conv=sparse seek=$((100*2*1024-1)) count=1 2> /dev/null
#show its actual used size before
echo "Before:"
ls $fs/sparse100M -s
#setting the sparse file up as loopback and run md5sum on loopback
losetup /dev/loop0 $fs/sparse100M
md5sum /dev/loop0
#show its actual used size afterwards
echo "After:"
ls $fs/sparse100M -s
#release loopback and remove file
losetup -d /dev/loop0
rm $fs/sparse100M
which yields
Before:
0 sparse100M
2f282b84e7e608d5852449ed940bfc51 /dev/loop0
After:
0 sparse100M
Doing the very same thing on tmpfs as with:
fs="/tmp"
yields
Before:
0 /tmp/sparse100M
2f282b84e7e608d5852449ed940bfc51 /dev/loop0
After:
102400 /tmp/sparse100M
which basically means that something I expected to merely read the data, caused the sparse file to "blow up like a balloon"?
I expect that is because of less perfect support for sparse file in tmpfs
filesystem, and in particular because of the missing FIEMAP ioctl, but I am not sure what causes this behaviour? Can you tell me?
ext4 tmpfs sparse-files
ext4 tmpfs sparse-files
edited Sep 20 at 19:28
asked Mar 8 '17 at 23:54
humanityANDpeace
4,65743350
4,65743350
hum. There is a shared (copy-on-write) zero page, that could be used when a sparse page needed to be mmap()ed, for example. So I'm not sure why any type of read from a sparse tmpfs file would require allocating real memory. lwn.net/Articles/517465 . I wondered if this was some side effect of the conversion of loop to use direct io, but it seems there should not be any difference when you try to use the new type of loop on tmpfs. spinics.net/lists/linux-fsdevel/msg60337.html
â sourcejedi
Sep 15 at 19:26
maybe this might get an answer if it were on SO ? just a thought
â Marcus Linsner
Sep 15 at 22:22
1
The output of /tmp has different files Before/After. Is that a typo? Before: 0 /tmp/sparse100 (without M at the end) After: 102400 /tmp/sparse100M (with the trailing M).
â YoMismo
Sep 19 at 13:16
@YoMismo, yes was a only a little typo
â humanityANDpeace
Sep 21 at 8:13
add a comment |Â
hum. There is a shared (copy-on-write) zero page, that could be used when a sparse page needed to be mmap()ed, for example. So I'm not sure why any type of read from a sparse tmpfs file would require allocating real memory. lwn.net/Articles/517465 . I wondered if this was some side effect of the conversion of loop to use direct io, but it seems there should not be any difference when you try to use the new type of loop on tmpfs. spinics.net/lists/linux-fsdevel/msg60337.html
â sourcejedi
Sep 15 at 19:26
maybe this might get an answer if it were on SO ? just a thought
â Marcus Linsner
Sep 15 at 22:22
1
The output of /tmp has different files Before/After. Is that a typo? Before: 0 /tmp/sparse100 (without M at the end) After: 102400 /tmp/sparse100M (with the trailing M).
â YoMismo
Sep 19 at 13:16
@YoMismo, yes was a only a little typo
â humanityANDpeace
Sep 21 at 8:13
hum. There is a shared (copy-on-write) zero page, that could be used when a sparse page needed to be mmap()ed, for example. So I'm not sure why any type of read from a sparse tmpfs file would require allocating real memory. lwn.net/Articles/517465 . I wondered if this was some side effect of the conversion of loop to use direct io, but it seems there should not be any difference when you try to use the new type of loop on tmpfs. spinics.net/lists/linux-fsdevel/msg60337.html
â sourcejedi
Sep 15 at 19:26
hum. There is a shared (copy-on-write) zero page, that could be used when a sparse page needed to be mmap()ed, for example. So I'm not sure why any type of read from a sparse tmpfs file would require allocating real memory. lwn.net/Articles/517465 . I wondered if this was some side effect of the conversion of loop to use direct io, but it seems there should not be any difference when you try to use the new type of loop on tmpfs. spinics.net/lists/linux-fsdevel/msg60337.html
â sourcejedi
Sep 15 at 19:26
maybe this might get an answer if it were on SO ? just a thought
â Marcus Linsner
Sep 15 at 22:22
maybe this might get an answer if it were on SO ? just a thought
â Marcus Linsner
Sep 15 at 22:22
1
1
The output of /tmp has different files Before/After. Is that a typo? Before: 0 /tmp/sparse100 (without M at the end) After: 102400 /tmp/sparse100M (with the trailing M).
â YoMismo
Sep 19 at 13:16
The output of /tmp has different files Before/After. Is that a typo? Before: 0 /tmp/sparse100 (without M at the end) After: 102400 /tmp/sparse100M (with the trailing M).
â YoMismo
Sep 19 at 13:16
@YoMismo, yes was a only a little typo
â humanityANDpeace
Sep 21 at 8:13
@YoMismo, yes was a only a little typo
â humanityANDpeace
Sep 21 at 8:13
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
4
down vote
First off you're not alone in puzzling about these sorts of issues.
This is not just limited to tmpfs
but has been a concern cited with
NFSv4.
If an application reads 'holes' in a sparse file, the file system converts empty blocks into "real" blocks filled with zeros, and returns them to the application.
When md5sum
is attempting to scan a file it explicitly chooses to do this in
sequential order, which makes a lot of sense based on what md5sum is
attempting to do.
As there are fundamentally "holes" in the file, this sequential reading is going
to (in some situations) cause a copy on write like operation to fill out the file. This then gets
into a deeper issue around whether or not fallocate()
as implemented in the
filesystem supports FALLOC_FL_PUNCH_HOLE
.
Fortunately, not only does tmpfs
support this but there is a mechanism to
"dig" the holes back out.
Using the CLI utility fallocate
we can successfuly detect and re-dig these
holes.
As per man 1 fallocate
:
-d, --dig-holes
Detect and dig holes. This makes the file sparse in-place, without
using extra disk space. The minimum size of the hole depends on
filesystem I/O block size (usually 4096 bytes). Also, when using
this option, --keep-size is implied. If no range is specified by
--offset and --length, then the entire file is analyzed for holes.
You can think of this option as doing a "cp --sparse" and then
renaming the destination file to the original, without the need for
extra disk space.
See --punch-hole for a list of supported filesystems.
fallocate
operates on the file level though and when you are running md5sum
against a block device (requesting sequential reads) you're tripping up on the
exact gap between how the fallocate()
syscall should operate. We can see this
in action:
In action, using your example we see the following:
$ fs=$(mktemp -d)
$ echo $fs
/tmp/tmp.ONTGAS8L06
$ dd if=/dev/zero of=$fs/sparse100M conv=sparse seek=$((100*2*1024-1)) count=1 2>/dev/null
$ echo "Before:" "$(ls $fs/sparse100M -s)"
Before: 0 /tmp/tmp.ONTGAS8L06/sparse100M
$ sudo losetup /dev/loop0 $fs/sparse100M
$ sudo md5sum /dev/loop0
2f282b84e7e608d5852449ed940bfc51 /dev/loop0
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 102400 /tmp/tmp.ONTGAS8L06/sparse100M
$ fallocate -d $fs/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 0 /tmp/tmp.ONTGAS8L06/sparse100M
Now... that answers your basic question. My general motto is "get weird" so I
dug in further...
$ fs=$(mktemp -d)
$ echo $fs
/tmp/tmp.ZcAxvW32GY
$ dd if=/dev/zero of=$fs/sparse100M conv=sparse seek=$((100*2*1024-1)) count=1 2>/dev/null
$ echo "Before:" "$(ls $fs/sparse100M -s)"
Before: 0 /tmp/tmp.ZcAxvW32GY/sparse100M
$ sudo losetup /dev/loop0 $fs/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 1036 /tmp/tmp.ZcAxvW32GY/sparse100M
$ sudo md5sum $fs/sparse100M
2f282b84e7e608d5852449ed940bfc51 /tmp/tmp.ZcAxvW32GY/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 1036 /tmp/tmp.ZcAxvW32GY/sparse100M
$ fallocate -d $fs/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 520 /tmp/tmp.ZcAxvW32GY/sparse100M
$ sudo md5sum $fs/sparse100M
2f282b84e7e608d5852449ed940bfc51 /tmp/tmp.ZcAxvW32GY/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 520 /tmp/tmp.ZcAxvW32GY/sparse100M
$ fallocate -d $fs/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 516 /tmp/tmp.ZcAxvW32GY/sparse100M
$ fallocate -d $fs/sparse100M
$ sudo md5sum $fs/sparse100M
2f282b84e7e608d5852449ed940bfc51 /tmp/tmp.ZcAxvW32GY/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 512 /tmp/tmp.ZcAxvW32GY/sparse100M
$ fallocate -d $fs/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 0 /tmp/tmp.ZcAxvW32GY/sparse100M
$ sudo md5sum $fs/sparse100M
2f282b84e7e608d5852449ed940bfc51 /tmp/tmp.ZcAxvW32GY/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 0 /tmp/tmp.ZcAxvW32GY/sparse100M
You see that merely the act of performing the losetup
changes the size of
the sparse file. So this becomes an interesting combination of where tmpfs
,
the HOLE_PUNCH mechanism, fallocate
, and block devices intersect.
Thanks for your answer. I'm awaretmpfs
supports sparse files and punch_hole. That's what makes it so confusing -tmpfs
supports this, so why go and fill the sparse holes when reading through a loop device?losetup
doesn't change the file size, but it creates a block device, which on most systems is then scanned for content like: is there a partition table? is there a filesystem with UUID? should I create a /dev/disk/by-uuid/ symlink then? And those reads already cause parts of the sparse file to be allocated, because for some mysterious reason, tmpfs fills holes on (some) reads.
â frostschutz
Sep 20 at 18:12
1
Can you clarify "sequential reading is going to (in some situations) cause a copy on write like operation", please? I'm curious to understand how a read operation would trigger a copy on write action. Thanks!
â roaima
Sep 20 at 18:12
This is odd. On my system I followed the same steps, though manually and not in a script. First I did a 100M file just like the OP. Then I repeated the steps with only a 10MB file. First result : ls -s sparse100M was 102400. But ls -s on the 10MB file was only 328 blocks. ??
â Patrick Taylor
Sep 21 at 5:16
1
@PatrickTaylor ~328K is about what's used after the UUID scanners came by, but you didn't cat / md5sum the loop device for a full read.
â frostschutz
Sep 21 at 10:53
1
I was digging through the source for the loop kernel module (inloop.c
) and saw that there are two relevant functions:lo_read_simple
&lo_read_transfer
. There are some minor differences in how they do low level memory allocation...lo_read_transfer
is actually requesting non-blocking io fromslab.h
(GFP_NOIO
) while performing aalloc_page()
call.lo_read_simple()
on the other hand is not performingalloc_page()
.
â Brian Redbeard
Sep 21 at 19:23
 |Â
show 4 more comments
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
4
down vote
First off you're not alone in puzzling about these sorts of issues.
This is not just limited to tmpfs
but has been a concern cited with
NFSv4.
If an application reads 'holes' in a sparse file, the file system converts empty blocks into "real" blocks filled with zeros, and returns them to the application.
When md5sum
is attempting to scan a file it explicitly chooses to do this in
sequential order, which makes a lot of sense based on what md5sum is
attempting to do.
As there are fundamentally "holes" in the file, this sequential reading is going
to (in some situations) cause a copy on write like operation to fill out the file. This then gets
into a deeper issue around whether or not fallocate()
as implemented in the
filesystem supports FALLOC_FL_PUNCH_HOLE
.
Fortunately, not only does tmpfs
support this but there is a mechanism to
"dig" the holes back out.
Using the CLI utility fallocate
we can successfuly detect and re-dig these
holes.
As per man 1 fallocate
:
-d, --dig-holes
Detect and dig holes. This makes the file sparse in-place, without
using extra disk space. The minimum size of the hole depends on
filesystem I/O block size (usually 4096 bytes). Also, when using
this option, --keep-size is implied. If no range is specified by
--offset and --length, then the entire file is analyzed for holes.
You can think of this option as doing a "cp --sparse" and then
renaming the destination file to the original, without the need for
extra disk space.
See --punch-hole for a list of supported filesystems.
fallocate
operates on the file level though and when you are running md5sum
against a block device (requesting sequential reads) you're tripping up on the
exact gap between how the fallocate()
syscall should operate. We can see this
in action:
In action, using your example we see the following:
$ fs=$(mktemp -d)
$ echo $fs
/tmp/tmp.ONTGAS8L06
$ dd if=/dev/zero of=$fs/sparse100M conv=sparse seek=$((100*2*1024-1)) count=1 2>/dev/null
$ echo "Before:" "$(ls $fs/sparse100M -s)"
Before: 0 /tmp/tmp.ONTGAS8L06/sparse100M
$ sudo losetup /dev/loop0 $fs/sparse100M
$ sudo md5sum /dev/loop0
2f282b84e7e608d5852449ed940bfc51 /dev/loop0
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 102400 /tmp/tmp.ONTGAS8L06/sparse100M
$ fallocate -d $fs/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 0 /tmp/tmp.ONTGAS8L06/sparse100M
Now... that answers your basic question. My general motto is "get weird" so I
dug in further...
$ fs=$(mktemp -d)
$ echo $fs
/tmp/tmp.ZcAxvW32GY
$ dd if=/dev/zero of=$fs/sparse100M conv=sparse seek=$((100*2*1024-1)) count=1 2>/dev/null
$ echo "Before:" "$(ls $fs/sparse100M -s)"
Before: 0 /tmp/tmp.ZcAxvW32GY/sparse100M
$ sudo losetup /dev/loop0 $fs/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 1036 /tmp/tmp.ZcAxvW32GY/sparse100M
$ sudo md5sum $fs/sparse100M
2f282b84e7e608d5852449ed940bfc51 /tmp/tmp.ZcAxvW32GY/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 1036 /tmp/tmp.ZcAxvW32GY/sparse100M
$ fallocate -d $fs/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 520 /tmp/tmp.ZcAxvW32GY/sparse100M
$ sudo md5sum $fs/sparse100M
2f282b84e7e608d5852449ed940bfc51 /tmp/tmp.ZcAxvW32GY/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 520 /tmp/tmp.ZcAxvW32GY/sparse100M
$ fallocate -d $fs/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 516 /tmp/tmp.ZcAxvW32GY/sparse100M
$ fallocate -d $fs/sparse100M
$ sudo md5sum $fs/sparse100M
2f282b84e7e608d5852449ed940bfc51 /tmp/tmp.ZcAxvW32GY/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 512 /tmp/tmp.ZcAxvW32GY/sparse100M
$ fallocate -d $fs/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 0 /tmp/tmp.ZcAxvW32GY/sparse100M
$ sudo md5sum $fs/sparse100M
2f282b84e7e608d5852449ed940bfc51 /tmp/tmp.ZcAxvW32GY/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 0 /tmp/tmp.ZcAxvW32GY/sparse100M
You see that merely the act of performing the losetup
changes the size of
the sparse file. So this becomes an interesting combination of where tmpfs
,
the HOLE_PUNCH mechanism, fallocate
, and block devices intersect.
Thanks for your answer. I'm awaretmpfs
supports sparse files and punch_hole. That's what makes it so confusing -tmpfs
supports this, so why go and fill the sparse holes when reading through a loop device?losetup
doesn't change the file size, but it creates a block device, which on most systems is then scanned for content like: is there a partition table? is there a filesystem with UUID? should I create a /dev/disk/by-uuid/ symlink then? And those reads already cause parts of the sparse file to be allocated, because for some mysterious reason, tmpfs fills holes on (some) reads.
â frostschutz
Sep 20 at 18:12
1
Can you clarify "sequential reading is going to (in some situations) cause a copy on write like operation", please? I'm curious to understand how a read operation would trigger a copy on write action. Thanks!
â roaima
Sep 20 at 18:12
This is odd. On my system I followed the same steps, though manually and not in a script. First I did a 100M file just like the OP. Then I repeated the steps with only a 10MB file. First result : ls -s sparse100M was 102400. But ls -s on the 10MB file was only 328 blocks. ??
â Patrick Taylor
Sep 21 at 5:16
1
@PatrickTaylor ~328K is about what's used after the UUID scanners came by, but you didn't cat / md5sum the loop device for a full read.
â frostschutz
Sep 21 at 10:53
1
I was digging through the source for the loop kernel module (inloop.c
) and saw that there are two relevant functions:lo_read_simple
&lo_read_transfer
. There are some minor differences in how they do low level memory allocation...lo_read_transfer
is actually requesting non-blocking io fromslab.h
(GFP_NOIO
) while performing aalloc_page()
call.lo_read_simple()
on the other hand is not performingalloc_page()
.
â Brian Redbeard
Sep 21 at 19:23
 |Â
show 4 more comments
up vote
4
down vote
First off you're not alone in puzzling about these sorts of issues.
This is not just limited to tmpfs
but has been a concern cited with
NFSv4.
If an application reads 'holes' in a sparse file, the file system converts empty blocks into "real" blocks filled with zeros, and returns them to the application.
When md5sum
is attempting to scan a file it explicitly chooses to do this in
sequential order, which makes a lot of sense based on what md5sum is
attempting to do.
As there are fundamentally "holes" in the file, this sequential reading is going
to (in some situations) cause a copy on write like operation to fill out the file. This then gets
into a deeper issue around whether or not fallocate()
as implemented in the
filesystem supports FALLOC_FL_PUNCH_HOLE
.
Fortunately, not only does tmpfs
support this but there is a mechanism to
"dig" the holes back out.
Using the CLI utility fallocate
we can successfuly detect and re-dig these
holes.
As per man 1 fallocate
:
-d, --dig-holes
Detect and dig holes. This makes the file sparse in-place, without
using extra disk space. The minimum size of the hole depends on
filesystem I/O block size (usually 4096 bytes). Also, when using
this option, --keep-size is implied. If no range is specified by
--offset and --length, then the entire file is analyzed for holes.
You can think of this option as doing a "cp --sparse" and then
renaming the destination file to the original, without the need for
extra disk space.
See --punch-hole for a list of supported filesystems.
fallocate
operates on the file level though and when you are running md5sum
against a block device (requesting sequential reads) you're tripping up on the
exact gap between how the fallocate()
syscall should operate. We can see this
in action:
In action, using your example we see the following:
$ fs=$(mktemp -d)
$ echo $fs
/tmp/tmp.ONTGAS8L06
$ dd if=/dev/zero of=$fs/sparse100M conv=sparse seek=$((100*2*1024-1)) count=1 2>/dev/null
$ echo "Before:" "$(ls $fs/sparse100M -s)"
Before: 0 /tmp/tmp.ONTGAS8L06/sparse100M
$ sudo losetup /dev/loop0 $fs/sparse100M
$ sudo md5sum /dev/loop0
2f282b84e7e608d5852449ed940bfc51 /dev/loop0
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 102400 /tmp/tmp.ONTGAS8L06/sparse100M
$ fallocate -d $fs/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 0 /tmp/tmp.ONTGAS8L06/sparse100M
Now... that answers your basic question. My general motto is "get weird" so I
dug in further...
$ fs=$(mktemp -d)
$ echo $fs
/tmp/tmp.ZcAxvW32GY
$ dd if=/dev/zero of=$fs/sparse100M conv=sparse seek=$((100*2*1024-1)) count=1 2>/dev/null
$ echo "Before:" "$(ls $fs/sparse100M -s)"
Before: 0 /tmp/tmp.ZcAxvW32GY/sparse100M
$ sudo losetup /dev/loop0 $fs/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 1036 /tmp/tmp.ZcAxvW32GY/sparse100M
$ sudo md5sum $fs/sparse100M
2f282b84e7e608d5852449ed940bfc51 /tmp/tmp.ZcAxvW32GY/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 1036 /tmp/tmp.ZcAxvW32GY/sparse100M
$ fallocate -d $fs/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 520 /tmp/tmp.ZcAxvW32GY/sparse100M
$ sudo md5sum $fs/sparse100M
2f282b84e7e608d5852449ed940bfc51 /tmp/tmp.ZcAxvW32GY/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 520 /tmp/tmp.ZcAxvW32GY/sparse100M
$ fallocate -d $fs/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 516 /tmp/tmp.ZcAxvW32GY/sparse100M
$ fallocate -d $fs/sparse100M
$ sudo md5sum $fs/sparse100M
2f282b84e7e608d5852449ed940bfc51 /tmp/tmp.ZcAxvW32GY/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 512 /tmp/tmp.ZcAxvW32GY/sparse100M
$ fallocate -d $fs/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 0 /tmp/tmp.ZcAxvW32GY/sparse100M
$ sudo md5sum $fs/sparse100M
2f282b84e7e608d5852449ed940bfc51 /tmp/tmp.ZcAxvW32GY/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 0 /tmp/tmp.ZcAxvW32GY/sparse100M
You see that merely the act of performing the losetup
changes the size of
the sparse file. So this becomes an interesting combination of where tmpfs
,
the HOLE_PUNCH mechanism, fallocate
, and block devices intersect.
Thanks for your answer. I'm awaretmpfs
supports sparse files and punch_hole. That's what makes it so confusing -tmpfs
supports this, so why go and fill the sparse holes when reading through a loop device?losetup
doesn't change the file size, but it creates a block device, which on most systems is then scanned for content like: is there a partition table? is there a filesystem with UUID? should I create a /dev/disk/by-uuid/ symlink then? And those reads already cause parts of the sparse file to be allocated, because for some mysterious reason, tmpfs fills holes on (some) reads.
â frostschutz
Sep 20 at 18:12
1
Can you clarify "sequential reading is going to (in some situations) cause a copy on write like operation", please? I'm curious to understand how a read operation would trigger a copy on write action. Thanks!
â roaima
Sep 20 at 18:12
This is odd. On my system I followed the same steps, though manually and not in a script. First I did a 100M file just like the OP. Then I repeated the steps with only a 10MB file. First result : ls -s sparse100M was 102400. But ls -s on the 10MB file was only 328 blocks. ??
â Patrick Taylor
Sep 21 at 5:16
1
@PatrickTaylor ~328K is about what's used after the UUID scanners came by, but you didn't cat / md5sum the loop device for a full read.
â frostschutz
Sep 21 at 10:53
1
I was digging through the source for the loop kernel module (inloop.c
) and saw that there are two relevant functions:lo_read_simple
&lo_read_transfer
. There are some minor differences in how they do low level memory allocation...lo_read_transfer
is actually requesting non-blocking io fromslab.h
(GFP_NOIO
) while performing aalloc_page()
call.lo_read_simple()
on the other hand is not performingalloc_page()
.
â Brian Redbeard
Sep 21 at 19:23
 |Â
show 4 more comments
up vote
4
down vote
up vote
4
down vote
First off you're not alone in puzzling about these sorts of issues.
This is not just limited to tmpfs
but has been a concern cited with
NFSv4.
If an application reads 'holes' in a sparse file, the file system converts empty blocks into "real" blocks filled with zeros, and returns them to the application.
When md5sum
is attempting to scan a file it explicitly chooses to do this in
sequential order, which makes a lot of sense based on what md5sum is
attempting to do.
As there are fundamentally "holes" in the file, this sequential reading is going
to (in some situations) cause a copy on write like operation to fill out the file. This then gets
into a deeper issue around whether or not fallocate()
as implemented in the
filesystem supports FALLOC_FL_PUNCH_HOLE
.
Fortunately, not only does tmpfs
support this but there is a mechanism to
"dig" the holes back out.
Using the CLI utility fallocate
we can successfuly detect and re-dig these
holes.
As per man 1 fallocate
:
-d, --dig-holes
Detect and dig holes. This makes the file sparse in-place, without
using extra disk space. The minimum size of the hole depends on
filesystem I/O block size (usually 4096 bytes). Also, when using
this option, --keep-size is implied. If no range is specified by
--offset and --length, then the entire file is analyzed for holes.
You can think of this option as doing a "cp --sparse" and then
renaming the destination file to the original, without the need for
extra disk space.
See --punch-hole for a list of supported filesystems.
fallocate
operates on the file level though and when you are running md5sum
against a block device (requesting sequential reads) you're tripping up on the
exact gap between how the fallocate()
syscall should operate. We can see this
in action:
In action, using your example we see the following:
$ fs=$(mktemp -d)
$ echo $fs
/tmp/tmp.ONTGAS8L06
$ dd if=/dev/zero of=$fs/sparse100M conv=sparse seek=$((100*2*1024-1)) count=1 2>/dev/null
$ echo "Before:" "$(ls $fs/sparse100M -s)"
Before: 0 /tmp/tmp.ONTGAS8L06/sparse100M
$ sudo losetup /dev/loop0 $fs/sparse100M
$ sudo md5sum /dev/loop0
2f282b84e7e608d5852449ed940bfc51 /dev/loop0
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 102400 /tmp/tmp.ONTGAS8L06/sparse100M
$ fallocate -d $fs/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 0 /tmp/tmp.ONTGAS8L06/sparse100M
Now... that answers your basic question. My general motto is "get weird" so I
dug in further...
$ fs=$(mktemp -d)
$ echo $fs
/tmp/tmp.ZcAxvW32GY
$ dd if=/dev/zero of=$fs/sparse100M conv=sparse seek=$((100*2*1024-1)) count=1 2>/dev/null
$ echo "Before:" "$(ls $fs/sparse100M -s)"
Before: 0 /tmp/tmp.ZcAxvW32GY/sparse100M
$ sudo losetup /dev/loop0 $fs/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 1036 /tmp/tmp.ZcAxvW32GY/sparse100M
$ sudo md5sum $fs/sparse100M
2f282b84e7e608d5852449ed940bfc51 /tmp/tmp.ZcAxvW32GY/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 1036 /tmp/tmp.ZcAxvW32GY/sparse100M
$ fallocate -d $fs/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 520 /tmp/tmp.ZcAxvW32GY/sparse100M
$ sudo md5sum $fs/sparse100M
2f282b84e7e608d5852449ed940bfc51 /tmp/tmp.ZcAxvW32GY/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 520 /tmp/tmp.ZcAxvW32GY/sparse100M
$ fallocate -d $fs/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 516 /tmp/tmp.ZcAxvW32GY/sparse100M
$ fallocate -d $fs/sparse100M
$ sudo md5sum $fs/sparse100M
2f282b84e7e608d5852449ed940bfc51 /tmp/tmp.ZcAxvW32GY/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 512 /tmp/tmp.ZcAxvW32GY/sparse100M
$ fallocate -d $fs/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 0 /tmp/tmp.ZcAxvW32GY/sparse100M
$ sudo md5sum $fs/sparse100M
2f282b84e7e608d5852449ed940bfc51 /tmp/tmp.ZcAxvW32GY/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 0 /tmp/tmp.ZcAxvW32GY/sparse100M
You see that merely the act of performing the losetup
changes the size of
the sparse file. So this becomes an interesting combination of where tmpfs
,
the HOLE_PUNCH mechanism, fallocate
, and block devices intersect.
First off you're not alone in puzzling about these sorts of issues.
This is not just limited to tmpfs
but has been a concern cited with
NFSv4.
If an application reads 'holes' in a sparse file, the file system converts empty blocks into "real" blocks filled with zeros, and returns them to the application.
When md5sum
is attempting to scan a file it explicitly chooses to do this in
sequential order, which makes a lot of sense based on what md5sum is
attempting to do.
As there are fundamentally "holes" in the file, this sequential reading is going
to (in some situations) cause a copy on write like operation to fill out the file. This then gets
into a deeper issue around whether or not fallocate()
as implemented in the
filesystem supports FALLOC_FL_PUNCH_HOLE
.
Fortunately, not only does tmpfs
support this but there is a mechanism to
"dig" the holes back out.
Using the CLI utility fallocate
we can successfuly detect and re-dig these
holes.
As per man 1 fallocate
:
-d, --dig-holes
Detect and dig holes. This makes the file sparse in-place, without
using extra disk space. The minimum size of the hole depends on
filesystem I/O block size (usually 4096 bytes). Also, when using
this option, --keep-size is implied. If no range is specified by
--offset and --length, then the entire file is analyzed for holes.
You can think of this option as doing a "cp --sparse" and then
renaming the destination file to the original, without the need for
extra disk space.
See --punch-hole for a list of supported filesystems.
fallocate
operates on the file level though and when you are running md5sum
against a block device (requesting sequential reads) you're tripping up on the
exact gap between how the fallocate()
syscall should operate. We can see this
in action:
In action, using your example we see the following:
$ fs=$(mktemp -d)
$ echo $fs
/tmp/tmp.ONTGAS8L06
$ dd if=/dev/zero of=$fs/sparse100M conv=sparse seek=$((100*2*1024-1)) count=1 2>/dev/null
$ echo "Before:" "$(ls $fs/sparse100M -s)"
Before: 0 /tmp/tmp.ONTGAS8L06/sparse100M
$ sudo losetup /dev/loop0 $fs/sparse100M
$ sudo md5sum /dev/loop0
2f282b84e7e608d5852449ed940bfc51 /dev/loop0
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 102400 /tmp/tmp.ONTGAS8L06/sparse100M
$ fallocate -d $fs/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 0 /tmp/tmp.ONTGAS8L06/sparse100M
Now... that answers your basic question. My general motto is "get weird" so I
dug in further...
$ fs=$(mktemp -d)
$ echo $fs
/tmp/tmp.ZcAxvW32GY
$ dd if=/dev/zero of=$fs/sparse100M conv=sparse seek=$((100*2*1024-1)) count=1 2>/dev/null
$ echo "Before:" "$(ls $fs/sparse100M -s)"
Before: 0 /tmp/tmp.ZcAxvW32GY/sparse100M
$ sudo losetup /dev/loop0 $fs/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 1036 /tmp/tmp.ZcAxvW32GY/sparse100M
$ sudo md5sum $fs/sparse100M
2f282b84e7e608d5852449ed940bfc51 /tmp/tmp.ZcAxvW32GY/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 1036 /tmp/tmp.ZcAxvW32GY/sparse100M
$ fallocate -d $fs/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 520 /tmp/tmp.ZcAxvW32GY/sparse100M
$ sudo md5sum $fs/sparse100M
2f282b84e7e608d5852449ed940bfc51 /tmp/tmp.ZcAxvW32GY/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 520 /tmp/tmp.ZcAxvW32GY/sparse100M
$ fallocate -d $fs/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 516 /tmp/tmp.ZcAxvW32GY/sparse100M
$ fallocate -d $fs/sparse100M
$ sudo md5sum $fs/sparse100M
2f282b84e7e608d5852449ed940bfc51 /tmp/tmp.ZcAxvW32GY/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 512 /tmp/tmp.ZcAxvW32GY/sparse100M
$ fallocate -d $fs/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 0 /tmp/tmp.ZcAxvW32GY/sparse100M
$ sudo md5sum $fs/sparse100M
2f282b84e7e608d5852449ed940bfc51 /tmp/tmp.ZcAxvW32GY/sparse100M
$ echo "After:" "$(ls $fs/sparse100M -s)"
After: 0 /tmp/tmp.ZcAxvW32GY/sparse100M
You see that merely the act of performing the losetup
changes the size of
the sparse file. So this becomes an interesting combination of where tmpfs
,
the HOLE_PUNCH mechanism, fallocate
, and block devices intersect.
answered Sep 20 at 18:00
Brian Redbeard
1,578827
1,578827
Thanks for your answer. I'm awaretmpfs
supports sparse files and punch_hole. That's what makes it so confusing -tmpfs
supports this, so why go and fill the sparse holes when reading through a loop device?losetup
doesn't change the file size, but it creates a block device, which on most systems is then scanned for content like: is there a partition table? is there a filesystem with UUID? should I create a /dev/disk/by-uuid/ symlink then? And those reads already cause parts of the sparse file to be allocated, because for some mysterious reason, tmpfs fills holes on (some) reads.
â frostschutz
Sep 20 at 18:12
1
Can you clarify "sequential reading is going to (in some situations) cause a copy on write like operation", please? I'm curious to understand how a read operation would trigger a copy on write action. Thanks!
â roaima
Sep 20 at 18:12
This is odd. On my system I followed the same steps, though manually and not in a script. First I did a 100M file just like the OP. Then I repeated the steps with only a 10MB file. First result : ls -s sparse100M was 102400. But ls -s on the 10MB file was only 328 blocks. ??
â Patrick Taylor
Sep 21 at 5:16
1
@PatrickTaylor ~328K is about what's used after the UUID scanners came by, but you didn't cat / md5sum the loop device for a full read.
â frostschutz
Sep 21 at 10:53
1
I was digging through the source for the loop kernel module (inloop.c
) and saw that there are two relevant functions:lo_read_simple
&lo_read_transfer
. There are some minor differences in how they do low level memory allocation...lo_read_transfer
is actually requesting non-blocking io fromslab.h
(GFP_NOIO
) while performing aalloc_page()
call.lo_read_simple()
on the other hand is not performingalloc_page()
.
â Brian Redbeard
Sep 21 at 19:23
 |Â
show 4 more comments
Thanks for your answer. I'm awaretmpfs
supports sparse files and punch_hole. That's what makes it so confusing -tmpfs
supports this, so why go and fill the sparse holes when reading through a loop device?losetup
doesn't change the file size, but it creates a block device, which on most systems is then scanned for content like: is there a partition table? is there a filesystem with UUID? should I create a /dev/disk/by-uuid/ symlink then? And those reads already cause parts of the sparse file to be allocated, because for some mysterious reason, tmpfs fills holes on (some) reads.
â frostschutz
Sep 20 at 18:12
1
Can you clarify "sequential reading is going to (in some situations) cause a copy on write like operation", please? I'm curious to understand how a read operation would trigger a copy on write action. Thanks!
â roaima
Sep 20 at 18:12
This is odd. On my system I followed the same steps, though manually and not in a script. First I did a 100M file just like the OP. Then I repeated the steps with only a 10MB file. First result : ls -s sparse100M was 102400. But ls -s on the 10MB file was only 328 blocks. ??
â Patrick Taylor
Sep 21 at 5:16
1
@PatrickTaylor ~328K is about what's used after the UUID scanners came by, but you didn't cat / md5sum the loop device for a full read.
â frostschutz
Sep 21 at 10:53
1
I was digging through the source for the loop kernel module (inloop.c
) and saw that there are two relevant functions:lo_read_simple
&lo_read_transfer
. There are some minor differences in how they do low level memory allocation...lo_read_transfer
is actually requesting non-blocking io fromslab.h
(GFP_NOIO
) while performing aalloc_page()
call.lo_read_simple()
on the other hand is not performingalloc_page()
.
â Brian Redbeard
Sep 21 at 19:23
Thanks for your answer. I'm aware
tmpfs
supports sparse files and punch_hole. That's what makes it so confusing - tmpfs
supports this, so why go and fill the sparse holes when reading through a loop device? losetup
doesn't change the file size, but it creates a block device, which on most systems is then scanned for content like: is there a partition table? is there a filesystem with UUID? should I create a /dev/disk/by-uuid/ symlink then? And those reads already cause parts of the sparse file to be allocated, because for some mysterious reason, tmpfs fills holes on (some) reads.â frostschutz
Sep 20 at 18:12
Thanks for your answer. I'm aware
tmpfs
supports sparse files and punch_hole. That's what makes it so confusing - tmpfs
supports this, so why go and fill the sparse holes when reading through a loop device? losetup
doesn't change the file size, but it creates a block device, which on most systems is then scanned for content like: is there a partition table? is there a filesystem with UUID? should I create a /dev/disk/by-uuid/ symlink then? And those reads already cause parts of the sparse file to be allocated, because for some mysterious reason, tmpfs fills holes on (some) reads.â frostschutz
Sep 20 at 18:12
1
1
Can you clarify "sequential reading is going to (in some situations) cause a copy on write like operation", please? I'm curious to understand how a read operation would trigger a copy on write action. Thanks!
â roaima
Sep 20 at 18:12
Can you clarify "sequential reading is going to (in some situations) cause a copy on write like operation", please? I'm curious to understand how a read operation would trigger a copy on write action. Thanks!
â roaima
Sep 20 at 18:12
This is odd. On my system I followed the same steps, though manually and not in a script. First I did a 100M file just like the OP. Then I repeated the steps with only a 10MB file. First result : ls -s sparse100M was 102400. But ls -s on the 10MB file was only 328 blocks. ??
â Patrick Taylor
Sep 21 at 5:16
This is odd. On my system I followed the same steps, though manually and not in a script. First I did a 100M file just like the OP. Then I repeated the steps with only a 10MB file. First result : ls -s sparse100M was 102400. But ls -s on the 10MB file was only 328 blocks. ??
â Patrick Taylor
Sep 21 at 5:16
1
1
@PatrickTaylor ~328K is about what's used after the UUID scanners came by, but you didn't cat / md5sum the loop device for a full read.
â frostschutz
Sep 21 at 10:53
@PatrickTaylor ~328K is about what's used after the UUID scanners came by, but you didn't cat / md5sum the loop device for a full read.
â frostschutz
Sep 21 at 10:53
1
1
I was digging through the source for the loop kernel module (in
loop.c
) and saw that there are two relevant functions: lo_read_simple
& lo_read_transfer
. There are some minor differences in how they do low level memory allocation... lo_read_transfer
is actually requesting non-blocking io from slab.h
(GFP_NOIO
) while performing a alloc_page()
call. lo_read_simple()
on the other hand is not performing alloc_page()
.â Brian Redbeard
Sep 21 at 19:23
I was digging through the source for the loop kernel module (in
loop.c
) and saw that there are two relevant functions: lo_read_simple
& lo_read_transfer
. There are some minor differences in how they do low level memory allocation... lo_read_transfer
is actually requesting non-blocking io from slab.h
(GFP_NOIO
) while performing a alloc_page()
call. lo_read_simple()
on the other hand is not performing alloc_page()
.â Brian Redbeard
Sep 21 at 19:23
 |Â
show 4 more comments
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f350126%2fwhat-could-explain-this-strange-sparse-file-handling-of-in-tmpfs%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
hum. There is a shared (copy-on-write) zero page, that could be used when a sparse page needed to be mmap()ed, for example. So I'm not sure why any type of read from a sparse tmpfs file would require allocating real memory. lwn.net/Articles/517465 . I wondered if this was some side effect of the conversion of loop to use direct io, but it seems there should not be any difference when you try to use the new type of loop on tmpfs. spinics.net/lists/linux-fsdevel/msg60337.html
â sourcejedi
Sep 15 at 19:26
maybe this might get an answer if it were on SO ? just a thought
â Marcus Linsner
Sep 15 at 22:22
1
The output of /tmp has different files Before/After. Is that a typo? Before: 0 /tmp/sparse100 (without M at the end) After: 102400 /tmp/sparse100M (with the trailing M).
â YoMismo
Sep 19 at 13:16
@YoMismo, yes was a only a little typo
â humanityANDpeace
Sep 21 at 8:13