cat on big files does not work
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
3
down vote
favorite
I'm trying to concatenate four big files in two. The files *_1P.gz contain the same amount of lines as the corrisponding *_2P.gz.
The files A_1P.gz
and A_2P.gz
both contain 1104507560
lines.
The files B_1P.gz
and B_2P.gz
both contain 1182136972
lines.
However, cat A_1P.gz B_1P.gz > C_1P.gz| wc -l
returns 186974687
lines, and cat A_2P.gz B_2P.gz > C_2P.gz| wc -l
returns 182952523
lines, so both are not only way smaller than the two input files (they should be more than 2B lines long and they're less than 2M instead), but also they have a different number of lines. The command ran showing no errors whatsoever.
I can't understand what's happening, I generated those four big files with cat
as well and it worked properly.
- What could the problem be?
- What other options do I have to concatenate gzipped files without using
cat
?
I'm working on a CentOS server. I still have 197G space, so that shouldn't be an issue (or it should show an error, at least).
shell cat compression
add a comment |Â
up vote
3
down vote
favorite
I'm trying to concatenate four big files in two. The files *_1P.gz contain the same amount of lines as the corrisponding *_2P.gz.
The files A_1P.gz
and A_2P.gz
both contain 1104507560
lines.
The files B_1P.gz
and B_2P.gz
both contain 1182136972
lines.
However, cat A_1P.gz B_1P.gz > C_1P.gz| wc -l
returns 186974687
lines, and cat A_2P.gz B_2P.gz > C_2P.gz| wc -l
returns 182952523
lines, so both are not only way smaller than the two input files (they should be more than 2B lines long and they're less than 2M instead), but also they have a different number of lines. The command ran showing no errors whatsoever.
I can't understand what's happening, I generated those four big files with cat
as well and it worked properly.
- What could the problem be?
- What other options do I have to concatenate gzipped files without using
cat
?
I'm working on a CentOS server. I still have 197G space, so that shouldn't be an issue (or it should show an error, at least).
shell cat compression
14
When you're counting lines, don't you want to count the uncompressed lines? Also the pipelinecat file1 file2 >file3 | wc -l
does not make sense aswc
would get no data. What's the command that you are actually using?
â Kusalananda
Jul 19 at 13:28
What command(s) did you use to count the lines in the original files? It's possible that you unintentionally used some wrapper that silently decompressed them first. Try showing the size in bytes (usingwc -c
) instead of lines.
â JigglyNaga
Jul 19 at 13:28
@Kusalananda I obtained the line count of the four big files doingzcat *P.gz | wc -l
. The actual command wascat file1 file2 > file3; wc -l file3
, but actually I didn't precede it withzcat
, and that might be the root of my problem. If that's so, I'll feel really stupid...
â LinuxBlanket
Jul 19 at 13:31
@LinuxBlanket yes, you need to count the uncompressed lines, since lines are defined byn
and there is no reason to expect to have a specific number ofn
characters in the compressed file.
â terdonâ¦
Jul 19 at 13:34
1
Obviously a cat eating big files is not an healthy diet. :)
â Rui F Ribeiro
Jul 19 at 13:58
add a comment |Â
up vote
3
down vote
favorite
up vote
3
down vote
favorite
I'm trying to concatenate four big files in two. The files *_1P.gz contain the same amount of lines as the corrisponding *_2P.gz.
The files A_1P.gz
and A_2P.gz
both contain 1104507560
lines.
The files B_1P.gz
and B_2P.gz
both contain 1182136972
lines.
However, cat A_1P.gz B_1P.gz > C_1P.gz| wc -l
returns 186974687
lines, and cat A_2P.gz B_2P.gz > C_2P.gz| wc -l
returns 182952523
lines, so both are not only way smaller than the two input files (they should be more than 2B lines long and they're less than 2M instead), but also they have a different number of lines. The command ran showing no errors whatsoever.
I can't understand what's happening, I generated those four big files with cat
as well and it worked properly.
- What could the problem be?
- What other options do I have to concatenate gzipped files without using
cat
?
I'm working on a CentOS server. I still have 197G space, so that shouldn't be an issue (or it should show an error, at least).
shell cat compression
I'm trying to concatenate four big files in two. The files *_1P.gz contain the same amount of lines as the corrisponding *_2P.gz.
The files A_1P.gz
and A_2P.gz
both contain 1104507560
lines.
The files B_1P.gz
and B_2P.gz
both contain 1182136972
lines.
However, cat A_1P.gz B_1P.gz > C_1P.gz| wc -l
returns 186974687
lines, and cat A_2P.gz B_2P.gz > C_2P.gz| wc -l
returns 182952523
lines, so both are not only way smaller than the two input files (they should be more than 2B lines long and they're less than 2M instead), but also they have a different number of lines. The command ran showing no errors whatsoever.
I can't understand what's happening, I generated those four big files with cat
as well and it worked properly.
- What could the problem be?
- What other options do I have to concatenate gzipped files without using
cat
?
I'm working on a CentOS server. I still have 197G space, so that shouldn't be an issue (or it should show an error, at least).
shell cat compression
edited Jul 19 at 14:14
Kusalananda
101k13199311
101k13199311
asked Jul 19 at 13:24
LinuxBlanket
2261311
2261311
14
When you're counting lines, don't you want to count the uncompressed lines? Also the pipelinecat file1 file2 >file3 | wc -l
does not make sense aswc
would get no data. What's the command that you are actually using?
â Kusalananda
Jul 19 at 13:28
What command(s) did you use to count the lines in the original files? It's possible that you unintentionally used some wrapper that silently decompressed them first. Try showing the size in bytes (usingwc -c
) instead of lines.
â JigglyNaga
Jul 19 at 13:28
@Kusalananda I obtained the line count of the four big files doingzcat *P.gz | wc -l
. The actual command wascat file1 file2 > file3; wc -l file3
, but actually I didn't precede it withzcat
, and that might be the root of my problem. If that's so, I'll feel really stupid...
â LinuxBlanket
Jul 19 at 13:31
@LinuxBlanket yes, you need to count the uncompressed lines, since lines are defined byn
and there is no reason to expect to have a specific number ofn
characters in the compressed file.
â terdonâ¦
Jul 19 at 13:34
1
Obviously a cat eating big files is not an healthy diet. :)
â Rui F Ribeiro
Jul 19 at 13:58
add a comment |Â
14
When you're counting lines, don't you want to count the uncompressed lines? Also the pipelinecat file1 file2 >file3 | wc -l
does not make sense aswc
would get no data. What's the command that you are actually using?
â Kusalananda
Jul 19 at 13:28
What command(s) did you use to count the lines in the original files? It's possible that you unintentionally used some wrapper that silently decompressed them first. Try showing the size in bytes (usingwc -c
) instead of lines.
â JigglyNaga
Jul 19 at 13:28
@Kusalananda I obtained the line count of the four big files doingzcat *P.gz | wc -l
. The actual command wascat file1 file2 > file3; wc -l file3
, but actually I didn't precede it withzcat
, and that might be the root of my problem. If that's so, I'll feel really stupid...
â LinuxBlanket
Jul 19 at 13:31
@LinuxBlanket yes, you need to count the uncompressed lines, since lines are defined byn
and there is no reason to expect to have a specific number ofn
characters in the compressed file.
â terdonâ¦
Jul 19 at 13:34
1
Obviously a cat eating big files is not an healthy diet. :)
â Rui F Ribeiro
Jul 19 at 13:58
14
14
When you're counting lines, don't you want to count the uncompressed lines? Also the pipeline
cat file1 file2 >file3 | wc -l
does not make sense as wc
would get no data. What's the command that you are actually using?â Kusalananda
Jul 19 at 13:28
When you're counting lines, don't you want to count the uncompressed lines? Also the pipeline
cat file1 file2 >file3 | wc -l
does not make sense as wc
would get no data. What's the command that you are actually using?â Kusalananda
Jul 19 at 13:28
What command(s) did you use to count the lines in the original files? It's possible that you unintentionally used some wrapper that silently decompressed them first. Try showing the size in bytes (using
wc -c
) instead of lines.â JigglyNaga
Jul 19 at 13:28
What command(s) did you use to count the lines in the original files? It's possible that you unintentionally used some wrapper that silently decompressed them first. Try showing the size in bytes (using
wc -c
) instead of lines.â JigglyNaga
Jul 19 at 13:28
@Kusalananda I obtained the line count of the four big files doing
zcat *P.gz | wc -l
. The actual command was cat file1 file2 > file3; wc -l file3
, but actually I didn't precede it with zcat
, and that might be the root of my problem. If that's so, I'll feel really stupid...â LinuxBlanket
Jul 19 at 13:31
@Kusalananda I obtained the line count of the four big files doing
zcat *P.gz | wc -l
. The actual command was cat file1 file2 > file3; wc -l file3
, but actually I didn't precede it with zcat
, and that might be the root of my problem. If that's so, I'll feel really stupid...â LinuxBlanket
Jul 19 at 13:31
@LinuxBlanket yes, you need to count the uncompressed lines, since lines are defined by
n
and there is no reason to expect to have a specific number of n
characters in the compressed file.â terdonâ¦
Jul 19 at 13:34
@LinuxBlanket yes, you need to count the uncompressed lines, since lines are defined by
n
and there is no reason to expect to have a specific number of n
characters in the compressed file.â terdonâ¦
Jul 19 at 13:34
1
1
Obviously a cat eating big files is not an healthy diet. :)
â Rui F Ribeiro
Jul 19 at 13:58
Obviously a cat eating big files is not an healthy diet. :)
â Rui F Ribeiro
Jul 19 at 13:58
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
10
down vote
accepted
Note that the files are compressed. You can't therefore use wc -l
on the files directly to count the original number of lines in them without decompressing them first.
It's OK to use cat
for concatenating these types of compressed files as the resulting file is a valid compressed file in itself. Uncompressing it later would result in a file that is the concatenation of the uncompressed data from the two files.
cat A_1P.gz B_1P.gz >C_1P.gz
To count the number of lines in C_1P.gz
:
zcat C_1P.gz | wc -l
or
gunzip -c C_1P.gz | wc -l
or
gzip -dc C_1P.gz | wc -l
but note that we need to uncompress the file to count the lines, otherwise we'll be counting the "random" newlines that the file compression algorithm generates as part of the compressed data (these have nothing to do with the lines in your uncompressed file).
...yes, I realized thanks to your comment that, unlike forA_1P.gz
andB_1P.gz
line count, I didn't uncompress the file before counting lines, and doingzcat file | wc -l
yielded the correct line number. I'm sorry for the silly question, I don't know how I didn't see it before...
â LinuxBlanket
Jul 19 at 13:54
@LinuxBlanket It's an easy mistake to make.
â Kusalananda
Jul 19 at 14:00
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
10
down vote
accepted
Note that the files are compressed. You can't therefore use wc -l
on the files directly to count the original number of lines in them without decompressing them first.
It's OK to use cat
for concatenating these types of compressed files as the resulting file is a valid compressed file in itself. Uncompressing it later would result in a file that is the concatenation of the uncompressed data from the two files.
cat A_1P.gz B_1P.gz >C_1P.gz
To count the number of lines in C_1P.gz
:
zcat C_1P.gz | wc -l
or
gunzip -c C_1P.gz | wc -l
or
gzip -dc C_1P.gz | wc -l
but note that we need to uncompress the file to count the lines, otherwise we'll be counting the "random" newlines that the file compression algorithm generates as part of the compressed data (these have nothing to do with the lines in your uncompressed file).
...yes, I realized thanks to your comment that, unlike forA_1P.gz
andB_1P.gz
line count, I didn't uncompress the file before counting lines, and doingzcat file | wc -l
yielded the correct line number. I'm sorry for the silly question, I don't know how I didn't see it before...
â LinuxBlanket
Jul 19 at 13:54
@LinuxBlanket It's an easy mistake to make.
â Kusalananda
Jul 19 at 14:00
add a comment |Â
up vote
10
down vote
accepted
Note that the files are compressed. You can't therefore use wc -l
on the files directly to count the original number of lines in them without decompressing them first.
It's OK to use cat
for concatenating these types of compressed files as the resulting file is a valid compressed file in itself. Uncompressing it later would result in a file that is the concatenation of the uncompressed data from the two files.
cat A_1P.gz B_1P.gz >C_1P.gz
To count the number of lines in C_1P.gz
:
zcat C_1P.gz | wc -l
or
gunzip -c C_1P.gz | wc -l
or
gzip -dc C_1P.gz | wc -l
but note that we need to uncompress the file to count the lines, otherwise we'll be counting the "random" newlines that the file compression algorithm generates as part of the compressed data (these have nothing to do with the lines in your uncompressed file).
...yes, I realized thanks to your comment that, unlike forA_1P.gz
andB_1P.gz
line count, I didn't uncompress the file before counting lines, and doingzcat file | wc -l
yielded the correct line number. I'm sorry for the silly question, I don't know how I didn't see it before...
â LinuxBlanket
Jul 19 at 13:54
@LinuxBlanket It's an easy mistake to make.
â Kusalananda
Jul 19 at 14:00
add a comment |Â
up vote
10
down vote
accepted
up vote
10
down vote
accepted
Note that the files are compressed. You can't therefore use wc -l
on the files directly to count the original number of lines in them without decompressing them first.
It's OK to use cat
for concatenating these types of compressed files as the resulting file is a valid compressed file in itself. Uncompressing it later would result in a file that is the concatenation of the uncompressed data from the two files.
cat A_1P.gz B_1P.gz >C_1P.gz
To count the number of lines in C_1P.gz
:
zcat C_1P.gz | wc -l
or
gunzip -c C_1P.gz | wc -l
or
gzip -dc C_1P.gz | wc -l
but note that we need to uncompress the file to count the lines, otherwise we'll be counting the "random" newlines that the file compression algorithm generates as part of the compressed data (these have nothing to do with the lines in your uncompressed file).
Note that the files are compressed. You can't therefore use wc -l
on the files directly to count the original number of lines in them without decompressing them first.
It's OK to use cat
for concatenating these types of compressed files as the resulting file is a valid compressed file in itself. Uncompressing it later would result in a file that is the concatenation of the uncompressed data from the two files.
cat A_1P.gz B_1P.gz >C_1P.gz
To count the number of lines in C_1P.gz
:
zcat C_1P.gz | wc -l
or
gunzip -c C_1P.gz | wc -l
or
gzip -dc C_1P.gz | wc -l
but note that we need to uncompress the file to count the lines, otherwise we'll be counting the "random" newlines that the file compression algorithm generates as part of the compressed data (these have nothing to do with the lines in your uncompressed file).
edited Jul 19 at 14:15
answered Jul 19 at 13:31
Kusalananda
101k13199311
101k13199311
...yes, I realized thanks to your comment that, unlike forA_1P.gz
andB_1P.gz
line count, I didn't uncompress the file before counting lines, and doingzcat file | wc -l
yielded the correct line number. I'm sorry for the silly question, I don't know how I didn't see it before...
â LinuxBlanket
Jul 19 at 13:54
@LinuxBlanket It's an easy mistake to make.
â Kusalananda
Jul 19 at 14:00
add a comment |Â
...yes, I realized thanks to your comment that, unlike forA_1P.gz
andB_1P.gz
line count, I didn't uncompress the file before counting lines, and doingzcat file | wc -l
yielded the correct line number. I'm sorry for the silly question, I don't know how I didn't see it before...
â LinuxBlanket
Jul 19 at 13:54
@LinuxBlanket It's an easy mistake to make.
â Kusalananda
Jul 19 at 14:00
...yes, I realized thanks to your comment that, unlike for
A_1P.gz
and B_1P.gz
line count, I didn't uncompress the file before counting lines, and doing zcat file | wc -l
yielded the correct line number. I'm sorry for the silly question, I don't know how I didn't see it before...â LinuxBlanket
Jul 19 at 13:54
...yes, I realized thanks to your comment that, unlike for
A_1P.gz
and B_1P.gz
line count, I didn't uncompress the file before counting lines, and doing zcat file | wc -l
yielded the correct line number. I'm sorry for the silly question, I don't know how I didn't see it before...â LinuxBlanket
Jul 19 at 13:54
@LinuxBlanket It's an easy mistake to make.
â Kusalananda
Jul 19 at 14:00
@LinuxBlanket It's an easy mistake to make.
â Kusalananda
Jul 19 at 14:00
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f457218%2fcat-on-big-files-does-not-work%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
14
When you're counting lines, don't you want to count the uncompressed lines? Also the pipeline
cat file1 file2 >file3 | wc -l
does not make sense aswc
would get no data. What's the command that you are actually using?â Kusalananda
Jul 19 at 13:28
What command(s) did you use to count the lines in the original files? It's possible that you unintentionally used some wrapper that silently decompressed them first. Try showing the size in bytes (using
wc -c
) instead of lines.â JigglyNaga
Jul 19 at 13:28
@Kusalananda I obtained the line count of the four big files doing
zcat *P.gz | wc -l
. The actual command wascat file1 file2 > file3; wc -l file3
, but actually I didn't precede it withzcat
, and that might be the root of my problem. If that's so, I'll feel really stupid...â LinuxBlanket
Jul 19 at 13:31
@LinuxBlanket yes, you need to count the uncompressed lines, since lines are defined by
n
and there is no reason to expect to have a specific number ofn
characters in the compressed file.â terdonâ¦
Jul 19 at 13:34
1
Obviously a cat eating big files is not an healthy diet. :)
â Rui F Ribeiro
Jul 19 at 13:58