cat on big files does not work

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
3
down vote

favorite

I'm trying to concatenate four big files in two. The files *_1P.gz contain the same amount of lines as the corrisponding *_2P.gz.

The files A_1P.gz and A_2P.gz both contain 1104507560 lines.

The files B_1P.gz and B_2P.gz both contain 1182136972 lines.

However, cat A_1P.gz B_1P.gz > C_1P.gz| wc -l returns 186974687 lines, and cat A_2P.gz B_2P.gz > C_2P.gz| wc -l returns 182952523 lines, so both are not only way smaller than the two input files (they should be more than 2B lines long and they're less than 2M instead), but also they have a different number of lines. The command ran showing no errors whatsoever.

I can't understand what's happening, I generated those four big files with cat as well and it worked properly.

What could the problem be?

What other options do I have to concatenate gzipped files without using cat?

I'm working on a CentOS server. I still have 197G space, so that shouldn't be an issue (or it should show an error, at least).

edited Jul 19 at 14:14

Kusalananda

101k13199311

asked Jul 19 at 13:24

LinuxBlanket

2261311

14

When you're counting lines, don't you want to count the uncompressed lines? Also the pipeline cat file1 file2 >file3 | wc -l does not make sense as wc would get no data. What's the command that you are actually using?
â€“Â Kusalananda
Jul 19 at 13:28

What command(s) did you use to count the lines in the original files? It's possible that you unintentionally used some wrapper that silently decompressed them first. Try showing the size in bytes (using wc -c) instead of lines.
â€“Â JigglyNaga
Jul 19 at 13:28

@Kusalananda I obtained the line count of the four big files doing zcat *P.gz | wc -l. The actual command was cat file1 file2 > file3; wc -l file3, but actually I didn't precede it with zcat, and that might be the root of my problem. If that's so, I'll feel really stupid...
â€“Â LinuxBlanket
Jul 19 at 13:31

@LinuxBlanket yes, you need to count the uncompressed lines, since lines are defined by n and there is no reason to expect to have a specific number of n characters in the compressed file.
â€“Â terdonâ™¦
Jul 19 at 13:34

1

Obviously a cat eating big files is not an healthy diet. :)
â€“Â Rui F Ribeiro
Jul 19 at 13:58

add a commentÂ |Â

up vote
3
down vote

favorite

I'm trying to concatenate four big files in two. The files *_1P.gz contain the same amount of lines as the corrisponding *_2P.gz.

The files A_1P.gz and A_2P.gz both contain 1104507560 lines.

The files B_1P.gz and B_2P.gz both contain 1182136972 lines.

I can't understand what's happening, I generated those four big files with cat as well and it worked properly.

What could the problem be?

What other options do I have to concatenate gzipped files without using cat?

I'm working on a CentOS server. I still have 197G space, so that shouldn't be an issue (or it should show an error, at least).

edited Jul 19 at 14:14

Kusalananda

101k13199311

asked Jul 19 at 13:24

LinuxBlanket

2261311

14

When you're counting lines, don't you want to count the uncompressed lines? Also the pipeline cat file1 file2 >file3 | wc -l does not make sense as wc would get no data. What's the command that you are actually using?
â€“Â Kusalananda
Jul 19 at 13:28

What command(s) did you use to count the lines in the original files? It's possible that you unintentionally used some wrapper that silently decompressed them first. Try showing the size in bytes (using wc -c) instead of lines.
â€“Â JigglyNaga
Jul 19 at 13:28

@Kusalananda I obtained the line count of the four big files doing zcat *P.gz | wc -l. The actual command was cat file1 file2 > file3; wc -l file3, but actually I didn't precede it with zcat, and that might be the root of my problem. If that's so, I'll feel really stupid...
â€“Â LinuxBlanket
Jul 19 at 13:31

@LinuxBlanket yes, you need to count the uncompressed lines, since lines are defined by n and there is no reason to expect to have a specific number of n characters in the compressed file.
â€“Â terdonâ™¦
Jul 19 at 13:34

1

Obviously a cat eating big files is not an healthy diet. :)
â€“Â Rui F Ribeiro
Jul 19 at 13:58

add a commentÂ |Â

up vote
3
down vote

favorite

I'm trying to concatenate four big files in two. The files *_1P.gz contain the same amount of lines as the corrisponding *_2P.gz.

The files A_1P.gz and A_2P.gz both contain 1104507560 lines.

The files B_1P.gz and B_2P.gz both contain 1182136972 lines.

I can't understand what's happening, I generated those four big files with cat as well and it worked properly.

What could the problem be?

What other options do I have to concatenate gzipped files without using cat?

I'm working on a CentOS server. I still have 197G space, so that shouldn't be an issue (or it should show an error, at least).

edited Jul 19 at 14:14

Kusalananda

101k13199311

asked Jul 19 at 13:24

LinuxBlanket

2261311

I'm trying to concatenate four big files in two. The files *_1P.gz contain the same amount of lines as the corrisponding *_2P.gz.

The files A_1P.gz and A_2P.gz both contain 1104507560 lines.

The files B_1P.gz and B_2P.gz both contain 1182136972 lines.

I can't understand what's happening, I generated those four big files with cat as well and it worked properly.

What could the problem be?

What other options do I have to concatenate gzipped files without using cat?

I'm working on a CentOS server. I still have 197G space, so that shouldn't be an issue (or it should show an error, at least).

edited Jul 19 at 14:14

Kusalananda

101k13199311

asked Jul 19 at 13:24

LinuxBlanket

2261311

edited Jul 19 at 14:14

Kusalananda

101k13199311

edited Jul 19 at 14:14

Kusalananda

101k13199311

edited Jul 19 at 14:14

Kusalananda

101k13199311

asked Jul 19 at 13:24

LinuxBlanket

2261311

asked Jul 19 at 13:24

LinuxBlanket

2261311

asked Jul 19 at 13:24

LinuxBlanket

2261311

14

When you're counting lines, don't you want to count the uncompressed lines? Also the pipeline cat file1 file2 >file3 | wc -l does not make sense as wc would get no data. What's the command that you are actually using?
â€“Â Kusalananda
Jul 19 at 13:28

What command(s) did you use to count the lines in the original files? It's possible that you unintentionally used some wrapper that silently decompressed them first. Try showing the size in bytes (using wc -c) instead of lines.
â€“Â JigglyNaga
Jul 19 at 13:28

@Kusalananda I obtained the line count of the four big files doing zcat *P.gz | wc -l. The actual command was cat file1 file2 > file3; wc -l file3, but actually I didn't precede it with zcat, and that might be the root of my problem. If that's so, I'll feel really stupid...
â€“Â LinuxBlanket
Jul 19 at 13:31

@LinuxBlanket yes, you need to count the uncompressed lines, since lines are defined by n and there is no reason to expect to have a specific number of n characters in the compressed file.
â€“Â terdonâ™¦
Jul 19 at 13:34

1

Obviously a cat eating big files is not an healthy diet. :)
â€“Â Rui F Ribeiro
Jul 19 at 13:58

add a commentÂ |Â

14

When you're counting lines, don't you want to count the uncompressed lines? Also the pipeline cat file1 file2 >file3 | wc -l does not make sense as wc would get no data. What's the command that you are actually using?
â€“Â Kusalananda
Jul 19 at 13:28

What command(s) did you use to count the lines in the original files? It's possible that you unintentionally used some wrapper that silently decompressed them first. Try showing the size in bytes (using wc -c) instead of lines.
â€“Â JigglyNaga
Jul 19 at 13:28

@Kusalananda I obtained the line count of the four big files doing zcat *P.gz | wc -l. The actual command was cat file1 file2 > file3; wc -l file3, but actually I didn't precede it with zcat, and that might be the root of my problem. If that's so, I'll feel really stupid...
â€“Â LinuxBlanket
Jul 19 at 13:31

@LinuxBlanket yes, you need to count the uncompressed lines, since lines are defined by n and there is no reason to expect to have a specific number of n characters in the compressed file.
â€“Â terdonâ™¦
Jul 19 at 13:34

1

Obviously a cat eating big files is not an healthy diet. :)
â€“Â Rui F Ribeiro
Jul 19 at 13:58

When you're counting lines, don't you want to count the uncompressed lines? Also the pipeline cat file1 file2 >file3 | wc -l does not make sense as wc would get no data. What's the command that you are actually using?
â€“Â Kusalananda
Jul 19 at 13:28

What command(s) did you use to count the lines in the original files? It's possible that you unintentionally used some wrapper that silently decompressed them first. Try showing the size in bytes (using wc -c) instead of lines.
â€“Â JigglyNaga
Jul 19 at 13:28

@Kusalananda I obtained the line count of the four big files doing zcat *P.gz | wc -l. The actual command was cat file1 file2 > file3; wc -l file3, but actually I didn't precede it with zcat, and that might be the root of my problem. If that's so, I'll feel really stupid...
â€“Â LinuxBlanket
Jul 19 at 13:31

@LinuxBlanket yes, you need to count the uncompressed lines, since lines are defined by n and there is no reason to expect to have a specific number of n characters in the compressed file.
â€“Â terdonâ™¦
Jul 19 at 13:34

Obviously a cat eating big files is not an healthy diet. :)
â€“Â Rui F Ribeiro
Jul 19 at 13:58

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
10
down vote

accepted

Note that the files are compressed. You can't therefore use wc -l on the files directly to count the original number of lines in them without decompressing them first.

It's OK to use cat for concatenating these types of compressed files as the resulting file is a valid compressed file in itself. Uncompressing it later would result in a file that is the concatenation of the uncompressed data from the two files.

cat A_1P.gz B_1P.gz >C_1P.gz

To count the number of lines in C_1P.gz:

zcat C_1P.gz | wc -l

gunzip -c C_1P.gz | wc -l

gzip -dc C_1P.gz | wc -l

but note that we need to uncompress the file to count the lines, otherwise we'll be counting the "random" newlines that the file compression algorithm generates as part of the compressed data (these have nothing to do with the lines in your uncompressed file).

edited Jul 19 at 14:15

answered Jul 19 at 13:31

Kusalananda

101k13199311

...yes, I realized thanks to your comment that, unlike for A_1P.gz and B_1P.gz line count, I didn't uncompress the file before counting lines, and doing zcat file | wc -l yielded the correct line number. I'm sorry for the silly question, I don't know how I didn't see it before...
â€“Â LinuxBlanket
Jul 19 at 13:54

@LinuxBlanket It's an easy mistake to make.
â€“Â Kusalananda
Jul 19 at 14:00

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f457218%2fcat-on-big-files-does-not-work%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
10
down vote

accepted

Note that the files are compressed. You can't therefore use wc -l on the files directly to count the original number of lines in them without decompressing them first.

cat A_1P.gz B_1P.gz >C_1P.gz

To count the number of lines in C_1P.gz:

zcat C_1P.gz | wc -l

gunzip -c C_1P.gz | wc -l

gzip -dc C_1P.gz | wc -l

edited Jul 19 at 14:15

answered Jul 19 at 13:31

Kusalananda

101k13199311

...yes, I realized thanks to your comment that, unlike for A_1P.gz and B_1P.gz line count, I didn't uncompress the file before counting lines, and doing zcat file | wc -l yielded the correct line number. I'm sorry for the silly question, I don't know how I didn't see it before...
â€“Â LinuxBlanket
Jul 19 at 13:54

@LinuxBlanket It's an easy mistake to make.
â€“Â Kusalananda
Jul 19 at 14:00

add a commentÂ |Â

up vote
10
down vote

accepted

Note that the files are compressed. You can't therefore use wc -l on the files directly to count the original number of lines in them without decompressing them first.

cat A_1P.gz B_1P.gz >C_1P.gz

To count the number of lines in C_1P.gz:

zcat C_1P.gz | wc -l

gunzip -c C_1P.gz | wc -l

gzip -dc C_1P.gz | wc -l

edited Jul 19 at 14:15

answered Jul 19 at 13:31

Kusalananda

101k13199311

...yes, I realized thanks to your comment that, unlike for A_1P.gz and B_1P.gz line count, I didn't uncompress the file before counting lines, and doing zcat file | wc -l yielded the correct line number. I'm sorry for the silly question, I don't know how I didn't see it before...
â€“Â LinuxBlanket
Jul 19 at 13:54

@LinuxBlanket It's an easy mistake to make.
â€“Â Kusalananda
Jul 19 at 14:00

add a commentÂ |Â

up vote
10
down vote

accepted

Note that the files are compressed. You can't therefore use wc -l on the files directly to count the original number of lines in them without decompressing them first.

cat A_1P.gz B_1P.gz >C_1P.gz

To count the number of lines in C_1P.gz:

zcat C_1P.gz | wc -l

gunzip -c C_1P.gz | wc -l

gzip -dc C_1P.gz | wc -l

edited Jul 19 at 14:15

answered Jul 19 at 13:31

Kusalananda

101k13199311

Note that the files are compressed. You can't therefore use wc -l on the files directly to count the original number of lines in them without decompressing them first.

cat A_1P.gz B_1P.gz >C_1P.gz

To count the number of lines in C_1P.gz:

zcat C_1P.gz | wc -l

gunzip -c C_1P.gz | wc -l

gzip -dc C_1P.gz | wc -l

edited Jul 19 at 14:15

answered Jul 19 at 13:31

Kusalananda

101k13199311

edited Jul 19 at 14:15

answered Jul 19 at 13:31

Kusalananda

101k13199311

answered Jul 19 at 13:31

Kusalananda

101k13199311

answered Jul 19 at 13:31

Kusalananda

101k13199311

...yes, I realized thanks to your comment that, unlike for A_1P.gz and B_1P.gz line count, I didn't uncompress the file before counting lines, and doing zcat file | wc -l yielded the correct line number. I'm sorry for the silly question, I don't know how I didn't see it before...
â€“Â LinuxBlanket
Jul 19 at 13:54

@LinuxBlanket It's an easy mistake to make.
â€“Â Kusalananda
Jul 19 at 14:00

add a commentÂ |Â

...yes, I realized thanks to your comment that, unlike for A_1P.gz and B_1P.gz line count, I didn't uncompress the file before counting lines, and doing zcat file | wc -l yielded the correct line number. I'm sorry for the silly question, I don't know how I didn't see it before...
â€“Â LinuxBlanket
Jul 19 at 13:54

@LinuxBlanket It's an easy mistake to make.
â€“Â Kusalananda
Jul 19 at 14:00

...yes, I realized thanks to your comment that, unlike for A_1P.gz and B_1P.gz line count, I didn't uncompress the file before counting lines, and doing zcat file | wc -l yielded the correct line number. I'm sorry for the silly question, I don't know how I didn't see it before...
â€“Â LinuxBlanket
Jul 19 at 13:54

@LinuxBlanket It's an easy mistake to make.
â€“Â Kusalananda
Jul 19 at 14:00

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu