Concatenate multiple zipped files, skipping header lines in all but the first file

up vote
3
down vote

favorite

I have a collection of gzipped files that I want to combine into a single file. They each have identical format. I want to keep the header information from only the first file and skip it in the subsequent files.

As a simple example, I have four identical files with the following content:

$ gzcat file1.gz
# header
1
2

I want to end up with

# header
1
2
1
2
1
2
1
2

In reality, I can have a varying number of files so I would like to be able to do this programatically. Here is the non-programatic solution I have so far...

cat <(gzcat file1.gz) <(tail -q -n +2 <(gzcat file2.gz) <(gzcat file3.gz) <(gzcat file4.gz))

This command works, but it is Ã¢Â€Âœhard codedÃ¢Â€Â to handle four files,
and I need to generalize it for any number of files.Ã‚Â
I am using bash as the shell if that helps. My preference is for performance (in reality the files can be millions of lines long), so I am OK with a less-than-elegant solution if it is speedy.

edited Sep 17 at 6:33

G-Man

11.9k92658

asked Sep 17 at 1:26

SethMMorton

1334

Are you saying that the command you have works (for four files), but you need to be able to generalize it for any number of files?Ã¢Â€Â‚ Or is there some other problem with the command you have?
â€“Â G-Man
Sep 17 at 2:16

@G-Man I am saying it works, but I need to generalize it for any number of files. Having said that, I am open to completely different solutions as well.
â€“Â SethMMorton
Sep 17 at 3:54

add a commentÂ |Â

up vote
3
down vote

favorite

As a simple example, I have four identical files with the following content:

$ gzcat file1.gz
# header
1
2

I want to end up with

# header
1
2
1
2
1
2
1
2

In reality, I can have a varying number of files so I would like to be able to do this programatically. Here is the non-programatic solution I have so far...

cat <(gzcat file1.gz) <(tail -q -n +2 <(gzcat file2.gz) <(gzcat file3.gz) <(gzcat file4.gz))

edited Sep 17 at 6:33

G-Man

11.9k92658

asked Sep 17 at 1:26

SethMMorton

1334

Are you saying that the command you have works (for four files), but you need to be able to generalize it for any number of files?Ã¢Â€Â‚ Or is there some other problem with the command you have?
â€“Â G-Man
Sep 17 at 2:16

@G-Man I am saying it works, but I need to generalize it for any number of files. Having said that, I am open to completely different solutions as well.
â€“Â SethMMorton
Sep 17 at 3:54

add a commentÂ |Â

up vote
3
down vote

favorite

As a simple example, I have four identical files with the following content:

$ gzcat file1.gz
# header
1
2

I want to end up with

# header
1
2
1
2
1
2
1
2

In reality, I can have a varying number of files so I would like to be able to do this programatically. Here is the non-programatic solution I have so far...

cat <(gzcat file1.gz) <(tail -q -n +2 <(gzcat file2.gz) <(gzcat file3.gz) <(gzcat file4.gz))

edited Sep 17 at 6:33

G-Man

11.9k92658

asked Sep 17 at 1:26

SethMMorton

1334

As a simple example, I have four identical files with the following content:

$ gzcat file1.gz
# header
1
2

I want to end up with

# header
1
2
1
2
1
2
1
2

In reality, I can have a varying number of files so I would like to be able to do this programatically. Here is the non-programatic solution I have so far...

cat <(gzcat file1.gz) <(tail -q -n +2 <(gzcat file2.gz) <(gzcat file3.gz) <(gzcat file4.gz))

shell-script text-processing cat tail gzip

edited Sep 17 at 6:33

G-Man

11.9k92658

asked Sep 17 at 1:26

SethMMorton

1334

edited Sep 17 at 6:33

G-Man

11.9k92658

asked Sep 17 at 1:26

SethMMorton

1334

edited Sep 17 at 6:33

G-Man

11.9k92658

edited Sep 17 at 6:33

G-Man

11.9k92658

edited Sep 17 at 6:33

G-Man

11.9k92658

asked Sep 17 at 1:26

SethMMorton

1334

asked Sep 17 at 1:26

SethMMorton

1334

asked Sep 17 at 1:26

SethMMorton

1334

Are you saying that the command you have works (for four files), but you need to be able to generalize it for any number of files?Ã¢Â€Â‚ Or is there some other problem with the command you have?
â€“Â G-Man
Sep 17 at 2:16

@G-Man I am saying it works, but I need to generalize it for any number of files. Having said that, I am open to completely different solutions as well.
â€“Â SethMMorton
Sep 17 at 3:54

add a commentÂ |Â

Are you saying that the command you have works (for four files), but you need to be able to generalize it for any number of files?Ã¢Â€Â‚ Or is there some other problem with the command you have?
â€“Â G-Man
Sep 17 at 2:16

@G-Man I am saying it works, but I need to generalize it for any number of files. Having said that, I am open to completely different solutions as well.
â€“Â SethMMorton
Sep 17 at 3:54

Are you saying that the command you have works (for four files), but you need to be able to generalize it for any number of files?Ã¢Â€Â‚ Or is there some other problem with the command you have?
â€“Â G-Man
Sep 17 at 2:16

@G-Man I am saying it works, but I need to generalize it for any number of files. Having said that, I am open to completely different solutions as well.
â€“Â SethMMorton
Sep 17 at 3:54

add a commentÂ |Â

2 Answers
2

active

oldest

votes

up vote
1
down vote

accepted

If the command that you show in your question basically works (for a hard-coded number of files), then

first=1
for f in file*.gz
do
 if [ "$first" ]
 then
 gzcat "$f"
 first=
 else
 gzcat "$f"| tail -n +2
 fi
done > collection_single_file

should work for you.Ã‚Â
I hope the logic is fairly clear.Ã‚Â
Look at all the files (change the wildcard as appropriate for your file names).Ã‚Â
IfÃ‚Â itÃ¢Â€Â™s the first one in the list, gzcat it, so you get the entire file
(including the header).Ã‚Â
Otherwise, use tail to strip the header.Ã‚Â
After youÃ¢Â€Â™ve handled a file, then no other file will be the first.

This invokes tail NÃ¢ÂˆÂ’1 times, instead of just once (like your answer).Ã‚Â
Aside from that, my answer should perform the same as your answer.

answered Sep 17 at 3:57

G-Man

11.9k92658

Ah! I was not yet aware one could redirect output from an entire for loop... it certainly makes things easier!
â€“Â SethMMorton
Sep 17 at 4:20

add a commentÂ |Â

up vote
1
down vote

A variation on G-Man's solution that does not use a separate variable to keep track of the first file:

set -- file*.gz


 gzcat "$1"; shift

 for file do
 gzcat "$file" >combined.txt

This uncompresses the first file and then loops over the remaining ones, passing each through a short sed script that deletes the first line. The output is redirected to combined.txt.

The set -- file*.gz command sets the positional parameters ($1, $2, etc., that collectively is the array $@) to the filenames matching the given pattern. The shift removes the $1 from the array after uncompressing it. The loop loops over the remaining filenames in the array and could also have been written

for file in "$@"; do
 gzcat "$file" | sed '1d'
done

The ... allows us to redirect the output of the commands within to a file in one go.

Even shorter, with the additional assumption that a "header line" is always starting with a # character (as in the example in the question), and that there are no other such lines in the data:

gzcat file*.gz | awk 'NR > 1 && /^#/ next 1' >combined.txt

or,

gzcat file*.gz | sed '2,$ /^#/d; ' >combined.txt

Both of these skips any line starting with # if it occurs on the second line or later in the combined contents of the uncompressed data.

edited Sep 19 at 15:13

answered Sep 19 at 7:57

Kusalananda

108k14209332

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f469463%2fconcatenate-multiple-zipped-files-skipping-header-lines-in-all-but-the-first-fi%23new-answer', 'question_page');

);

Post as a guest

Name

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
1
down vote

accepted

If the command that you show in your question basically works (for a hard-coded number of files), then

first=1
for f in file*.gz
do
 if [ "$first" ]
 then
 gzcat "$f"
 first=
 else
 gzcat "$f"| tail -n +2
 fi
done > collection_single_file

This invokes tail NÃ¢ÂˆÂ’1 times, instead of just once (like your answer).Ã‚Â
Aside from that, my answer should perform the same as your answer.

answered Sep 17 at 3:57

G-Man

11.9k92658

Ah! I was not yet aware one could redirect output from an entire for loop... it certainly makes things easier!
â€“Â SethMMorton
Sep 17 at 4:20

add a commentÂ |Â

up vote
1
down vote

accepted

If the command that you show in your question basically works (for a hard-coded number of files), then

first=1
for f in file*.gz
do
 if [ "$first" ]
 then
 gzcat "$f"
 first=
 else
 gzcat "$f"| tail -n +2
 fi
done > collection_single_file

This invokes tail NÃ¢ÂˆÂ’1 times, instead of just once (like your answer).Ã‚Â
Aside from that, my answer should perform the same as your answer.

answered Sep 17 at 3:57

G-Man

11.9k92658

Ah! I was not yet aware one could redirect output from an entire for loop... it certainly makes things easier!
â€“Â SethMMorton
Sep 17 at 4:20

add a commentÂ |Â

up vote
1
down vote

accepted

If the command that you show in your question basically works (for a hard-coded number of files), then

first=1
for f in file*.gz
do
 if [ "$first" ]
 then
 gzcat "$f"
 first=
 else
 gzcat "$f"| tail -n +2
 fi
done > collection_single_file

This invokes tail NÃ¢ÂˆÂ’1 times, instead of just once (like your answer).Ã‚Â
Aside from that, my answer should perform the same as your answer.

answered Sep 17 at 3:57

G-Man

11.9k92658

If the command that you show in your question basically works (for a hard-coded number of files), then

first=1
for f in file*.gz
do
 if [ "$first" ]
 then
 gzcat "$f"
 first=
 else
 gzcat "$f"| tail -n +2
 fi
done > collection_single_file

This invokes tail NÃ¢ÂˆÂ’1 times, instead of just once (like your answer).Ã‚Â
Aside from that, my answer should perform the same as your answer.

answered Sep 17 at 3:57

G-Man

11.9k92658

answered Sep 17 at 3:57

G-Man

11.9k92658

answered Sep 17 at 3:57

G-Man

11.9k92658

answered Sep 17 at 3:57

G-Man

11.9k92658

Ah! I was not yet aware one could redirect output from an entire for loop... it certainly makes things easier!
â€“Â SethMMorton
Sep 17 at 4:20

add a commentÂ |Â

Ah! I was not yet aware one could redirect output from an entire for loop... it certainly makes things easier!
â€“Â SethMMorton
Sep 17 at 4:20

Ah! I was not yet aware one could redirect output from an entire for loop... it certainly makes things easier!
â€“Â SethMMorton
Sep 17 at 4:20

add a commentÂ |Â

up vote
1
down vote

A variation on G-Man's solution that does not use a separate variable to keep track of the first file:

set -- file*.gz


 gzcat "$1"; shift

 for file do
 gzcat "$file" >combined.txt

This uncompresses the first file and then loops over the remaining ones, passing each through a short sed script that deletes the first line. The output is redirected to combined.txt.

for file in "$@"; do
 gzcat "$file" | sed '1d'
done

The ... allows us to redirect the output of the commands within to a file in one go.

Even shorter, with the additional assumption that a "header line" is always starting with a # character (as in the example in the question), and that there are no other such lines in the data:

gzcat file*.gz | awk 'NR > 1 && /^#/ next 1' >combined.txt

or,

gzcat file*.gz | sed '2,$ /^#/d; ' >combined.txt

Both of these skips any line starting with # if it occurs on the second line or later in the combined contents of the uncompressed data.

edited Sep 19 at 15:13

answered Sep 19 at 7:57

Kusalananda

108k14209332

add a commentÂ |Â

up vote
1
down vote

A variation on G-Man's solution that does not use a separate variable to keep track of the first file:

set -- file*.gz


 gzcat "$1"; shift

 for file do
 gzcat "$file" >combined.txt

This uncompresses the first file and then loops over the remaining ones, passing each through a short sed script that deletes the first line. The output is redirected to combined.txt.

for file in "$@"; do
 gzcat "$file" | sed '1d'
done

The ... allows us to redirect the output of the commands within to a file in one go.

Even shorter, with the additional assumption that a "header line" is always starting with a # character (as in the example in the question), and that there are no other such lines in the data:

gzcat file*.gz | awk 'NR > 1 && /^#/ next 1' >combined.txt

or,

gzcat file*.gz | sed '2,$ /^#/d; ' >combined.txt

Both of these skips any line starting with # if it occurs on the second line or later in the combined contents of the uncompressed data.

edited Sep 19 at 15:13

answered Sep 19 at 7:57

Kusalananda

108k14209332

add a commentÂ |Â

up vote
1
down vote

A variation on G-Man's solution that does not use a separate variable to keep track of the first file:

set -- file*.gz


 gzcat "$1"; shift

 for file do
 gzcat "$file" >combined.txt

This uncompresses the first file and then loops over the remaining ones, passing each through a short sed script that deletes the first line. The output is redirected to combined.txt.

for file in "$@"; do
 gzcat "$file" | sed '1d'
done

The ... allows us to redirect the output of the commands within to a file in one go.

Even shorter, with the additional assumption that a "header line" is always starting with a # character (as in the example in the question), and that there are no other such lines in the data:

gzcat file*.gz | awk 'NR > 1 && /^#/ next 1' >combined.txt

or,

gzcat file*.gz | sed '2,$ /^#/d; ' >combined.txt

Both of these skips any line starting with # if it occurs on the second line or later in the combined contents of the uncompressed data.

edited Sep 19 at 15:13

answered Sep 19 at 7:57

Kusalananda

108k14209332

A variation on G-Man's solution that does not use a separate variable to keep track of the first file:

set -- file*.gz


 gzcat "$1"; shift

 for file do
 gzcat "$file" >combined.txt

This uncompresses the first file and then loops over the remaining ones, passing each through a short sed script that deletes the first line. The output is redirected to combined.txt.

for file in "$@"; do
 gzcat "$file" | sed '1d'
done

The ... allows us to redirect the output of the commands within to a file in one go.

Even shorter, with the additional assumption that a "header line" is always starting with a # character (as in the example in the question), and that there are no other such lines in the data:

gzcat file*.gz | awk 'NR > 1 && /^#/ next 1' >combined.txt

or,

gzcat file*.gz | sed '2,$ /^#/d; ' >combined.txt

Both of these skips any line starting with # if it occurs on the second line or later in the combined contents of the uncompressed data.

edited Sep 19 at 15:13

answered Sep 19 at 7:57

Kusalananda

108k14209332

edited Sep 19 at 15:13

answered Sep 19 at 7:57

Kusalananda

108k14209332

answered Sep 19 at 7:57

Kusalananda

108k14209332

answered Sep 19 at 7:57

Kusalananda

108k14209332

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu