Concatenate multiple zipped files, skipping header lines in all but the first file
Clash Royale CLAN TAG#URR8PPP
up vote
3
down vote
favorite
I have a collection of gzipped files that I want to combine into a single file. They each have identical format. I want to keep the header information from only the first file and skip it in the subsequent files.
As a simple example, I have four identical files with the following content:
$ gzcat file1.gz
# header
1
2
I want to end up with
# header
1
2
1
2
1
2
1
2
In reality, I can have a varying number of files so I would like to be able to do this programatically. Here is the non-programatic solution I have so far...
cat <(gzcat file1.gz) <(tail -q -n +2 <(gzcat file2.gz) <(gzcat file3.gz) <(gzcat file4.gz))
This command works, but it is âÂÂhard codedâ to handle four files,
and I need to generalize it for any number of files.ÃÂ
I am using bash
as the shell if that helps. My preference is for performance (in reality the files can be millions of lines long), so I am OK with a less-than-elegant solution if it is speedy.
shell-script text-processing cat tail gzip
add a comment |Â
up vote
3
down vote
favorite
I have a collection of gzipped files that I want to combine into a single file. They each have identical format. I want to keep the header information from only the first file and skip it in the subsequent files.
As a simple example, I have four identical files with the following content:
$ gzcat file1.gz
# header
1
2
I want to end up with
# header
1
2
1
2
1
2
1
2
In reality, I can have a varying number of files so I would like to be able to do this programatically. Here is the non-programatic solution I have so far...
cat <(gzcat file1.gz) <(tail -q -n +2 <(gzcat file2.gz) <(gzcat file3.gz) <(gzcat file4.gz))
This command works, but it is âÂÂhard codedâ to handle four files,
and I need to generalize it for any number of files.ÃÂ
I am using bash
as the shell if that helps. My preference is for performance (in reality the files can be millions of lines long), so I am OK with a less-than-elegant solution if it is speedy.
shell-script text-processing cat tail gzip
Are you saying that the command you have works (for four files), but you need to be able to generalize it for any number of files?â Or is there some other problem with the command you have?
â G-Man
Sep 17 at 2:16
@G-Man I am saying it works, but I need to generalize it for any number of files. Having said that, I am open to completely different solutions as well.
â SethMMorton
Sep 17 at 3:54
add a comment |Â
up vote
3
down vote
favorite
up vote
3
down vote
favorite
I have a collection of gzipped files that I want to combine into a single file. They each have identical format. I want to keep the header information from only the first file and skip it in the subsequent files.
As a simple example, I have four identical files with the following content:
$ gzcat file1.gz
# header
1
2
I want to end up with
# header
1
2
1
2
1
2
1
2
In reality, I can have a varying number of files so I would like to be able to do this programatically. Here is the non-programatic solution I have so far...
cat <(gzcat file1.gz) <(tail -q -n +2 <(gzcat file2.gz) <(gzcat file3.gz) <(gzcat file4.gz))
This command works, but it is âÂÂhard codedâ to handle four files,
and I need to generalize it for any number of files.ÃÂ
I am using bash
as the shell if that helps. My preference is for performance (in reality the files can be millions of lines long), so I am OK with a less-than-elegant solution if it is speedy.
shell-script text-processing cat tail gzip
I have a collection of gzipped files that I want to combine into a single file. They each have identical format. I want to keep the header information from only the first file and skip it in the subsequent files.
As a simple example, I have four identical files with the following content:
$ gzcat file1.gz
# header
1
2
I want to end up with
# header
1
2
1
2
1
2
1
2
In reality, I can have a varying number of files so I would like to be able to do this programatically. Here is the non-programatic solution I have so far...
cat <(gzcat file1.gz) <(tail -q -n +2 <(gzcat file2.gz) <(gzcat file3.gz) <(gzcat file4.gz))
This command works, but it is âÂÂhard codedâ to handle four files,
and I need to generalize it for any number of files.ÃÂ
I am using bash
as the shell if that helps. My preference is for performance (in reality the files can be millions of lines long), so I am OK with a less-than-elegant solution if it is speedy.
shell-script text-processing cat tail gzip
shell-script text-processing cat tail gzip
edited Sep 17 at 6:33
G-Man
11.9k92658
11.9k92658
asked Sep 17 at 1:26
SethMMorton
1334
1334
Are you saying that the command you have works (for four files), but you need to be able to generalize it for any number of files?â Or is there some other problem with the command you have?
â G-Man
Sep 17 at 2:16
@G-Man I am saying it works, but I need to generalize it for any number of files. Having said that, I am open to completely different solutions as well.
â SethMMorton
Sep 17 at 3:54
add a comment |Â
Are you saying that the command you have works (for four files), but you need to be able to generalize it for any number of files?â Or is there some other problem with the command you have?
â G-Man
Sep 17 at 2:16
@G-Man I am saying it works, but I need to generalize it for any number of files. Having said that, I am open to completely different solutions as well.
â SethMMorton
Sep 17 at 3:54
Are you saying that the command you have works (for four files), but you need to be able to generalize it for any number of files?â Or is there some other problem with the command you have?
â G-Man
Sep 17 at 2:16
Are you saying that the command you have works (for four files), but you need to be able to generalize it for any number of files?â Or is there some other problem with the command you have?
â G-Man
Sep 17 at 2:16
@G-Man I am saying it works, but I need to generalize it for any number of files. Having said that, I am open to completely different solutions as well.
â SethMMorton
Sep 17 at 3:54
@G-Man I am saying it works, but I need to generalize it for any number of files. Having said that, I am open to completely different solutions as well.
â SethMMorton
Sep 17 at 3:54
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
1
down vote
accepted
If the command that you show in your question basically works (for a hard-coded number of files), then
first=1
for f in file*.gz
do
if [ "$first" ]
then
gzcat "$f"
first=
else
gzcat "$f"| tail -n +2
fi
done > collection_single_file
should work for you.ÃÂ
I hope the logic is fairly clear.ÃÂ
Look at all the files (change the wildcard as appropriate for your file names).ÃÂ
IfàitâÂÂs the first one in the list, gzcat
it, so you get the entire file
(including the header).ÃÂ
Otherwise, use tail
to strip the header.ÃÂ
After youâÂÂve handled a file, then no other file will be the first.
This invokes tail
NâÂÂ1 times, instead of just once (like your answer).ÃÂ
Aside from that, my answer should perform the same as your answer.
Ah! I was not yet aware one could redirect output from an entire for loop... it certainly makes things easier!
â SethMMorton
Sep 17 at 4:20
add a comment |Â
up vote
1
down vote
A variation on G-Man's solution that does not use a separate variable to keep track of the first file:
set -- file*.gz
gzcat "$1"; shift
for file do
gzcat "$file" >combined.txt
This uncompresses the first file and then loops over the remaining ones, passing each through a short sed
script that deletes the first line. The output is redirected to combined.txt
.
The set -- file*.gz
command sets the positional parameters ($1
, $2
, etc., that collectively is the array $@
) to the filenames matching the given pattern. The shift
removes the $1
from the array after uncompressing it. The loop loops over the remaining filenames in the array and could also have been written
for file in "$@"; do
gzcat "$file" | sed '1d'
done
The ...
allows us to redirect the output of the commands within to a file in one go.
Even shorter, with the additional assumption that a "header line" is always starting with a #
character (as in the example in the question), and that there are no other such lines in the data:
gzcat file*.gz | awk 'NR > 1 && /^#/ next 1' >combined.txt
or,
gzcat file*.gz | sed '2,$ /^#/d; ' >combined.txt
Both of these skips any line starting with #
if it occurs on the second line or later in the combined contents of the uncompressed data.
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
If the command that you show in your question basically works (for a hard-coded number of files), then
first=1
for f in file*.gz
do
if [ "$first" ]
then
gzcat "$f"
first=
else
gzcat "$f"| tail -n +2
fi
done > collection_single_file
should work for you.ÃÂ
I hope the logic is fairly clear.ÃÂ
Look at all the files (change the wildcard as appropriate for your file names).ÃÂ
IfàitâÂÂs the first one in the list, gzcat
it, so you get the entire file
(including the header).ÃÂ
Otherwise, use tail
to strip the header.ÃÂ
After youâÂÂve handled a file, then no other file will be the first.
This invokes tail
NâÂÂ1 times, instead of just once (like your answer).ÃÂ
Aside from that, my answer should perform the same as your answer.
Ah! I was not yet aware one could redirect output from an entire for loop... it certainly makes things easier!
â SethMMorton
Sep 17 at 4:20
add a comment |Â
up vote
1
down vote
accepted
If the command that you show in your question basically works (for a hard-coded number of files), then
first=1
for f in file*.gz
do
if [ "$first" ]
then
gzcat "$f"
first=
else
gzcat "$f"| tail -n +2
fi
done > collection_single_file
should work for you.ÃÂ
I hope the logic is fairly clear.ÃÂ
Look at all the files (change the wildcard as appropriate for your file names).ÃÂ
IfàitâÂÂs the first one in the list, gzcat
it, so you get the entire file
(including the header).ÃÂ
Otherwise, use tail
to strip the header.ÃÂ
After youâÂÂve handled a file, then no other file will be the first.
This invokes tail
NâÂÂ1 times, instead of just once (like your answer).ÃÂ
Aside from that, my answer should perform the same as your answer.
Ah! I was not yet aware one could redirect output from an entire for loop... it certainly makes things easier!
â SethMMorton
Sep 17 at 4:20
add a comment |Â
up vote
1
down vote
accepted
up vote
1
down vote
accepted
If the command that you show in your question basically works (for a hard-coded number of files), then
first=1
for f in file*.gz
do
if [ "$first" ]
then
gzcat "$f"
first=
else
gzcat "$f"| tail -n +2
fi
done > collection_single_file
should work for you.ÃÂ
I hope the logic is fairly clear.ÃÂ
Look at all the files (change the wildcard as appropriate for your file names).ÃÂ
IfàitâÂÂs the first one in the list, gzcat
it, so you get the entire file
(including the header).ÃÂ
Otherwise, use tail
to strip the header.ÃÂ
After youâÂÂve handled a file, then no other file will be the first.
This invokes tail
NâÂÂ1 times, instead of just once (like your answer).ÃÂ
Aside from that, my answer should perform the same as your answer.
If the command that you show in your question basically works (for a hard-coded number of files), then
first=1
for f in file*.gz
do
if [ "$first" ]
then
gzcat "$f"
first=
else
gzcat "$f"| tail -n +2
fi
done > collection_single_file
should work for you.ÃÂ
I hope the logic is fairly clear.ÃÂ
Look at all the files (change the wildcard as appropriate for your file names).ÃÂ
IfàitâÂÂs the first one in the list, gzcat
it, so you get the entire file
(including the header).ÃÂ
Otherwise, use tail
to strip the header.ÃÂ
After youâÂÂve handled a file, then no other file will be the first.
This invokes tail
NâÂÂ1 times, instead of just once (like your answer).ÃÂ
Aside from that, my answer should perform the same as your answer.
answered Sep 17 at 3:57
G-Man
11.9k92658
11.9k92658
Ah! I was not yet aware one could redirect output from an entire for loop... it certainly makes things easier!
â SethMMorton
Sep 17 at 4:20
add a comment |Â
Ah! I was not yet aware one could redirect output from an entire for loop... it certainly makes things easier!
â SethMMorton
Sep 17 at 4:20
Ah! I was not yet aware one could redirect output from an entire for loop... it certainly makes things easier!
â SethMMorton
Sep 17 at 4:20
Ah! I was not yet aware one could redirect output from an entire for loop... it certainly makes things easier!
â SethMMorton
Sep 17 at 4:20
add a comment |Â
up vote
1
down vote
A variation on G-Man's solution that does not use a separate variable to keep track of the first file:
set -- file*.gz
gzcat "$1"; shift
for file do
gzcat "$file" >combined.txt
This uncompresses the first file and then loops over the remaining ones, passing each through a short sed
script that deletes the first line. The output is redirected to combined.txt
.
The set -- file*.gz
command sets the positional parameters ($1
, $2
, etc., that collectively is the array $@
) to the filenames matching the given pattern. The shift
removes the $1
from the array after uncompressing it. The loop loops over the remaining filenames in the array and could also have been written
for file in "$@"; do
gzcat "$file" | sed '1d'
done
The ...
allows us to redirect the output of the commands within to a file in one go.
Even shorter, with the additional assumption that a "header line" is always starting with a #
character (as in the example in the question), and that there are no other such lines in the data:
gzcat file*.gz | awk 'NR > 1 && /^#/ next 1' >combined.txt
or,
gzcat file*.gz | sed '2,$ /^#/d; ' >combined.txt
Both of these skips any line starting with #
if it occurs on the second line or later in the combined contents of the uncompressed data.
add a comment |Â
up vote
1
down vote
A variation on G-Man's solution that does not use a separate variable to keep track of the first file:
set -- file*.gz
gzcat "$1"; shift
for file do
gzcat "$file" >combined.txt
This uncompresses the first file and then loops over the remaining ones, passing each through a short sed
script that deletes the first line. The output is redirected to combined.txt
.
The set -- file*.gz
command sets the positional parameters ($1
, $2
, etc., that collectively is the array $@
) to the filenames matching the given pattern. The shift
removes the $1
from the array after uncompressing it. The loop loops over the remaining filenames in the array and could also have been written
for file in "$@"; do
gzcat "$file" | sed '1d'
done
The ...
allows us to redirect the output of the commands within to a file in one go.
Even shorter, with the additional assumption that a "header line" is always starting with a #
character (as in the example in the question), and that there are no other such lines in the data:
gzcat file*.gz | awk 'NR > 1 && /^#/ next 1' >combined.txt
or,
gzcat file*.gz | sed '2,$ /^#/d; ' >combined.txt
Both of these skips any line starting with #
if it occurs on the second line or later in the combined contents of the uncompressed data.
add a comment |Â
up vote
1
down vote
up vote
1
down vote
A variation on G-Man's solution that does not use a separate variable to keep track of the first file:
set -- file*.gz
gzcat "$1"; shift
for file do
gzcat "$file" >combined.txt
This uncompresses the first file and then loops over the remaining ones, passing each through a short sed
script that deletes the first line. The output is redirected to combined.txt
.
The set -- file*.gz
command sets the positional parameters ($1
, $2
, etc., that collectively is the array $@
) to the filenames matching the given pattern. The shift
removes the $1
from the array after uncompressing it. The loop loops over the remaining filenames in the array and could also have been written
for file in "$@"; do
gzcat "$file" | sed '1d'
done
The ...
allows us to redirect the output of the commands within to a file in one go.
Even shorter, with the additional assumption that a "header line" is always starting with a #
character (as in the example in the question), and that there are no other such lines in the data:
gzcat file*.gz | awk 'NR > 1 && /^#/ next 1' >combined.txt
or,
gzcat file*.gz | sed '2,$ /^#/d; ' >combined.txt
Both of these skips any line starting with #
if it occurs on the second line or later in the combined contents of the uncompressed data.
A variation on G-Man's solution that does not use a separate variable to keep track of the first file:
set -- file*.gz
gzcat "$1"; shift
for file do
gzcat "$file" >combined.txt
This uncompresses the first file and then loops over the remaining ones, passing each through a short sed
script that deletes the first line. The output is redirected to combined.txt
.
The set -- file*.gz
command sets the positional parameters ($1
, $2
, etc., that collectively is the array $@
) to the filenames matching the given pattern. The shift
removes the $1
from the array after uncompressing it. The loop loops over the remaining filenames in the array and could also have been written
for file in "$@"; do
gzcat "$file" | sed '1d'
done
The ...
allows us to redirect the output of the commands within to a file in one go.
Even shorter, with the additional assumption that a "header line" is always starting with a #
character (as in the example in the question), and that there are no other such lines in the data:
gzcat file*.gz | awk 'NR > 1 && /^#/ next 1' >combined.txt
or,
gzcat file*.gz | sed '2,$ /^#/d; ' >combined.txt
Both of these skips any line starting with #
if it occurs on the second line or later in the combined contents of the uncompressed data.
edited Sep 19 at 15:13
answered Sep 19 at 7:57
Kusalananda
108k14209332
108k14209332
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f469463%2fconcatenate-multiple-zipped-files-skipping-header-lines-in-all-but-the-first-fi%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Are you saying that the command you have works (for four files), but you need to be able to generalize it for any number of files?â Or is there some other problem with the command you have?
â G-Man
Sep 17 at 2:16
@G-Man I am saying it works, but I need to generalize it for any number of files. Having said that, I am open to completely different solutions as well.
â SethMMorton
Sep 17 at 3:54