Concatenate multiple zipped files, skipping header lines in all but the first file

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
3
down vote

favorite












I have a collection of gzipped files that I want to combine into a single file. They each have identical format. I want to keep the header information from only the first file and skip it in the subsequent files.



As a simple example, I have four identical files with the following content:



$ gzcat file1.gz
# header
1
2


I want to end up with



# header
1
2
1
2
1
2
1
2


In reality, I can have a varying number of files so I would like to be able to do this programatically. Here is the non-programatic solution I have so far...



cat <(gzcat file1.gz) <(tail -q -n +2 <(gzcat file2.gz) <(gzcat file3.gz) <(gzcat file4.gz))


This command works, but it is “hard coded” to handle four files,
and I need to generalize it for any number of files. 
I am using bash as the shell if that helps. My preference is for performance (in reality the files can be millions of lines long), so I am OK with a less-than-elegant solution if it is speedy.










share|improve this question























  • Are you saying that the command you have works (for four files), but you need to be able to generalize it for any number of files?  Or is there some other problem with the command you have?
    – G-Man
    Sep 17 at 2:16










  • @G-Man I am saying it works, but I need to generalize it for any number of files. Having said that, I am open to completely different solutions as well.
    – SethMMorton
    Sep 17 at 3:54














up vote
3
down vote

favorite












I have a collection of gzipped files that I want to combine into a single file. They each have identical format. I want to keep the header information from only the first file and skip it in the subsequent files.



As a simple example, I have four identical files with the following content:



$ gzcat file1.gz
# header
1
2


I want to end up with



# header
1
2
1
2
1
2
1
2


In reality, I can have a varying number of files so I would like to be able to do this programatically. Here is the non-programatic solution I have so far...



cat <(gzcat file1.gz) <(tail -q -n +2 <(gzcat file2.gz) <(gzcat file3.gz) <(gzcat file4.gz))


This command works, but it is “hard coded” to handle four files,
and I need to generalize it for any number of files. 
I am using bash as the shell if that helps. My preference is for performance (in reality the files can be millions of lines long), so I am OK with a less-than-elegant solution if it is speedy.










share|improve this question























  • Are you saying that the command you have works (for four files), but you need to be able to generalize it for any number of files?  Or is there some other problem with the command you have?
    – G-Man
    Sep 17 at 2:16










  • @G-Man I am saying it works, but I need to generalize it for any number of files. Having said that, I am open to completely different solutions as well.
    – SethMMorton
    Sep 17 at 3:54












up vote
3
down vote

favorite









up vote
3
down vote

favorite











I have a collection of gzipped files that I want to combine into a single file. They each have identical format. I want to keep the header information from only the first file and skip it in the subsequent files.



As a simple example, I have four identical files with the following content:



$ gzcat file1.gz
# header
1
2


I want to end up with



# header
1
2
1
2
1
2
1
2


In reality, I can have a varying number of files so I would like to be able to do this programatically. Here is the non-programatic solution I have so far...



cat <(gzcat file1.gz) <(tail -q -n +2 <(gzcat file2.gz) <(gzcat file3.gz) <(gzcat file4.gz))


This command works, but it is “hard coded” to handle four files,
and I need to generalize it for any number of files. 
I am using bash as the shell if that helps. My preference is for performance (in reality the files can be millions of lines long), so I am OK with a less-than-elegant solution if it is speedy.










share|improve this question















I have a collection of gzipped files that I want to combine into a single file. They each have identical format. I want to keep the header information from only the first file and skip it in the subsequent files.



As a simple example, I have four identical files with the following content:



$ gzcat file1.gz
# header
1
2


I want to end up with



# header
1
2
1
2
1
2
1
2


In reality, I can have a varying number of files so I would like to be able to do this programatically. Here is the non-programatic solution I have so far...



cat <(gzcat file1.gz) <(tail -q -n +2 <(gzcat file2.gz) <(gzcat file3.gz) <(gzcat file4.gz))


This command works, but it is “hard coded” to handle four files,
and I need to generalize it for any number of files. 
I am using bash as the shell if that helps. My preference is for performance (in reality the files can be millions of lines long), so I am OK with a less-than-elegant solution if it is speedy.







shell-script text-processing cat tail gzip






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Sep 17 at 6:33









G-Man

11.9k92658




11.9k92658










asked Sep 17 at 1:26









SethMMorton

1334




1334











  • Are you saying that the command you have works (for four files), but you need to be able to generalize it for any number of files?  Or is there some other problem with the command you have?
    – G-Man
    Sep 17 at 2:16










  • @G-Man I am saying it works, but I need to generalize it for any number of files. Having said that, I am open to completely different solutions as well.
    – SethMMorton
    Sep 17 at 3:54
















  • Are you saying that the command you have works (for four files), but you need to be able to generalize it for any number of files?  Or is there some other problem with the command you have?
    – G-Man
    Sep 17 at 2:16










  • @G-Man I am saying it works, but I need to generalize it for any number of files. Having said that, I am open to completely different solutions as well.
    – SethMMorton
    Sep 17 at 3:54















Are you saying that the command you have works (for four files), but you need to be able to generalize it for any number of files?  Or is there some other problem with the command you have?
– G-Man
Sep 17 at 2:16




Are you saying that the command you have works (for four files), but you need to be able to generalize it for any number of files?  Or is there some other problem with the command you have?
– G-Man
Sep 17 at 2:16












@G-Man I am saying it works, but I need to generalize it for any number of files. Having said that, I am open to completely different solutions as well.
– SethMMorton
Sep 17 at 3:54




@G-Man I am saying it works, but I need to generalize it for any number of files. Having said that, I am open to completely different solutions as well.
– SethMMorton
Sep 17 at 3:54










2 Answers
2






active

oldest

votes

















up vote
1
down vote



accepted










If the command that you show in your question basically works (for a hard-coded number of files), then



first=1
for f in file*.gz
do
if [ "$first" ]
then
gzcat "$f"
first=
else
gzcat "$f"| tail -n +2
fi
done > collection_single_file


should work for you. 
I hope the logic is fairly clear. 
Look at all the files (change the wildcard as appropriate for your file names). 
If it’s the first one in the list, gzcat it, so you get the entire file
(including the header). 
Otherwise, use tail to strip the header. 
After you’ve handled a file, then no other file will be the first.



This invokes tail N−1 times, instead of just once (like your answer). 
Aside from that, my answer should perform the same as your answer.






share|improve this answer




















  • Ah! I was not yet aware one could redirect output from an entire for loop... it certainly makes things easier!
    – SethMMorton
    Sep 17 at 4:20

















up vote
1
down vote













A variation on G-Man's solution that does not use a separate variable to keep track of the first file:



set -- file*.gz


gzcat "$1"; shift

for file do
gzcat "$file" >combined.txt


This uncompresses the first file and then loops over the remaining ones, passing each through a short sed script that deletes the first line. The output is redirected to combined.txt.



The set -- file*.gz command sets the positional parameters ($1, $2, etc., that collectively is the array $@) to the filenames matching the given pattern. The shift removes the $1 from the array after uncompressing it. The loop loops over the remaining filenames in the array and could also have been written



for file in "$@"; do
gzcat "$file" | sed '1d'
done


The ... allows us to redirect the output of the commands within to a file in one go.




Even shorter, with the additional assumption that a "header line" is always starting with a # character (as in the example in the question), and that there are no other such lines in the data:



gzcat file*.gz | awk 'NR > 1 && /^#/ next 1' >combined.txt


or,



gzcat file*.gz | sed '2,$ /^#/d; ' >combined.txt


Both of these skips any line starting with # if it occurs on the second line or later in the combined contents of the uncompressed data.






share|improve this answer






















    Your Answer







    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "106"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f469463%2fconcatenate-multiple-zipped-files-skipping-header-lines-in-all-but-the-first-fi%23new-answer', 'question_page');

    );

    Post as a guest






























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    1
    down vote



    accepted










    If the command that you show in your question basically works (for a hard-coded number of files), then



    first=1
    for f in file*.gz
    do
    if [ "$first" ]
    then
    gzcat "$f"
    first=
    else
    gzcat "$f"| tail -n +2
    fi
    done > collection_single_file


    should work for you. 
    I hope the logic is fairly clear. 
    Look at all the files (change the wildcard as appropriate for your file names). 
    If it’s the first one in the list, gzcat it, so you get the entire file
    (including the header). 
    Otherwise, use tail to strip the header. 
    After you’ve handled a file, then no other file will be the first.



    This invokes tail N−1 times, instead of just once (like your answer). 
    Aside from that, my answer should perform the same as your answer.






    share|improve this answer




















    • Ah! I was not yet aware one could redirect output from an entire for loop... it certainly makes things easier!
      – SethMMorton
      Sep 17 at 4:20














    up vote
    1
    down vote



    accepted










    If the command that you show in your question basically works (for a hard-coded number of files), then



    first=1
    for f in file*.gz
    do
    if [ "$first" ]
    then
    gzcat "$f"
    first=
    else
    gzcat "$f"| tail -n +2
    fi
    done > collection_single_file


    should work for you. 
    I hope the logic is fairly clear. 
    Look at all the files (change the wildcard as appropriate for your file names). 
    If it’s the first one in the list, gzcat it, so you get the entire file
    (including the header). 
    Otherwise, use tail to strip the header. 
    After you’ve handled a file, then no other file will be the first.



    This invokes tail N−1 times, instead of just once (like your answer). 
    Aside from that, my answer should perform the same as your answer.






    share|improve this answer




















    • Ah! I was not yet aware one could redirect output from an entire for loop... it certainly makes things easier!
      – SethMMorton
      Sep 17 at 4:20












    up vote
    1
    down vote



    accepted







    up vote
    1
    down vote



    accepted






    If the command that you show in your question basically works (for a hard-coded number of files), then



    first=1
    for f in file*.gz
    do
    if [ "$first" ]
    then
    gzcat "$f"
    first=
    else
    gzcat "$f"| tail -n +2
    fi
    done > collection_single_file


    should work for you. 
    I hope the logic is fairly clear. 
    Look at all the files (change the wildcard as appropriate for your file names). 
    If it’s the first one in the list, gzcat it, so you get the entire file
    (including the header). 
    Otherwise, use tail to strip the header. 
    After you’ve handled a file, then no other file will be the first.



    This invokes tail N−1 times, instead of just once (like your answer). 
    Aside from that, my answer should perform the same as your answer.






    share|improve this answer












    If the command that you show in your question basically works (for a hard-coded number of files), then



    first=1
    for f in file*.gz
    do
    if [ "$first" ]
    then
    gzcat "$f"
    first=
    else
    gzcat "$f"| tail -n +2
    fi
    done > collection_single_file


    should work for you. 
    I hope the logic is fairly clear. 
    Look at all the files (change the wildcard as appropriate for your file names). 
    If it’s the first one in the list, gzcat it, so you get the entire file
    (including the header). 
    Otherwise, use tail to strip the header. 
    After you’ve handled a file, then no other file will be the first.



    This invokes tail N−1 times, instead of just once (like your answer). 
    Aside from that, my answer should perform the same as your answer.







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Sep 17 at 3:57









    G-Man

    11.9k92658




    11.9k92658











    • Ah! I was not yet aware one could redirect output from an entire for loop... it certainly makes things easier!
      – SethMMorton
      Sep 17 at 4:20
















    • Ah! I was not yet aware one could redirect output from an entire for loop... it certainly makes things easier!
      – SethMMorton
      Sep 17 at 4:20















    Ah! I was not yet aware one could redirect output from an entire for loop... it certainly makes things easier!
    – SethMMorton
    Sep 17 at 4:20




    Ah! I was not yet aware one could redirect output from an entire for loop... it certainly makes things easier!
    – SethMMorton
    Sep 17 at 4:20












    up vote
    1
    down vote













    A variation on G-Man's solution that does not use a separate variable to keep track of the first file:



    set -- file*.gz


    gzcat "$1"; shift

    for file do
    gzcat "$file" >combined.txt


    This uncompresses the first file and then loops over the remaining ones, passing each through a short sed script that deletes the first line. The output is redirected to combined.txt.



    The set -- file*.gz command sets the positional parameters ($1, $2, etc., that collectively is the array $@) to the filenames matching the given pattern. The shift removes the $1 from the array after uncompressing it. The loop loops over the remaining filenames in the array and could also have been written



    for file in "$@"; do
    gzcat "$file" | sed '1d'
    done


    The ... allows us to redirect the output of the commands within to a file in one go.




    Even shorter, with the additional assumption that a "header line" is always starting with a # character (as in the example in the question), and that there are no other such lines in the data:



    gzcat file*.gz | awk 'NR > 1 && /^#/ next 1' >combined.txt


    or,



    gzcat file*.gz | sed '2,$ /^#/d; ' >combined.txt


    Both of these skips any line starting with # if it occurs on the second line or later in the combined contents of the uncompressed data.






    share|improve this answer


























      up vote
      1
      down vote













      A variation on G-Man's solution that does not use a separate variable to keep track of the first file:



      set -- file*.gz


      gzcat "$1"; shift

      for file do
      gzcat "$file" >combined.txt


      This uncompresses the first file and then loops over the remaining ones, passing each through a short sed script that deletes the first line. The output is redirected to combined.txt.



      The set -- file*.gz command sets the positional parameters ($1, $2, etc., that collectively is the array $@) to the filenames matching the given pattern. The shift removes the $1 from the array after uncompressing it. The loop loops over the remaining filenames in the array and could also have been written



      for file in "$@"; do
      gzcat "$file" | sed '1d'
      done


      The ... allows us to redirect the output of the commands within to a file in one go.




      Even shorter, with the additional assumption that a "header line" is always starting with a # character (as in the example in the question), and that there are no other such lines in the data:



      gzcat file*.gz | awk 'NR > 1 && /^#/ next 1' >combined.txt


      or,



      gzcat file*.gz | sed '2,$ /^#/d; ' >combined.txt


      Both of these skips any line starting with # if it occurs on the second line or later in the combined contents of the uncompressed data.






      share|improve this answer
























        up vote
        1
        down vote










        up vote
        1
        down vote









        A variation on G-Man's solution that does not use a separate variable to keep track of the first file:



        set -- file*.gz


        gzcat "$1"; shift

        for file do
        gzcat "$file" >combined.txt


        This uncompresses the first file and then loops over the remaining ones, passing each through a short sed script that deletes the first line. The output is redirected to combined.txt.



        The set -- file*.gz command sets the positional parameters ($1, $2, etc., that collectively is the array $@) to the filenames matching the given pattern. The shift removes the $1 from the array after uncompressing it. The loop loops over the remaining filenames in the array and could also have been written



        for file in "$@"; do
        gzcat "$file" | sed '1d'
        done


        The ... allows us to redirect the output of the commands within to a file in one go.




        Even shorter, with the additional assumption that a "header line" is always starting with a # character (as in the example in the question), and that there are no other such lines in the data:



        gzcat file*.gz | awk 'NR > 1 && /^#/ next 1' >combined.txt


        or,



        gzcat file*.gz | sed '2,$ /^#/d; ' >combined.txt


        Both of these skips any line starting with # if it occurs on the second line or later in the combined contents of the uncompressed data.






        share|improve this answer














        A variation on G-Man's solution that does not use a separate variable to keep track of the first file:



        set -- file*.gz


        gzcat "$1"; shift

        for file do
        gzcat "$file" >combined.txt


        This uncompresses the first file and then loops over the remaining ones, passing each through a short sed script that deletes the first line. The output is redirected to combined.txt.



        The set -- file*.gz command sets the positional parameters ($1, $2, etc., that collectively is the array $@) to the filenames matching the given pattern. The shift removes the $1 from the array after uncompressing it. The loop loops over the remaining filenames in the array and could also have been written



        for file in "$@"; do
        gzcat "$file" | sed '1d'
        done


        The ... allows us to redirect the output of the commands within to a file in one go.




        Even shorter, with the additional assumption that a "header line" is always starting with a # character (as in the example in the question), and that there are no other such lines in the data:



        gzcat file*.gz | awk 'NR > 1 && /^#/ next 1' >combined.txt


        or,



        gzcat file*.gz | sed '2,$ /^#/d; ' >combined.txt


        Both of these skips any line starting with # if it occurs on the second line or later in the combined contents of the uncompressed data.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Sep 19 at 15:13

























        answered Sep 19 at 7:57









        Kusalananda

        108k14209332




        108k14209332



























             

            draft saved


            draft discarded















































             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f469463%2fconcatenate-multiple-zipped-files-skipping-header-lines-in-all-but-the-first-fi%23new-answer', 'question_page');

            );

            Post as a guest













































































            Popular posts from this blog

            How to check contact read email or not when send email to Individual?

            Displaying single band from multi-band raster using QGIS

            How many registers does an x86_64 CPU actually have?