How to speed up awk script that uses several large gzip files?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
0
down vote

favorite












I have two data files:




  1. file_1.in, containing over 2k lines like "12 AB0001":



    10 AB0001
    11 AC0002
    12 AD0003
    ...



  2. A list of *.gz gzipped files (about 1 to 3 million lines) I should
    extract and parse to create one output file named as lines (second
    col) in file_1.in.



    ##comment..
    ##comment..
    #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT AB0001 AC0002 AD0003
    21 1234567 ab11111 G A 100 PASS info1;info2 GT 0|0 0|1 0|0
    21 1234568 ab22222 C A 100 PASS info1,info2 GT 1:23:2 .:.:. 0:32:2
    21 1234569 ab33333 A C 100 PASS info1;info2 GT 0|2 1|0 0|0


Trying different approaches I came to this:




if(FNR==NR)list[$1]=$2;next
if(!/^#/)
for(p in list)0" && sp[1] != "0" && sp[1] != ".")
printf("%s %s %s %s %s %s %s %s %s %sn",
$1, $2, $3, $4, $5, $6, $7, $8, $9, $p) >> out"/"list[p]".tmp"






executed by command line:



awk -v out="outfolder/" -f myscript.awk file_1.in <(zcat *.gz)


But it takes more than two hours for creating just one file. There could be a way to improve my code? I think most of the time is spent by zcat each file and by the append writing function that seems to be slow. What do you think?







share|improve this question


















  • 3




    I’m missing the zcat portion...
    – Jeff Schaller
    Nov 19 '17 at 0:55










  • Can you add some data, that would help me understand what you're doing :)
    – tink
    Nov 19 '17 at 2:01










  • I still can't figure out what you expect the value of $p to be ...
    – tink
    Nov 19 '17 at 3:11










  • @JeffSchaller sorry I forgot the command, just edited, it won't work without.
    – redrich
    Nov 19 '17 at 10:12










  • @tink just added few examples with almost all cases. $p will be the index of the column I need that will be written to a single file named as that column. I know that it is possibile to process the column without taking file_1.in as input but I've noticed that doing a for loop on more then 2k columns in 24 files takes too much time, so I came up with this solution (not sure if is the best one) as the header of the files is always the same
    – redrich
    Nov 19 '17 at 10:20














up vote
0
down vote

favorite












I have two data files:




  1. file_1.in, containing over 2k lines like "12 AB0001":



    10 AB0001
    11 AC0002
    12 AD0003
    ...



  2. A list of *.gz gzipped files (about 1 to 3 million lines) I should
    extract and parse to create one output file named as lines (second
    col) in file_1.in.



    ##comment..
    ##comment..
    #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT AB0001 AC0002 AD0003
    21 1234567 ab11111 G A 100 PASS info1;info2 GT 0|0 0|1 0|0
    21 1234568 ab22222 C A 100 PASS info1,info2 GT 1:23:2 .:.:. 0:32:2
    21 1234569 ab33333 A C 100 PASS info1;info2 GT 0|2 1|0 0|0


Trying different approaches I came to this:




if(FNR==NR)list[$1]=$2;next
if(!/^#/)
for(p in list)0" && sp[1] != "0" && sp[1] != ".")
printf("%s %s %s %s %s %s %s %s %s %sn",
$1, $2, $3, $4, $5, $6, $7, $8, $9, $p) >> out"/"list[p]".tmp"






executed by command line:



awk -v out="outfolder/" -f myscript.awk file_1.in <(zcat *.gz)


But it takes more than two hours for creating just one file. There could be a way to improve my code? I think most of the time is spent by zcat each file and by the append writing function that seems to be slow. What do you think?







share|improve this question


















  • 3




    I’m missing the zcat portion...
    – Jeff Schaller
    Nov 19 '17 at 0:55










  • Can you add some data, that would help me understand what you're doing :)
    – tink
    Nov 19 '17 at 2:01










  • I still can't figure out what you expect the value of $p to be ...
    – tink
    Nov 19 '17 at 3:11










  • @JeffSchaller sorry I forgot the command, just edited, it won't work without.
    – redrich
    Nov 19 '17 at 10:12










  • @tink just added few examples with almost all cases. $p will be the index of the column I need that will be written to a single file named as that column. I know that it is possibile to process the column without taking file_1.in as input but I've noticed that doing a for loop on more then 2k columns in 24 files takes too much time, so I came up with this solution (not sure if is the best one) as the header of the files is always the same
    – redrich
    Nov 19 '17 at 10:20












up vote
0
down vote

favorite









up vote
0
down vote

favorite











I have two data files:




  1. file_1.in, containing over 2k lines like "12 AB0001":



    10 AB0001
    11 AC0002
    12 AD0003
    ...



  2. A list of *.gz gzipped files (about 1 to 3 million lines) I should
    extract and parse to create one output file named as lines (second
    col) in file_1.in.



    ##comment..
    ##comment..
    #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT AB0001 AC0002 AD0003
    21 1234567 ab11111 G A 100 PASS info1;info2 GT 0|0 0|1 0|0
    21 1234568 ab22222 C A 100 PASS info1,info2 GT 1:23:2 .:.:. 0:32:2
    21 1234569 ab33333 A C 100 PASS info1;info2 GT 0|2 1|0 0|0


Trying different approaches I came to this:




if(FNR==NR)list[$1]=$2;next
if(!/^#/)
for(p in list)0" && sp[1] != "0" && sp[1] != ".")
printf("%s %s %s %s %s %s %s %s %s %sn",
$1, $2, $3, $4, $5, $6, $7, $8, $9, $p) >> out"/"list[p]".tmp"






executed by command line:



awk -v out="outfolder/" -f myscript.awk file_1.in <(zcat *.gz)


But it takes more than two hours for creating just one file. There could be a way to improve my code? I think most of the time is spent by zcat each file and by the append writing function that seems to be slow. What do you think?







share|improve this question














I have two data files:




  1. file_1.in, containing over 2k lines like "12 AB0001":



    10 AB0001
    11 AC0002
    12 AD0003
    ...



  2. A list of *.gz gzipped files (about 1 to 3 million lines) I should
    extract and parse to create one output file named as lines (second
    col) in file_1.in.



    ##comment..
    ##comment..
    #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT AB0001 AC0002 AD0003
    21 1234567 ab11111 G A 100 PASS info1;info2 GT 0|0 0|1 0|0
    21 1234568 ab22222 C A 100 PASS info1,info2 GT 1:23:2 .:.:. 0:32:2
    21 1234569 ab33333 A C 100 PASS info1;info2 GT 0|2 1|0 0|0


Trying different approaches I came to this:




if(FNR==NR)list[$1]=$2;next
if(!/^#/)
for(p in list)0" && sp[1] != "0" && sp[1] != ".")
printf("%s %s %s %s %s %s %s %s %s %sn",
$1, $2, $3, $4, $5, $6, $7, $8, $9, $p) >> out"/"list[p]".tmp"






executed by command line:



awk -v out="outfolder/" -f myscript.awk file_1.in <(zcat *.gz)


But it takes more than two hours for creating just one file. There could be a way to improve my code? I think most of the time is spent by zcat each file and by the append writing function that seems to be slow. What do you think?









share|improve this question













share|improve this question




share|improve this question








edited May 6 at 13:42









agc

4,1101935




4,1101935










asked Nov 19 '17 at 0:19









redrich

11




11







  • 3




    I’m missing the zcat portion...
    – Jeff Schaller
    Nov 19 '17 at 0:55










  • Can you add some data, that would help me understand what you're doing :)
    – tink
    Nov 19 '17 at 2:01










  • I still can't figure out what you expect the value of $p to be ...
    – tink
    Nov 19 '17 at 3:11










  • @JeffSchaller sorry I forgot the command, just edited, it won't work without.
    – redrich
    Nov 19 '17 at 10:12










  • @tink just added few examples with almost all cases. $p will be the index of the column I need that will be written to a single file named as that column. I know that it is possibile to process the column without taking file_1.in as input but I've noticed that doing a for loop on more then 2k columns in 24 files takes too much time, so I came up with this solution (not sure if is the best one) as the header of the files is always the same
    – redrich
    Nov 19 '17 at 10:20












  • 3




    I’m missing the zcat portion...
    – Jeff Schaller
    Nov 19 '17 at 0:55










  • Can you add some data, that would help me understand what you're doing :)
    – tink
    Nov 19 '17 at 2:01










  • I still can't figure out what you expect the value of $p to be ...
    – tink
    Nov 19 '17 at 3:11










  • @JeffSchaller sorry I forgot the command, just edited, it won't work without.
    – redrich
    Nov 19 '17 at 10:12










  • @tink just added few examples with almost all cases. $p will be the index of the column I need that will be written to a single file named as that column. I know that it is possibile to process the column without taking file_1.in as input but I've noticed that doing a for loop on more then 2k columns in 24 files takes too much time, so I came up with this solution (not sure if is the best one) as the header of the files is always the same
    – redrich
    Nov 19 '17 at 10:20







3




3




I’m missing the zcat portion...
– Jeff Schaller
Nov 19 '17 at 0:55




I’m missing the zcat portion...
– Jeff Schaller
Nov 19 '17 at 0:55












Can you add some data, that would help me understand what you're doing :)
– tink
Nov 19 '17 at 2:01




Can you add some data, that would help me understand what you're doing :)
– tink
Nov 19 '17 at 2:01












I still can't figure out what you expect the value of $p to be ...
– tink
Nov 19 '17 at 3:11




I still can't figure out what you expect the value of $p to be ...
– tink
Nov 19 '17 at 3:11












@JeffSchaller sorry I forgot the command, just edited, it won't work without.
– redrich
Nov 19 '17 at 10:12




@JeffSchaller sorry I forgot the command, just edited, it won't work without.
– redrich
Nov 19 '17 at 10:12












@tink just added few examples with almost all cases. $p will be the index of the column I need that will be written to a single file named as that column. I know that it is possibile to process the column without taking file_1.in as input but I've noticed that doing a for loop on more then 2k columns in 24 files takes too much time, so I came up with this solution (not sure if is the best one) as the header of the files is always the same
– redrich
Nov 19 '17 at 10:20




@tink just added few examples with almost all cases. $p will be the index of the column I need that will be written to a single file named as that column. I know that it is possibile to process the column without taking file_1.in as input but I've noticed that doing a for loop on more then 2k columns in 24 files takes too much time, so I came up with this solution (not sure if is the best one) as the header of the files is always the same
– redrich
Nov 19 '17 at 10:20















active

oldest

votes











Your Answer







StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f405514%2fhow-to-speed-up-awk-script-that-uses-several-large-gzip-files%23new-answer', 'question_page');

);

Post as a guest



































active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes















 

draft saved


draft discarded















































 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f405514%2fhow-to-speed-up-awk-script-that-uses-several-large-gzip-files%23new-answer', 'question_page');

);

Post as a guest













































































Popular posts from this blog

How to check contact read email or not when send email to Individual?

How many registers does an x86_64 CPU actually have?

Nur Jahan