How to speed up awk script that uses several large gzip files?

Multi tool use
Multi tool use

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
0
down vote

favorite












I have two data files:




  1. file_1.in, containing over 2k lines like "12 AB0001":



    10 AB0001
    11 AC0002
    12 AD0003
    ...



  2. A list of *.gz gzipped files (about 1 to 3 million lines) I should
    extract and parse to create one output file named as lines (second
    col) in file_1.in.



    ##comment..
    ##comment..
    #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT AB0001 AC0002 AD0003
    21 1234567 ab11111 G A 100 PASS info1;info2 GT 0|0 0|1 0|0
    21 1234568 ab22222 C A 100 PASS info1,info2 GT 1:23:2 .:.:. 0:32:2
    21 1234569 ab33333 A C 100 PASS info1;info2 GT 0|2 1|0 0|0


Trying different approaches I came to this:




if(FNR==NR)list[$1]=$2;next
if(!/^#/)
for(p in list)0" && sp[1] != "0" && sp[1] != ".")
printf("%s %s %s %s %s %s %s %s %s %sn",
$1, $2, $3, $4, $5, $6, $7, $8, $9, $p) >> out"/"list[p]".tmp"






executed by command line:



awk -v out="outfolder/" -f myscript.awk file_1.in <(zcat *.gz)


But it takes more than two hours for creating just one file. There could be a way to improve my code? I think most of the time is spent by zcat each file and by the append writing function that seems to be slow. What do you think?







share|improve this question


















  • 3




    I’m missing the zcat portion...
    – Jeff Schaller
    Nov 19 '17 at 0:55










  • Can you add some data, that would help me understand what you're doing :)
    – tink
    Nov 19 '17 at 2:01










  • I still can't figure out what you expect the value of $p to be ...
    – tink
    Nov 19 '17 at 3:11










  • @JeffSchaller sorry I forgot the command, just edited, it won't work without.
    – redrich
    Nov 19 '17 at 10:12










  • @tink just added few examples with almost all cases. $p will be the index of the column I need that will be written to a single file named as that column. I know that it is possibile to process the column without taking file_1.in as input but I've noticed that doing a for loop on more then 2k columns in 24 files takes too much time, so I came up with this solution (not sure if is the best one) as the header of the files is always the same
    – redrich
    Nov 19 '17 at 10:20














up vote
0
down vote

favorite












I have two data files:




  1. file_1.in, containing over 2k lines like "12 AB0001":



    10 AB0001
    11 AC0002
    12 AD0003
    ...



  2. A list of *.gz gzipped files (about 1 to 3 million lines) I should
    extract and parse to create one output file named as lines (second
    col) in file_1.in.



    ##comment..
    ##comment..
    #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT AB0001 AC0002 AD0003
    21 1234567 ab11111 G A 100 PASS info1;info2 GT 0|0 0|1 0|0
    21 1234568 ab22222 C A 100 PASS info1,info2 GT 1:23:2 .:.:. 0:32:2
    21 1234569 ab33333 A C 100 PASS info1;info2 GT 0|2 1|0 0|0


Trying different approaches I came to this:




if(FNR==NR)list[$1]=$2;next
if(!/^#/)
for(p in list)0" && sp[1] != "0" && sp[1] != ".")
printf("%s %s %s %s %s %s %s %s %s %sn",
$1, $2, $3, $4, $5, $6, $7, $8, $9, $p) >> out"/"list[p]".tmp"






executed by command line:



awk -v out="outfolder/" -f myscript.awk file_1.in <(zcat *.gz)


But it takes more than two hours for creating just one file. There could be a way to improve my code? I think most of the time is spent by zcat each file and by the append writing function that seems to be slow. What do you think?







share|improve this question


















  • 3




    I’m missing the zcat portion...
    – Jeff Schaller
    Nov 19 '17 at 0:55










  • Can you add some data, that would help me understand what you're doing :)
    – tink
    Nov 19 '17 at 2:01










  • I still can't figure out what you expect the value of $p to be ...
    – tink
    Nov 19 '17 at 3:11










  • @JeffSchaller sorry I forgot the command, just edited, it won't work without.
    – redrich
    Nov 19 '17 at 10:12










  • @tink just added few examples with almost all cases. $p will be the index of the column I need that will be written to a single file named as that column. I know that it is possibile to process the column without taking file_1.in as input but I've noticed that doing a for loop on more then 2k columns in 24 files takes too much time, so I came up with this solution (not sure if is the best one) as the header of the files is always the same
    – redrich
    Nov 19 '17 at 10:20












up vote
0
down vote

favorite









up vote
0
down vote

favorite











I have two data files:




  1. file_1.in, containing over 2k lines like "12 AB0001":



    10 AB0001
    11 AC0002
    12 AD0003
    ...



  2. A list of *.gz gzipped files (about 1 to 3 million lines) I should
    extract and parse to create one output file named as lines (second
    col) in file_1.in.



    ##comment..
    ##comment..
    #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT AB0001 AC0002 AD0003
    21 1234567 ab11111 G A 100 PASS info1;info2 GT 0|0 0|1 0|0
    21 1234568 ab22222 C A 100 PASS info1,info2 GT 1:23:2 .:.:. 0:32:2
    21 1234569 ab33333 A C 100 PASS info1;info2 GT 0|2 1|0 0|0


Trying different approaches I came to this:




if(FNR==NR)list[$1]=$2;next
if(!/^#/)
for(p in list)0" && sp[1] != "0" && sp[1] != ".")
printf("%s %s %s %s %s %s %s %s %s %sn",
$1, $2, $3, $4, $5, $6, $7, $8, $9, $p) >> out"/"list[p]".tmp"






executed by command line:



awk -v out="outfolder/" -f myscript.awk file_1.in <(zcat *.gz)


But it takes more than two hours for creating just one file. There could be a way to improve my code? I think most of the time is spent by zcat each file and by the append writing function that seems to be slow. What do you think?







share|improve this question














I have two data files:




  1. file_1.in, containing over 2k lines like "12 AB0001":



    10 AB0001
    11 AC0002
    12 AD0003
    ...



  2. A list of *.gz gzipped files (about 1 to 3 million lines) I should
    extract and parse to create one output file named as lines (second
    col) in file_1.in.



    ##comment..
    ##comment..
    #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT AB0001 AC0002 AD0003
    21 1234567 ab11111 G A 100 PASS info1;info2 GT 0|0 0|1 0|0
    21 1234568 ab22222 C A 100 PASS info1,info2 GT 1:23:2 .:.:. 0:32:2
    21 1234569 ab33333 A C 100 PASS info1;info2 GT 0|2 1|0 0|0


Trying different approaches I came to this:




if(FNR==NR)list[$1]=$2;next
if(!/^#/)
for(p in list)0" && sp[1] != "0" && sp[1] != ".")
printf("%s %s %s %s %s %s %s %s %s %sn",
$1, $2, $3, $4, $5, $6, $7, $8, $9, $p) >> out"/"list[p]".tmp"






executed by command line:



awk -v out="outfolder/" -f myscript.awk file_1.in <(zcat *.gz)


But it takes more than two hours for creating just one file. There could be a way to improve my code? I think most of the time is spent by zcat each file and by the append writing function that seems to be slow. What do you think?









share|improve this question













share|improve this question




share|improve this question








edited May 6 at 13:42









agc

4,1101935




4,1101935










asked Nov 19 '17 at 0:19









redrich

11




11







  • 3




    I’m missing the zcat portion...
    – Jeff Schaller
    Nov 19 '17 at 0:55










  • Can you add some data, that would help me understand what you're doing :)
    – tink
    Nov 19 '17 at 2:01










  • I still can't figure out what you expect the value of $p to be ...
    – tink
    Nov 19 '17 at 3:11










  • @JeffSchaller sorry I forgot the command, just edited, it won't work without.
    – redrich
    Nov 19 '17 at 10:12










  • @tink just added few examples with almost all cases. $p will be the index of the column I need that will be written to a single file named as that column. I know that it is possibile to process the column without taking file_1.in as input but I've noticed that doing a for loop on more then 2k columns in 24 files takes too much time, so I came up with this solution (not sure if is the best one) as the header of the files is always the same
    – redrich
    Nov 19 '17 at 10:20












  • 3




    I’m missing the zcat portion...
    – Jeff Schaller
    Nov 19 '17 at 0:55










  • Can you add some data, that would help me understand what you're doing :)
    – tink
    Nov 19 '17 at 2:01










  • I still can't figure out what you expect the value of $p to be ...
    – tink
    Nov 19 '17 at 3:11










  • @JeffSchaller sorry I forgot the command, just edited, it won't work without.
    – redrich
    Nov 19 '17 at 10:12










  • @tink just added few examples with almost all cases. $p will be the index of the column I need that will be written to a single file named as that column. I know that it is possibile to process the column without taking file_1.in as input but I've noticed that doing a for loop on more then 2k columns in 24 files takes too much time, so I came up with this solution (not sure if is the best one) as the header of the files is always the same
    – redrich
    Nov 19 '17 at 10:20







3




3




I’m missing the zcat portion...
– Jeff Schaller
Nov 19 '17 at 0:55




I’m missing the zcat portion...
– Jeff Schaller
Nov 19 '17 at 0:55












Can you add some data, that would help me understand what you're doing :)
– tink
Nov 19 '17 at 2:01




Can you add some data, that would help me understand what you're doing :)
– tink
Nov 19 '17 at 2:01












I still can't figure out what you expect the value of $p to be ...
– tink
Nov 19 '17 at 3:11




I still can't figure out what you expect the value of $p to be ...
– tink
Nov 19 '17 at 3:11












@JeffSchaller sorry I forgot the command, just edited, it won't work without.
– redrich
Nov 19 '17 at 10:12




@JeffSchaller sorry I forgot the command, just edited, it won't work without.
– redrich
Nov 19 '17 at 10:12












@tink just added few examples with almost all cases. $p will be the index of the column I need that will be written to a single file named as that column. I know that it is possibile to process the column without taking file_1.in as input but I've noticed that doing a for loop on more then 2k columns in 24 files takes too much time, so I came up with this solution (not sure if is the best one) as the header of the files is always the same
– redrich
Nov 19 '17 at 10:20




@tink just added few examples with almost all cases. $p will be the index of the column I need that will be written to a single file named as that column. I know that it is possibile to process the column without taking file_1.in as input but I've noticed that doing a for loop on more then 2k columns in 24 files takes too much time, so I came up with this solution (not sure if is the best one) as the header of the files is always the same
– redrich
Nov 19 '17 at 10:20















active

oldest

votes











Your Answer







StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f405514%2fhow-to-speed-up-awk-script-that-uses-several-large-gzip-files%23new-answer', 'question_page');

);

Post as a guest



































active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes















 

draft saved


draft discarded















































 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f405514%2fhow-to-speed-up-awk-script-that-uses-several-large-gzip-files%23new-answer', 'question_page');

);

Post as a guest













































































ohLabeag8jOX95pq,jLEO
VOXm0lHwHVhBp OsznnV9M,YrJ,D8C0xKpgxQRS7NuS9SurS,W0D72wgZS1jUDR0s

Popular posts from this blog

How to check contact read email or not when send email to Individual?

How many registers does an x86_64 CPU actually have?

Displaying single band from multi-band raster using QGIS