How to speed up awk script that uses several large gzip files?
Clash Royale CLAN TAG#URR8PPP
up vote
0
down vote
favorite
I have two data files:
file_1.in, containing over 2k lines like "12 AB0001":
10 AB0001
11 AC0002
12 AD0003
...A list of
*.gz
gzipped files (about 1 to 3 million lines) I should
extract and parse to create one output file named as lines (second
col) in file_1.in.##comment..
##comment..
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT AB0001 AC0002 AD0003
21 1234567 ab11111 G A 100 PASS info1;info2 GT 0|0 0|1 0|0
21 1234568 ab22222 C A 100 PASS info1,info2 GT 1:23:2 .:.:. 0:32:2
21 1234569 ab33333 A C 100 PASS info1;info2 GT 0|2 1|0 0|0
Trying different approaches I came to this:
if(FNR==NR)list[$1]=$2;next
if(!/^#/)
for(p in list)0" && sp[1] != "0" && sp[1] != ".")
printf("%s %s %s %s %s %s %s %s %s %sn",
$1, $2, $3, $4, $5, $6, $7, $8, $9, $p) >> out"/"list[p]".tmp"
executed by command line:
awk -v out="outfolder/" -f myscript.awk file_1.in <(zcat *.gz)
But it takes more than two hours for creating just one file. There could be a way to improve my code? I think most of the time is spent by zcat
each file and by the append writing function that seems to be slow. What do you think?
bash awk files scripting gawk
 |Â
show 8 more comments
up vote
0
down vote
favorite
I have two data files:
file_1.in, containing over 2k lines like "12 AB0001":
10 AB0001
11 AC0002
12 AD0003
...A list of
*.gz
gzipped files (about 1 to 3 million lines) I should
extract and parse to create one output file named as lines (second
col) in file_1.in.##comment..
##comment..
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT AB0001 AC0002 AD0003
21 1234567 ab11111 G A 100 PASS info1;info2 GT 0|0 0|1 0|0
21 1234568 ab22222 C A 100 PASS info1,info2 GT 1:23:2 .:.:. 0:32:2
21 1234569 ab33333 A C 100 PASS info1;info2 GT 0|2 1|0 0|0
Trying different approaches I came to this:
if(FNR==NR)list[$1]=$2;next
if(!/^#/)
for(p in list)0" && sp[1] != "0" && sp[1] != ".")
printf("%s %s %s %s %s %s %s %s %s %sn",
$1, $2, $3, $4, $5, $6, $7, $8, $9, $p) >> out"/"list[p]".tmp"
executed by command line:
awk -v out="outfolder/" -f myscript.awk file_1.in <(zcat *.gz)
But it takes more than two hours for creating just one file. There could be a way to improve my code? I think most of the time is spent by zcat
each file and by the append writing function that seems to be slow. What do you think?
bash awk files scripting gawk
3
IâÂÂm missing the zcat portion...
â Jeff Schaller
Nov 19 '17 at 0:55
Can you add some data, that would help me understand what you're doing :)
â tink
Nov 19 '17 at 2:01
I still can't figure out what you expect the value of $p to be ...
â tink
Nov 19 '17 at 3:11
@JeffSchaller sorry I forgot the command, just edited, it won't work without.
â redrich
Nov 19 '17 at 10:12
@tink just added few examples with almost all cases. $p will be the index of the column I need that will be written to a single file named as that column. I know that it is possibile to process the column without taking file_1.in as input but I've noticed that doing a for loop on more then 2k columns in 24 files takes too much time, so I came up with this solution (not sure if is the best one) as the header of the files is always the same
â redrich
Nov 19 '17 at 10:20
 |Â
show 8 more comments
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I have two data files:
file_1.in, containing over 2k lines like "12 AB0001":
10 AB0001
11 AC0002
12 AD0003
...A list of
*.gz
gzipped files (about 1 to 3 million lines) I should
extract and parse to create one output file named as lines (second
col) in file_1.in.##comment..
##comment..
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT AB0001 AC0002 AD0003
21 1234567 ab11111 G A 100 PASS info1;info2 GT 0|0 0|1 0|0
21 1234568 ab22222 C A 100 PASS info1,info2 GT 1:23:2 .:.:. 0:32:2
21 1234569 ab33333 A C 100 PASS info1;info2 GT 0|2 1|0 0|0
Trying different approaches I came to this:
if(FNR==NR)list[$1]=$2;next
if(!/^#/)
for(p in list)0" && sp[1] != "0" && sp[1] != ".")
printf("%s %s %s %s %s %s %s %s %s %sn",
$1, $2, $3, $4, $5, $6, $7, $8, $9, $p) >> out"/"list[p]".tmp"
executed by command line:
awk -v out="outfolder/" -f myscript.awk file_1.in <(zcat *.gz)
But it takes more than two hours for creating just one file. There could be a way to improve my code? I think most of the time is spent by zcat
each file and by the append writing function that seems to be slow. What do you think?
bash awk files scripting gawk
I have two data files:
file_1.in, containing over 2k lines like "12 AB0001":
10 AB0001
11 AC0002
12 AD0003
...A list of
*.gz
gzipped files (about 1 to 3 million lines) I should
extract and parse to create one output file named as lines (second
col) in file_1.in.##comment..
##comment..
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT AB0001 AC0002 AD0003
21 1234567 ab11111 G A 100 PASS info1;info2 GT 0|0 0|1 0|0
21 1234568 ab22222 C A 100 PASS info1,info2 GT 1:23:2 .:.:. 0:32:2
21 1234569 ab33333 A C 100 PASS info1;info2 GT 0|2 1|0 0|0
Trying different approaches I came to this:
if(FNR==NR)list[$1]=$2;next
if(!/^#/)
for(p in list)0" && sp[1] != "0" && sp[1] != ".")
printf("%s %s %s %s %s %s %s %s %s %sn",
$1, $2, $3, $4, $5, $6, $7, $8, $9, $p) >> out"/"list[p]".tmp"
executed by command line:
awk -v out="outfolder/" -f myscript.awk file_1.in <(zcat *.gz)
But it takes more than two hours for creating just one file. There could be a way to improve my code? I think most of the time is spent by zcat
each file and by the append writing function that seems to be slow. What do you think?
bash awk files scripting gawk
edited May 6 at 13:42
agc
4,1101935
4,1101935
asked Nov 19 '17 at 0:19
redrich
11
11
3
IâÂÂm missing the zcat portion...
â Jeff Schaller
Nov 19 '17 at 0:55
Can you add some data, that would help me understand what you're doing :)
â tink
Nov 19 '17 at 2:01
I still can't figure out what you expect the value of $p to be ...
â tink
Nov 19 '17 at 3:11
@JeffSchaller sorry I forgot the command, just edited, it won't work without.
â redrich
Nov 19 '17 at 10:12
@tink just added few examples with almost all cases. $p will be the index of the column I need that will be written to a single file named as that column. I know that it is possibile to process the column without taking file_1.in as input but I've noticed that doing a for loop on more then 2k columns in 24 files takes too much time, so I came up with this solution (not sure if is the best one) as the header of the files is always the same
â redrich
Nov 19 '17 at 10:20
 |Â
show 8 more comments
3
IâÂÂm missing the zcat portion...
â Jeff Schaller
Nov 19 '17 at 0:55
Can you add some data, that would help me understand what you're doing :)
â tink
Nov 19 '17 at 2:01
I still can't figure out what you expect the value of $p to be ...
â tink
Nov 19 '17 at 3:11
@JeffSchaller sorry I forgot the command, just edited, it won't work without.
â redrich
Nov 19 '17 at 10:12
@tink just added few examples with almost all cases. $p will be the index of the column I need that will be written to a single file named as that column. I know that it is possibile to process the column without taking file_1.in as input but I've noticed that doing a for loop on more then 2k columns in 24 files takes too much time, so I came up with this solution (not sure if is the best one) as the header of the files is always the same
â redrich
Nov 19 '17 at 10:20
3
3
IâÂÂm missing the zcat portion...
â Jeff Schaller
Nov 19 '17 at 0:55
IâÂÂm missing the zcat portion...
â Jeff Schaller
Nov 19 '17 at 0:55
Can you add some data, that would help me understand what you're doing :)
â tink
Nov 19 '17 at 2:01
Can you add some data, that would help me understand what you're doing :)
â tink
Nov 19 '17 at 2:01
I still can't figure out what you expect the value of $p to be ...
â tink
Nov 19 '17 at 3:11
I still can't figure out what you expect the value of $p to be ...
â tink
Nov 19 '17 at 3:11
@JeffSchaller sorry I forgot the command, just edited, it won't work without.
â redrich
Nov 19 '17 at 10:12
@JeffSchaller sorry I forgot the command, just edited, it won't work without.
â redrich
Nov 19 '17 at 10:12
@tink just added few examples with almost all cases. $p will be the index of the column I need that will be written to a single file named as that column. I know that it is possibile to process the column without taking file_1.in as input but I've noticed that doing a for loop on more then 2k columns in 24 files takes too much time, so I came up with this solution (not sure if is the best one) as the header of the files is always the same
â redrich
Nov 19 '17 at 10:20
@tink just added few examples with almost all cases. $p will be the index of the column I need that will be written to a single file named as that column. I know that it is possibile to process the column without taking file_1.in as input but I've noticed that doing a for loop on more then 2k columns in 24 files takes too much time, so I came up with this solution (not sure if is the best one) as the header of the files is always the same
â redrich
Nov 19 '17 at 10:20
 |Â
show 8 more comments
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f405514%2fhow-to-speed-up-awk-script-that-uses-several-large-gzip-files%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
3
IâÂÂm missing the zcat portion...
â Jeff Schaller
Nov 19 '17 at 0:55
Can you add some data, that would help me understand what you're doing :)
â tink
Nov 19 '17 at 2:01
I still can't figure out what you expect the value of $p to be ...
â tink
Nov 19 '17 at 3:11
@JeffSchaller sorry I forgot the command, just edited, it won't work without.
â redrich
Nov 19 '17 at 10:12
@tink just added few examples with almost all cases. $p will be the index of the column I need that will be written to a single file named as that column. I know that it is possibile to process the column without taking file_1.in as input but I've noticed that doing a for loop on more then 2k columns in 24 files takes too much time, so I came up with this solution (not sure if is the best one) as the header of the files is always the same
â redrich
Nov 19 '17 at 10:20