How to speed up awk script that uses several large gzip files?

up vote
0
down vote

favorite

I have two data files:

file_1.in, containing over 2k lines like "12 AB0001":
```
10 AB0001
11 AC0002
12 AD0003
...
```

A list of *.gz gzipped files (about 1 to 3 million lines) I should
extract and parse to create one output file named as lines (second
col) in file_1.in.

##comment..
##comment..
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT AB0001 AC0002 AD0003
21 1234567 ab11111 G A 100 PASS info1;info2 GT 0|0 0|1 0|0
21 1234568 ab22222 C A 100 PASS info1,info2 GT 1:23:2 .:.:. 0:32:2
21 1234569 ab33333 A C 100 PASS info1;info2 GT 0|2 1|0 0|0

Trying different approaches I came to this:


if(FNR==NR)list[$1]=$2;next
if(!/^#/)
 for(p in list)0" && sp[1] != "0" && sp[1] != ".")
 printf("%s %s %s %s %s %s %s %s %s %sn", 
 $1, $2, $3, $4, $5, $6, $7, $8, $9, $p) >> out"/"list[p]".tmp"

executed by command line:

awk -v out="outfolder/" -f myscript.awk file_1.in <(zcat *.gz)

But it takes more than two hours for creating just one file. There could be a way to improve my code? I think most of the time is spent by zcat each file and by the append writing function that seems to be slow. What do you think?

edited May 6 at 13:42

agc

4,1101935

asked Nov 19 '17 at 0:19

redrich

3

IÃ¢Â€Â™m missing the zcat portion...
â€“Â Jeff Schaller
Nov 19 '17 at 0:55

Can you add some data, that would help me understand what you're doing :)
â€“Â tink
Nov 19 '17 at 2:01

I still can't figure out what you expect the value of $p to be ...
â€“Â tink
Nov 19 '17 at 3:11

@JeffSchaller sorry I forgot the command, just edited, it won't work without.
â€“Â redrich
Nov 19 '17 at 10:12

@tink just added few examples with almost all cases. $p will be the index of the column I need that will be written to a single file named as that column. I know that it is possibile to process the column without taking file_1.in as input but I've noticed that doing a for loop on more then 2k columns in 24 files takes too much time, so I came up with this solution (not sure if is the best one) as the header of the files is always the same
â€“Â redrich
Nov 19 '17 at 10:20

Â |Â
show 8 more comments

up vote
0
down vote

favorite

I have two data files:

file_1.in, containing over 2k lines like "12 AB0001":
```
10 AB0001
11 AC0002
12 AD0003
...
```

A list of *.gz gzipped files (about 1 to 3 million lines) I should
extract and parse to create one output file named as lines (second
col) in file_1.in.

##comment..
##comment..
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT AB0001 AC0002 AD0003
21 1234567 ab11111 G A 100 PASS info1;info2 GT 0|0 0|1 0|0
21 1234568 ab22222 C A 100 PASS info1,info2 GT 1:23:2 .:.:. 0:32:2
21 1234569 ab33333 A C 100 PASS info1;info2 GT 0|2 1|0 0|0

Trying different approaches I came to this:


if(FNR==NR)list[$1]=$2;next
if(!/^#/)
 for(p in list)0" && sp[1] != "0" && sp[1] != ".")
 printf("%s %s %s %s %s %s %s %s %s %sn", 
 $1, $2, $3, $4, $5, $6, $7, $8, $9, $p) >> out"/"list[p]".tmp"

executed by command line:

awk -v out="outfolder/" -f myscript.awk file_1.in <(zcat *.gz)

edited May 6 at 13:42

agc

4,1101935

asked Nov 19 '17 at 0:19

redrich

3

IÃ¢Â€Â™m missing the zcat portion...
â€“Â Jeff Schaller
Nov 19 '17 at 0:55

Can you add some data, that would help me understand what you're doing :)
â€“Â tink
Nov 19 '17 at 2:01

I still can't figure out what you expect the value of $p to be ...
â€“Â tink
Nov 19 '17 at 3:11

@JeffSchaller sorry I forgot the command, just edited, it won't work without.
â€“Â redrich
Nov 19 '17 at 10:12

@tink just added few examples with almost all cases. $p will be the index of the column I need that will be written to a single file named as that column. I know that it is possibile to process the column without taking file_1.in as input but I've noticed that doing a for loop on more then 2k columns in 24 files takes too much time, so I came up with this solution (not sure if is the best one) as the header of the files is always the same
â€“Â redrich
Nov 19 '17 at 10:20

Â |Â
show 8 more comments

up vote
0
down vote

favorite

I have two data files:

file_1.in, containing over 2k lines like "12 AB0001":
```
10 AB0001
11 AC0002
12 AD0003
...
```

A list of *.gz gzipped files (about 1 to 3 million lines) I should
extract and parse to create one output file named as lines (second
col) in file_1.in.

##comment..
##comment..
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT AB0001 AC0002 AD0003
21 1234567 ab11111 G A 100 PASS info1;info2 GT 0|0 0|1 0|0
21 1234568 ab22222 C A 100 PASS info1,info2 GT 1:23:2 .:.:. 0:32:2
21 1234569 ab33333 A C 100 PASS info1;info2 GT 0|2 1|0 0|0

Trying different approaches I came to this:


if(FNR==NR)list[$1]=$2;next
if(!/^#/)
 for(p in list)0" && sp[1] != "0" && sp[1] != ".")
 printf("%s %s %s %s %s %s %s %s %s %sn", 
 $1, $2, $3, $4, $5, $6, $7, $8, $9, $p) >> out"/"list[p]".tmp"

executed by command line:

awk -v out="outfolder/" -f myscript.awk file_1.in <(zcat *.gz)

edited May 6 at 13:42

agc

4,1101935

asked Nov 19 '17 at 0:19

redrich

I have two data files:

file_1.in, containing over 2k lines like "12 AB0001":
```
10 AB0001
11 AC0002
12 AD0003
...
```

A list of *.gz gzipped files (about 1 to 3 million lines) I should
extract and parse to create one output file named as lines (second
col) in file_1.in.

##comment..
##comment..
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT AB0001 AC0002 AD0003
21 1234567 ab11111 G A 100 PASS info1;info2 GT 0|0 0|1 0|0
21 1234568 ab22222 C A 100 PASS info1,info2 GT 1:23:2 .:.:. 0:32:2
21 1234569 ab33333 A C 100 PASS info1;info2 GT 0|2 1|0 0|0

Trying different approaches I came to this:


if(FNR==NR)list[$1]=$2;next
if(!/^#/)
 for(p in list)0" && sp[1] != "0" && sp[1] != ".")
 printf("%s %s %s %s %s %s %s %s %s %sn", 
 $1, $2, $3, $4, $5, $6, $7, $8, $9, $p) >> out"/"list[p]".tmp"

executed by command line:

awk -v out="outfolder/" -f myscript.awk file_1.in <(zcat *.gz)

edited May 6 at 13:42

agc

4,1101935

asked Nov 19 '17 at 0:19

redrich

edited May 6 at 13:42

agc

4,1101935

edited May 6 at 13:42

agc

4,1101935

edited May 6 at 13:42

agc

4,1101935

asked Nov 19 '17 at 0:19

redrich

asked Nov 19 '17 at 0:19

redrich

asked Nov 19 '17 at 0:19

redrich

3

IÃ¢Â€Â™m missing the zcat portion...
â€“Â Jeff Schaller
Nov 19 '17 at 0:55

Can you add some data, that would help me understand what you're doing :)
â€“Â tink
Nov 19 '17 at 2:01

I still can't figure out what you expect the value of $p to be ...
â€“Â tink
Nov 19 '17 at 3:11

@JeffSchaller sorry I forgot the command, just edited, it won't work without.
â€“Â redrich
Nov 19 '17 at 10:12

@tink just added few examples with almost all cases. $p will be the index of the column I need that will be written to a single file named as that column. I know that it is possibile to process the column without taking file_1.in as input but I've noticed that doing a for loop on more then 2k columns in 24 files takes too much time, so I came up with this solution (not sure if is the best one) as the header of the files is always the same
â€“Â redrich
Nov 19 '17 at 10:20

Â |Â
show 8 more comments

3

IÃ¢Â€Â™m missing the zcat portion...
â€“Â Jeff Schaller
Nov 19 '17 at 0:55

Can you add some data, that would help me understand what you're doing :)
â€“Â tink
Nov 19 '17 at 2:01

I still can't figure out what you expect the value of $p to be ...
â€“Â tink
Nov 19 '17 at 3:11

@JeffSchaller sorry I forgot the command, just edited, it won't work without.
â€“Â redrich
Nov 19 '17 at 10:12

@tink just added few examples with almost all cases. $p will be the index of the column I need that will be written to a single file named as that column. I know that it is possibile to process the column without taking file_1.in as input but I've noticed that doing a for loop on more then 2k columns in 24 files takes too much time, so I came up with this solution (not sure if is the best one) as the header of the files is always the same
â€“Â redrich
Nov 19 '17 at 10:20

IÃ¢Â€Â™m missing the zcat portion...
â€“Â Jeff Schaller
Nov 19 '17 at 0:55

Can you add some data, that would help me understand what you're doing :)
â€“Â tink
Nov 19 '17 at 2:01

I still can't figure out what you expect the value of $p to be ...
â€“Â tink
Nov 19 '17 at 3:11

@JeffSchaller sorry I forgot the command, just edited, it won't work without.
â€“Â redrich
Nov 19 '17 at 10:12

@tink just added few examples with almost all cases. $p will be the index of the column I need that will be written to a single file named as that column. I know that it is possibile to process the column without taking file_1.in as input but I've noticed that doing a for loop on more then 2k columns in 24 files takes too much time, so I came up with this solution (not sure if is the best one) as the header of the files is always the same
â€“Â redrich
Nov 19 '17 at 10:20

Â |Â
show 8 more comments

active

oldest

votes

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f405514%2fhow-to-speed-up-awk-script-that-uses-several-large-gzip-files%23new-answer', 'question_page');

);

Post as a guest

Name

active

oldest

votes

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu