Merge 2 huge files matching multiple columns and preserving the order (print matching and non-matching values)

Merge 2 huge files matching multiple columns and preserving the order (print matching and non-matching values) - scale up from awk

up vote
3
down vote

favorite

I'm having trouble with merging data from 2 files. It is genetic data with chromosome, position, reference and alternate alleles and I need to merge the files by matching all 4 of those columns - with the reference and alternate alleles either way round. So I need columns $1,$2,$4 and $5 OR $1,$2,$5 and $4 in the lookup file to exactly match columns $1,$5,$6 and $7 in the datafile. It is really important that I keep the exact order in the datafile - I can't sort it (so sadly can't use join - this is the suggested answer I have found in other instances of this sort of question).

I have used awk and got the code working for sample files with a few thousand rows, but it does not scale up well for my large dataset (lookup file has >300million rows; datafile has 30 million) - presumably as the code requires keeping in memory 2 huge arrays for lookup. Any suggestions for a scaleable code (?in perl?) gratefully received!

Format of lookup file is:

1 10150 rs371194064 C T
1 10165 rs796884232 A AC
1 10177 rs367896724 A AC
1 10177 rs201752861 A C
1 10180 rs201694901 T C
1 10199 rs905327004 A T
1 10231 rs200279319 C A
1 10234 rs145599635 C T
1 10235 rs540431307 T TA
1 10235 rs1035249121 T A
1 10235 rs1035249121 T C
1 10241 rs960927773 T C
1 10247 rs796996180 T C
1 10248 rs148908337 A T
1 10249 rs774211241 AAC A

and format of my datafile is

1 chr1 chr1:10177 1:10177_A_AC 10177 A AC
1 chr1 chr1:10235 1:10235_T_TA 10235 T TA
1 chr1 chr1:10352 1:10352_T_TA 10352 T TA
1 chr1 chr1:10505 1:10505_A_T 10505 A T
1 chr1 chr1:10506 1:10506_C_G 10506 C G
1 chr1 chr1:10511 1:10511_G_A 10511 G A
1 chr1 chr1:10539 1:10539_C_A 10539 C A
1 chr1 chr1:10542 1:10542_C_T 10542 C T
1 chr1 chr1:10579 1:10579_C_A 10579 C A

The output should look like:

1 rs367896724 1:10177_A_AC 10177 A AC A AC
1 rs540431307 1:10235_T_TA 10235 T TA T TA
1 chr1:10352 1:10352_T_TA 10352 T TA T TA
1 chr1:10505 1:10505_A_T 10505 A T A T
1 chr1:10506 1:10506_C_G 10506 C G C G
1 chr1:10511 1:10511_G_A 10511 G A G A
1 chr1:10539 1:10539_C_A 10539 C A C A
1 chr1:10542 1:10542_C_T 10542 C T C T
1 chr1:10579 1:10579_C_A 10579 C A C A

The awk code I have managed to get working for the sample file is as follows:

awk 'BEGIN OFS = "t"
NR==FNR #lookup file (323 million rows)
 key = $1 "," $2 "," $4 "," $5
 present[key] = 1
 ID[key] = $3
 Ref[key] = $4
 Alt[key] = $5

 key1 = $1 "," $2 "," $4 "," $5
 present1[key1] = 1
 ID1[key1] = $3
 Ref1[key1] = $4
 Alt1[key1] = $5

 next

 # my data file (3 million rows)
 key = $1 "," $5 "," $6 "," $7
 key1 = $1 "," $5 "," $7 "," $6
 if (present[key]) print $1, ID[key], $4, $5, $6, $7, Ref[key], Alt[key];
 else if (present1[key1]) print $1, ID1[key1], $4, $5, $6, $7, Ref1[key1], Alt1[key1];
 else print $1, $3, $4, $5, $6, $7, $6, $7
' $lookupfile $mydatafile > $outputfile

edited Aug 21 at 12:41

Rui F Ribeiro

36.7k1271116

asked Sep 19 '17 at 8:14

Katie Fletcher

162

1

"Merge", "huge files", "preserve order": pick two.
â€“Â SatÃ…Â Katsura
Sep 19 '17 at 8:20

1

Your best bet is probably to add line numbers (to preserve order), and import both files in a SQL database. Then your problem becomes a simple join.
â€“Â SatÃ…Â Katsura
Sep 19 '17 at 8:30

@SatÃ…ÂKatsura The order of the data file is to be kept, but this is not critical. The critical part is the lookup table, but you can modify that as you like.
â€“Â Philippos
Sep 19 '17 at 10:46

This is on topic here and welcome to stay. I would also expect you have a much better chance of an answer here than on Stack Overflow. That said, it might be wrong to look at this as a simple text parsing issue. it looks like you are essentially trying to attach rsIDs to a set of variants. So essentially to annotate a VCF file. Or something very similar. You might want to explain your issue from the beginning and ask on Bioinformatics instead. But there, just explain what your final objective is. I am guessing there's no need to use the enormous lookup file to begin with.
â€“Â terdonâ™¦
Sep 19 '17 at 12:55

One more thing. Why is the order important? In your example, it looks like you've just sorted by chromosome and position. Is that so? If yes, we can easily get the original order back by sorting again. Would that be acceptable?
â€“Â terdonâ™¦
Sep 19 '17 at 12:56

Â |Â
show 1 more comment

up vote
3
down vote

favorite

Format of lookup file is:

1 10150 rs371194064 C T
1 10165 rs796884232 A AC
1 10177 rs367896724 A AC
1 10177 rs201752861 A C
1 10180 rs201694901 T C
1 10199 rs905327004 A T
1 10231 rs200279319 C A
1 10234 rs145599635 C T
1 10235 rs540431307 T TA
1 10235 rs1035249121 T A
1 10235 rs1035249121 T C
1 10241 rs960927773 T C
1 10247 rs796996180 T C
1 10248 rs148908337 A T
1 10249 rs774211241 AAC A

and format of my datafile is

1 chr1 chr1:10177 1:10177_A_AC 10177 A AC
1 chr1 chr1:10235 1:10235_T_TA 10235 T TA
1 chr1 chr1:10352 1:10352_T_TA 10352 T TA
1 chr1 chr1:10505 1:10505_A_T 10505 A T
1 chr1 chr1:10506 1:10506_C_G 10506 C G
1 chr1 chr1:10511 1:10511_G_A 10511 G A
1 chr1 chr1:10539 1:10539_C_A 10539 C A
1 chr1 chr1:10542 1:10542_C_T 10542 C T
1 chr1 chr1:10579 1:10579_C_A 10579 C A

The output should look like:

1 rs367896724 1:10177_A_AC 10177 A AC A AC
1 rs540431307 1:10235_T_TA 10235 T TA T TA
1 chr1:10352 1:10352_T_TA 10352 T TA T TA
1 chr1:10505 1:10505_A_T 10505 A T A T
1 chr1:10506 1:10506_C_G 10506 C G C G
1 chr1:10511 1:10511_G_A 10511 G A G A
1 chr1:10539 1:10539_C_A 10539 C A C A
1 chr1:10542 1:10542_C_T 10542 C T C T
1 chr1:10579 1:10579_C_A 10579 C A C A

The awk code I have managed to get working for the sample file is as follows:

awk 'BEGIN OFS = "t"
NR==FNR #lookup file (323 million rows)
 key = $1 "," $2 "," $4 "," $5
 present[key] = 1
 ID[key] = $3
 Ref[key] = $4
 Alt[key] = $5

 key1 = $1 "," $2 "," $4 "," $5
 present1[key1] = 1
 ID1[key1] = $3
 Ref1[key1] = $4
 Alt1[key1] = $5

 next

 # my data file (3 million rows)
 key = $1 "," $5 "," $6 "," $7
 key1 = $1 "," $5 "," $7 "," $6
 if (present[key]) print $1, ID[key], $4, $5, $6, $7, Ref[key], Alt[key];
 else if (present1[key1]) print $1, ID1[key1], $4, $5, $6, $7, Ref1[key1], Alt1[key1];
 else print $1, $3, $4, $5, $6, $7, $6, $7
' $lookupfile $mydatafile > $outputfile

edited Aug 21 at 12:41

Rui F Ribeiro

36.7k1271116

asked Sep 19 '17 at 8:14

Katie Fletcher

162

1

"Merge", "huge files", "preserve order": pick two.
â€“Â SatÃ…Â Katsura
Sep 19 '17 at 8:20

1

Your best bet is probably to add line numbers (to preserve order), and import both files in a SQL database. Then your problem becomes a simple join.
â€“Â SatÃ…Â Katsura
Sep 19 '17 at 8:30

@SatÃ…ÂKatsura The order of the data file is to be kept, but this is not critical. The critical part is the lookup table, but you can modify that as you like.
â€“Â Philippos
Sep 19 '17 at 10:46

This is on topic here and welcome to stay. I would also expect you have a much better chance of an answer here than on Stack Overflow. That said, it might be wrong to look at this as a simple text parsing issue. it looks like you are essentially trying to attach rsIDs to a set of variants. So essentially to annotate a VCF file. Or something very similar. You might want to explain your issue from the beginning and ask on Bioinformatics instead. But there, just explain what your final objective is. I am guessing there's no need to use the enormous lookup file to begin with.
â€“Â terdonâ™¦
Sep 19 '17 at 12:55

One more thing. Why is the order important? In your example, it looks like you've just sorted by chromosome and position. Is that so? If yes, we can easily get the original order back by sorting again. Would that be acceptable?
â€“Â terdonâ™¦
Sep 19 '17 at 12:56

Â |Â
show 1 more comment

up vote
3
down vote

favorite

Format of lookup file is:

1 10150 rs371194064 C T
1 10165 rs796884232 A AC
1 10177 rs367896724 A AC
1 10177 rs201752861 A C
1 10180 rs201694901 T C
1 10199 rs905327004 A T
1 10231 rs200279319 C A
1 10234 rs145599635 C T
1 10235 rs540431307 T TA
1 10235 rs1035249121 T A
1 10235 rs1035249121 T C
1 10241 rs960927773 T C
1 10247 rs796996180 T C
1 10248 rs148908337 A T
1 10249 rs774211241 AAC A

and format of my datafile is

1 chr1 chr1:10177 1:10177_A_AC 10177 A AC
1 chr1 chr1:10235 1:10235_T_TA 10235 T TA
1 chr1 chr1:10352 1:10352_T_TA 10352 T TA
1 chr1 chr1:10505 1:10505_A_T 10505 A T
1 chr1 chr1:10506 1:10506_C_G 10506 C G
1 chr1 chr1:10511 1:10511_G_A 10511 G A
1 chr1 chr1:10539 1:10539_C_A 10539 C A
1 chr1 chr1:10542 1:10542_C_T 10542 C T
1 chr1 chr1:10579 1:10579_C_A 10579 C A

The output should look like:

1 rs367896724 1:10177_A_AC 10177 A AC A AC
1 rs540431307 1:10235_T_TA 10235 T TA T TA
1 chr1:10352 1:10352_T_TA 10352 T TA T TA
1 chr1:10505 1:10505_A_T 10505 A T A T
1 chr1:10506 1:10506_C_G 10506 C G C G
1 chr1:10511 1:10511_G_A 10511 G A G A
1 chr1:10539 1:10539_C_A 10539 C A C A
1 chr1:10542 1:10542_C_T 10542 C T C T
1 chr1:10579 1:10579_C_A 10579 C A C A

The awk code I have managed to get working for the sample file is as follows:

awk 'BEGIN OFS = "t"
NR==FNR #lookup file (323 million rows)
 key = $1 "," $2 "," $4 "," $5
 present[key] = 1
 ID[key] = $3
 Ref[key] = $4
 Alt[key] = $5

 key1 = $1 "," $2 "," $4 "," $5
 present1[key1] = 1
 ID1[key1] = $3
 Ref1[key1] = $4
 Alt1[key1] = $5

 next

 # my data file (3 million rows)
 key = $1 "," $5 "," $6 "," $7
 key1 = $1 "," $5 "," $7 "," $6
 if (present[key]) print $1, ID[key], $4, $5, $6, $7, Ref[key], Alt[key];
 else if (present1[key1]) print $1, ID1[key1], $4, $5, $6, $7, Ref1[key1], Alt1[key1];
 else print $1, $3, $4, $5, $6, $7, $6, $7
' $lookupfile $mydatafile > $outputfile

edited Aug 21 at 12:41

Rui F Ribeiro

36.7k1271116

asked Sep 19 '17 at 8:14

Katie Fletcher

162

Format of lookup file is:

1 10150 rs371194064 C T
1 10165 rs796884232 A AC
1 10177 rs367896724 A AC
1 10177 rs201752861 A C
1 10180 rs201694901 T C
1 10199 rs905327004 A T
1 10231 rs200279319 C A
1 10234 rs145599635 C T
1 10235 rs540431307 T TA
1 10235 rs1035249121 T A
1 10235 rs1035249121 T C
1 10241 rs960927773 T C
1 10247 rs796996180 T C
1 10248 rs148908337 A T
1 10249 rs774211241 AAC A

and format of my datafile is

1 chr1 chr1:10177 1:10177_A_AC 10177 A AC
1 chr1 chr1:10235 1:10235_T_TA 10235 T TA
1 chr1 chr1:10352 1:10352_T_TA 10352 T TA
1 chr1 chr1:10505 1:10505_A_T 10505 A T
1 chr1 chr1:10506 1:10506_C_G 10506 C G
1 chr1 chr1:10511 1:10511_G_A 10511 G A
1 chr1 chr1:10539 1:10539_C_A 10539 C A
1 chr1 chr1:10542 1:10542_C_T 10542 C T
1 chr1 chr1:10579 1:10579_C_A 10579 C A

The output should look like:

1 rs367896724 1:10177_A_AC 10177 A AC A AC
1 rs540431307 1:10235_T_TA 10235 T TA T TA
1 chr1:10352 1:10352_T_TA 10352 T TA T TA
1 chr1:10505 1:10505_A_T 10505 A T A T
1 chr1:10506 1:10506_C_G 10506 C G C G
1 chr1:10511 1:10511_G_A 10511 G A G A
1 chr1:10539 1:10539_C_A 10539 C A C A
1 chr1:10542 1:10542_C_T 10542 C T C T
1 chr1:10579 1:10579_C_A 10579 C A C A

The awk code I have managed to get working for the sample file is as follows:

awk 'BEGIN OFS = "t"
NR==FNR #lookup file (323 million rows)
 key = $1 "," $2 "," $4 "," $5
 present[key] = 1
 ID[key] = $3
 Ref[key] = $4
 Alt[key] = $5

 key1 = $1 "," $2 "," $4 "," $5
 present1[key1] = 1
 ID1[key1] = $3
 Ref1[key1] = $4
 Alt1[key1] = $5

 next

 # my data file (3 million rows)
 key = $1 "," $5 "," $6 "," $7
 key1 = $1 "," $5 "," $7 "," $6
 if (present[key]) print $1, ID[key], $4, $5, $6, $7, Ref[key], Alt[key];
 else if (present1[key1]) print $1, ID1[key1], $4, $5, $6, $7, Ref1[key1], Alt1[key1];
 else print $1, $3, $4, $5, $6, $7, $6, $7
' $lookupfile $mydatafile > $outputfile

awk columns merge

edited Aug 21 at 12:41

Rui F Ribeiro

36.7k1271116

asked Sep 19 '17 at 8:14

Katie Fletcher

162

edited Aug 21 at 12:41

Rui F Ribeiro

36.7k1271116

asked Sep 19 '17 at 8:14

Katie Fletcher

162

edited Aug 21 at 12:41

Rui F Ribeiro

36.7k1271116

edited Aug 21 at 12:41

Rui F Ribeiro

36.7k1271116

edited Aug 21 at 12:41

Rui F Ribeiro

36.7k1271116

asked Sep 19 '17 at 8:14

Katie Fletcher

162

asked Sep 19 '17 at 8:14

Katie Fletcher

162

asked Sep 19 '17 at 8:14

Katie Fletcher

162

1

"Merge", "huge files", "preserve order": pick two.
â€“Â SatÃ…Â Katsura
Sep 19 '17 at 8:20

1

Your best bet is probably to add line numbers (to preserve order), and import both files in a SQL database. Then your problem becomes a simple join.
â€“Â SatÃ…Â Katsura
Sep 19 '17 at 8:30

@SatÃ…ÂKatsura The order of the data file is to be kept, but this is not critical. The critical part is the lookup table, but you can modify that as you like.
â€“Â Philippos
Sep 19 '17 at 10:46

This is on topic here and welcome to stay. I would also expect you have a much better chance of an answer here than on Stack Overflow. That said, it might be wrong to look at this as a simple text parsing issue. it looks like you are essentially trying to attach rsIDs to a set of variants. So essentially to annotate a VCF file. Or something very similar. You might want to explain your issue from the beginning and ask on Bioinformatics instead. But there, just explain what your final objective is. I am guessing there's no need to use the enormous lookup file to begin with.
â€“Â terdonâ™¦
Sep 19 '17 at 12:55

One more thing. Why is the order important? In your example, it looks like you've just sorted by chromosome and position. Is that so? If yes, we can easily get the original order back by sorting again. Would that be acceptable?
â€“Â terdonâ™¦
Sep 19 '17 at 12:56

Â |Â
show 1 more comment

1

"Merge", "huge files", "preserve order": pick two.
â€“Â SatÃ…Â Katsura
Sep 19 '17 at 8:20

1

Your best bet is probably to add line numbers (to preserve order), and import both files in a SQL database. Then your problem becomes a simple join.
â€“Â SatÃ…Â Katsura
Sep 19 '17 at 8:30

@SatÃ…ÂKatsura The order of the data file is to be kept, but this is not critical. The critical part is the lookup table, but you can modify that as you like.
â€“Â Philippos
Sep 19 '17 at 10:46

This is on topic here and welcome to stay. I would also expect you have a much better chance of an answer here than on Stack Overflow. That said, it might be wrong to look at this as a simple text parsing issue. it looks like you are essentially trying to attach rsIDs to a set of variants. So essentially to annotate a VCF file. Or something very similar. You might want to explain your issue from the beginning and ask on Bioinformatics instead. But there, just explain what your final objective is. I am guessing there's no need to use the enormous lookup file to begin with.
â€“Â terdonâ™¦
Sep 19 '17 at 12:55

One more thing. Why is the order important? In your example, it looks like you've just sorted by chromosome and position. Is that so? If yes, we can easily get the original order back by sorting again. Would that be acceptable?
â€“Â terdonâ™¦
Sep 19 '17 at 12:56

"Merge", "huge files", "preserve order": pick two.
â€“Â SatÃ…Â Katsura
Sep 19 '17 at 8:20

Your best bet is probably to add line numbers (to preserve order), and import both files in a SQL database. Then your problem becomes a simple join.
â€“Â SatÃ…Â Katsura
Sep 19 '17 at 8:30

@SatÃ…ÂKatsura The order of the data file is to be kept, but this is not critical. The critical part is the lookup table, but you can modify that as you like.
â€“Â Philippos
Sep 19 '17 at 10:46

This is on topic here and welcome to stay. I would also expect you have a much better chance of an answer here than on Stack Overflow. That said, it might be wrong to look at this as a simple text parsing issue. it looks like you are essentially trying to attach rsIDs to a set of variants. So essentially to annotate a VCF file. Or something very similar. You might want to explain your issue from the beginning and ask on Bioinformatics instead. But there, just explain what your final objective is. I am guessing there's no need to use the enormous lookup file to begin with.
â€“Â terdonâ™¦
Sep 19 '17 at 12:55

One more thing. Why is the order important? In your example, it looks like you've just sorted by chromosome and position. Is that so? If yes, we can easily get the original order back by sorting again. Would that be acceptable?
â€“Â terdonâ™¦
Sep 19 '17 at 12:56

Â |Â
show 1 more comment

1 Answer
1

active

oldest

votes

up vote
0
down vote

The problem is not keeping that file in memory, but to scan through the lookup table for each line of the data file. Your code doesn't show it, but behind the scenes you do 3'000'000 times 323'000'000/2 = almost half a quadrillion string comparisons, moving thousands of terabytes over your memory bus. Even for fast memory with 200 GBit/s this will take many hours.

So the key to the problem is how to store the lookup table. I suggest to use a binary tree to exponentially reduce execution time. You can do this in perl or C or some other language, but at this moment it will become off-topic here.

The unix command toolset will not help you with this problem.

answered Sep 19 '17 at 8:39

Philippos

5,95211546

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f393126%2fmerge-2-huge-files-matching-multiple-columns-and-preserving-the-order-print-mat%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

The unix command toolset will not help you with this problem.

answered Sep 19 '17 at 8:39

Philippos

5,95211546

add a commentÂ |Â

up vote
0
down vote

The unix command toolset will not help you with this problem.

answered Sep 19 '17 at 8:39

Philippos

5,95211546

add a commentÂ |Â

up vote
0
down vote

The unix command toolset will not help you with this problem.

answered Sep 19 '17 at 8:39

Philippos

5,95211546

The unix command toolset will not help you with this problem.

answered Sep 19 '17 at 8:39

Philippos

5,95211546

answered Sep 19 '17 at 8:39

Philippos

5,95211546

answered Sep 19 '17 at 8:39

Philippos

5,95211546

answered Sep 19 '17 at 8:39

Philippos

5,95211546

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu