Merge 2 huge files matching multiple columns and preserving the order (print matching and non-matching values) - scale up from awk

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
3
down vote

favorite












I'm having trouble with merging data from 2 files. It is genetic data with chromosome, position, reference and alternate alleles and I need to merge the files by matching all 4 of those columns - with the reference and alternate alleles either way round. So I need columns $1,$2,$4 and $5 OR $1,$2,$5 and $4 in the lookup file to exactly match columns $1,$5,$6 and $7 in the datafile. It is really important that I keep the exact order in the datafile - I can't sort it (so sadly can't use join - this is the suggested answer I have found in other instances of this sort of question).



I have used awk and got the code working for sample files with a few thousand rows, but it does not scale up well for my large dataset (lookup file has >300million rows; datafile has 30 million) - presumably as the code requires keeping in memory 2 huge arrays for lookup. Any suggestions for a scaleable code (?in perl?) gratefully received!



Format of lookup file is:



1 10150 rs371194064 C T
1 10165 rs796884232 A AC
1 10177 rs367896724 A AC
1 10177 rs201752861 A C
1 10180 rs201694901 T C
1 10199 rs905327004 A T
1 10231 rs200279319 C A
1 10234 rs145599635 C T
1 10235 rs540431307 T TA
1 10235 rs1035249121 T A
1 10235 rs1035249121 T C
1 10241 rs960927773 T C
1 10247 rs796996180 T C
1 10248 rs148908337 A T
1 10249 rs774211241 AAC A


and format of my datafile is



1 chr1 chr1:10177 1:10177_A_AC 10177 A AC
1 chr1 chr1:10235 1:10235_T_TA 10235 T TA
1 chr1 chr1:10352 1:10352_T_TA 10352 T TA
1 chr1 chr1:10505 1:10505_A_T 10505 A T
1 chr1 chr1:10506 1:10506_C_G 10506 C G
1 chr1 chr1:10511 1:10511_G_A 10511 G A
1 chr1 chr1:10539 1:10539_C_A 10539 C A
1 chr1 chr1:10542 1:10542_C_T 10542 C T
1 chr1 chr1:10579 1:10579_C_A 10579 C A


The output should look like:



1 rs367896724 1:10177_A_AC 10177 A AC A AC
1 rs540431307 1:10235_T_TA 10235 T TA T TA
1 chr1:10352 1:10352_T_TA 10352 T TA T TA
1 chr1:10505 1:10505_A_T 10505 A T A T
1 chr1:10506 1:10506_C_G 10506 C G C G
1 chr1:10511 1:10511_G_A 10511 G A G A
1 chr1:10539 1:10539_C_A 10539 C A C A
1 chr1:10542 1:10542_C_T 10542 C T C T
1 chr1:10579 1:10579_C_A 10579 C A C A


The awk code I have managed to get working for the sample file is as follows:




awk 'BEGIN OFS = "t"
NR==FNR #lookup file (323 million rows)
key = $1 "," $2 "," $4 "," $5
present[key] = 1
ID[key] = $3
Ref[key] = $4
Alt[key] = $5

key1 = $1 "," $2 "," $4 "," $5
present1[key1] = 1
ID1[key1] = $3
Ref1[key1] = $4
Alt1[key1] = $5

next

# my data file (3 million rows)
key = $1 "," $5 "," $6 "," $7
key1 = $1 "," $5 "," $7 "," $6
if (present[key]) print $1, ID[key], $4, $5, $6, $7, Ref[key], Alt[key];
else if (present1[key1]) print $1, ID1[key1], $4, $5, $6, $7, Ref1[key1], Alt1[key1];
else print $1, $3, $4, $5, $6, $7, $6, $7
' $lookupfile $mydatafile > $outputfile










share|improve this question



















  • 1




    "Merge", "huge files", "preserve order": pick two.
    – Satō Katsura
    Sep 19 '17 at 8:20






  • 1




    Your best bet is probably to add line numbers (to preserve order), and import both files in a SQL database. Then your problem becomes a simple join.
    – Satō Katsura
    Sep 19 '17 at 8:30










  • @SatōKatsura The order of the data file is to be kept, but this is not critical. The critical part is the lookup table, but you can modify that as you like.
    – Philippos
    Sep 19 '17 at 10:46










  • This is on topic here and welcome to stay. I would also expect you have a much better chance of an answer here than on Stack Overflow. That said, it might be wrong to look at this as a simple text parsing issue. it looks like you are essentially trying to attach rsIDs to a set of variants. So essentially to annotate a VCF file. Or something very similar. You might want to explain your issue from the beginning and ask on Bioinformatics instead. But there, just explain what your final objective is. I am guessing there's no need to use the enormous lookup file to begin with.
    – terdon♦
    Sep 19 '17 at 12:55










  • One more thing. Why is the order important? In your example, it looks like you've just sorted by chromosome and position. Is that so? If yes, we can easily get the original order back by sorting again. Would that be acceptable?
    – terdon♦
    Sep 19 '17 at 12:56














up vote
3
down vote

favorite












I'm having trouble with merging data from 2 files. It is genetic data with chromosome, position, reference and alternate alleles and I need to merge the files by matching all 4 of those columns - with the reference and alternate alleles either way round. So I need columns $1,$2,$4 and $5 OR $1,$2,$5 and $4 in the lookup file to exactly match columns $1,$5,$6 and $7 in the datafile. It is really important that I keep the exact order in the datafile - I can't sort it (so sadly can't use join - this is the suggested answer I have found in other instances of this sort of question).



I have used awk and got the code working for sample files with a few thousand rows, but it does not scale up well for my large dataset (lookup file has >300million rows; datafile has 30 million) - presumably as the code requires keeping in memory 2 huge arrays for lookup. Any suggestions for a scaleable code (?in perl?) gratefully received!



Format of lookup file is:



1 10150 rs371194064 C T
1 10165 rs796884232 A AC
1 10177 rs367896724 A AC
1 10177 rs201752861 A C
1 10180 rs201694901 T C
1 10199 rs905327004 A T
1 10231 rs200279319 C A
1 10234 rs145599635 C T
1 10235 rs540431307 T TA
1 10235 rs1035249121 T A
1 10235 rs1035249121 T C
1 10241 rs960927773 T C
1 10247 rs796996180 T C
1 10248 rs148908337 A T
1 10249 rs774211241 AAC A


and format of my datafile is



1 chr1 chr1:10177 1:10177_A_AC 10177 A AC
1 chr1 chr1:10235 1:10235_T_TA 10235 T TA
1 chr1 chr1:10352 1:10352_T_TA 10352 T TA
1 chr1 chr1:10505 1:10505_A_T 10505 A T
1 chr1 chr1:10506 1:10506_C_G 10506 C G
1 chr1 chr1:10511 1:10511_G_A 10511 G A
1 chr1 chr1:10539 1:10539_C_A 10539 C A
1 chr1 chr1:10542 1:10542_C_T 10542 C T
1 chr1 chr1:10579 1:10579_C_A 10579 C A


The output should look like:



1 rs367896724 1:10177_A_AC 10177 A AC A AC
1 rs540431307 1:10235_T_TA 10235 T TA T TA
1 chr1:10352 1:10352_T_TA 10352 T TA T TA
1 chr1:10505 1:10505_A_T 10505 A T A T
1 chr1:10506 1:10506_C_G 10506 C G C G
1 chr1:10511 1:10511_G_A 10511 G A G A
1 chr1:10539 1:10539_C_A 10539 C A C A
1 chr1:10542 1:10542_C_T 10542 C T C T
1 chr1:10579 1:10579_C_A 10579 C A C A


The awk code I have managed to get working for the sample file is as follows:




awk 'BEGIN OFS = "t"
NR==FNR #lookup file (323 million rows)
key = $1 "," $2 "," $4 "," $5
present[key] = 1
ID[key] = $3
Ref[key] = $4
Alt[key] = $5

key1 = $1 "," $2 "," $4 "," $5
present1[key1] = 1
ID1[key1] = $3
Ref1[key1] = $4
Alt1[key1] = $5

next

# my data file (3 million rows)
key = $1 "," $5 "," $6 "," $7
key1 = $1 "," $5 "," $7 "," $6
if (present[key]) print $1, ID[key], $4, $5, $6, $7, Ref[key], Alt[key];
else if (present1[key1]) print $1, ID1[key1], $4, $5, $6, $7, Ref1[key1], Alt1[key1];
else print $1, $3, $4, $5, $6, $7, $6, $7
' $lookupfile $mydatafile > $outputfile










share|improve this question



















  • 1




    "Merge", "huge files", "preserve order": pick two.
    – Satō Katsura
    Sep 19 '17 at 8:20






  • 1




    Your best bet is probably to add line numbers (to preserve order), and import both files in a SQL database. Then your problem becomes a simple join.
    – Satō Katsura
    Sep 19 '17 at 8:30










  • @SatōKatsura The order of the data file is to be kept, but this is not critical. The critical part is the lookup table, but you can modify that as you like.
    – Philippos
    Sep 19 '17 at 10:46










  • This is on topic here and welcome to stay. I would also expect you have a much better chance of an answer here than on Stack Overflow. That said, it might be wrong to look at this as a simple text parsing issue. it looks like you are essentially trying to attach rsIDs to a set of variants. So essentially to annotate a VCF file. Or something very similar. You might want to explain your issue from the beginning and ask on Bioinformatics instead. But there, just explain what your final objective is. I am guessing there's no need to use the enormous lookup file to begin with.
    – terdon♦
    Sep 19 '17 at 12:55










  • One more thing. Why is the order important? In your example, it looks like you've just sorted by chromosome and position. Is that so? If yes, we can easily get the original order back by sorting again. Would that be acceptable?
    – terdon♦
    Sep 19 '17 at 12:56












up vote
3
down vote

favorite









up vote
3
down vote

favorite











I'm having trouble with merging data from 2 files. It is genetic data with chromosome, position, reference and alternate alleles and I need to merge the files by matching all 4 of those columns - with the reference and alternate alleles either way round. So I need columns $1,$2,$4 and $5 OR $1,$2,$5 and $4 in the lookup file to exactly match columns $1,$5,$6 and $7 in the datafile. It is really important that I keep the exact order in the datafile - I can't sort it (so sadly can't use join - this is the suggested answer I have found in other instances of this sort of question).



I have used awk and got the code working for sample files with a few thousand rows, but it does not scale up well for my large dataset (lookup file has >300million rows; datafile has 30 million) - presumably as the code requires keeping in memory 2 huge arrays for lookup. Any suggestions for a scaleable code (?in perl?) gratefully received!



Format of lookup file is:



1 10150 rs371194064 C T
1 10165 rs796884232 A AC
1 10177 rs367896724 A AC
1 10177 rs201752861 A C
1 10180 rs201694901 T C
1 10199 rs905327004 A T
1 10231 rs200279319 C A
1 10234 rs145599635 C T
1 10235 rs540431307 T TA
1 10235 rs1035249121 T A
1 10235 rs1035249121 T C
1 10241 rs960927773 T C
1 10247 rs796996180 T C
1 10248 rs148908337 A T
1 10249 rs774211241 AAC A


and format of my datafile is



1 chr1 chr1:10177 1:10177_A_AC 10177 A AC
1 chr1 chr1:10235 1:10235_T_TA 10235 T TA
1 chr1 chr1:10352 1:10352_T_TA 10352 T TA
1 chr1 chr1:10505 1:10505_A_T 10505 A T
1 chr1 chr1:10506 1:10506_C_G 10506 C G
1 chr1 chr1:10511 1:10511_G_A 10511 G A
1 chr1 chr1:10539 1:10539_C_A 10539 C A
1 chr1 chr1:10542 1:10542_C_T 10542 C T
1 chr1 chr1:10579 1:10579_C_A 10579 C A


The output should look like:



1 rs367896724 1:10177_A_AC 10177 A AC A AC
1 rs540431307 1:10235_T_TA 10235 T TA T TA
1 chr1:10352 1:10352_T_TA 10352 T TA T TA
1 chr1:10505 1:10505_A_T 10505 A T A T
1 chr1:10506 1:10506_C_G 10506 C G C G
1 chr1:10511 1:10511_G_A 10511 G A G A
1 chr1:10539 1:10539_C_A 10539 C A C A
1 chr1:10542 1:10542_C_T 10542 C T C T
1 chr1:10579 1:10579_C_A 10579 C A C A


The awk code I have managed to get working for the sample file is as follows:




awk 'BEGIN OFS = "t"
NR==FNR #lookup file (323 million rows)
key = $1 "," $2 "," $4 "," $5
present[key] = 1
ID[key] = $3
Ref[key] = $4
Alt[key] = $5

key1 = $1 "," $2 "," $4 "," $5
present1[key1] = 1
ID1[key1] = $3
Ref1[key1] = $4
Alt1[key1] = $5

next

# my data file (3 million rows)
key = $1 "," $5 "," $6 "," $7
key1 = $1 "," $5 "," $7 "," $6
if (present[key]) print $1, ID[key], $4, $5, $6, $7, Ref[key], Alt[key];
else if (present1[key1]) print $1, ID1[key1], $4, $5, $6, $7, Ref1[key1], Alt1[key1];
else print $1, $3, $4, $5, $6, $7, $6, $7
' $lookupfile $mydatafile > $outputfile










share|improve this question















I'm having trouble with merging data from 2 files. It is genetic data with chromosome, position, reference and alternate alleles and I need to merge the files by matching all 4 of those columns - with the reference and alternate alleles either way round. So I need columns $1,$2,$4 and $5 OR $1,$2,$5 and $4 in the lookup file to exactly match columns $1,$5,$6 and $7 in the datafile. It is really important that I keep the exact order in the datafile - I can't sort it (so sadly can't use join - this is the suggested answer I have found in other instances of this sort of question).



I have used awk and got the code working for sample files with a few thousand rows, but it does not scale up well for my large dataset (lookup file has >300million rows; datafile has 30 million) - presumably as the code requires keeping in memory 2 huge arrays for lookup. Any suggestions for a scaleable code (?in perl?) gratefully received!



Format of lookup file is:



1 10150 rs371194064 C T
1 10165 rs796884232 A AC
1 10177 rs367896724 A AC
1 10177 rs201752861 A C
1 10180 rs201694901 T C
1 10199 rs905327004 A T
1 10231 rs200279319 C A
1 10234 rs145599635 C T
1 10235 rs540431307 T TA
1 10235 rs1035249121 T A
1 10235 rs1035249121 T C
1 10241 rs960927773 T C
1 10247 rs796996180 T C
1 10248 rs148908337 A T
1 10249 rs774211241 AAC A


and format of my datafile is



1 chr1 chr1:10177 1:10177_A_AC 10177 A AC
1 chr1 chr1:10235 1:10235_T_TA 10235 T TA
1 chr1 chr1:10352 1:10352_T_TA 10352 T TA
1 chr1 chr1:10505 1:10505_A_T 10505 A T
1 chr1 chr1:10506 1:10506_C_G 10506 C G
1 chr1 chr1:10511 1:10511_G_A 10511 G A
1 chr1 chr1:10539 1:10539_C_A 10539 C A
1 chr1 chr1:10542 1:10542_C_T 10542 C T
1 chr1 chr1:10579 1:10579_C_A 10579 C A


The output should look like:



1 rs367896724 1:10177_A_AC 10177 A AC A AC
1 rs540431307 1:10235_T_TA 10235 T TA T TA
1 chr1:10352 1:10352_T_TA 10352 T TA T TA
1 chr1:10505 1:10505_A_T 10505 A T A T
1 chr1:10506 1:10506_C_G 10506 C G C G
1 chr1:10511 1:10511_G_A 10511 G A G A
1 chr1:10539 1:10539_C_A 10539 C A C A
1 chr1:10542 1:10542_C_T 10542 C T C T
1 chr1:10579 1:10579_C_A 10579 C A C A


The awk code I have managed to get working for the sample file is as follows:




awk 'BEGIN OFS = "t"
NR==FNR #lookup file (323 million rows)
key = $1 "," $2 "," $4 "," $5
present[key] = 1
ID[key] = $3
Ref[key] = $4
Alt[key] = $5

key1 = $1 "," $2 "," $4 "," $5
present1[key1] = 1
ID1[key1] = $3
Ref1[key1] = $4
Alt1[key1] = $5

next

# my data file (3 million rows)
key = $1 "," $5 "," $6 "," $7
key1 = $1 "," $5 "," $7 "," $6
if (present[key]) print $1, ID[key], $4, $5, $6, $7, Ref[key], Alt[key];
else if (present1[key1]) print $1, ID1[key1], $4, $5, $6, $7, Ref1[key1], Alt1[key1];
else print $1, $3, $4, $5, $6, $7, $6, $7
' $lookupfile $mydatafile > $outputfile







awk columns merge






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Aug 21 at 12:41









Rui F Ribeiro

36.7k1271116




36.7k1271116










asked Sep 19 '17 at 8:14









Katie Fletcher

162




162







  • 1




    "Merge", "huge files", "preserve order": pick two.
    – Satō Katsura
    Sep 19 '17 at 8:20






  • 1




    Your best bet is probably to add line numbers (to preserve order), and import both files in a SQL database. Then your problem becomes a simple join.
    – Satō Katsura
    Sep 19 '17 at 8:30










  • @SatōKatsura The order of the data file is to be kept, but this is not critical. The critical part is the lookup table, but you can modify that as you like.
    – Philippos
    Sep 19 '17 at 10:46










  • This is on topic here and welcome to stay. I would also expect you have a much better chance of an answer here than on Stack Overflow. That said, it might be wrong to look at this as a simple text parsing issue. it looks like you are essentially trying to attach rsIDs to a set of variants. So essentially to annotate a VCF file. Or something very similar. You might want to explain your issue from the beginning and ask on Bioinformatics instead. But there, just explain what your final objective is. I am guessing there's no need to use the enormous lookup file to begin with.
    – terdon♦
    Sep 19 '17 at 12:55










  • One more thing. Why is the order important? In your example, it looks like you've just sorted by chromosome and position. Is that so? If yes, we can easily get the original order back by sorting again. Would that be acceptable?
    – terdon♦
    Sep 19 '17 at 12:56












  • 1




    "Merge", "huge files", "preserve order": pick two.
    – Satō Katsura
    Sep 19 '17 at 8:20






  • 1




    Your best bet is probably to add line numbers (to preserve order), and import both files in a SQL database. Then your problem becomes a simple join.
    – Satō Katsura
    Sep 19 '17 at 8:30










  • @SatōKatsura The order of the data file is to be kept, but this is not critical. The critical part is the lookup table, but you can modify that as you like.
    – Philippos
    Sep 19 '17 at 10:46










  • This is on topic here and welcome to stay. I would also expect you have a much better chance of an answer here than on Stack Overflow. That said, it might be wrong to look at this as a simple text parsing issue. it looks like you are essentially trying to attach rsIDs to a set of variants. So essentially to annotate a VCF file. Or something very similar. You might want to explain your issue from the beginning and ask on Bioinformatics instead. But there, just explain what your final objective is. I am guessing there's no need to use the enormous lookup file to begin with.
    – terdon♦
    Sep 19 '17 at 12:55










  • One more thing. Why is the order important? In your example, it looks like you've just sorted by chromosome and position. Is that so? If yes, we can easily get the original order back by sorting again. Would that be acceptable?
    – terdon♦
    Sep 19 '17 at 12:56







1




1




"Merge", "huge files", "preserve order": pick two.
– Satō Katsura
Sep 19 '17 at 8:20




"Merge", "huge files", "preserve order": pick two.
– Satō Katsura
Sep 19 '17 at 8:20




1




1




Your best bet is probably to add line numbers (to preserve order), and import both files in a SQL database. Then your problem becomes a simple join.
– Satō Katsura
Sep 19 '17 at 8:30




Your best bet is probably to add line numbers (to preserve order), and import both files in a SQL database. Then your problem becomes a simple join.
– Satō Katsura
Sep 19 '17 at 8:30












@SatōKatsura The order of the data file is to be kept, but this is not critical. The critical part is the lookup table, but you can modify that as you like.
– Philippos
Sep 19 '17 at 10:46




@SatōKatsura The order of the data file is to be kept, but this is not critical. The critical part is the lookup table, but you can modify that as you like.
– Philippos
Sep 19 '17 at 10:46












This is on topic here and welcome to stay. I would also expect you have a much better chance of an answer here than on Stack Overflow. That said, it might be wrong to look at this as a simple text parsing issue. it looks like you are essentially trying to attach rsIDs to a set of variants. So essentially to annotate a VCF file. Or something very similar. You might want to explain your issue from the beginning and ask on Bioinformatics instead. But there, just explain what your final objective is. I am guessing there's no need to use the enormous lookup file to begin with.
– terdon♦
Sep 19 '17 at 12:55




This is on topic here and welcome to stay. I would also expect you have a much better chance of an answer here than on Stack Overflow. That said, it might be wrong to look at this as a simple text parsing issue. it looks like you are essentially trying to attach rsIDs to a set of variants. So essentially to annotate a VCF file. Or something very similar. You might want to explain your issue from the beginning and ask on Bioinformatics instead. But there, just explain what your final objective is. I am guessing there's no need to use the enormous lookup file to begin with.
– terdon♦
Sep 19 '17 at 12:55












One more thing. Why is the order important? In your example, it looks like you've just sorted by chromosome and position. Is that so? If yes, we can easily get the original order back by sorting again. Would that be acceptable?
– terdon♦
Sep 19 '17 at 12:56




One more thing. Why is the order important? In your example, it looks like you've just sorted by chromosome and position. Is that so? If yes, we can easily get the original order back by sorting again. Would that be acceptable?
– terdon♦
Sep 19 '17 at 12:56










1 Answer
1






active

oldest

votes

















up vote
0
down vote













The problem is not keeping that file in memory, but to scan through the lookup table for each line of the data file. Your code doesn't show it, but behind the scenes you do 3'000'000 times 323'000'000/2 = almost half a quadrillion string comparisons, moving thousands of terabytes over your memory bus. Even for fast memory with 200 GBit/s this will take many hours.



So the key to the problem is how to store the lookup table. I suggest to use a binary tree to exponentially reduce execution time. You can do this in perl or C or some other language, but at this moment it will become off-topic here.



The unix command toolset will not help you with this problem.






share|improve this answer




















    Your Answer







    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "106"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f393126%2fmerge-2-huge-files-matching-multiple-columns-and-preserving-the-order-print-mat%23new-answer', 'question_page');

    );

    Post as a guest






























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    0
    down vote













    The problem is not keeping that file in memory, but to scan through the lookup table for each line of the data file. Your code doesn't show it, but behind the scenes you do 3'000'000 times 323'000'000/2 = almost half a quadrillion string comparisons, moving thousands of terabytes over your memory bus. Even for fast memory with 200 GBit/s this will take many hours.



    So the key to the problem is how to store the lookup table. I suggest to use a binary tree to exponentially reduce execution time. You can do this in perl or C or some other language, but at this moment it will become off-topic here.



    The unix command toolset will not help you with this problem.






    share|improve this answer
























      up vote
      0
      down vote













      The problem is not keeping that file in memory, but to scan through the lookup table for each line of the data file. Your code doesn't show it, but behind the scenes you do 3'000'000 times 323'000'000/2 = almost half a quadrillion string comparisons, moving thousands of terabytes over your memory bus. Even for fast memory with 200 GBit/s this will take many hours.



      So the key to the problem is how to store the lookup table. I suggest to use a binary tree to exponentially reduce execution time. You can do this in perl or C or some other language, but at this moment it will become off-topic here.



      The unix command toolset will not help you with this problem.






      share|improve this answer






















        up vote
        0
        down vote










        up vote
        0
        down vote









        The problem is not keeping that file in memory, but to scan through the lookup table for each line of the data file. Your code doesn't show it, but behind the scenes you do 3'000'000 times 323'000'000/2 = almost half a quadrillion string comparisons, moving thousands of terabytes over your memory bus. Even for fast memory with 200 GBit/s this will take many hours.



        So the key to the problem is how to store the lookup table. I suggest to use a binary tree to exponentially reduce execution time. You can do this in perl or C or some other language, but at this moment it will become off-topic here.



        The unix command toolset will not help you with this problem.






        share|improve this answer












        The problem is not keeping that file in memory, but to scan through the lookup table for each line of the data file. Your code doesn't show it, but behind the scenes you do 3'000'000 times 323'000'000/2 = almost half a quadrillion string comparisons, moving thousands of terabytes over your memory bus. Even for fast memory with 200 GBit/s this will take many hours.



        So the key to the problem is how to store the lookup table. I suggest to use a binary tree to exponentially reduce execution time. You can do this in perl or C or some other language, but at this moment it will become off-topic here.



        The unix command toolset will not help you with this problem.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Sep 19 '17 at 8:39









        Philippos

        5,95211546




        5,95211546



























             

            draft saved


            draft discarded















































             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f393126%2fmerge-2-huge-files-matching-multiple-columns-and-preserving-the-order-print-mat%23new-answer', 'question_page');

            );

            Post as a guest













































































            Popular posts from this blog

            How to check contact read email or not when send email to Individual?

            Bahrain

            Postfix configuration issue with fips on centos 7; mailgun relay