Merge 2 huge files matching multiple columns and preserving the order (print matching and non-matching values) - scale up from awk
Clash Royale CLAN TAG#URR8PPP
up vote
3
down vote
favorite
I'm having trouble with merging data from 2 files. It is genetic data with chromosome, position, reference and alternate alleles and I need to merge the files by matching all 4 of those columns - with the reference and alternate alleles either way round. So I need columns $1,$2,$4 and $5 OR $1,$2,$5 and $4 in the lookup file to exactly match columns $1,$5,$6 and $7 in the datafile. It is really important that I keep the exact order in the datafile - I can't sort it (so sadly can't use join - this is the suggested answer I have found in other instances of this sort of question).
I have used awk and got the code working for sample files with a few thousand rows, but it does not scale up well for my large dataset (lookup file has >300million rows; datafile has 30 million) - presumably as the code requires keeping in memory 2 huge arrays for lookup. Any suggestions for a scaleable code (?in perl?) gratefully received!
Format of lookup file is:
1 10150 rs371194064 C T
1 10165 rs796884232 A AC
1 10177 rs367896724 A AC
1 10177 rs201752861 A C
1 10180 rs201694901 T C
1 10199 rs905327004 A T
1 10231 rs200279319 C A
1 10234 rs145599635 C T
1 10235 rs540431307 T TA
1 10235 rs1035249121 T A
1 10235 rs1035249121 T C
1 10241 rs960927773 T C
1 10247 rs796996180 T C
1 10248 rs148908337 A T
1 10249 rs774211241 AAC A
and format of my datafile is
1 chr1 chr1:10177 1:10177_A_AC 10177 A AC
1 chr1 chr1:10235 1:10235_T_TA 10235 T TA
1 chr1 chr1:10352 1:10352_T_TA 10352 T TA
1 chr1 chr1:10505 1:10505_A_T 10505 A T
1 chr1 chr1:10506 1:10506_C_G 10506 C G
1 chr1 chr1:10511 1:10511_G_A 10511 G A
1 chr1 chr1:10539 1:10539_C_A 10539 C A
1 chr1 chr1:10542 1:10542_C_T 10542 C T
1 chr1 chr1:10579 1:10579_C_A 10579 C A
The output should look like:
1 rs367896724 1:10177_A_AC 10177 A AC A AC
1 rs540431307 1:10235_T_TA 10235 T TA T TA
1 chr1:10352 1:10352_T_TA 10352 T TA T TA
1 chr1:10505 1:10505_A_T 10505 A T A T
1 chr1:10506 1:10506_C_G 10506 C G C G
1 chr1:10511 1:10511_G_A 10511 G A G A
1 chr1:10539 1:10539_C_A 10539 C A C A
1 chr1:10542 1:10542_C_T 10542 C T C T
1 chr1:10579 1:10579_C_A 10579 C A C A
The awk code I have managed to get working for the sample file is as follows:
awk 'BEGIN OFS = "t"
NR==FNR #lookup file (323 million rows)
key = $1 "," $2 "," $4 "," $5
present[key] = 1
ID[key] = $3
Ref[key] = $4
Alt[key] = $5
key1 = $1 "," $2 "," $4 "," $5
present1[key1] = 1
ID1[key1] = $3
Ref1[key1] = $4
Alt1[key1] = $5
next
# my data file (3 million rows)
key = $1 "," $5 "," $6 "," $7
key1 = $1 "," $5 "," $7 "," $6
if (present[key]) print $1, ID[key], $4, $5, $6, $7, Ref[key], Alt[key];
else if (present1[key1]) print $1, ID1[key1], $4, $5, $6, $7, Ref1[key1], Alt1[key1];
else print $1, $3, $4, $5, $6, $7, $6, $7
' $lookupfile $mydatafile > $outputfile
awk columns merge
 |Â
show 1 more comment
up vote
3
down vote
favorite
I'm having trouble with merging data from 2 files. It is genetic data with chromosome, position, reference and alternate alleles and I need to merge the files by matching all 4 of those columns - with the reference and alternate alleles either way round. So I need columns $1,$2,$4 and $5 OR $1,$2,$5 and $4 in the lookup file to exactly match columns $1,$5,$6 and $7 in the datafile. It is really important that I keep the exact order in the datafile - I can't sort it (so sadly can't use join - this is the suggested answer I have found in other instances of this sort of question).
I have used awk and got the code working for sample files with a few thousand rows, but it does not scale up well for my large dataset (lookup file has >300million rows; datafile has 30 million) - presumably as the code requires keeping in memory 2 huge arrays for lookup. Any suggestions for a scaleable code (?in perl?) gratefully received!
Format of lookup file is:
1 10150 rs371194064 C T
1 10165 rs796884232 A AC
1 10177 rs367896724 A AC
1 10177 rs201752861 A C
1 10180 rs201694901 T C
1 10199 rs905327004 A T
1 10231 rs200279319 C A
1 10234 rs145599635 C T
1 10235 rs540431307 T TA
1 10235 rs1035249121 T A
1 10235 rs1035249121 T C
1 10241 rs960927773 T C
1 10247 rs796996180 T C
1 10248 rs148908337 A T
1 10249 rs774211241 AAC A
and format of my datafile is
1 chr1 chr1:10177 1:10177_A_AC 10177 A AC
1 chr1 chr1:10235 1:10235_T_TA 10235 T TA
1 chr1 chr1:10352 1:10352_T_TA 10352 T TA
1 chr1 chr1:10505 1:10505_A_T 10505 A T
1 chr1 chr1:10506 1:10506_C_G 10506 C G
1 chr1 chr1:10511 1:10511_G_A 10511 G A
1 chr1 chr1:10539 1:10539_C_A 10539 C A
1 chr1 chr1:10542 1:10542_C_T 10542 C T
1 chr1 chr1:10579 1:10579_C_A 10579 C A
The output should look like:
1 rs367896724 1:10177_A_AC 10177 A AC A AC
1 rs540431307 1:10235_T_TA 10235 T TA T TA
1 chr1:10352 1:10352_T_TA 10352 T TA T TA
1 chr1:10505 1:10505_A_T 10505 A T A T
1 chr1:10506 1:10506_C_G 10506 C G C G
1 chr1:10511 1:10511_G_A 10511 G A G A
1 chr1:10539 1:10539_C_A 10539 C A C A
1 chr1:10542 1:10542_C_T 10542 C T C T
1 chr1:10579 1:10579_C_A 10579 C A C A
The awk code I have managed to get working for the sample file is as follows:
awk 'BEGIN OFS = "t"
NR==FNR #lookup file (323 million rows)
key = $1 "," $2 "," $4 "," $5
present[key] = 1
ID[key] = $3
Ref[key] = $4
Alt[key] = $5
key1 = $1 "," $2 "," $4 "," $5
present1[key1] = 1
ID1[key1] = $3
Ref1[key1] = $4
Alt1[key1] = $5
next
# my data file (3 million rows)
key = $1 "," $5 "," $6 "," $7
key1 = $1 "," $5 "," $7 "," $6
if (present[key]) print $1, ID[key], $4, $5, $6, $7, Ref[key], Alt[key];
else if (present1[key1]) print $1, ID1[key1], $4, $5, $6, $7, Ref1[key1], Alt1[key1];
else print $1, $3, $4, $5, $6, $7, $6, $7
' $lookupfile $mydatafile > $outputfile
awk columns merge
1
"Merge", "huge files", "preserve order": pick two.
â Satà  Katsura
Sep 19 '17 at 8:20
1
Your best bet is probably to add line numbers (to preserve order), and import both files in a SQL database. Then your problem becomes a simple join.
â Satà  Katsura
Sep 19 '17 at 8:30
@Satà ÂKatsura The order of the data file is to be kept, but this is not critical. The critical part is the lookup table, but you can modify that as you like.
â Philippos
Sep 19 '17 at 10:46
This is on topic here and welcome to stay. I would also expect you have a much better chance of an answer here than on Stack Overflow. That said, it might be wrong to look at this as a simple text parsing issue. it looks like you are essentially trying to attach rsIDs to a set of variants. So essentially to annotate a VCF file. Or something very similar. You might want to explain your issue from the beginning and ask on Bioinformatics instead. But there, just explain what your final objective is. I am guessing there's no need to use the enormous lookup file to begin with.
â terdonâ¦
Sep 19 '17 at 12:55
One more thing. Why is the order important? In your example, it looks like you've just sorted by chromosome and position. Is that so? If yes, we can easily get the original order back by sorting again. Would that be acceptable?
â terdonâ¦
Sep 19 '17 at 12:56
 |Â
show 1 more comment
up vote
3
down vote
favorite
up vote
3
down vote
favorite
I'm having trouble with merging data from 2 files. It is genetic data with chromosome, position, reference and alternate alleles and I need to merge the files by matching all 4 of those columns - with the reference and alternate alleles either way round. So I need columns $1,$2,$4 and $5 OR $1,$2,$5 and $4 in the lookup file to exactly match columns $1,$5,$6 and $7 in the datafile. It is really important that I keep the exact order in the datafile - I can't sort it (so sadly can't use join - this is the suggested answer I have found in other instances of this sort of question).
I have used awk and got the code working for sample files with a few thousand rows, but it does not scale up well for my large dataset (lookup file has >300million rows; datafile has 30 million) - presumably as the code requires keeping in memory 2 huge arrays for lookup. Any suggestions for a scaleable code (?in perl?) gratefully received!
Format of lookup file is:
1 10150 rs371194064 C T
1 10165 rs796884232 A AC
1 10177 rs367896724 A AC
1 10177 rs201752861 A C
1 10180 rs201694901 T C
1 10199 rs905327004 A T
1 10231 rs200279319 C A
1 10234 rs145599635 C T
1 10235 rs540431307 T TA
1 10235 rs1035249121 T A
1 10235 rs1035249121 T C
1 10241 rs960927773 T C
1 10247 rs796996180 T C
1 10248 rs148908337 A T
1 10249 rs774211241 AAC A
and format of my datafile is
1 chr1 chr1:10177 1:10177_A_AC 10177 A AC
1 chr1 chr1:10235 1:10235_T_TA 10235 T TA
1 chr1 chr1:10352 1:10352_T_TA 10352 T TA
1 chr1 chr1:10505 1:10505_A_T 10505 A T
1 chr1 chr1:10506 1:10506_C_G 10506 C G
1 chr1 chr1:10511 1:10511_G_A 10511 G A
1 chr1 chr1:10539 1:10539_C_A 10539 C A
1 chr1 chr1:10542 1:10542_C_T 10542 C T
1 chr1 chr1:10579 1:10579_C_A 10579 C A
The output should look like:
1 rs367896724 1:10177_A_AC 10177 A AC A AC
1 rs540431307 1:10235_T_TA 10235 T TA T TA
1 chr1:10352 1:10352_T_TA 10352 T TA T TA
1 chr1:10505 1:10505_A_T 10505 A T A T
1 chr1:10506 1:10506_C_G 10506 C G C G
1 chr1:10511 1:10511_G_A 10511 G A G A
1 chr1:10539 1:10539_C_A 10539 C A C A
1 chr1:10542 1:10542_C_T 10542 C T C T
1 chr1:10579 1:10579_C_A 10579 C A C A
The awk code I have managed to get working for the sample file is as follows:
awk 'BEGIN OFS = "t"
NR==FNR #lookup file (323 million rows)
key = $1 "," $2 "," $4 "," $5
present[key] = 1
ID[key] = $3
Ref[key] = $4
Alt[key] = $5
key1 = $1 "," $2 "," $4 "," $5
present1[key1] = 1
ID1[key1] = $3
Ref1[key1] = $4
Alt1[key1] = $5
next
# my data file (3 million rows)
key = $1 "," $5 "," $6 "," $7
key1 = $1 "," $5 "," $7 "," $6
if (present[key]) print $1, ID[key], $4, $5, $6, $7, Ref[key], Alt[key];
else if (present1[key1]) print $1, ID1[key1], $4, $5, $6, $7, Ref1[key1], Alt1[key1];
else print $1, $3, $4, $5, $6, $7, $6, $7
' $lookupfile $mydatafile > $outputfile
awk columns merge
I'm having trouble with merging data from 2 files. It is genetic data with chromosome, position, reference and alternate alleles and I need to merge the files by matching all 4 of those columns - with the reference and alternate alleles either way round. So I need columns $1,$2,$4 and $5 OR $1,$2,$5 and $4 in the lookup file to exactly match columns $1,$5,$6 and $7 in the datafile. It is really important that I keep the exact order in the datafile - I can't sort it (so sadly can't use join - this is the suggested answer I have found in other instances of this sort of question).
I have used awk and got the code working for sample files with a few thousand rows, but it does not scale up well for my large dataset (lookup file has >300million rows; datafile has 30 million) - presumably as the code requires keeping in memory 2 huge arrays for lookup. Any suggestions for a scaleable code (?in perl?) gratefully received!
Format of lookup file is:
1 10150 rs371194064 C T
1 10165 rs796884232 A AC
1 10177 rs367896724 A AC
1 10177 rs201752861 A C
1 10180 rs201694901 T C
1 10199 rs905327004 A T
1 10231 rs200279319 C A
1 10234 rs145599635 C T
1 10235 rs540431307 T TA
1 10235 rs1035249121 T A
1 10235 rs1035249121 T C
1 10241 rs960927773 T C
1 10247 rs796996180 T C
1 10248 rs148908337 A T
1 10249 rs774211241 AAC A
and format of my datafile is
1 chr1 chr1:10177 1:10177_A_AC 10177 A AC
1 chr1 chr1:10235 1:10235_T_TA 10235 T TA
1 chr1 chr1:10352 1:10352_T_TA 10352 T TA
1 chr1 chr1:10505 1:10505_A_T 10505 A T
1 chr1 chr1:10506 1:10506_C_G 10506 C G
1 chr1 chr1:10511 1:10511_G_A 10511 G A
1 chr1 chr1:10539 1:10539_C_A 10539 C A
1 chr1 chr1:10542 1:10542_C_T 10542 C T
1 chr1 chr1:10579 1:10579_C_A 10579 C A
The output should look like:
1 rs367896724 1:10177_A_AC 10177 A AC A AC
1 rs540431307 1:10235_T_TA 10235 T TA T TA
1 chr1:10352 1:10352_T_TA 10352 T TA T TA
1 chr1:10505 1:10505_A_T 10505 A T A T
1 chr1:10506 1:10506_C_G 10506 C G C G
1 chr1:10511 1:10511_G_A 10511 G A G A
1 chr1:10539 1:10539_C_A 10539 C A C A
1 chr1:10542 1:10542_C_T 10542 C T C T
1 chr1:10579 1:10579_C_A 10579 C A C A
The awk code I have managed to get working for the sample file is as follows:
awk 'BEGIN OFS = "t"
NR==FNR #lookup file (323 million rows)
key = $1 "," $2 "," $4 "," $5
present[key] = 1
ID[key] = $3
Ref[key] = $4
Alt[key] = $5
key1 = $1 "," $2 "," $4 "," $5
present1[key1] = 1
ID1[key1] = $3
Ref1[key1] = $4
Alt1[key1] = $5
next
# my data file (3 million rows)
key = $1 "," $5 "," $6 "," $7
key1 = $1 "," $5 "," $7 "," $6
if (present[key]) print $1, ID[key], $4, $5, $6, $7, Ref[key], Alt[key];
else if (present1[key1]) print $1, ID1[key1], $4, $5, $6, $7, Ref1[key1], Alt1[key1];
else print $1, $3, $4, $5, $6, $7, $6, $7
' $lookupfile $mydatafile > $outputfile
awk columns merge
awk columns merge
edited Aug 21 at 12:41
Rui F Ribeiro
36.7k1271116
36.7k1271116
asked Sep 19 '17 at 8:14
Katie Fletcher
162
162
1
"Merge", "huge files", "preserve order": pick two.
â Satà  Katsura
Sep 19 '17 at 8:20
1
Your best bet is probably to add line numbers (to preserve order), and import both files in a SQL database. Then your problem becomes a simple join.
â Satà  Katsura
Sep 19 '17 at 8:30
@Satà ÂKatsura The order of the data file is to be kept, but this is not critical. The critical part is the lookup table, but you can modify that as you like.
â Philippos
Sep 19 '17 at 10:46
This is on topic here and welcome to stay. I would also expect you have a much better chance of an answer here than on Stack Overflow. That said, it might be wrong to look at this as a simple text parsing issue. it looks like you are essentially trying to attach rsIDs to a set of variants. So essentially to annotate a VCF file. Or something very similar. You might want to explain your issue from the beginning and ask on Bioinformatics instead. But there, just explain what your final objective is. I am guessing there's no need to use the enormous lookup file to begin with.
â terdonâ¦
Sep 19 '17 at 12:55
One more thing. Why is the order important? In your example, it looks like you've just sorted by chromosome and position. Is that so? If yes, we can easily get the original order back by sorting again. Would that be acceptable?
â terdonâ¦
Sep 19 '17 at 12:56
 |Â
show 1 more comment
1
"Merge", "huge files", "preserve order": pick two.
â Satà  Katsura
Sep 19 '17 at 8:20
1
Your best bet is probably to add line numbers (to preserve order), and import both files in a SQL database. Then your problem becomes a simple join.
â Satà  Katsura
Sep 19 '17 at 8:30
@Satà ÂKatsura The order of the data file is to be kept, but this is not critical. The critical part is the lookup table, but you can modify that as you like.
â Philippos
Sep 19 '17 at 10:46
This is on topic here and welcome to stay. I would also expect you have a much better chance of an answer here than on Stack Overflow. That said, it might be wrong to look at this as a simple text parsing issue. it looks like you are essentially trying to attach rsIDs to a set of variants. So essentially to annotate a VCF file. Or something very similar. You might want to explain your issue from the beginning and ask on Bioinformatics instead. But there, just explain what your final objective is. I am guessing there's no need to use the enormous lookup file to begin with.
â terdonâ¦
Sep 19 '17 at 12:55
One more thing. Why is the order important? In your example, it looks like you've just sorted by chromosome and position. Is that so? If yes, we can easily get the original order back by sorting again. Would that be acceptable?
â terdonâ¦
Sep 19 '17 at 12:56
1
1
"Merge", "huge files", "preserve order": pick two.
â Satà  Katsura
Sep 19 '17 at 8:20
"Merge", "huge files", "preserve order": pick two.
â Satà  Katsura
Sep 19 '17 at 8:20
1
1
Your best bet is probably to add line numbers (to preserve order), and import both files in a SQL database. Then your problem becomes a simple join.
â Satà  Katsura
Sep 19 '17 at 8:30
Your best bet is probably to add line numbers (to preserve order), and import both files in a SQL database. Then your problem becomes a simple join.
â Satà  Katsura
Sep 19 '17 at 8:30
@Satà ÂKatsura The order of the data file is to be kept, but this is not critical. The critical part is the lookup table, but you can modify that as you like.
â Philippos
Sep 19 '17 at 10:46
@Satà ÂKatsura The order of the data file is to be kept, but this is not critical. The critical part is the lookup table, but you can modify that as you like.
â Philippos
Sep 19 '17 at 10:46
This is on topic here and welcome to stay. I would also expect you have a much better chance of an answer here than on Stack Overflow. That said, it might be wrong to look at this as a simple text parsing issue. it looks like you are essentially trying to attach rsIDs to a set of variants. So essentially to annotate a VCF file. Or something very similar. You might want to explain your issue from the beginning and ask on Bioinformatics instead. But there, just explain what your final objective is. I am guessing there's no need to use the enormous lookup file to begin with.
â terdonâ¦
Sep 19 '17 at 12:55
This is on topic here and welcome to stay. I would also expect you have a much better chance of an answer here than on Stack Overflow. That said, it might be wrong to look at this as a simple text parsing issue. it looks like you are essentially trying to attach rsIDs to a set of variants. So essentially to annotate a VCF file. Or something very similar. You might want to explain your issue from the beginning and ask on Bioinformatics instead. But there, just explain what your final objective is. I am guessing there's no need to use the enormous lookup file to begin with.
â terdonâ¦
Sep 19 '17 at 12:55
One more thing. Why is the order important? In your example, it looks like you've just sorted by chromosome and position. Is that so? If yes, we can easily get the original order back by sorting again. Would that be acceptable?
â terdonâ¦
Sep 19 '17 at 12:56
One more thing. Why is the order important? In your example, it looks like you've just sorted by chromosome and position. Is that so? If yes, we can easily get the original order back by sorting again. Would that be acceptable?
â terdonâ¦
Sep 19 '17 at 12:56
 |Â
show 1 more comment
1 Answer
1
active
oldest
votes
up vote
0
down vote
The problem is not keeping that file in memory, but to scan through the lookup table for each line of the data file. Your code doesn't show it, but behind the scenes you do 3'000'000 times 323'000'000/2 = almost half a quadrillion string comparisons, moving thousands of terabytes over your memory bus. Even for fast memory with 200 GBit/s this will take many hours.
So the key to the problem is how to store the lookup table. I suggest to use a binary tree to exponentially reduce execution time. You can do this in perl
or C
or some other language, but at this moment it will become off-topic here.
The unix command toolset will not help you with this problem.
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
The problem is not keeping that file in memory, but to scan through the lookup table for each line of the data file. Your code doesn't show it, but behind the scenes you do 3'000'000 times 323'000'000/2 = almost half a quadrillion string comparisons, moving thousands of terabytes over your memory bus. Even for fast memory with 200 GBit/s this will take many hours.
So the key to the problem is how to store the lookup table. I suggest to use a binary tree to exponentially reduce execution time. You can do this in perl
or C
or some other language, but at this moment it will become off-topic here.
The unix command toolset will not help you with this problem.
add a comment |Â
up vote
0
down vote
The problem is not keeping that file in memory, but to scan through the lookup table for each line of the data file. Your code doesn't show it, but behind the scenes you do 3'000'000 times 323'000'000/2 = almost half a quadrillion string comparisons, moving thousands of terabytes over your memory bus. Even for fast memory with 200 GBit/s this will take many hours.
So the key to the problem is how to store the lookup table. I suggest to use a binary tree to exponentially reduce execution time. You can do this in perl
or C
or some other language, but at this moment it will become off-topic here.
The unix command toolset will not help you with this problem.
add a comment |Â
up vote
0
down vote
up vote
0
down vote
The problem is not keeping that file in memory, but to scan through the lookup table for each line of the data file. Your code doesn't show it, but behind the scenes you do 3'000'000 times 323'000'000/2 = almost half a quadrillion string comparisons, moving thousands of terabytes over your memory bus. Even for fast memory with 200 GBit/s this will take many hours.
So the key to the problem is how to store the lookup table. I suggest to use a binary tree to exponentially reduce execution time. You can do this in perl
or C
or some other language, but at this moment it will become off-topic here.
The unix command toolset will not help you with this problem.
The problem is not keeping that file in memory, but to scan through the lookup table for each line of the data file. Your code doesn't show it, but behind the scenes you do 3'000'000 times 323'000'000/2 = almost half a quadrillion string comparisons, moving thousands of terabytes over your memory bus. Even for fast memory with 200 GBit/s this will take many hours.
So the key to the problem is how to store the lookup table. I suggest to use a binary tree to exponentially reduce execution time. You can do this in perl
or C
or some other language, but at this moment it will become off-topic here.
The unix command toolset will not help you with this problem.
answered Sep 19 '17 at 8:39
Philippos
5,95211546
5,95211546
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f393126%2fmerge-2-huge-files-matching-multiple-columns-and-preserving-the-order-print-mat%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
1
"Merge", "huge files", "preserve order": pick two.
â Satà  Katsura
Sep 19 '17 at 8:20
1
Your best bet is probably to add line numbers (to preserve order), and import both files in a SQL database. Then your problem becomes a simple join.
â Satà  Katsura
Sep 19 '17 at 8:30
@Satà ÂKatsura The order of the data file is to be kept, but this is not critical. The critical part is the lookup table, but you can modify that as you like.
â Philippos
Sep 19 '17 at 10:46
This is on topic here and welcome to stay. I would also expect you have a much better chance of an answer here than on Stack Overflow. That said, it might be wrong to look at this as a simple text parsing issue. it looks like you are essentially trying to attach rsIDs to a set of variants. So essentially to annotate a VCF file. Or something very similar. You might want to explain your issue from the beginning and ask on Bioinformatics instead. But there, just explain what your final objective is. I am guessing there's no need to use the enormous lookup file to begin with.
â terdonâ¦
Sep 19 '17 at 12:55
One more thing. Why is the order important? In your example, it looks like you've just sorted by chromosome and position. Is that so? If yes, we can easily get the original order back by sorting again. Would that be acceptable?
â terdonâ¦
Sep 19 '17 at 12:56