Comparing csv files to lookup column 1 then check values in column 2
Clash Royale CLAN TAG#URR8PPP
up vote
0
down vote
favorite
OK, I will try and explain what I need to do as best as possible.
Basically I have two CSV files, as per the examples below:
File 1:
Column 1, Column 2
abc , 123
def , 234
adf , 567
File 2
Column 1, Column 2
abc , 123
def , 234
adf , 578
I need to write either a shell script or simple command that will do the following:
- Sort both files by column 1
- Row by row, do the following:
- Using column 1 in file 1, search for this value in columnÃÂ 1 in file 2.
- if found, compare the value in column 2 in file 1 against the value in column 2 of file 2
- if it matches, write column 1, column 2 and "Validated" in column 3 to a separate file
- if it does not match, write column 1, column 2 and "Failed" to a separate file
- Using column 1 in file 1, search for this value in columnÃÂ 1 in file 2.
This results in two output files: the first with everything that was found in columnÃÂ 1 and column 2 matches, and a second file containing either column 1 lookups that failed or where column 1 was found, where column 2 did not match, so, essentially, using column 1 as the key to check column 2.
shell-script text-processing columns file-comparison
add a comment |Â
up vote
0
down vote
favorite
OK, I will try and explain what I need to do as best as possible.
Basically I have two CSV files, as per the examples below:
File 1:
Column 1, Column 2
abc , 123
def , 234
adf , 567
File 2
Column 1, Column 2
abc , 123
def , 234
adf , 578
I need to write either a shell script or simple command that will do the following:
- Sort both files by column 1
- Row by row, do the following:
- Using column 1 in file 1, search for this value in columnÃÂ 1 in file 2.
- if found, compare the value in column 2 in file 1 against the value in column 2 of file 2
- if it matches, write column 1, column 2 and "Validated" in column 3 to a separate file
- if it does not match, write column 1, column 2 and "Failed" to a separate file
- Using column 1 in file 1, search for this value in columnÃÂ 1 in file 2.
This results in two output files: the first with everything that was found in columnÃÂ 1 and column 2 matches, and a second file containing either column 1 lookups that failed or where column 1 was found, where column 2 did not match, so, essentially, using column 1 as the key to check column 2.
shell-script text-processing columns file-comparison
Your question would be clearer if you (1)â¯showed exactly what you want the output files to look like for your example input, and (2)â¯described (and illustrated) what you want to happen if a value in columnâ¯1 of fileâ¯1 is not present in fileâ¯2.âÂÂAlso, have you made any attempt to solve this yourself?âÂÂDone any research?âÂÂWhat have you found?âÂÂWhat have you tried?
â G-Man
Mar 21 '17 at 18:10
Hi, apologies yes I have been looking into this extensively. I first tried the below but this did not give the desired result so had to play around with the columns etc awk -F, 'NR==FNRa[$1,$3]++;next (a[$1,$2])' file1.txt file2.txt I have been also looking at whether I should be doing this in SQL but that means importing on eof the files into SQL and running innerjoins which with 150 million lines can take a while. I am still investigating and will obviously share anything I find that works on this post.
â Veyron
Mar 22 '17 at 14:24
add a comment |Â
up vote
0
down vote
favorite
up vote
0
down vote
favorite
OK, I will try and explain what I need to do as best as possible.
Basically I have two CSV files, as per the examples below:
File 1:
Column 1, Column 2
abc , 123
def , 234
adf , 567
File 2
Column 1, Column 2
abc , 123
def , 234
adf , 578
I need to write either a shell script or simple command that will do the following:
- Sort both files by column 1
- Row by row, do the following:
- Using column 1 in file 1, search for this value in columnÃÂ 1 in file 2.
- if found, compare the value in column 2 in file 1 against the value in column 2 of file 2
- if it matches, write column 1, column 2 and "Validated" in column 3 to a separate file
- if it does not match, write column 1, column 2 and "Failed" to a separate file
- Using column 1 in file 1, search for this value in columnÃÂ 1 in file 2.
This results in two output files: the first with everything that was found in columnÃÂ 1 and column 2 matches, and a second file containing either column 1 lookups that failed or where column 1 was found, where column 2 did not match, so, essentially, using column 1 as the key to check column 2.
shell-script text-processing columns file-comparison
OK, I will try and explain what I need to do as best as possible.
Basically I have two CSV files, as per the examples below:
File 1:
Column 1, Column 2
abc , 123
def , 234
adf , 567
File 2
Column 1, Column 2
abc , 123
def , 234
adf , 578
I need to write either a shell script or simple command that will do the following:
- Sort both files by column 1
- Row by row, do the following:
- Using column 1 in file 1, search for this value in columnÃÂ 1 in file 2.
- if found, compare the value in column 2 in file 1 against the value in column 2 of file 2
- if it matches, write column 1, column 2 and "Validated" in column 3 to a separate file
- if it does not match, write column 1, column 2 and "Failed" to a separate file
- Using column 1 in file 1, search for this value in columnÃÂ 1 in file 2.
This results in two output files: the first with everything that was found in columnÃÂ 1 and column 2 matches, and a second file containing either column 1 lookups that failed or where column 1 was found, where column 2 did not match, so, essentially, using column 1 as the key to check column 2.
shell-script text-processing columns file-comparison
shell-script text-processing columns file-comparison
edited Mar 21 '17 at 18:08
G-Man
11.8k92658
11.8k92658
asked Mar 21 '17 at 17:08
Veyron
11
11
Your question would be clearer if you (1)â¯showed exactly what you want the output files to look like for your example input, and (2)â¯described (and illustrated) what you want to happen if a value in columnâ¯1 of fileâ¯1 is not present in fileâ¯2.âÂÂAlso, have you made any attempt to solve this yourself?âÂÂDone any research?âÂÂWhat have you found?âÂÂWhat have you tried?
â G-Man
Mar 21 '17 at 18:10
Hi, apologies yes I have been looking into this extensively. I first tried the below but this did not give the desired result so had to play around with the columns etc awk -F, 'NR==FNRa[$1,$3]++;next (a[$1,$2])' file1.txt file2.txt I have been also looking at whether I should be doing this in SQL but that means importing on eof the files into SQL and running innerjoins which with 150 million lines can take a while. I am still investigating and will obviously share anything I find that works on this post.
â Veyron
Mar 22 '17 at 14:24
add a comment |Â
Your question would be clearer if you (1)â¯showed exactly what you want the output files to look like for your example input, and (2)â¯described (and illustrated) what you want to happen if a value in columnâ¯1 of fileâ¯1 is not present in fileâ¯2.âÂÂAlso, have you made any attempt to solve this yourself?âÂÂDone any research?âÂÂWhat have you found?âÂÂWhat have you tried?
â G-Man
Mar 21 '17 at 18:10
Hi, apologies yes I have been looking into this extensively. I first tried the below but this did not give the desired result so had to play around with the columns etc awk -F, 'NR==FNRa[$1,$3]++;next (a[$1,$2])' file1.txt file2.txt I have been also looking at whether I should be doing this in SQL but that means importing on eof the files into SQL and running innerjoins which with 150 million lines can take a while. I am still investigating and will obviously share anything I find that works on this post.
â Veyron
Mar 22 '17 at 14:24
Your question would be clearer if you (1)â¯showed exactly what you want the output files to look like for your example input, and (2)â¯described (and illustrated) what you want to happen if a value in columnâ¯1 of fileâ¯1 is not present in fileâ¯2.âÂÂAlso, have you made any attempt to solve this yourself?âÂÂDone any research?âÂÂWhat have you found?âÂÂWhat have you tried?
â G-Man
Mar 21 '17 at 18:10
Your question would be clearer if you (1)â¯showed exactly what you want the output files to look like for your example input, and (2)â¯described (and illustrated) what you want to happen if a value in columnâ¯1 of fileâ¯1 is not present in fileâ¯2.âÂÂAlso, have you made any attempt to solve this yourself?âÂÂDone any research?âÂÂWhat have you found?âÂÂWhat have you tried?
â G-Man
Mar 21 '17 at 18:10
Hi, apologies yes I have been looking into this extensively. I first tried the below but this did not give the desired result so had to play around with the columns etc awk -F, 'NR==FNRa[$1,$3]++;next (a[$1,$2])' file1.txt file2.txt I have been also looking at whether I should be doing this in SQL but that means importing on eof the files into SQL and running innerjoins which with 150 million lines can take a while. I am still investigating and will obviously share anything I find that works on this post.
â Veyron
Mar 22 '17 at 14:24
Hi, apologies yes I have been looking into this extensively. I first tried the below but this did not give the desired result so had to play around with the columns etc awk -F, 'NR==FNRa[$1,$3]++;next (a[$1,$2])' file1.txt file2.txt I have been also looking at whether I should be doing this in SQL but that means importing on eof the files into SQL and running innerjoins which with 150 million lines can take a while. I am still investigating and will obviously share anything I find that works on this post.
â Veyron
Mar 22 '17 at 14:24
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
0
down vote
Given the following input files:
$ cat in1 in2
Column 1, Column 2
abc , 123
def , 234
adf , 567
Column 1, Column 2
abc , 123
def , 234
adf , 578
First, we sort them; we can then stitch them together into one file:
$ sort in1 > in1.sorted; sort in2 > in2.sorted; paste in1,2.sorted
Column 1, Column 2 Column 1, Column 2
abc , 123 abc , 123
adf , 567 adf , 578
def , 234 def , 234
awk
will help us here, but the commas get in our way; we can get rid of them with sed
first:
$ paste in1,2.sorted | sed s/,//g
Column 1 Column 2 Column 1 Column 2
abc 123 abc 123
adf 567 adf 578
def 234 def 234
And then we can dump that through a quick awk
:
$ paste in1,2.sorted | sed s/,//g | awk '$2 == $4 print $1,"Validated"; $2 != $4 print $1,"Failed"'
Column Failed
abc Validated
adf Failed
def Validated
This can also be done with raw awk
, with the advantage of being able to strip out the header rows and not rely on the same data being in the same order, thus removing the need for sorting:
$ awk 'FNR != 1 && NR == FNR data[$1]=$3 FNR != 1 && NR != FNR if(data[$1]==$3) print $1, "Validated" else print $1, "Failed" ' in1,2
abc Validated
adf Failed
def Validated
This relies on a few magic awk
built-in variables and tricks related to them:
NR
- the total number of records processedFNR
- the total number of records in the current file processedFNR != 1
- skips the first row of each file (does not treat the headers as data)NR != FNR
- runs only after the first file has been completely read and we have started reading subsequent files. This allows us to prepopulate thedata
array for testing once we start chewing on the second file.
Here is a modified version as a one-liner:paste <(sort in1) <(sort in2) | tr -d , | awk '$2 == $4 print $1,"Validated"; $2 != $4 print $1,"Failed"'
â hschou
Mar 21 '17 at 22:23
Thank you, I am working on the final lists to compare and will give this a try and will let you know the end results and if I need to tweak anything. Will update you on progress.
â Veyron
Mar 22 '17 at 14:28
OK, so did a test with this. The first file is the complete source list of 15 million lines, the second file is a subset of what has been check, ~22,000 lines, end result is that only about 50 lines were validated. I am looking into another method and will post here shortly.
â Veyron
Mar 22 '17 at 15:48
add a comment |Â
up vote
0
down vote
I think I have this sorted out now with the following, in case any else reads this and needs this. Thanks again.
FNR == NR
for (i = 2; i <= NF; i++) a[i,$1] = $i
b[$1];
next;
($1 in b) # check if row in file2 existed in file1
for (i = 2; i <= NF; i++)
if (a[i,$1] == $i)
printf("%s->col%d: %s vs %s: Validn", $1, i-1, a[i,$1], $i);
else
printf("%s->col%d: %s vs %s: Failuren", $1, i-1, a[i,$1], $i);
delete b[$1]; # delete entries which are processed
END
for (left in b) # look which didn't match
for (i = 2; i <= NF; i++)
printf("%s->col%d: %s vs (blank): Not Equaln", left, i-1, a[i,left])
Further to my email, although I thought I had an answer, I have tested this with a couple of CSR files but it does not work. I have spent hours going through this but cannot seem to work out where the awk statement is wrong. Any help would be appreciated and thanks in advance.
â Veyron
Mar 27 '17 at 4:28
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
Given the following input files:
$ cat in1 in2
Column 1, Column 2
abc , 123
def , 234
adf , 567
Column 1, Column 2
abc , 123
def , 234
adf , 578
First, we sort them; we can then stitch them together into one file:
$ sort in1 > in1.sorted; sort in2 > in2.sorted; paste in1,2.sorted
Column 1, Column 2 Column 1, Column 2
abc , 123 abc , 123
adf , 567 adf , 578
def , 234 def , 234
awk
will help us here, but the commas get in our way; we can get rid of them with sed
first:
$ paste in1,2.sorted | sed s/,//g
Column 1 Column 2 Column 1 Column 2
abc 123 abc 123
adf 567 adf 578
def 234 def 234
And then we can dump that through a quick awk
:
$ paste in1,2.sorted | sed s/,//g | awk '$2 == $4 print $1,"Validated"; $2 != $4 print $1,"Failed"'
Column Failed
abc Validated
adf Failed
def Validated
This can also be done with raw awk
, with the advantage of being able to strip out the header rows and not rely on the same data being in the same order, thus removing the need for sorting:
$ awk 'FNR != 1 && NR == FNR data[$1]=$3 FNR != 1 && NR != FNR if(data[$1]==$3) print $1, "Validated" else print $1, "Failed" ' in1,2
abc Validated
adf Failed
def Validated
This relies on a few magic awk
built-in variables and tricks related to them:
NR
- the total number of records processedFNR
- the total number of records in the current file processedFNR != 1
- skips the first row of each file (does not treat the headers as data)NR != FNR
- runs only after the first file has been completely read and we have started reading subsequent files. This allows us to prepopulate thedata
array for testing once we start chewing on the second file.
Here is a modified version as a one-liner:paste <(sort in1) <(sort in2) | tr -d , | awk '$2 == $4 print $1,"Validated"; $2 != $4 print $1,"Failed"'
â hschou
Mar 21 '17 at 22:23
Thank you, I am working on the final lists to compare and will give this a try and will let you know the end results and if I need to tweak anything. Will update you on progress.
â Veyron
Mar 22 '17 at 14:28
OK, so did a test with this. The first file is the complete source list of 15 million lines, the second file is a subset of what has been check, ~22,000 lines, end result is that only about 50 lines were validated. I am looking into another method and will post here shortly.
â Veyron
Mar 22 '17 at 15:48
add a comment |Â
up vote
0
down vote
Given the following input files:
$ cat in1 in2
Column 1, Column 2
abc , 123
def , 234
adf , 567
Column 1, Column 2
abc , 123
def , 234
adf , 578
First, we sort them; we can then stitch them together into one file:
$ sort in1 > in1.sorted; sort in2 > in2.sorted; paste in1,2.sorted
Column 1, Column 2 Column 1, Column 2
abc , 123 abc , 123
adf , 567 adf , 578
def , 234 def , 234
awk
will help us here, but the commas get in our way; we can get rid of them with sed
first:
$ paste in1,2.sorted | sed s/,//g
Column 1 Column 2 Column 1 Column 2
abc 123 abc 123
adf 567 adf 578
def 234 def 234
And then we can dump that through a quick awk
:
$ paste in1,2.sorted | sed s/,//g | awk '$2 == $4 print $1,"Validated"; $2 != $4 print $1,"Failed"'
Column Failed
abc Validated
adf Failed
def Validated
This can also be done with raw awk
, with the advantage of being able to strip out the header rows and not rely on the same data being in the same order, thus removing the need for sorting:
$ awk 'FNR != 1 && NR == FNR data[$1]=$3 FNR != 1 && NR != FNR if(data[$1]==$3) print $1, "Validated" else print $1, "Failed" ' in1,2
abc Validated
adf Failed
def Validated
This relies on a few magic awk
built-in variables and tricks related to them:
NR
- the total number of records processedFNR
- the total number of records in the current file processedFNR != 1
- skips the first row of each file (does not treat the headers as data)NR != FNR
- runs only after the first file has been completely read and we have started reading subsequent files. This allows us to prepopulate thedata
array for testing once we start chewing on the second file.
Here is a modified version as a one-liner:paste <(sort in1) <(sort in2) | tr -d , | awk '$2 == $4 print $1,"Validated"; $2 != $4 print $1,"Failed"'
â hschou
Mar 21 '17 at 22:23
Thank you, I am working on the final lists to compare and will give this a try and will let you know the end results and if I need to tweak anything. Will update you on progress.
â Veyron
Mar 22 '17 at 14:28
OK, so did a test with this. The first file is the complete source list of 15 million lines, the second file is a subset of what has been check, ~22,000 lines, end result is that only about 50 lines were validated. I am looking into another method and will post here shortly.
â Veyron
Mar 22 '17 at 15:48
add a comment |Â
up vote
0
down vote
up vote
0
down vote
Given the following input files:
$ cat in1 in2
Column 1, Column 2
abc , 123
def , 234
adf , 567
Column 1, Column 2
abc , 123
def , 234
adf , 578
First, we sort them; we can then stitch them together into one file:
$ sort in1 > in1.sorted; sort in2 > in2.sorted; paste in1,2.sorted
Column 1, Column 2 Column 1, Column 2
abc , 123 abc , 123
adf , 567 adf , 578
def , 234 def , 234
awk
will help us here, but the commas get in our way; we can get rid of them with sed
first:
$ paste in1,2.sorted | sed s/,//g
Column 1 Column 2 Column 1 Column 2
abc 123 abc 123
adf 567 adf 578
def 234 def 234
And then we can dump that through a quick awk
:
$ paste in1,2.sorted | sed s/,//g | awk '$2 == $4 print $1,"Validated"; $2 != $4 print $1,"Failed"'
Column Failed
abc Validated
adf Failed
def Validated
This can also be done with raw awk
, with the advantage of being able to strip out the header rows and not rely on the same data being in the same order, thus removing the need for sorting:
$ awk 'FNR != 1 && NR == FNR data[$1]=$3 FNR != 1 && NR != FNR if(data[$1]==$3) print $1, "Validated" else print $1, "Failed" ' in1,2
abc Validated
adf Failed
def Validated
This relies on a few magic awk
built-in variables and tricks related to them:
NR
- the total number of records processedFNR
- the total number of records in the current file processedFNR != 1
- skips the first row of each file (does not treat the headers as data)NR != FNR
- runs only after the first file has been completely read and we have started reading subsequent files. This allows us to prepopulate thedata
array for testing once we start chewing on the second file.
Given the following input files:
$ cat in1 in2
Column 1, Column 2
abc , 123
def , 234
adf , 567
Column 1, Column 2
abc , 123
def , 234
adf , 578
First, we sort them; we can then stitch them together into one file:
$ sort in1 > in1.sorted; sort in2 > in2.sorted; paste in1,2.sorted
Column 1, Column 2 Column 1, Column 2
abc , 123 abc , 123
adf , 567 adf , 578
def , 234 def , 234
awk
will help us here, but the commas get in our way; we can get rid of them with sed
first:
$ paste in1,2.sorted | sed s/,//g
Column 1 Column 2 Column 1 Column 2
abc 123 abc 123
adf 567 adf 578
def 234 def 234
And then we can dump that through a quick awk
:
$ paste in1,2.sorted | sed s/,//g | awk '$2 == $4 print $1,"Validated"; $2 != $4 print $1,"Failed"'
Column Failed
abc Validated
adf Failed
def Validated
This can also be done with raw awk
, with the advantage of being able to strip out the header rows and not rely on the same data being in the same order, thus removing the need for sorting:
$ awk 'FNR != 1 && NR == FNR data[$1]=$3 FNR != 1 && NR != FNR if(data[$1]==$3) print $1, "Validated" else print $1, "Failed" ' in1,2
abc Validated
adf Failed
def Validated
This relies on a few magic awk
built-in variables and tricks related to them:
NR
- the total number of records processedFNR
- the total number of records in the current file processedFNR != 1
- skips the first row of each file (does not treat the headers as data)NR != FNR
- runs only after the first file has been completely read and we have started reading subsequent files. This allows us to prepopulate thedata
array for testing once we start chewing on the second file.
edited Mar 21 '17 at 22:50
answered Mar 21 '17 at 17:25
DopeGhoti
41k55080
41k55080
Here is a modified version as a one-liner:paste <(sort in1) <(sort in2) | tr -d , | awk '$2 == $4 print $1,"Validated"; $2 != $4 print $1,"Failed"'
â hschou
Mar 21 '17 at 22:23
Thank you, I am working on the final lists to compare and will give this a try and will let you know the end results and if I need to tweak anything. Will update you on progress.
â Veyron
Mar 22 '17 at 14:28
OK, so did a test with this. The first file is the complete source list of 15 million lines, the second file is a subset of what has been check, ~22,000 lines, end result is that only about 50 lines were validated. I am looking into another method and will post here shortly.
â Veyron
Mar 22 '17 at 15:48
add a comment |Â
Here is a modified version as a one-liner:paste <(sort in1) <(sort in2) | tr -d , | awk '$2 == $4 print $1,"Validated"; $2 != $4 print $1,"Failed"'
â hschou
Mar 21 '17 at 22:23
Thank you, I am working on the final lists to compare and will give this a try and will let you know the end results and if I need to tweak anything. Will update you on progress.
â Veyron
Mar 22 '17 at 14:28
OK, so did a test with this. The first file is the complete source list of 15 million lines, the second file is a subset of what has been check, ~22,000 lines, end result is that only about 50 lines were validated. I am looking into another method and will post here shortly.
â Veyron
Mar 22 '17 at 15:48
Here is a modified version as a one-liner:
paste <(sort in1) <(sort in2) | tr -d , | awk '$2 == $4 print $1,"Validated"; $2 != $4 print $1,"Failed"'
â hschou
Mar 21 '17 at 22:23
Here is a modified version as a one-liner:
paste <(sort in1) <(sort in2) | tr -d , | awk '$2 == $4 print $1,"Validated"; $2 != $4 print $1,"Failed"'
â hschou
Mar 21 '17 at 22:23
Thank you, I am working on the final lists to compare and will give this a try and will let you know the end results and if I need to tweak anything. Will update you on progress.
â Veyron
Mar 22 '17 at 14:28
Thank you, I am working on the final lists to compare and will give this a try and will let you know the end results and if I need to tweak anything. Will update you on progress.
â Veyron
Mar 22 '17 at 14:28
OK, so did a test with this. The first file is the complete source list of 15 million lines, the second file is a subset of what has been check, ~22,000 lines, end result is that only about 50 lines were validated. I am looking into another method and will post here shortly.
â Veyron
Mar 22 '17 at 15:48
OK, so did a test with this. The first file is the complete source list of 15 million lines, the second file is a subset of what has been check, ~22,000 lines, end result is that only about 50 lines were validated. I am looking into another method and will post here shortly.
â Veyron
Mar 22 '17 at 15:48
add a comment |Â
up vote
0
down vote
I think I have this sorted out now with the following, in case any else reads this and needs this. Thanks again.
FNR == NR
for (i = 2; i <= NF; i++) a[i,$1] = $i
b[$1];
next;
($1 in b) # check if row in file2 existed in file1
for (i = 2; i <= NF; i++)
if (a[i,$1] == $i)
printf("%s->col%d: %s vs %s: Validn", $1, i-1, a[i,$1], $i);
else
printf("%s->col%d: %s vs %s: Failuren", $1, i-1, a[i,$1], $i);
delete b[$1]; # delete entries which are processed
END
for (left in b) # look which didn't match
for (i = 2; i <= NF; i++)
printf("%s->col%d: %s vs (blank): Not Equaln", left, i-1, a[i,left])
Further to my email, although I thought I had an answer, I have tested this with a couple of CSR files but it does not work. I have spent hours going through this but cannot seem to work out where the awk statement is wrong. Any help would be appreciated and thanks in advance.
â Veyron
Mar 27 '17 at 4:28
add a comment |Â
up vote
0
down vote
I think I have this sorted out now with the following, in case any else reads this and needs this. Thanks again.
FNR == NR
for (i = 2; i <= NF; i++) a[i,$1] = $i
b[$1];
next;
($1 in b) # check if row in file2 existed in file1
for (i = 2; i <= NF; i++)
if (a[i,$1] == $i)
printf("%s->col%d: %s vs %s: Validn", $1, i-1, a[i,$1], $i);
else
printf("%s->col%d: %s vs %s: Failuren", $1, i-1, a[i,$1], $i);
delete b[$1]; # delete entries which are processed
END
for (left in b) # look which didn't match
for (i = 2; i <= NF; i++)
printf("%s->col%d: %s vs (blank): Not Equaln", left, i-1, a[i,left])
Further to my email, although I thought I had an answer, I have tested this with a couple of CSR files but it does not work. I have spent hours going through this but cannot seem to work out where the awk statement is wrong. Any help would be appreciated and thanks in advance.
â Veyron
Mar 27 '17 at 4:28
add a comment |Â
up vote
0
down vote
up vote
0
down vote
I think I have this sorted out now with the following, in case any else reads this and needs this. Thanks again.
FNR == NR
for (i = 2; i <= NF; i++) a[i,$1] = $i
b[$1];
next;
($1 in b) # check if row in file2 existed in file1
for (i = 2; i <= NF; i++)
if (a[i,$1] == $i)
printf("%s->col%d: %s vs %s: Validn", $1, i-1, a[i,$1], $i);
else
printf("%s->col%d: %s vs %s: Failuren", $1, i-1, a[i,$1], $i);
delete b[$1]; # delete entries which are processed
END
for (left in b) # look which didn't match
for (i = 2; i <= NF; i++)
printf("%s->col%d: %s vs (blank): Not Equaln", left, i-1, a[i,left])
I think I have this sorted out now with the following, in case any else reads this and needs this. Thanks again.
FNR == NR
for (i = 2; i <= NF; i++) a[i,$1] = $i
b[$1];
next;
($1 in b) # check if row in file2 existed in file1
for (i = 2; i <= NF; i++)
if (a[i,$1] == $i)
printf("%s->col%d: %s vs %s: Validn", $1, i-1, a[i,$1], $i);
else
printf("%s->col%d: %s vs %s: Failuren", $1, i-1, a[i,$1], $i);
delete b[$1]; # delete entries which are processed
END
for (left in b) # look which didn't match
for (i = 2; i <= NF; i++)
printf("%s->col%d: %s vs (blank): Not Equaln", left, i-1, a[i,left])
answered Mar 23 '17 at 11:31
Veyron
11
11
Further to my email, although I thought I had an answer, I have tested this with a couple of CSR files but it does not work. I have spent hours going through this but cannot seem to work out where the awk statement is wrong. Any help would be appreciated and thanks in advance.
â Veyron
Mar 27 '17 at 4:28
add a comment |Â
Further to my email, although I thought I had an answer, I have tested this with a couple of CSR files but it does not work. I have spent hours going through this but cannot seem to work out where the awk statement is wrong. Any help would be appreciated and thanks in advance.
â Veyron
Mar 27 '17 at 4:28
Further to my email, although I thought I had an answer, I have tested this with a couple of CSR files but it does not work. I have spent hours going through this but cannot seem to work out where the awk statement is wrong. Any help would be appreciated and thanks in advance.
â Veyron
Mar 27 '17 at 4:28
Further to my email, although I thought I had an answer, I have tested this with a couple of CSR files but it does not work. I have spent hours going through this but cannot seem to work out where the awk statement is wrong. Any help would be appreciated and thanks in advance.
â Veyron
Mar 27 '17 at 4:28
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f352877%2fcomparing-csv-files-to-lookup-column-1-then-check-values-in-column-2%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Your question would be clearer if you (1)â¯showed exactly what you want the output files to look like for your example input, and (2)â¯described (and illustrated) what you want to happen if a value in columnâ¯1 of fileâ¯1 is not present in fileâ¯2.âÂÂAlso, have you made any attempt to solve this yourself?âÂÂDone any research?âÂÂWhat have you found?âÂÂWhat have you tried?
â G-Man
Mar 21 '17 at 18:10
Hi, apologies yes I have been looking into this extensively. I first tried the below but this did not give the desired result so had to play around with the columns etc awk -F, 'NR==FNRa[$1,$3]++;next (a[$1,$2])' file1.txt file2.txt I have been also looking at whether I should be doing this in SQL but that means importing on eof the files into SQL and running innerjoins which with 150 million lines can take a while. I am still investigating and will obviously share anything I find that works on this post.
â Veyron
Mar 22 '17 at 14:24