Matching two files and printing lines that appear first time

up vote
4
down vote

favorite

I have two files that look like this:

file1 (unique IDs):

and file2:

 1 C95696352 score: -69.785 nathvy = 38 nconfs = 888
 2 C98230482 score: -57.431 nathvy = 47 nconfs = 575
 3 C96209347 score: -57.128 nathvy = 24 nconfs = 1188
 4 C36510773 score: -56.502 nathvy = 38 nconfs = 7595
 5 C04355288 score: -56.400 nathvy = 41 nconfs = 50502
 6 C89372772 score: -55.728 nathvy = 22 nconfs = 3228
 7 C96209347 score: -54.713 nathvy = 24 nconfs = 162
 8 C96209347 score: -53.901 nathvy = 24 nconfs = 159
 9 C06169346 score: -53.438 nathvy = 22 nconfs = 105
 10 C95696352 score: -52.848 nathvy = 38 nconfs = 878
 11 C98216318 score: -52.061 nathvy = 52 nconfs = 1092
 12 C04285713 score: -52.009 nathvy = 38 nconfs = 1355
 13 C96209347 score: -51.477 nathvy = 24 nconfs = 1375
 14 C98222837 score: -50.730 nathvy = 34 nconfs = 588
 15 C98216318 score: -50.694 nathvy = 52 nconfs = 1136
 16 C32832068 score: -50.546 nathvy = 22 nconfs = 548
 17 C95696352 score: -50.475 nathvy = 38 nconfs = 3220
 18 C32832068 score: -50.457 nathvy = 22 nconfs = 16235
 19 C95696352 score: -50.234 nathvy = 38 nconfs = 3048
 20 C85594749 score: -49.780 nathvy = 44 nconfs = 4536
 21 C72332782 score: -49.676 nathvy = 41 nconfs = 3942
 22 C97970648 score: -49.616 nathvy = 45 nconfs = 17640
 23 C04285713 score: -49.594 nathvy = 38 nconfs = 14038
 24 C98043133 score: -49.370 nathvy = 43 nconfs = 1236
 25 C89372772 score: -49.308 nathvy = 22 nconfs = 471
 26 C97970648 score: -49.297 nathvy = 45 nconfs = 17850
 27 C85594749 score: -49.122 nathvy = 44 nconfs = 4158
 28 C70006381 score: -49.092 nathvy = 24 nconfs = 880

I would like to match IDs from file1 with IDs in file2 (second column) and for those that are matching to print them. Also, in file2 some IDs are repeating, such as C96209347 (although whole lines are not identical). I would like to grep those lines that are appearing for the first time only and others to skip. So in this specific example with C96209347 only third line from file2 should be printed. Anybody can help?

edited Aug 31 at 7:45

pa4080

12.3k52256

asked Aug 31 at 7:25

sergio

786

add a commentÂ |Â

up vote
4
down vote

favorite

I have two files that look like this:

file1 (unique IDs):

and file2:

 1 C95696352 score: -69.785 nathvy = 38 nconfs = 888
 2 C98230482 score: -57.431 nathvy = 47 nconfs = 575
 3 C96209347 score: -57.128 nathvy = 24 nconfs = 1188
 4 C36510773 score: -56.502 nathvy = 38 nconfs = 7595
 5 C04355288 score: -56.400 nathvy = 41 nconfs = 50502
 6 C89372772 score: -55.728 nathvy = 22 nconfs = 3228
 7 C96209347 score: -54.713 nathvy = 24 nconfs = 162
 8 C96209347 score: -53.901 nathvy = 24 nconfs = 159
 9 C06169346 score: -53.438 nathvy = 22 nconfs = 105
 10 C95696352 score: -52.848 nathvy = 38 nconfs = 878
 11 C98216318 score: -52.061 nathvy = 52 nconfs = 1092
 12 C04285713 score: -52.009 nathvy = 38 nconfs = 1355
 13 C96209347 score: -51.477 nathvy = 24 nconfs = 1375
 14 C98222837 score: -50.730 nathvy = 34 nconfs = 588
 15 C98216318 score: -50.694 nathvy = 52 nconfs = 1136
 16 C32832068 score: -50.546 nathvy = 22 nconfs = 548
 17 C95696352 score: -50.475 nathvy = 38 nconfs = 3220
 18 C32832068 score: -50.457 nathvy = 22 nconfs = 16235
 19 C95696352 score: -50.234 nathvy = 38 nconfs = 3048
 20 C85594749 score: -49.780 nathvy = 44 nconfs = 4536
 21 C72332782 score: -49.676 nathvy = 41 nconfs = 3942
 22 C97970648 score: -49.616 nathvy = 45 nconfs = 17640
 23 C04285713 score: -49.594 nathvy = 38 nconfs = 14038
 24 C98043133 score: -49.370 nathvy = 43 nconfs = 1236
 25 C89372772 score: -49.308 nathvy = 22 nconfs = 471
 26 C97970648 score: -49.297 nathvy = 45 nconfs = 17850
 27 C85594749 score: -49.122 nathvy = 44 nconfs = 4158
 28 C70006381 score: -49.092 nathvy = 24 nconfs = 880

edited Aug 31 at 7:45

pa4080

12.3k52256

asked Aug 31 at 7:25

sergio

786

add a commentÂ |Â

up vote
4
down vote

favorite

I have two files that look like this:

file1 (unique IDs):

and file2:

 1 C95696352 score: -69.785 nathvy = 38 nconfs = 888
 2 C98230482 score: -57.431 nathvy = 47 nconfs = 575
 3 C96209347 score: -57.128 nathvy = 24 nconfs = 1188
 4 C36510773 score: -56.502 nathvy = 38 nconfs = 7595
 5 C04355288 score: -56.400 nathvy = 41 nconfs = 50502
 6 C89372772 score: -55.728 nathvy = 22 nconfs = 3228
 7 C96209347 score: -54.713 nathvy = 24 nconfs = 162
 8 C96209347 score: -53.901 nathvy = 24 nconfs = 159
 9 C06169346 score: -53.438 nathvy = 22 nconfs = 105
 10 C95696352 score: -52.848 nathvy = 38 nconfs = 878
 11 C98216318 score: -52.061 nathvy = 52 nconfs = 1092
 12 C04285713 score: -52.009 nathvy = 38 nconfs = 1355
 13 C96209347 score: -51.477 nathvy = 24 nconfs = 1375
 14 C98222837 score: -50.730 nathvy = 34 nconfs = 588
 15 C98216318 score: -50.694 nathvy = 52 nconfs = 1136
 16 C32832068 score: -50.546 nathvy = 22 nconfs = 548
 17 C95696352 score: -50.475 nathvy = 38 nconfs = 3220
 18 C32832068 score: -50.457 nathvy = 22 nconfs = 16235
 19 C95696352 score: -50.234 nathvy = 38 nconfs = 3048
 20 C85594749 score: -49.780 nathvy = 44 nconfs = 4536
 21 C72332782 score: -49.676 nathvy = 41 nconfs = 3942
 22 C97970648 score: -49.616 nathvy = 45 nconfs = 17640
 23 C04285713 score: -49.594 nathvy = 38 nconfs = 14038
 24 C98043133 score: -49.370 nathvy = 43 nconfs = 1236
 25 C89372772 score: -49.308 nathvy = 22 nconfs = 471
 26 C97970648 score: -49.297 nathvy = 45 nconfs = 17850
 27 C85594749 score: -49.122 nathvy = 44 nconfs = 4158
 28 C70006381 score: -49.092 nathvy = 24 nconfs = 880

edited Aug 31 at 7:45

pa4080

12.3k52256

asked Aug 31 at 7:25

sergio

786

I have two files that look like this:

file1 (unique IDs):

and file2:

 1 C95696352 score: -69.785 nathvy = 38 nconfs = 888
 2 C98230482 score: -57.431 nathvy = 47 nconfs = 575
 3 C96209347 score: -57.128 nathvy = 24 nconfs = 1188
 4 C36510773 score: -56.502 nathvy = 38 nconfs = 7595
 5 C04355288 score: -56.400 nathvy = 41 nconfs = 50502
 6 C89372772 score: -55.728 nathvy = 22 nconfs = 3228
 7 C96209347 score: -54.713 nathvy = 24 nconfs = 162
 8 C96209347 score: -53.901 nathvy = 24 nconfs = 159
 9 C06169346 score: -53.438 nathvy = 22 nconfs = 105
 10 C95696352 score: -52.848 nathvy = 38 nconfs = 878
 11 C98216318 score: -52.061 nathvy = 52 nconfs = 1092
 12 C04285713 score: -52.009 nathvy = 38 nconfs = 1355
 13 C96209347 score: -51.477 nathvy = 24 nconfs = 1375
 14 C98222837 score: -50.730 nathvy = 34 nconfs = 588
 15 C98216318 score: -50.694 nathvy = 52 nconfs = 1136
 16 C32832068 score: -50.546 nathvy = 22 nconfs = 548
 17 C95696352 score: -50.475 nathvy = 38 nconfs = 3220
 18 C32832068 score: -50.457 nathvy = 22 nconfs = 16235
 19 C95696352 score: -50.234 nathvy = 38 nconfs = 3048
 20 C85594749 score: -49.780 nathvy = 44 nconfs = 4536
 21 C72332782 score: -49.676 nathvy = 41 nconfs = 3942
 22 C97970648 score: -49.616 nathvy = 45 nconfs = 17640
 23 C04285713 score: -49.594 nathvy = 38 nconfs = 14038
 24 C98043133 score: -49.370 nathvy = 43 nconfs = 1236
 25 C89372772 score: -49.308 nathvy = 22 nconfs = 471
 26 C97970648 score: -49.297 nathvy = 45 nconfs = 17850
 27 C85594749 score: -49.122 nathvy = 44 nconfs = 4158
 28 C70006381 score: -49.092 nathvy = 24 nconfs = 880

command-line text-processing

edited Aug 31 at 7:45

pa4080

12.3k52256

asked Aug 31 at 7:25

sergio

786

edited Aug 31 at 7:45

pa4080

12.3k52256

asked Aug 31 at 7:25

sergio

786

edited Aug 31 at 7:45

pa4080

12.3k52256

edited Aug 31 at 7:45

pa4080

12.3k52256

edited Aug 31 at 7:45

pa4080

12.3k52256

asked Aug 31 at 7:25

sergio

786

asked Aug 31 at 7:25

sergio

786

asked Aug 31 at 7:25

sergio

786

add a commentÂ |Â

2 Answers
2

active

oldest

votes

up vote
9
down vote

accepted

Try this,

grep -f file1 file2 | awk '!_[$2]++'

 1 C95696352 score: -69.785 nathvy = 38 nconfs = 888
 3 C96209347 score: -57.128 nathvy = 24 nconfs = 1188
 6 C89372772 score: -55.728 nathvy = 22 nconfs = 3228
20 C85594749 score: -49.780 nathvy = 44 nconfs = 4536

Explanation

grep -f file1 file2: search in file2 for matches of patterns obtained from file1

awk '!_[$2]++': Don't print anything if field $2 has been seen before (via)
- _ is the array name (can be anything, e.g. "seen")
- _[$2]++ will create an array entry with the key being the content of field $2 and add 1
- If _[$2] was not (!) already set, print the line. The printcommand is the default action that is made by awk when the condition matches.

edited Aug 31 at 7:52

answered Aug 31 at 7:40

RoVo

5,6441237

1

This works. Thank you very much! All the best
â€“Â sergio
Aug 31 at 7:45

Wow, nice and simple solution.
â€“Â abu_bua
yesterday

add a commentÂ |Â

up vote
1
down vote

With awk alone:

$ awk 'NR==FNR a[$1]=1; next $2 in a print; delete a[$2]' file1 file2
 1 C95696352 score: -69.785 nathvy = 38 nconfs = 888
 3 C96209347 score: -57.128 nathvy = 24 nconfs = 1188
 6 C89372772 score: -55.728 nathvy = 22 nconfs = 3228
 20 C85594749 score: -49.780 nathvy = 44 nconfs = 4536

answered Aug 31 at 10:57

steeldriver

63.3k1198167

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "89"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1070748%2fmatching-two-files-and-printing-lines-that-appear-first-time%23new-answer', 'question_page');

);

Post as a guest

Name

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
9
down vote

accepted

Try this,

grep -f file1 file2 | awk '!_[$2]++'

 1 C95696352 score: -69.785 nathvy = 38 nconfs = 888
 3 C96209347 score: -57.128 nathvy = 24 nconfs = 1188
 6 C89372772 score: -55.728 nathvy = 22 nconfs = 3228
20 C85594749 score: -49.780 nathvy = 44 nconfs = 4536

Explanation

grep -f file1 file2: search in file2 for matches of patterns obtained from file1

awk '!_[$2]++': Don't print anything if field $2 has been seen before (via)
- _ is the array name (can be anything, e.g. "seen")
- _[$2]++ will create an array entry with the key being the content of field $2 and add 1
- If _[$2] was not (!) already set, print the line. The printcommand is the default action that is made by awk when the condition matches.

edited Aug 31 at 7:52

answered Aug 31 at 7:40

RoVo

5,6441237

1

This works. Thank you very much! All the best
â€“Â sergio
Aug 31 at 7:45

Wow, nice and simple solution.
â€“Â abu_bua
yesterday

add a commentÂ |Â

up vote
9
down vote

accepted

Try this,

grep -f file1 file2 | awk '!_[$2]++'

 1 C95696352 score: -69.785 nathvy = 38 nconfs = 888
 3 C96209347 score: -57.128 nathvy = 24 nconfs = 1188
 6 C89372772 score: -55.728 nathvy = 22 nconfs = 3228
20 C85594749 score: -49.780 nathvy = 44 nconfs = 4536

Explanation

grep -f file1 file2: search in file2 for matches of patterns obtained from file1

awk '!_[$2]++': Don't print anything if field $2 has been seen before (via)
- _ is the array name (can be anything, e.g. "seen")
- _[$2]++ will create an array entry with the key being the content of field $2 and add 1
- If _[$2] was not (!) already set, print the line. The printcommand is the default action that is made by awk when the condition matches.

edited Aug 31 at 7:52

answered Aug 31 at 7:40

RoVo

5,6441237

1

This works. Thank you very much! All the best
â€“Â sergio
Aug 31 at 7:45

Wow, nice and simple solution.
â€“Â abu_bua
yesterday

add a commentÂ |Â

up vote
9
down vote

accepted

Try this,

grep -f file1 file2 | awk '!_[$2]++'

 1 C95696352 score: -69.785 nathvy = 38 nconfs = 888
 3 C96209347 score: -57.128 nathvy = 24 nconfs = 1188
 6 C89372772 score: -55.728 nathvy = 22 nconfs = 3228
20 C85594749 score: -49.780 nathvy = 44 nconfs = 4536

Explanation

grep -f file1 file2: search in file2 for matches of patterns obtained from file1

awk '!_[$2]++': Don't print anything if field $2 has been seen before (via)
- _ is the array name (can be anything, e.g. "seen")
- _[$2]++ will create an array entry with the key being the content of field $2 and add 1
- If _[$2] was not (!) already set, print the line. The printcommand is the default action that is made by awk when the condition matches.

edited Aug 31 at 7:52

answered Aug 31 at 7:40

RoVo

5,6441237

Try this,

grep -f file1 file2 | awk '!_[$2]++'

 1 C95696352 score: -69.785 nathvy = 38 nconfs = 888
 3 C96209347 score: -57.128 nathvy = 24 nconfs = 1188
 6 C89372772 score: -55.728 nathvy = 22 nconfs = 3228
20 C85594749 score: -49.780 nathvy = 44 nconfs = 4536

Explanation

grep -f file1 file2: search in file2 for matches of patterns obtained from file1

awk '!_[$2]++': Don't print anything if field $2 has been seen before (via)
- _ is the array name (can be anything, e.g. "seen")
- _[$2]++ will create an array entry with the key being the content of field $2 and add 1
- If _[$2] was not (!) already set, print the line. The printcommand is the default action that is made by awk when the condition matches.

edited Aug 31 at 7:52

answered Aug 31 at 7:40

RoVo

5,6441237

edited Aug 31 at 7:52

answered Aug 31 at 7:40

RoVo

5,6441237

answered Aug 31 at 7:40

RoVo

5,6441237

answered Aug 31 at 7:40

RoVo

5,6441237

1

This works. Thank you very much! All the best
â€“Â sergio
Aug 31 at 7:45

Wow, nice and simple solution.
â€“Â abu_bua
yesterday

add a commentÂ |Â

1

This works. Thank you very much! All the best
â€“Â sergio
Aug 31 at 7:45

Wow, nice and simple solution.
â€“Â abu_bua
yesterday

This works. Thank you very much! All the best
â€“Â sergio
Aug 31 at 7:45

Wow, nice and simple solution.
â€“Â abu_bua
yesterday

add a commentÂ |Â

up vote
1
down vote

With awk alone:

$ awk 'NR==FNR a[$1]=1; next $2 in a print; delete a[$2]' file1 file2
 1 C95696352 score: -69.785 nathvy = 38 nconfs = 888
 3 C96209347 score: -57.128 nathvy = 24 nconfs = 1188
 6 C89372772 score: -55.728 nathvy = 22 nconfs = 3228
 20 C85594749 score: -49.780 nathvy = 44 nconfs = 4536

answered Aug 31 at 10:57

steeldriver

63.3k1198167

add a commentÂ |Â

up vote
1
down vote

With awk alone:

$ awk 'NR==FNR a[$1]=1; next $2 in a print; delete a[$2]' file1 file2
 1 C95696352 score: -69.785 nathvy = 38 nconfs = 888
 3 C96209347 score: -57.128 nathvy = 24 nconfs = 1188
 6 C89372772 score: -55.728 nathvy = 22 nconfs = 3228
 20 C85594749 score: -49.780 nathvy = 44 nconfs = 4536

answered Aug 31 at 10:57

steeldriver

63.3k1198167

add a commentÂ |Â

up vote
1
down vote

With awk alone:

$ awk 'NR==FNR a[$1]=1; next $2 in a print; delete a[$2]' file1 file2
 1 C95696352 score: -69.785 nathvy = 38 nconfs = 888
 3 C96209347 score: -57.128 nathvy = 24 nconfs = 1188
 6 C89372772 score: -55.728 nathvy = 22 nconfs = 3228
 20 C85594749 score: -49.780 nathvy = 44 nconfs = 4536

answered Aug 31 at 10:57

steeldriver

63.3k1198167

With awk alone:

$ awk 'NR==FNR a[$1]=1; next $2 in a print; delete a[$2]' file1 file2
 1 C95696352 score: -69.785 nathvy = 38 nconfs = 888
 3 C96209347 score: -57.128 nathvy = 24 nconfs = 1188
 6 C89372772 score: -55.728 nathvy = 22 nconfs = 3228
 20 C85594749 score: -49.780 nathvy = 44 nconfs = 4536

answered Aug 31 at 10:57

steeldriver

63.3k1198167

answered Aug 31 at 10:57

steeldriver

63.3k1198167

answered Aug 31 at 10:57

steeldriver

63.3k1198167

answered Aug 31 at 10:57

steeldriver

63.3k1198167

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu