How to remove duplicate value in a tab-delimited text file

up vote
5
down vote

favorite

I have a tab delimited column text like below

A B1 B1 C1
B B2 D2 
C C12 C13 C13
D D3 D5 D9
G F2 F2

how could I convert the above table like below

A B1 C1
B B2 D2 
C C12 C13
D D3 D5 D9
G F2

I have extracted my real data file, it is a tab delimited file and I have tried the command line you (StÃƒÂ©phane Chazelas?) posted it works fine but it couldn't remove the duplicate on the last column

A CD274 PDCD1LG2 CD276 PDCD1LG2 CD274
B NEK2 NEK6 NEK10 NEK10 NEKL-4
C TNFAIP3 OTUD7B OTUD7B TNFAIP3 TNFAIP3
D DUSP16 DUSP4 DUSP8 VHP-1 DUSP8
E AGO2 AGO2 AGO2 AGO2 AGO2

output need to be as below

A CD274 CD276 PDCD1LG2
B NEK2 NEK6 NEK10 NEKL-4
C TNFAIP3 OTUD7B
D DUSP16 DUSP4 DUSP8 VHP-1
E AGO2

edited Sep 26 '17 at 22:33

Kusalananda

106k14209327

asked Sep 26 '17 at 21:15

desu

544

Does order of fields in a line in output is important ? like AGO2 E or C OTUD7B TNFAIP3
â€“Â ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·
Sep 27 '17 at 9:09

@ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ· A B C seems to be the line numbering, I think at least they should stay there.
â€“Â dessert
Sep 27 '17 at 10:01

If you're happy with one or several of the answers, upvote them. If one is solving your issue, accepting it would be the best way of saying "Thank You!" :-)
â€“Â Kusalananda
Sep 27 '17 at 10:50

add a commentÂ |Â

up vote
5
down vote

favorite

I have a tab delimited column text like below

A B1 B1 C1
B B2 D2 
C C12 C13 C13
D D3 D5 D9
G F2 F2

how could I convert the above table like below

A B1 C1
B B2 D2 
C C12 C13
D D3 D5 D9
G F2

A CD274 PDCD1LG2 CD276 PDCD1LG2 CD274
B NEK2 NEK6 NEK10 NEK10 NEKL-4
C TNFAIP3 OTUD7B OTUD7B TNFAIP3 TNFAIP3
D DUSP16 DUSP4 DUSP8 VHP-1 DUSP8
E AGO2 AGO2 AGO2 AGO2 AGO2

output need to be as below

A CD274 CD276 PDCD1LG2
B NEK2 NEK6 NEK10 NEKL-4
C TNFAIP3 OTUD7B
D DUSP16 DUSP4 DUSP8 VHP-1
E AGO2

edited Sep 26 '17 at 22:33

Kusalananda

106k14209327

asked Sep 26 '17 at 21:15

desu

544

Does order of fields in a line in output is important ? like AGO2 E or C OTUD7B TNFAIP3
â€“Â ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·
Sep 27 '17 at 9:09

@ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ· A B C seems to be the line numbering, I think at least they should stay there.
â€“Â dessert
Sep 27 '17 at 10:01

If you're happy with one or several of the answers, upvote them. If one is solving your issue, accepting it would be the best way of saying "Thank You!" :-)
â€“Â Kusalananda
Sep 27 '17 at 10:50

add a commentÂ |Â

up vote
5
down vote

favorite

I have a tab delimited column text like below

A B1 B1 C1
B B2 D2 
C C12 C13 C13
D D3 D5 D9
G F2 F2

how could I convert the above table like below

A B1 C1
B B2 D2 
C C12 C13
D D3 D5 D9
G F2

A CD274 PDCD1LG2 CD276 PDCD1LG2 CD274
B NEK2 NEK6 NEK10 NEK10 NEKL-4
C TNFAIP3 OTUD7B OTUD7B TNFAIP3 TNFAIP3
D DUSP16 DUSP4 DUSP8 VHP-1 DUSP8
E AGO2 AGO2 AGO2 AGO2 AGO2

output need to be as below

A CD274 CD276 PDCD1LG2
B NEK2 NEK6 NEK10 NEKL-4
C TNFAIP3 OTUD7B
D DUSP16 DUSP4 DUSP8 VHP-1
E AGO2

edited Sep 26 '17 at 22:33

Kusalananda

106k14209327

asked Sep 26 '17 at 21:15

desu

544

I have a tab delimited column text like below

A B1 B1 C1
B B2 D2 
C C12 C13 C13
D D3 D5 D9
G F2 F2

how could I convert the above table like below

A B1 C1
B B2 D2 
C C12 C13
D D3 D5 D9
G F2

A CD274 PDCD1LG2 CD276 PDCD1LG2 CD274
B NEK2 NEK6 NEK10 NEK10 NEKL-4
C TNFAIP3 OTUD7B OTUD7B TNFAIP3 TNFAIP3
D DUSP16 DUSP4 DUSP8 VHP-1 DUSP8
E AGO2 AGO2 AGO2 AGO2 AGO2

output need to be as below

A CD274 CD276 PDCD1LG2
B NEK2 NEK6 NEK10 NEKL-4
C TNFAIP3 OTUD7B
D DUSP16 DUSP4 DUSP8 VHP-1
E AGO2

text-processing csv-simple

edited Sep 26 '17 at 22:33

Kusalananda

106k14209327

asked Sep 26 '17 at 21:15

desu

544

edited Sep 26 '17 at 22:33

Kusalananda

106k14209327

asked Sep 26 '17 at 21:15

desu

544

edited Sep 26 '17 at 22:33

Kusalananda

106k14209327

edited Sep 26 '17 at 22:33

Kusalananda

106k14209327

edited Sep 26 '17 at 22:33

Kusalananda

106k14209327

asked Sep 26 '17 at 21:15

desu

544

asked Sep 26 '17 at 21:15

desu

544

asked Sep 26 '17 at 21:15

desu

544

Does order of fields in a line in output is important ? like AGO2 E or C OTUD7B TNFAIP3
â€“Â ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·
Sep 27 '17 at 9:09

@ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ· A B C seems to be the line numbering, I think at least they should stay there.
â€“Â dessert
Sep 27 '17 at 10:01

If you're happy with one or several of the answers, upvote them. If one is solving your issue, accepting it would be the best way of saying "Thank You!" :-)
â€“Â Kusalananda
Sep 27 '17 at 10:50

add a commentÂ |Â

Does order of fields in a line in output is important ? like AGO2 E or C OTUD7B TNFAIP3
â€“Â ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·
Sep 27 '17 at 9:09

@ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ· A B C seems to be the line numbering, I think at least they should stay there.
â€“Â dessert
Sep 27 '17 at 10:01

If you're happy with one or several of the answers, upvote them. If one is solving your issue, accepting it would be the best way of saying "Thank You!" :-)
â€“Â Kusalananda
Sep 27 '17 at 10:50

Does order of fields in a line in output is important ? like AGO2 E or C OTUD7B TNFAIP3
â€“Â ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·
Sep 27 '17 at 9:09

@ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ· A B C seems to be the line numbering, I think at least they should stay there.
â€“Â dessert
Sep 27 '17 at 10:01

If you're happy with one or several of the answers, upvote them. If one is solving your issue, accepting it would be the best way of saying "Thank You!" :-)
â€“Â Kusalananda
Sep 27 '17 at 10:50

add a commentÂ |Â

7 Answers
7

active

oldest

votes

up vote
7
down vote

First set of example data:

$ awk -vOFS='t' ' r=""; delete t; for (i=1;i<=NF;++i) if (!t[$i]++) r = r ? r OFS $i : $i print r ' file
A B1 C1
B B2 D2
C C12 C13
D D3 D5 D9
G F2

Second set of example data (same awk script):

$ awk -vOFS='t' ' r=""; delete t; for (i=1;i<=NF;++i) if (!t[$i]++) r = r ? r OFS $i : $i print r ' file
A CD274 PDCD1LG2 CD276
B NEK2 NEK6 NEK10 NEKL-4
C TNFAIP3 OTUD7B
D DUSP16 DUSP4 DUSP8 VHP-1
E AGO2

The script reads the input file file line by line, and for each line it goes through each field, building up the output line, r. If the value in a field has already been added to the output line (determined by a lookup table, t, of used field values), then the field is ignored, otherwise it's added.

When all the fields of an input line have been processed, the constructed line is outputted.

The output field delimiter is set to tab through -vOFS='t' on the command line.

The awk script unravelled:


 r = ""
 delete t

 for (i = 1; i <= NF; ++i) 
 if (!t[$i]++) 
 r = r ? r OFS $i : $i
 
 

 print r

edited Sep 26 '17 at 23:22

answered Sep 26 '17 at 22:54

Kusalananda

106k14209327

2

See split("", t) for the POSIX equivalent to delete t
â€“Â StÃ©phane Chazelas
Sep 27 '17 at 6:45

add a commentÂ |Â

up vote
6
down vote

sed/tr, uniq and paste

while read -r l; do sed 's/t/n/g' <<< "$l" | uniq | paste -s; done < test

or POSIX compliant:

while read -r l; do echo "$l" | tr 't' 'n' | uniq | paste -s -; done < test

For the file test this will line by line replace all Tab characters with linebreaks, run uniq to delete dupes and replace the linebreaks with Tab characters again.

$ cat test
A B1 B1 C1
B B2 D2
C C12 C13 C13
D D3 D5 D9
G F2 F2

$ while read -r l; do sed 's/t/n/g' <<< "$l" | uniq | paste -s; done < test
A B1 C1
B B2 D2
C C12 C13
D D3 D5 D9
G F2

NB: This solution will not work for duplicates over multiple rows, e.g. C1 in

A B1 B1 C1
C1 B B2 D2

edited Sep 26 '17 at 22:19

answered Sep 26 '17 at 21:26

dessert

1,013321

add a commentÂ |Â

up vote
6
down vote

Maybe something like:

gawk -vRS='\s*\S*' -vORS= '$0=RT;$1!=prev;prev=$1'

The RS=pattern...$0=RT trick lets you process records defined as the parts that match the pattern.

So here, we're slicing the input into <whitespace><non-whitespace> $0 records, <non-whitespace> goes in $1 (the first and only field). We're printing the records whose $1 is not equal to the previous one.

On an input like:

A B1 B1 C1
B B2 D2 
C C12 C13 C13
D D3 D5 D9
G F2 F2

The records are:


[A][ B1][ B1][ C1][
B][ B2][ D2][ 
C][ C12][ C13][ C13][
D][ D3][ D5][ D9][
G][ F2][ F2][
]

Doesn't work for your second example though and note that it could remove some newline characters.

edited Sep 27 '17 at 6:48

answered Sep 26 '17 at 21:34

StÃ©phane Chazelas

284k53523859

What if a row begins with a dupe from the preceding line, e.g. if we add C1 at the beginning of row 2? The linebreak clearly should not get removed even then.
â€“Â dessert
Sep 26 '17 at 22:00

A CD274 PDCD1LG2 CD276 PDCD1LG2 CD274 B NEK2 NEK6 NEK10 NEK10 NEKL-4 C TNFAIP3 OTUD7B OTUD7B TNFAIP3 TNFAIP3 D DUSP16 DUSP4 DUSP8 VHP-1 DUSP8 E AGO2 AGO2 AGO2 AGO2 AGO2
â€“Â desu
Sep 26 '17 at 22:13

3

@desu, whatever you're trying to say to clarify your question, please edit it in your question. You may want to take the tour for some advise on how to ask great questions.
â€“Â StÃ©phane Chazelas
Sep 26 '17 at 22:17

@desu add -F'n' to separate each input lines, so gawk -F'n' -vRS='\s*\S*' -vORS= '$0=RT;$1!=prev;prev=$1'
â€“Â ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·
Sep 27 '17 at 9:48

1

@ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·, not sure what you mean. n is already included in the default FS. The problem here is that if that n is part of a record that is deleted, it will be deleted. Anyway, that answer doesn't answer the OP's question any more with their updated requirements. I'm only leaving it in for the trick which may be useful in other situations.
â€“Â StÃ©phane Chazelas
Sep 27 '17 at 9:57

Â |Â
show 3 more comments

up vote
2
down vote

This is more of a code-golf / freak challenge solution:

xargs -L1 -I echo '; ' < ./test.txt | 
 xargs -n1 | 
 uniq | 
 xargs | 
 sed -e 's/; /n/g' -e 's/ +/t/g'

But it avoids using loops and all other heavy machinery seen in other answers.

It also builds on an assumption your data doesn't contain ; character.

answered Sep 27 '17 at 7:08

wvxvw

3362412

It also assumes no ", ' backslash characters and that none of the words look like -n, -e, -nEne... (depending on the echo implementation) It also assume GNU sed. It still spawns one echo process per line. But it's true that it's less heavy than some of the while loops seen around. It doesn't work for the updated requirements where the duplicated words may no longer be contiguous.
â€“Â StÃ©phane Chazelas
Sep 27 '17 at 10:43

@StÃ©phaneChazelas the argument to echo is quoted, so that the values that look like options won't be interpreted as such. What part of sed call isn't POSIX? (I honestly don't know).
â€“Â wvxvw
Sep 27 '17 at 10:53

No quoting doesn't prevent option processing. Try printf '%sn' -n -ne foo | xargs. Note that xargs -n1 means that one echo is being run for each word which is quite heavy actually. n, + and t are GNU extensions, though you do find some other implementations supporting it nowadays.
â€“Â StÃ©phane Chazelas
Sep 27 '17 at 12:29

@StÃ©phaneChazelas Well, maybe it's echo implementation issue, but for me echo "-n 'foo'" | xargs -L1 -I echo '; ' prints ; -n foo, i.e. -n wasn't treated as an option. Or, do you mean this will propagate to uniq? I think I see your point now.
â€“Â wvxvw
Sep 27 '17 at 13:19

Yes, it doesn't apply to the first echo as the argument starts with ;, it applies to the other ones (the ones implictely run by xargs upon xargs or xargs -n1 alone).
â€“Â StÃ©phane Chazelas
Sep 27 '17 at 13:53

add a commentÂ |Â

up vote
1
down vote

With perl:

unique words on each line:

perl -MList::Util=uniq -lape '$_ = join "t", uniq @F'

unique words globally:

perl -lape '$_ = join "t", grep !$count$_++ @F'

Or to only consider words of each line starting with the 2^nd one:

perl -lape '$_ = join "t", shift(@F), grep !$count$_++ @F'

edited Sep 27 '17 at 10:45

answered Sep 27 '17 at 10:08

StÃ©phane Chazelas

284k53523859

add a commentÂ |Â

up vote
0
down vote

With bash v4.3 (if you don't mind the order of fields as it's sorted except first)

while IFS='n' read -r line; 
 do aline=( $line );
 echo $aline[0] $(sort -u <(printf "%sn" $aline[@]:1));
done < infile

Explanation:

aline=( $line ) this make the line save into an array 'aline'

$aline[0] prints first element of an array 'aline' (array index is starting with zero in bash)

printf "%sn" $aline[@]:1 prints each element of array 'aline' in separate lines and ignore first element; Then

sort -u sorts each line and remove duplicates entries

echo this also combine splited line elements after sort into one linear.

Please see below example to have better view of this step:
```
printf "Cn4nBnC" |sort -u 
4
B
C
echo $(printf "Cn4nBnC" |sort -u)
4 B C
```

This will give output as:

A CD274 CD276 PDCD1LG2
B NEK10 NEK2 NEK6 NEKL-4
C OTUD7B TNFAIP3
D DUSP16 DUSP4 DUSP8 VHP-1
E AGO2

edited Sep 27 '17 at 10:46

answered Sep 27 '17 at 10:08

ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·

15.7k92563

add a commentÂ |Â

up vote
0
down vote

sed substitution with back reference

sed -re 's/s+$//; s/(t[^t]+)1+$/1/'

(s/s+$// gets rid of trailing white-space like in your example.)

answered Sep 27 '17 at 11:36

David Foerster

918616

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f394634%2fhow-to-remove-duplicate-value-in-a-tab-delimited-text-file%23new-answer', 'question_page');

);

Post as a guest

Name

7 Answers
7

active

oldest

votes

7 Answers
7

active

oldest

votes

up vote
7
down vote

First set of example data:

$ awk -vOFS='t' ' r=""; delete t; for (i=1;i<=NF;++i) if (!t[$i]++) r = r ? r OFS $i : $i print r ' file
A B1 C1
B B2 D2
C C12 C13
D D3 D5 D9
G F2

Second set of example data (same awk script):

$ awk -vOFS='t' ' r=""; delete t; for (i=1;i<=NF;++i) if (!t[$i]++) r = r ? r OFS $i : $i print r ' file
A CD274 PDCD1LG2 CD276
B NEK2 NEK6 NEK10 NEKL-4
C TNFAIP3 OTUD7B
D DUSP16 DUSP4 DUSP8 VHP-1
E AGO2

When all the fields of an input line have been processed, the constructed line is outputted.

The output field delimiter is set to tab through -vOFS='t' on the command line.

The awk script unravelled:


 r = ""
 delete t

 for (i = 1; i <= NF; ++i) 
 if (!t[$i]++) 
 r = r ? r OFS $i : $i
 
 

 print r

edited Sep 26 '17 at 23:22

answered Sep 26 '17 at 22:54

Kusalananda

106k14209327

2

See split("", t) for the POSIX equivalent to delete t
â€“Â StÃ©phane Chazelas
Sep 27 '17 at 6:45

add a commentÂ |Â

up vote
7
down vote

First set of example data:

$ awk -vOFS='t' ' r=""; delete t; for (i=1;i<=NF;++i) if (!t[$i]++) r = r ? r OFS $i : $i print r ' file
A B1 C1
B B2 D2
C C12 C13
D D3 D5 D9
G F2

Second set of example data (same awk script):

$ awk -vOFS='t' ' r=""; delete t; for (i=1;i<=NF;++i) if (!t[$i]++) r = r ? r OFS $i : $i print r ' file
A CD274 PDCD1LG2 CD276
B NEK2 NEK6 NEK10 NEKL-4
C TNFAIP3 OTUD7B
D DUSP16 DUSP4 DUSP8 VHP-1
E AGO2

When all the fields of an input line have been processed, the constructed line is outputted.

The output field delimiter is set to tab through -vOFS='t' on the command line.

The awk script unravelled:


 r = ""
 delete t

 for (i = 1; i <= NF; ++i) 
 if (!t[$i]++) 
 r = r ? r OFS $i : $i
 
 

 print r

edited Sep 26 '17 at 23:22

answered Sep 26 '17 at 22:54

Kusalananda

106k14209327

2

See split("", t) for the POSIX equivalent to delete t
â€“Â StÃ©phane Chazelas
Sep 27 '17 at 6:45

add a commentÂ |Â

up vote
7
down vote

First set of example data:

$ awk -vOFS='t' ' r=""; delete t; for (i=1;i<=NF;++i) if (!t[$i]++) r = r ? r OFS $i : $i print r ' file
A B1 C1
B B2 D2
C C12 C13
D D3 D5 D9
G F2

Second set of example data (same awk script):

$ awk -vOFS='t' ' r=""; delete t; for (i=1;i<=NF;++i) if (!t[$i]++) r = r ? r OFS $i : $i print r ' file
A CD274 PDCD1LG2 CD276
B NEK2 NEK6 NEK10 NEKL-4
C TNFAIP3 OTUD7B
D DUSP16 DUSP4 DUSP8 VHP-1
E AGO2

When all the fields of an input line have been processed, the constructed line is outputted.

The output field delimiter is set to tab through -vOFS='t' on the command line.

The awk script unravelled:


 r = ""
 delete t

 for (i = 1; i <= NF; ++i) 
 if (!t[$i]++) 
 r = r ? r OFS $i : $i
 
 

 print r

edited Sep 26 '17 at 23:22

answered Sep 26 '17 at 22:54

Kusalananda

106k14209327

First set of example data:

$ awk -vOFS='t' ' r=""; delete t; for (i=1;i<=NF;++i) if (!t[$i]++) r = r ? r OFS $i : $i print r ' file
A B1 C1
B B2 D2
C C12 C13
D D3 D5 D9
G F2

Second set of example data (same awk script):

$ awk -vOFS='t' ' r=""; delete t; for (i=1;i<=NF;++i) if (!t[$i]++) r = r ? r OFS $i : $i print r ' file
A CD274 PDCD1LG2 CD276
B NEK2 NEK6 NEK10 NEKL-4
C TNFAIP3 OTUD7B
D DUSP16 DUSP4 DUSP8 VHP-1
E AGO2

When all the fields of an input line have been processed, the constructed line is outputted.

The output field delimiter is set to tab through -vOFS='t' on the command line.

The awk script unravelled:


 r = ""
 delete t

 for (i = 1; i <= NF; ++i) 
 if (!t[$i]++) 
 r = r ? r OFS $i : $i
 
 

 print r

edited Sep 26 '17 at 23:22

answered Sep 26 '17 at 22:54

Kusalananda

106k14209327

edited Sep 26 '17 at 23:22

answered Sep 26 '17 at 22:54

Kusalananda

106k14209327

answered Sep 26 '17 at 22:54

Kusalananda

106k14209327

answered Sep 26 '17 at 22:54

Kusalananda

106k14209327

2

See split("", t) for the POSIX equivalent to delete t
â€“Â StÃ©phane Chazelas
Sep 27 '17 at 6:45

add a commentÂ |Â

2

See split("", t) for the POSIX equivalent to delete t
â€“Â StÃ©phane Chazelas
Sep 27 '17 at 6:45

See split("", t) for the POSIX equivalent to delete t
â€“Â StÃ©phane Chazelas
Sep 27 '17 at 6:45

add a commentÂ |Â

up vote
6
down vote

sed/tr, uniq and paste

while read -r l; do sed 's/t/n/g' <<< "$l" | uniq | paste -s; done < test

or POSIX compliant:

while read -r l; do echo "$l" | tr 't' 'n' | uniq | paste -s -; done < test

For the file test this will line by line replace all Tab characters with linebreaks, run uniq to delete dupes and replace the linebreaks with Tab characters again.

$ cat test
A B1 B1 C1
B B2 D2
C C12 C13 C13
D D3 D5 D9
G F2 F2

$ while read -r l; do sed 's/t/n/g' <<< "$l" | uniq | paste -s; done < test
A B1 C1
B B2 D2
C C12 C13
D D3 D5 D9
G F2

NB: This solution will not work for duplicates over multiple rows, e.g. C1 in

A B1 B1 C1
C1 B B2 D2

edited Sep 26 '17 at 22:19

answered Sep 26 '17 at 21:26

dessert

1,013321

add a commentÂ |Â

up vote
6
down vote

sed/tr, uniq and paste

while read -r l; do sed 's/t/n/g' <<< "$l" | uniq | paste -s; done < test

or POSIX compliant:

while read -r l; do echo "$l" | tr 't' 'n' | uniq | paste -s -; done < test

For the file test this will line by line replace all Tab characters with linebreaks, run uniq to delete dupes and replace the linebreaks with Tab characters again.

$ cat test
A B1 B1 C1
B B2 D2
C C12 C13 C13
D D3 D5 D9
G F2 F2

$ while read -r l; do sed 's/t/n/g' <<< "$l" | uniq | paste -s; done < test
A B1 C1
B B2 D2
C C12 C13
D D3 D5 D9
G F2

NB: This solution will not work for duplicates over multiple rows, e.g. C1 in

A B1 B1 C1
C1 B B2 D2

edited Sep 26 '17 at 22:19

answered Sep 26 '17 at 21:26

dessert

1,013321

add a commentÂ |Â

up vote
6
down vote

sed/tr, uniq and paste

while read -r l; do sed 's/t/n/g' <<< "$l" | uniq | paste -s; done < test

or POSIX compliant:

while read -r l; do echo "$l" | tr 't' 'n' | uniq | paste -s -; done < test

For the file test this will line by line replace all Tab characters with linebreaks, run uniq to delete dupes and replace the linebreaks with Tab characters again.

$ cat test
A B1 B1 C1
B B2 D2
C C12 C13 C13
D D3 D5 D9
G F2 F2

$ while read -r l; do sed 's/t/n/g' <<< "$l" | uniq | paste -s; done < test
A B1 C1
B B2 D2
C C12 C13
D D3 D5 D9
G F2

NB: This solution will not work for duplicates over multiple rows, e.g. C1 in

A B1 B1 C1
C1 B B2 D2

edited Sep 26 '17 at 22:19

answered Sep 26 '17 at 21:26

dessert

1,013321

sed/tr, uniq and paste

while read -r l; do sed 's/t/n/g' <<< "$l" | uniq | paste -s; done < test

or POSIX compliant:

while read -r l; do echo "$l" | tr 't' 'n' | uniq | paste -s -; done < test

For the file test this will line by line replace all Tab characters with linebreaks, run uniq to delete dupes and replace the linebreaks with Tab characters again.

$ cat test
A B1 B1 C1
B B2 D2
C C12 C13 C13
D D3 D5 D9
G F2 F2

$ while read -r l; do sed 's/t/n/g' <<< "$l" | uniq | paste -s; done < test
A B1 C1
B B2 D2
C C12 C13
D D3 D5 D9
G F2

NB: This solution will not work for duplicates over multiple rows, e.g. C1 in

A B1 B1 C1
C1 B B2 D2

edited Sep 26 '17 at 22:19

answered Sep 26 '17 at 21:26

dessert

1,013321

edited Sep 26 '17 at 22:19

answered Sep 26 '17 at 21:26

dessert

1,013321

answered Sep 26 '17 at 21:26

dessert

1,013321

answered Sep 26 '17 at 21:26

dessert

1,013321

add a commentÂ |Â

up vote
6
down vote

Maybe something like:

gawk -vRS='\s*\S*' -vORS= '$0=RT;$1!=prev;prev=$1'

The RS=pattern...$0=RT trick lets you process records defined as the parts that match the pattern.

On an input like:

A B1 B1 C1
B B2 D2 
C C12 C13 C13
D D3 D5 D9
G F2 F2

The records are:


[A][ B1][ B1][ C1][
B][ B2][ D2][ 
C][ C12][ C13][ C13][
D][ D3][ D5][ D9][
G][ F2][ F2][
]

Doesn't work for your second example though and note that it could remove some newline characters.

edited Sep 27 '17 at 6:48

answered Sep 26 '17 at 21:34

StÃ©phane Chazelas

284k53523859

What if a row begins with a dupe from the preceding line, e.g. if we add C1 at the beginning of row 2? The linebreak clearly should not get removed even then.
â€“Â dessert
Sep 26 '17 at 22:00

A CD274 PDCD1LG2 CD276 PDCD1LG2 CD274 B NEK2 NEK6 NEK10 NEK10 NEKL-4 C TNFAIP3 OTUD7B OTUD7B TNFAIP3 TNFAIP3 D DUSP16 DUSP4 DUSP8 VHP-1 DUSP8 E AGO2 AGO2 AGO2 AGO2 AGO2
â€“Â desu
Sep 26 '17 at 22:13

3

@desu, whatever you're trying to say to clarify your question, please edit it in your question. You may want to take the tour for some advise on how to ask great questions.
â€“Â StÃ©phane Chazelas
Sep 26 '17 at 22:17

@desu add -F'n' to separate each input lines, so gawk -F'n' -vRS='\s*\S*' -vORS= '$0=RT;$1!=prev;prev=$1'
â€“Â ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·
Sep 27 '17 at 9:48

1

@ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·, not sure what you mean. n is already included in the default FS. The problem here is that if that n is part of a record that is deleted, it will be deleted. Anyway, that answer doesn't answer the OP's question any more with their updated requirements. I'm only leaving it in for the trick which may be useful in other situations.
â€“Â StÃ©phane Chazelas
Sep 27 '17 at 9:57

Â |Â
show 3 more comments

up vote
6
down vote

Maybe something like:

gawk -vRS='\s*\S*' -vORS= '$0=RT;$1!=prev;prev=$1'

The RS=pattern...$0=RT trick lets you process records defined as the parts that match the pattern.

On an input like:

A B1 B1 C1
B B2 D2 
C C12 C13 C13
D D3 D5 D9
G F2 F2

The records are:


[A][ B1][ B1][ C1][
B][ B2][ D2][ 
C][ C12][ C13][ C13][
D][ D3][ D5][ D9][
G][ F2][ F2][
]

Doesn't work for your second example though and note that it could remove some newline characters.

edited Sep 27 '17 at 6:48

answered Sep 26 '17 at 21:34

StÃ©phane Chazelas

284k53523859

What if a row begins with a dupe from the preceding line, e.g. if we add C1 at the beginning of row 2? The linebreak clearly should not get removed even then.
â€“Â dessert
Sep 26 '17 at 22:00

A CD274 PDCD1LG2 CD276 PDCD1LG2 CD274 B NEK2 NEK6 NEK10 NEK10 NEKL-4 C TNFAIP3 OTUD7B OTUD7B TNFAIP3 TNFAIP3 D DUSP16 DUSP4 DUSP8 VHP-1 DUSP8 E AGO2 AGO2 AGO2 AGO2 AGO2
â€“Â desu
Sep 26 '17 at 22:13

3

@desu, whatever you're trying to say to clarify your question, please edit it in your question. You may want to take the tour for some advise on how to ask great questions.
â€“Â StÃ©phane Chazelas
Sep 26 '17 at 22:17

@desu add -F'n' to separate each input lines, so gawk -F'n' -vRS='\s*\S*' -vORS= '$0=RT;$1!=prev;prev=$1'
â€“Â ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·
Sep 27 '17 at 9:48

1

@ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·, not sure what you mean. n is already included in the default FS. The problem here is that if that n is part of a record that is deleted, it will be deleted. Anyway, that answer doesn't answer the OP's question any more with their updated requirements. I'm only leaving it in for the trick which may be useful in other situations.
â€“Â StÃ©phane Chazelas
Sep 27 '17 at 9:57

Â |Â
show 3 more comments

up vote
6
down vote

Maybe something like:

gawk -vRS='\s*\S*' -vORS= '$0=RT;$1!=prev;prev=$1'

The RS=pattern...$0=RT trick lets you process records defined as the parts that match the pattern.

On an input like:

A B1 B1 C1
B B2 D2 
C C12 C13 C13
D D3 D5 D9
G F2 F2

The records are:


[A][ B1][ B1][ C1][
B][ B2][ D2][ 
C][ C12][ C13][ C13][
D][ D3][ D5][ D9][
G][ F2][ F2][
]

Doesn't work for your second example though and note that it could remove some newline characters.

edited Sep 27 '17 at 6:48

answered Sep 26 '17 at 21:34

StÃ©phane Chazelas

284k53523859

Maybe something like:

gawk -vRS='\s*\S*' -vORS= '$0=RT;$1!=prev;prev=$1'

The RS=pattern...$0=RT trick lets you process records defined as the parts that match the pattern.

On an input like:

A B1 B1 C1
B B2 D2 
C C12 C13 C13
D D3 D5 D9
G F2 F2

The records are:


[A][ B1][ B1][ C1][
B][ B2][ D2][ 
C][ C12][ C13][ C13][
D][ D3][ D5][ D9][
G][ F2][ F2][
]

Doesn't work for your second example though and note that it could remove some newline characters.

edited Sep 27 '17 at 6:48

answered Sep 26 '17 at 21:34

StÃ©phane Chazelas

284k53523859

edited Sep 27 '17 at 6:48

answered Sep 26 '17 at 21:34

StÃ©phane Chazelas

284k53523859

answered Sep 26 '17 at 21:34

StÃ©phane Chazelas

284k53523859

answered Sep 26 '17 at 21:34

StÃ©phane Chazelas

284k53523859

What if a row begins with a dupe from the preceding line, e.g. if we add C1 at the beginning of row 2? The linebreak clearly should not get removed even then.
â€“Â dessert
Sep 26 '17 at 22:00

A CD274 PDCD1LG2 CD276 PDCD1LG2 CD274 B NEK2 NEK6 NEK10 NEK10 NEKL-4 C TNFAIP3 OTUD7B OTUD7B TNFAIP3 TNFAIP3 D DUSP16 DUSP4 DUSP8 VHP-1 DUSP8 E AGO2 AGO2 AGO2 AGO2 AGO2
â€“Â desu
Sep 26 '17 at 22:13

3

@desu, whatever you're trying to say to clarify your question, please edit it in your question. You may want to take the tour for some advise on how to ask great questions.
â€“Â StÃ©phane Chazelas
Sep 26 '17 at 22:17

@desu add -F'n' to separate each input lines, so gawk -F'n' -vRS='\s*\S*' -vORS= '$0=RT;$1!=prev;prev=$1'
â€“Â ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·
Sep 27 '17 at 9:48

1

@ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·, not sure what you mean. n is already included in the default FS. The problem here is that if that n is part of a record that is deleted, it will be deleted. Anyway, that answer doesn't answer the OP's question any more with their updated requirements. I'm only leaving it in for the trick which may be useful in other situations.
â€“Â StÃ©phane Chazelas
Sep 27 '17 at 9:57

Â |Â
show 3 more comments

What if a row begins with a dupe from the preceding line, e.g. if we add C1 at the beginning of row 2? The linebreak clearly should not get removed even then.
â€“Â dessert
Sep 26 '17 at 22:00

A CD274 PDCD1LG2 CD276 PDCD1LG2 CD274 B NEK2 NEK6 NEK10 NEK10 NEKL-4 C TNFAIP3 OTUD7B OTUD7B TNFAIP3 TNFAIP3 D DUSP16 DUSP4 DUSP8 VHP-1 DUSP8 E AGO2 AGO2 AGO2 AGO2 AGO2
â€“Â desu
Sep 26 '17 at 22:13

3

@desu, whatever you're trying to say to clarify your question, please edit it in your question. You may want to take the tour for some advise on how to ask great questions.
â€“Â StÃ©phane Chazelas
Sep 26 '17 at 22:17

@desu add -F'n' to separate each input lines, so gawk -F'n' -vRS='\s*\S*' -vORS= '$0=RT;$1!=prev;prev=$1'
â€“Â ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·
Sep 27 '17 at 9:48

1

@ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·, not sure what you mean. n is already included in the default FS. The problem here is that if that n is part of a record that is deleted, it will be deleted. Anyway, that answer doesn't answer the OP's question any more with their updated requirements. I'm only leaving it in for the trick which may be useful in other situations.
â€“Â StÃ©phane Chazelas
Sep 27 '17 at 9:57

What if a row begins with a dupe from the preceding line, e.g. if we add C1 at the beginning of row 2? The linebreak clearly should not get removed even then.
â€“Â dessert
Sep 26 '17 at 22:00

A CD274 PDCD1LG2 CD276 PDCD1LG2 CD274 B NEK2 NEK6 NEK10 NEK10 NEKL-4 C TNFAIP3 OTUD7B OTUD7B TNFAIP3 TNFAIP3 D DUSP16 DUSP4 DUSP8 VHP-1 DUSP8 E AGO2 AGO2 AGO2 AGO2 AGO2
â€“Â desu
Sep 26 '17 at 22:13

@desu, whatever you're trying to say to clarify your question, please edit it in your question. You may want to take the tour for some advise on how to ask great questions.
â€“Â StÃ©phane Chazelas
Sep 26 '17 at 22:17

@desu add -F'n' to separate each input lines, so gawk -F'n' -vRS='\s*\S*' -vORS= '$0=RT;$1!=prev;prev=$1'
â€“Â ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·
Sep 27 '17 at 9:48

@ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·, not sure what you mean. n is already included in the default FS. The problem here is that if that n is part of a record that is deleted, it will be deleted. Anyway, that answer doesn't answer the OP's question any more with their updated requirements. I'm only leaving it in for the trick which may be useful in other situations.
â€“Â StÃ©phane Chazelas
Sep 27 '17 at 9:57

Â |Â
show 3 more comments

up vote
2
down vote

This is more of a code-golf / freak challenge solution:

xargs -L1 -I echo '; ' < ./test.txt | 
 xargs -n1 | 
 uniq | 
 xargs | 
 sed -e 's/; /n/g' -e 's/ +/t/g'

But it avoids using loops and all other heavy machinery seen in other answers.

It also builds on an assumption your data doesn't contain ; character.

answered Sep 27 '17 at 7:08

wvxvw

3362412

It also assumes no ", ' backslash characters and that none of the words look like -n, -e, -nEne... (depending on the echo implementation) It also assume GNU sed. It still spawns one echo process per line. But it's true that it's less heavy than some of the while loops seen around. It doesn't work for the updated requirements where the duplicated words may no longer be contiguous.
â€“Â StÃ©phane Chazelas
Sep 27 '17 at 10:43

@StÃ©phaneChazelas the argument to echo is quoted, so that the values that look like options won't be interpreted as such. What part of sed call isn't POSIX? (I honestly don't know).
â€“Â wvxvw
Sep 27 '17 at 10:53

No quoting doesn't prevent option processing. Try printf '%sn' -n -ne foo | xargs. Note that xargs -n1 means that one echo is being run for each word which is quite heavy actually. n, + and t are GNU extensions, though you do find some other implementations supporting it nowadays.
â€“Â StÃ©phane Chazelas
Sep 27 '17 at 12:29

@StÃ©phaneChazelas Well, maybe it's echo implementation issue, but for me echo "-n 'foo'" | xargs -L1 -I echo '; ' prints ; -n foo, i.e. -n wasn't treated as an option. Or, do you mean this will propagate to uniq? I think I see your point now.
â€“Â wvxvw
Sep 27 '17 at 13:19

Yes, it doesn't apply to the first echo as the argument starts with ;, it applies to the other ones (the ones implictely run by xargs upon xargs or xargs -n1 alone).
â€“Â StÃ©phane Chazelas
Sep 27 '17 at 13:53

add a commentÂ |Â

up vote
2
down vote

This is more of a code-golf / freak challenge solution:

xargs -L1 -I echo '; ' < ./test.txt | 
 xargs -n1 | 
 uniq | 
 xargs | 
 sed -e 's/; /n/g' -e 's/ +/t/g'

But it avoids using loops and all other heavy machinery seen in other answers.

It also builds on an assumption your data doesn't contain ; character.

answered Sep 27 '17 at 7:08

wvxvw

3362412

It also assumes no ", ' backslash characters and that none of the words look like -n, -e, -nEne... (depending on the echo implementation) It also assume GNU sed. It still spawns one echo process per line. But it's true that it's less heavy than some of the while loops seen around. It doesn't work for the updated requirements where the duplicated words may no longer be contiguous.
â€“Â StÃ©phane Chazelas
Sep 27 '17 at 10:43

@StÃ©phaneChazelas the argument to echo is quoted, so that the values that look like options won't be interpreted as such. What part of sed call isn't POSIX? (I honestly don't know).
â€“Â wvxvw
Sep 27 '17 at 10:53

No quoting doesn't prevent option processing. Try printf '%sn' -n -ne foo | xargs. Note that xargs -n1 means that one echo is being run for each word which is quite heavy actually. n, + and t are GNU extensions, though you do find some other implementations supporting it nowadays.
â€“Â StÃ©phane Chazelas
Sep 27 '17 at 12:29

@StÃ©phaneChazelas Well, maybe it's echo implementation issue, but for me echo "-n 'foo'" | xargs -L1 -I echo '; ' prints ; -n foo, i.e. -n wasn't treated as an option. Or, do you mean this will propagate to uniq? I think I see your point now.
â€“Â wvxvw
Sep 27 '17 at 13:19

Yes, it doesn't apply to the first echo as the argument starts with ;, it applies to the other ones (the ones implictely run by xargs upon xargs or xargs -n1 alone).
â€“Â StÃ©phane Chazelas
Sep 27 '17 at 13:53

add a commentÂ |Â

up vote
2
down vote

This is more of a code-golf / freak challenge solution:

xargs -L1 -I echo '; ' < ./test.txt | 
 xargs -n1 | 
 uniq | 
 xargs | 
 sed -e 's/; /n/g' -e 's/ +/t/g'

But it avoids using loops and all other heavy machinery seen in other answers.

It also builds on an assumption your data doesn't contain ; character.

answered Sep 27 '17 at 7:08

wvxvw

3362412

This is more of a code-golf / freak challenge solution:

xargs -L1 -I echo '; ' < ./test.txt | 
 xargs -n1 | 
 uniq | 
 xargs | 
 sed -e 's/; /n/g' -e 's/ +/t/g'

But it avoids using loops and all other heavy machinery seen in other answers.

It also builds on an assumption your data doesn't contain ; character.

answered Sep 27 '17 at 7:08

wvxvw

3362412

answered Sep 27 '17 at 7:08

wvxvw

3362412

answered Sep 27 '17 at 7:08

wvxvw

3362412

answered Sep 27 '17 at 7:08

wvxvw

3362412

It also assumes no ", ' backslash characters and that none of the words look like -n, -e, -nEne... (depending on the echo implementation) It also assume GNU sed. It still spawns one echo process per line. But it's true that it's less heavy than some of the while loops seen around. It doesn't work for the updated requirements where the duplicated words may no longer be contiguous.
â€“Â StÃ©phane Chazelas
Sep 27 '17 at 10:43

@StÃ©phaneChazelas the argument to echo is quoted, so that the values that look like options won't be interpreted as such. What part of sed call isn't POSIX? (I honestly don't know).
â€“Â wvxvw
Sep 27 '17 at 10:53

No quoting doesn't prevent option processing. Try printf '%sn' -n -ne foo | xargs. Note that xargs -n1 means that one echo is being run for each word which is quite heavy actually. n, + and t are GNU extensions, though you do find some other implementations supporting it nowadays.
â€“Â StÃ©phane Chazelas
Sep 27 '17 at 12:29

@StÃ©phaneChazelas Well, maybe it's echo implementation issue, but for me echo "-n 'foo'" | xargs -L1 -I echo '; ' prints ; -n foo, i.e. -n wasn't treated as an option. Or, do you mean this will propagate to uniq? I think I see your point now.
â€“Â wvxvw
Sep 27 '17 at 13:19

Yes, it doesn't apply to the first echo as the argument starts with ;, it applies to the other ones (the ones implictely run by xargs upon xargs or xargs -n1 alone).
â€“Â StÃ©phane Chazelas
Sep 27 '17 at 13:53

add a commentÂ |Â

It also assumes no ", ' backslash characters and that none of the words look like -n, -e, -nEne... (depending on the echo implementation) It also assume GNU sed. It still spawns one echo process per line. But it's true that it's less heavy than some of the while loops seen around. It doesn't work for the updated requirements where the duplicated words may no longer be contiguous.
â€“Â StÃ©phane Chazelas
Sep 27 '17 at 10:43

@StÃ©phaneChazelas the argument to echo is quoted, so that the values that look like options won't be interpreted as such. What part of sed call isn't POSIX? (I honestly don't know).
â€“Â wvxvw
Sep 27 '17 at 10:53

No quoting doesn't prevent option processing. Try printf '%sn' -n -ne foo | xargs. Note that xargs -n1 means that one echo is being run for each word which is quite heavy actually. n, + and t are GNU extensions, though you do find some other implementations supporting it nowadays.
â€“Â StÃ©phane Chazelas
Sep 27 '17 at 12:29

@StÃ©phaneChazelas Well, maybe it's echo implementation issue, but for me echo "-n 'foo'" | xargs -L1 -I echo '; ' prints ; -n foo, i.e. -n wasn't treated as an option. Or, do you mean this will propagate to uniq? I think I see your point now.
â€“Â wvxvw
Sep 27 '17 at 13:19

Yes, it doesn't apply to the first echo as the argument starts with ;, it applies to the other ones (the ones implictely run by xargs upon xargs or xargs -n1 alone).
â€“Â StÃ©phane Chazelas
Sep 27 '17 at 13:53

It also assumes no ", ' backslash characters and that none of the words look like -n, -e, -nEne... (depending on the echo implementation) It also assume GNU sed. It still spawns one echo process per line. But it's true that it's less heavy than some of the while loops seen around. It doesn't work for the updated requirements where the duplicated words may no longer be contiguous.
â€“Â StÃ©phane Chazelas
Sep 27 '17 at 10:43

@StÃ©phaneChazelas the argument to echo is quoted, so that the values that look like options won't be interpreted as such. What part of sed call isn't POSIX? (I honestly don't know).
â€“Â wvxvw
Sep 27 '17 at 10:53

No quoting doesn't prevent option processing. Try printf '%sn' -n -ne foo | xargs. Note that xargs -n1 means that one echo is being run for each word which is quite heavy actually. n, + and t are GNU extensions, though you do find some other implementations supporting it nowadays.
â€“Â StÃ©phane Chazelas
Sep 27 '17 at 12:29

@StÃ©phaneChazelas Well, maybe it's echo implementation issue, but for me echo "-n 'foo'" | xargs -L1 -I echo '; ' prints ; -n foo, i.e. -n wasn't treated as an option. Or, do you mean this will propagate to uniq? I think I see your point now.
â€“Â wvxvw
Sep 27 '17 at 13:19

Yes, it doesn't apply to the first echo as the argument starts with ;, it applies to the other ones (the ones implictely run by xargs upon xargs or xargs -n1 alone).
â€“Â StÃ©phane Chazelas
Sep 27 '17 at 13:53

add a commentÂ |Â

up vote
1
down vote

With perl:

unique words on each line:

perl -MList::Util=uniq -lape '$_ = join "t", uniq @F'

unique words globally:

perl -lape '$_ = join "t", grep !$count$_++ @F'

Or to only consider words of each line starting with the 2^nd one:

perl -lape '$_ = join "t", shift(@F), grep !$count$_++ @F'

edited Sep 27 '17 at 10:45

answered Sep 27 '17 at 10:08

284k53523859

add a commentÂ |Â

up vote
1
down vote

With perl:

unique words on each line:

perl -MList::Util=uniq -lape '$_ = join "t", uniq @F'

unique words globally:

perl -lape '$_ = join "t", grep !$count$_++ @F'

Or to only consider words of each line starting with the 2^nd one:

perl -lape '$_ = join "t", shift(@F), grep !$count$_++ @F'

edited Sep 27 '17 at 10:45

answered Sep 27 '17 at 10:08

284k53523859

add a commentÂ |Â

up vote
1
down vote

With perl:

unique words on each line:

perl -MList::Util=uniq -lape '$_ = join "t", uniq @F'

unique words globally:

perl -lape '$_ = join "t", grep !$count$_++ @F'

Or to only consider words of each line starting with the 2^nd one:

perl -lape '$_ = join "t", shift(@F), grep !$count$_++ @F'

edited Sep 27 '17 at 10:45

answered Sep 27 '17 at 10:08

284k53523859

With perl:

unique words on each line:

perl -MList::Util=uniq -lape '$_ = join "t", uniq @F'

unique words globally:

perl -lape '$_ = join "t", grep !$count$_++ @F'

Or to only consider words of each line starting with the 2^nd one:

perl -lape '$_ = join "t", shift(@F), grep !$count$_++ @F'

edited Sep 27 '17 at 10:45

answered Sep 27 '17 at 10:08

284k53523859

edited Sep 27 '17 at 10:45

answered Sep 27 '17 at 10:08

284k53523859

answered Sep 27 '17 at 10:08

284k53523859

answered Sep 27 '17 at 10:08

284k53523859

add a commentÂ |Â

up vote
0
down vote

With bash v4.3 (if you don't mind the order of fields as it's sorted except first)

while IFS='n' read -r line; 
 do aline=( $line );
 echo $aline[0] $(sort -u <(printf "%sn" $aline[@]:1));
done < infile

Explanation:

aline=( $line ) this make the line save into an array 'aline'

$aline[0] prints first element of an array 'aline' (array index is starting with zero in bash)

printf "%sn" $aline[@]:1 prints each element of array 'aline' in separate lines and ignore first element; Then

sort -u sorts each line and remove duplicates entries

echo this also combine splited line elements after sort into one linear.

Please see below example to have better view of this step:
```
printf "Cn4nBnC" |sort -u 
4
B
C
echo $(printf "Cn4nBnC" |sort -u)
4 B C
```

This will give output as:

A CD274 CD276 PDCD1LG2
B NEK10 NEK2 NEK6 NEKL-4
C OTUD7B TNFAIP3
D DUSP16 DUSP4 DUSP8 VHP-1
E AGO2

edited Sep 27 '17 at 10:46

answered Sep 27 '17 at 10:08

ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·

15.7k92563

add a commentÂ |Â

up vote
0
down vote

With bash v4.3 (if you don't mind the order of fields as it's sorted except first)

while IFS='n' read -r line; 
 do aline=( $line );
 echo $aline[0] $(sort -u <(printf "%sn" $aline[@]:1));
done < infile

Explanation:

aline=( $line ) this make the line save into an array 'aline'

$aline[0] prints first element of an array 'aline' (array index is starting with zero in bash)

printf "%sn" $aline[@]:1 prints each element of array 'aline' in separate lines and ignore first element; Then

sort -u sorts each line and remove duplicates entries

echo this also combine splited line elements after sort into one linear.

Please see below example to have better view of this step:
```
printf "Cn4nBnC" |sort -u 
4
B
C
echo $(printf "Cn4nBnC" |sort -u)
4 B C
```

This will give output as:

A CD274 CD276 PDCD1LG2
B NEK10 NEK2 NEK6 NEKL-4
C OTUD7B TNFAIP3
D DUSP16 DUSP4 DUSP8 VHP-1
E AGO2

edited Sep 27 '17 at 10:46

answered Sep 27 '17 at 10:08

ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·

15.7k92563

add a commentÂ |Â

up vote
0
down vote

With bash v4.3 (if you don't mind the order of fields as it's sorted except first)

while IFS='n' read -r line; 
 do aline=( $line );
 echo $aline[0] $(sort -u <(printf "%sn" $aline[@]:1));
done < infile

Explanation:

aline=( $line ) this make the line save into an array 'aline'

$aline[0] prints first element of an array 'aline' (array index is starting with zero in bash)

printf "%sn" $aline[@]:1 prints each element of array 'aline' in separate lines and ignore first element; Then

sort -u sorts each line and remove duplicates entries

echo this also combine splited line elements after sort into one linear.

Please see below example to have better view of this step:
```
printf "Cn4nBnC" |sort -u 
4
B
C
echo $(printf "Cn4nBnC" |sort -u)
4 B C
```

This will give output as:

A CD274 CD276 PDCD1LG2
B NEK10 NEK2 NEK6 NEKL-4
C OTUD7B TNFAIP3
D DUSP16 DUSP4 DUSP8 VHP-1
E AGO2

edited Sep 27 '17 at 10:46

answered Sep 27 '17 at 10:08

ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·

15.7k92563

With bash v4.3 (if you don't mind the order of fields as it's sorted except first)

while IFS='n' read -r line; 
 do aline=( $line );
 echo $aline[0] $(sort -u <(printf "%sn" $aline[@]:1));
done < infile

Explanation:

aline=( $line ) this make the line save into an array 'aline'

$aline[0] prints first element of an array 'aline' (array index is starting with zero in bash)

printf "%sn" $aline[@]:1 prints each element of array 'aline' in separate lines and ignore first element; Then

sort -u sorts each line and remove duplicates entries

echo this also combine splited line elements after sort into one linear.

Please see below example to have better view of this step:
```
printf "Cn4nBnC" |sort -u 
4
B
C
echo $(printf "Cn4nBnC" |sort -u)
4 B C
```

This will give output as:

A CD274 CD276 PDCD1LG2
B NEK10 NEK2 NEK6 NEKL-4
C OTUD7B TNFAIP3
D DUSP16 DUSP4 DUSP8 VHP-1
E AGO2

edited Sep 27 '17 at 10:46

answered Sep 27 '17 at 10:08

ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·

15.7k92563

edited Sep 27 '17 at 10:46

answered Sep 27 '17 at 10:08

ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·

15.7k92563

answered Sep 27 '17 at 10:08

ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·

15.7k92563

answered Sep 27 '17 at 10:08

ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·

15.7k92563

add a commentÂ |Â

up vote
0
down vote

sed substitution with back reference

sed -re 's/s+$//; s/(t[^t]+)1+$/1/'

(s/s+$// gets rid of trailing white-space like in your example.)

answered Sep 27 '17 at 11:36

David Foerster

918616

add a commentÂ |Â

up vote
0
down vote

sed substitution with back reference

sed -re 's/s+$//; s/(t[^t]+)1+$/1/'

(s/s+$// gets rid of trailing white-space like in your example.)

answered Sep 27 '17 at 11:36

David Foerster

918616

add a commentÂ |Â

up vote
0
down vote

sed substitution with back reference

sed -re 's/s+$//; s/(t[^t]+)1+$/1/'

(s/s+$// gets rid of trailing white-space like in your example.)

answered Sep 27 '17 at 11:36

David Foerster

918616

sed substitution with back reference

sed -re 's/s+$//; s/(t[^t]+)1+$/1/'

(s/s+$// gets rid of trailing white-space like in your example.)

answered Sep 27 '17 at 11:36

David Foerster

918616

answered Sep 27 '17 at 11:36

David Foerster

918616

answered Sep 27 '17 at 11:36

David Foerster

918616

answered Sep 27 '17 at 11:36

David Foerster

918616

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu