How to remove duplicate value in a tab-delimited text file

Clash Royale CLAN TAG#URR8PPP
up vote
5
down vote
favorite
I have a tab delimited column text like below
A B1 B1 C1
B B2 D2
C C12 C13 C13
D D3 D5 D9
G F2 F2
how could I convert the above table like below
A B1 C1
B B2 D2
C C12 C13
D D3 D5 D9
G F2
I have extracted my real data file, it is a tab delimited file and I have tried the command line you (Stéphane Chazelas?) posted it works fine but it couldn't remove the duplicate on the last column
A CD274 PDCD1LG2 CD276 PDCD1LG2 CD274
B NEK2 NEK6 NEK10 NEK10 NEKL-4
C TNFAIP3 OTUD7B OTUD7B TNFAIP3 TNFAIP3
D DUSP16 DUSP4 DUSP8 VHP-1 DUSP8
E AGO2 AGO2 AGO2 AGO2 AGO2
output need to be as below
A CD274 CD276 PDCD1LG2
B NEK2 NEK6 NEK10 NEKL-4
C TNFAIP3 OTUD7B
D DUSP16 DUSP4 DUSP8 VHP-1
E AGO2
text-processing csv-simple
add a comment |Â
up vote
5
down vote
favorite
I have a tab delimited column text like below
A B1 B1 C1
B B2 D2
C C12 C13 C13
D D3 D5 D9
G F2 F2
how could I convert the above table like below
A B1 C1
B B2 D2
C C12 C13
D D3 D5 D9
G F2
I have extracted my real data file, it is a tab delimited file and I have tried the command line you (Stéphane Chazelas?) posted it works fine but it couldn't remove the duplicate on the last column
A CD274 PDCD1LG2 CD276 PDCD1LG2 CD274
B NEK2 NEK6 NEK10 NEK10 NEKL-4
C TNFAIP3 OTUD7B OTUD7B TNFAIP3 TNFAIP3
D DUSP16 DUSP4 DUSP8 VHP-1 DUSP8
E AGO2 AGO2 AGO2 AGO2 AGO2
output need to be as below
A CD274 CD276 PDCD1LG2
B NEK2 NEK6 NEK10 NEKL-4
C TNFAIP3 OTUD7B
D DUSP16 DUSP4 DUSP8 VHP-1
E AGO2
text-processing csv-simple
Does order of fields in a line in output is important ? likeAGO2 EorC OTUD7B TNFAIP3
â Ã±ÃÂsýù÷
Sep 27 '17 at 9:09
@ñÃÂsýù÷ABCseems to be the line numbering, I think at least they should stay there.
â dessert
Sep 27 '17 at 10:01
If you're happy with one or several of the answers, upvote them. If one is solving your issue, accepting it would be the best way of saying "Thank You!" :-)
â Kusalananda
Sep 27 '17 at 10:50
add a comment |Â
up vote
5
down vote
favorite
up vote
5
down vote
favorite
I have a tab delimited column text like below
A B1 B1 C1
B B2 D2
C C12 C13 C13
D D3 D5 D9
G F2 F2
how could I convert the above table like below
A B1 C1
B B2 D2
C C12 C13
D D3 D5 D9
G F2
I have extracted my real data file, it is a tab delimited file and I have tried the command line you (Stéphane Chazelas?) posted it works fine but it couldn't remove the duplicate on the last column
A CD274 PDCD1LG2 CD276 PDCD1LG2 CD274
B NEK2 NEK6 NEK10 NEK10 NEKL-4
C TNFAIP3 OTUD7B OTUD7B TNFAIP3 TNFAIP3
D DUSP16 DUSP4 DUSP8 VHP-1 DUSP8
E AGO2 AGO2 AGO2 AGO2 AGO2
output need to be as below
A CD274 CD276 PDCD1LG2
B NEK2 NEK6 NEK10 NEKL-4
C TNFAIP3 OTUD7B
D DUSP16 DUSP4 DUSP8 VHP-1
E AGO2
text-processing csv-simple
I have a tab delimited column text like below
A B1 B1 C1
B B2 D2
C C12 C13 C13
D D3 D5 D9
G F2 F2
how could I convert the above table like below
A B1 C1
B B2 D2
C C12 C13
D D3 D5 D9
G F2
I have extracted my real data file, it is a tab delimited file and I have tried the command line you (Stéphane Chazelas?) posted it works fine but it couldn't remove the duplicate on the last column
A CD274 PDCD1LG2 CD276 PDCD1LG2 CD274
B NEK2 NEK6 NEK10 NEK10 NEKL-4
C TNFAIP3 OTUD7B OTUD7B TNFAIP3 TNFAIP3
D DUSP16 DUSP4 DUSP8 VHP-1 DUSP8
E AGO2 AGO2 AGO2 AGO2 AGO2
output need to be as below
A CD274 CD276 PDCD1LG2
B NEK2 NEK6 NEK10 NEKL-4
C TNFAIP3 OTUD7B
D DUSP16 DUSP4 DUSP8 VHP-1
E AGO2
text-processing csv-simple
text-processing csv-simple
edited Sep 26 '17 at 22:33
Kusalananda
106k14209327
106k14209327
asked Sep 26 '17 at 21:15
desu
544
544
Does order of fields in a line in output is important ? likeAGO2 EorC OTUD7B TNFAIP3
â Ã±ÃÂsýù÷
Sep 27 '17 at 9:09
@ñÃÂsýù÷ABCseems to be the line numbering, I think at least they should stay there.
â dessert
Sep 27 '17 at 10:01
If you're happy with one or several of the answers, upvote them. If one is solving your issue, accepting it would be the best way of saying "Thank You!" :-)
â Kusalananda
Sep 27 '17 at 10:50
add a comment |Â
Does order of fields in a line in output is important ? likeAGO2 EorC OTUD7B TNFAIP3
â Ã±ÃÂsýù÷
Sep 27 '17 at 9:09
@ñÃÂsýù÷ABCseems to be the line numbering, I think at least they should stay there.
â dessert
Sep 27 '17 at 10:01
If you're happy with one or several of the answers, upvote them. If one is solving your issue, accepting it would be the best way of saying "Thank You!" :-)
â Kusalananda
Sep 27 '17 at 10:50
Does order of fields in a line in output is important ? like
AGO2 E or C OTUD7B TNFAIP3â Ã±ÃÂsýù÷
Sep 27 '17 at 9:09
Does order of fields in a line in output is important ? like
AGO2 E or C OTUD7B TNFAIP3â Ã±ÃÂsýù÷
Sep 27 '17 at 9:09
@ñÃÂsýù÷
A B C seems to be the line numbering, I think at least they should stay there.â dessert
Sep 27 '17 at 10:01
@ñÃÂsýù÷
A B C seems to be the line numbering, I think at least they should stay there.â dessert
Sep 27 '17 at 10:01
If you're happy with one or several of the answers, upvote them. If one is solving your issue, accepting it would be the best way of saying "Thank You!" :-)
â Kusalananda
Sep 27 '17 at 10:50
If you're happy with one or several of the answers, upvote them. If one is solving your issue, accepting it would be the best way of saying "Thank You!" :-)
â Kusalananda
Sep 27 '17 at 10:50
add a comment |Â
7 Answers
7
active
oldest
votes
up vote
7
down vote
First set of example data:
$ awk -vOFS='t' ' r=""; delete t; for (i=1;i<=NF;++i) if (!t[$i]++) r = r ? r OFS $i : $i print r ' file
A B1 C1
B B2 D2
C C12 C13
D D3 D5 D9
G F2
Second set of example data (same awk script):
$ awk -vOFS='t' ' r=""; delete t; for (i=1;i<=NF;++i) if (!t[$i]++) r = r ? r OFS $i : $i print r ' file
A CD274 PDCD1LG2 CD276
B NEK2 NEK6 NEK10 NEKL-4
C TNFAIP3 OTUD7B
D DUSP16 DUSP4 DUSP8 VHP-1
E AGO2
The script reads the input file file line by line, and for each line it goes through each field, building up the output line, r. If the value in a field has already been added to the output line (determined by a lookup table, t, of used field values), then the field is ignored, otherwise it's added.
When all the fields of an input line have been processed, the constructed line is outputted.
The output field delimiter is set to tab through -vOFS='t' on the command line.
The awk script unravelled:
r = ""
delete t
for (i = 1; i <= NF; ++i)
if (!t[$i]++)
r = r ? r OFS $i : $i
print r
2
Seesplit("", t)for the POSIX equivalent todelete t
â Stéphane Chazelas
Sep 27 '17 at 6:45
add a comment |Â
up vote
6
down vote
sed/tr, uniq and paste
while read -r l; do sed 's/t/n/g' <<< "$l" | uniq | paste -s; done < test
or POSIX compliant:
while read -r l; do echo "$l" | tr 't' 'n' | uniq | paste -s -; done < test
For the file test this will line by line replace all Tab characters with linebreaks, run uniq to delete dupes and replace the linebreaks with Tab characters again.
$ cat test
A B1 B1 C1
B B2 D2
C C12 C13 C13
D D3 D5 D9
G F2 F2
$ while read -r l; do sed 's/t/n/g' <<< "$l" | uniq | paste -s; done < test
A B1 C1
B B2 D2
C C12 C13
D D3 D5 D9
G F2
NB: This solution will not work for duplicates over multiple rows, e.g. C1 in
A B1 B1 C1
C1 B B2 D2
add a comment |Â
up vote
6
down vote
Maybe something like:
gawk -vRS='\s*\S*' -vORS= '$0=RT;$1!=prev;prev=$1'
The RS=pattern...$0=RT trick lets you process records defined as the parts that match the pattern.
So here, we're slicing the input into <whitespace><non-whitespace> $0 records, <non-whitespace> goes in $1 (the first and only field). We're printing the records whose $1 is not equal to the previous one.
On an input like:
A B1 B1 C1
B B2 D2
C C12 C13 C13
D D3 D5 D9
G F2 F2
The records are:
[A][ B1][ B1][ C1][
B][ B2][ D2][
C][ C12][ C13][ C13][
D][ D3][ D5][ D9][
G][ F2][ F2][
]
Doesn't work for your second example though and note that it could remove some newline characters.
What if a row begins with a dupe from the preceding line, e.g. if we addC1at the beginning of row 2? The linebreak clearly should not get removed even then.
â dessert
Sep 26 '17 at 22:00
A CD274 PDCD1LG2 CD276 PDCD1LG2 CD274 B NEK2 NEK6 NEK10 NEK10 NEKL-4 C TNFAIP3 OTUD7B OTUD7B TNFAIP3 TNFAIP3 D DUSP16 DUSP4 DUSP8 VHP-1 DUSP8 E AGO2 AGO2 AGO2 AGO2 AGO2
â desu
Sep 26 '17 at 22:13
3
@desu, whatever you're trying to say to clarify your question, please edit it in your question. You may want to take the tour for some advise on how to ask great questions.
â Stéphane Chazelas
Sep 26 '17 at 22:17
@desu add-F'n'to separate each input lines, sogawk -F'n' -vRS='\s*\S*' -vORS= '$0=RT;$1!=prev;prev=$1'
â Ã±ÃÂsýù÷
Sep 27 '17 at 9:48
1
@ñÃÂsýù÷, not sure what you mean.nis already included in the default FS. The problem here is that if thatnis part of a record that is deleted, it will be deleted. Anyway, that answer doesn't answer the OP's question any more with their updated requirements. I'm only leaving it in for the trick which may be useful in other situations.
â Stéphane Chazelas
Sep 27 '17 at 9:57
 |Â
show 3 more comments
up vote
2
down vote
This is more of a code-golf / freak challenge solution:
xargs -L1 -I echo '; ' < ./test.txt |
xargs -n1 |
uniq |
xargs |
sed -e 's/; /n/g' -e 's/ +/t/g'
But it avoids using loops and all other heavy machinery seen in other answers.
It also builds on an assumption your data doesn't contain ; character.
It also assumes no",'backslash characters and that none of the words look like-n,-e,-nEne... (depending on theechoimplementation) It also assume GNUsed. It still spawns oneechoprocess per line. But it's true that it's less heavy than some of thewhileloops seen around. It doesn't work for the updated requirements where the duplicated words may no longer be contiguous.
â Stéphane Chazelas
Sep 27 '17 at 10:43
@StéphaneChazelas the argument toechois quoted, so that the values that look like options won't be interpreted as such. What part ofsedcall isn't POSIX? (I honestly don't know).
â wvxvw
Sep 27 '17 at 10:53
No quoting doesn't prevent option processing. Tryprintf '%sn' -n -ne foo | xargs. Note thatxargs -n1means that oneechois being run for each word which is quite heavy actually.n,+andtare GNU extensions, though you do find some other implementations supporting it nowadays.
â Stéphane Chazelas
Sep 27 '17 at 12:29
@StéphaneChazelas Well, maybe it'sechoimplementation issue, but for meecho "-n 'foo'" | xargs -L1 -I echo '; 'prints; -n foo, i.e.-nwasn't treated as an option. Or, do you mean this will propagate touniq? I think I see your point now.
â wvxvw
Sep 27 '17 at 13:19
Yes, it doesn't apply to the firstechoas the argument starts with;, it applies to the other ones (the ones implictely run byxargsuponxargsorxargs -n1alone).
â Stéphane Chazelas
Sep 27 '17 at 13:53
add a comment |Â
up vote
1
down vote
With perl:
unique words on each line:
perl -MList::Util=uniq -lape '$_ = join "t", uniq @F'
unique words globally:
perl -lape '$_ = join "t", grep !$count$_++ @F'
Or to only consider words of each line starting with the 2nd one:
perl -lape '$_ = join "t", shift(@F), grep !$count$_++ @F'
add a comment |Â
up vote
0
down vote
With bash v4.3 (if you don't mind the order of fields as it's sorted except first)
while IFS='n' read -r line;
do aline=( $line );
echo $aline[0] $(sort -u <(printf "%sn" $aline[@]:1));
done < infile
Explanation:
aline=( $line )this make the line save into an array 'aline'$aline[0]prints first element of an array 'aline' (array index is starting with zero inbash)printf "%sn" $aline[@]:1prints each element of array 'aline' in separate lines and ignore first element; Thensort -usorts each line and remove duplicates entriesechothis also combine splited line elements after sort into one linear.Please see below example to have better view of this step:
printf "Cn4nBnC" |sort -u
4
B
C
echo $(printf "Cn4nBnC" |sort -u)
4 B C
This will give output as:
A CD274 CD276 PDCD1LG2
B NEK10 NEK2 NEK6 NEKL-4
C OTUD7B TNFAIP3
D DUSP16 DUSP4 DUSP8 VHP-1
E AGO2
add a comment |Â
up vote
0
down vote
sed substitution with back reference
sed -re 's/s+$//; s/(t[^t]+)1+$/1/'
(s/s+$// gets rid of trailing white-space like in your example.)
add a comment |Â
7 Answers
7
active
oldest
votes
7 Answers
7
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
7
down vote
First set of example data:
$ awk -vOFS='t' ' r=""; delete t; for (i=1;i<=NF;++i) if (!t[$i]++) r = r ? r OFS $i : $i print r ' file
A B1 C1
B B2 D2
C C12 C13
D D3 D5 D9
G F2
Second set of example data (same awk script):
$ awk -vOFS='t' ' r=""; delete t; for (i=1;i<=NF;++i) if (!t[$i]++) r = r ? r OFS $i : $i print r ' file
A CD274 PDCD1LG2 CD276
B NEK2 NEK6 NEK10 NEKL-4
C TNFAIP3 OTUD7B
D DUSP16 DUSP4 DUSP8 VHP-1
E AGO2
The script reads the input file file line by line, and for each line it goes through each field, building up the output line, r. If the value in a field has already been added to the output line (determined by a lookup table, t, of used field values), then the field is ignored, otherwise it's added.
When all the fields of an input line have been processed, the constructed line is outputted.
The output field delimiter is set to tab through -vOFS='t' on the command line.
The awk script unravelled:
r = ""
delete t
for (i = 1; i <= NF; ++i)
if (!t[$i]++)
r = r ? r OFS $i : $i
print r
2
Seesplit("", t)for the POSIX equivalent todelete t
â Stéphane Chazelas
Sep 27 '17 at 6:45
add a comment |Â
up vote
7
down vote
First set of example data:
$ awk -vOFS='t' ' r=""; delete t; for (i=1;i<=NF;++i) if (!t[$i]++) r = r ? r OFS $i : $i print r ' file
A B1 C1
B B2 D2
C C12 C13
D D3 D5 D9
G F2
Second set of example data (same awk script):
$ awk -vOFS='t' ' r=""; delete t; for (i=1;i<=NF;++i) if (!t[$i]++) r = r ? r OFS $i : $i print r ' file
A CD274 PDCD1LG2 CD276
B NEK2 NEK6 NEK10 NEKL-4
C TNFAIP3 OTUD7B
D DUSP16 DUSP4 DUSP8 VHP-1
E AGO2
The script reads the input file file line by line, and for each line it goes through each field, building up the output line, r. If the value in a field has already been added to the output line (determined by a lookup table, t, of used field values), then the field is ignored, otherwise it's added.
When all the fields of an input line have been processed, the constructed line is outputted.
The output field delimiter is set to tab through -vOFS='t' on the command line.
The awk script unravelled:
r = ""
delete t
for (i = 1; i <= NF; ++i)
if (!t[$i]++)
r = r ? r OFS $i : $i
print r
2
Seesplit("", t)for the POSIX equivalent todelete t
â Stéphane Chazelas
Sep 27 '17 at 6:45
add a comment |Â
up vote
7
down vote
up vote
7
down vote
First set of example data:
$ awk -vOFS='t' ' r=""; delete t; for (i=1;i<=NF;++i) if (!t[$i]++) r = r ? r OFS $i : $i print r ' file
A B1 C1
B B2 D2
C C12 C13
D D3 D5 D9
G F2
Second set of example data (same awk script):
$ awk -vOFS='t' ' r=""; delete t; for (i=1;i<=NF;++i) if (!t[$i]++) r = r ? r OFS $i : $i print r ' file
A CD274 PDCD1LG2 CD276
B NEK2 NEK6 NEK10 NEKL-4
C TNFAIP3 OTUD7B
D DUSP16 DUSP4 DUSP8 VHP-1
E AGO2
The script reads the input file file line by line, and for each line it goes through each field, building up the output line, r. If the value in a field has already been added to the output line (determined by a lookup table, t, of used field values), then the field is ignored, otherwise it's added.
When all the fields of an input line have been processed, the constructed line is outputted.
The output field delimiter is set to tab through -vOFS='t' on the command line.
The awk script unravelled:
r = ""
delete t
for (i = 1; i <= NF; ++i)
if (!t[$i]++)
r = r ? r OFS $i : $i
print r
First set of example data:
$ awk -vOFS='t' ' r=""; delete t; for (i=1;i<=NF;++i) if (!t[$i]++) r = r ? r OFS $i : $i print r ' file
A B1 C1
B B2 D2
C C12 C13
D D3 D5 D9
G F2
Second set of example data (same awk script):
$ awk -vOFS='t' ' r=""; delete t; for (i=1;i<=NF;++i) if (!t[$i]++) r = r ? r OFS $i : $i print r ' file
A CD274 PDCD1LG2 CD276
B NEK2 NEK6 NEK10 NEKL-4
C TNFAIP3 OTUD7B
D DUSP16 DUSP4 DUSP8 VHP-1
E AGO2
The script reads the input file file line by line, and for each line it goes through each field, building up the output line, r. If the value in a field has already been added to the output line (determined by a lookup table, t, of used field values), then the field is ignored, otherwise it's added.
When all the fields of an input line have been processed, the constructed line is outputted.
The output field delimiter is set to tab through -vOFS='t' on the command line.
The awk script unravelled:
r = ""
delete t
for (i = 1; i <= NF; ++i)
if (!t[$i]++)
r = r ? r OFS $i : $i
print r
edited Sep 26 '17 at 23:22
answered Sep 26 '17 at 22:54
Kusalananda
106k14209327
106k14209327
2
Seesplit("", t)for the POSIX equivalent todelete t
â Stéphane Chazelas
Sep 27 '17 at 6:45
add a comment |Â
2
Seesplit("", t)for the POSIX equivalent todelete t
â Stéphane Chazelas
Sep 27 '17 at 6:45
2
2
See
split("", t) for the POSIX equivalent to delete tâ Stéphane Chazelas
Sep 27 '17 at 6:45
See
split("", t) for the POSIX equivalent to delete tâ Stéphane Chazelas
Sep 27 '17 at 6:45
add a comment |Â
up vote
6
down vote
sed/tr, uniq and paste
while read -r l; do sed 's/t/n/g' <<< "$l" | uniq | paste -s; done < test
or POSIX compliant:
while read -r l; do echo "$l" | tr 't' 'n' | uniq | paste -s -; done < test
For the file test this will line by line replace all Tab characters with linebreaks, run uniq to delete dupes and replace the linebreaks with Tab characters again.
$ cat test
A B1 B1 C1
B B2 D2
C C12 C13 C13
D D3 D5 D9
G F2 F2
$ while read -r l; do sed 's/t/n/g' <<< "$l" | uniq | paste -s; done < test
A B1 C1
B B2 D2
C C12 C13
D D3 D5 D9
G F2
NB: This solution will not work for duplicates over multiple rows, e.g. C1 in
A B1 B1 C1
C1 B B2 D2
add a comment |Â
up vote
6
down vote
sed/tr, uniq and paste
while read -r l; do sed 's/t/n/g' <<< "$l" | uniq | paste -s; done < test
or POSIX compliant:
while read -r l; do echo "$l" | tr 't' 'n' | uniq | paste -s -; done < test
For the file test this will line by line replace all Tab characters with linebreaks, run uniq to delete dupes and replace the linebreaks with Tab characters again.
$ cat test
A B1 B1 C1
B B2 D2
C C12 C13 C13
D D3 D5 D9
G F2 F2
$ while read -r l; do sed 's/t/n/g' <<< "$l" | uniq | paste -s; done < test
A B1 C1
B B2 D2
C C12 C13
D D3 D5 D9
G F2
NB: This solution will not work for duplicates over multiple rows, e.g. C1 in
A B1 B1 C1
C1 B B2 D2
add a comment |Â
up vote
6
down vote
up vote
6
down vote
sed/tr, uniq and paste
while read -r l; do sed 's/t/n/g' <<< "$l" | uniq | paste -s; done < test
or POSIX compliant:
while read -r l; do echo "$l" | tr 't' 'n' | uniq | paste -s -; done < test
For the file test this will line by line replace all Tab characters with linebreaks, run uniq to delete dupes and replace the linebreaks with Tab characters again.
$ cat test
A B1 B1 C1
B B2 D2
C C12 C13 C13
D D3 D5 D9
G F2 F2
$ while read -r l; do sed 's/t/n/g' <<< "$l" | uniq | paste -s; done < test
A B1 C1
B B2 D2
C C12 C13
D D3 D5 D9
G F2
NB: This solution will not work for duplicates over multiple rows, e.g. C1 in
A B1 B1 C1
C1 B B2 D2
sed/tr, uniq and paste
while read -r l; do sed 's/t/n/g' <<< "$l" | uniq | paste -s; done < test
or POSIX compliant:
while read -r l; do echo "$l" | tr 't' 'n' | uniq | paste -s -; done < test
For the file test this will line by line replace all Tab characters with linebreaks, run uniq to delete dupes and replace the linebreaks with Tab characters again.
$ cat test
A B1 B1 C1
B B2 D2
C C12 C13 C13
D D3 D5 D9
G F2 F2
$ while read -r l; do sed 's/t/n/g' <<< "$l" | uniq | paste -s; done < test
A B1 C1
B B2 D2
C C12 C13
D D3 D5 D9
G F2
NB: This solution will not work for duplicates over multiple rows, e.g. C1 in
A B1 B1 C1
C1 B B2 D2
edited Sep 26 '17 at 22:19
answered Sep 26 '17 at 21:26
dessert
1,013321
1,013321
add a comment |Â
add a comment |Â
up vote
6
down vote
Maybe something like:
gawk -vRS='\s*\S*' -vORS= '$0=RT;$1!=prev;prev=$1'
The RS=pattern...$0=RT trick lets you process records defined as the parts that match the pattern.
So here, we're slicing the input into <whitespace><non-whitespace> $0 records, <non-whitespace> goes in $1 (the first and only field). We're printing the records whose $1 is not equal to the previous one.
On an input like:
A B1 B1 C1
B B2 D2
C C12 C13 C13
D D3 D5 D9
G F2 F2
The records are:
[A][ B1][ B1][ C1][
B][ B2][ D2][
C][ C12][ C13][ C13][
D][ D3][ D5][ D9][
G][ F2][ F2][
]
Doesn't work for your second example though and note that it could remove some newline characters.
What if a row begins with a dupe from the preceding line, e.g. if we addC1at the beginning of row 2? The linebreak clearly should not get removed even then.
â dessert
Sep 26 '17 at 22:00
A CD274 PDCD1LG2 CD276 PDCD1LG2 CD274 B NEK2 NEK6 NEK10 NEK10 NEKL-4 C TNFAIP3 OTUD7B OTUD7B TNFAIP3 TNFAIP3 D DUSP16 DUSP4 DUSP8 VHP-1 DUSP8 E AGO2 AGO2 AGO2 AGO2 AGO2
â desu
Sep 26 '17 at 22:13
3
@desu, whatever you're trying to say to clarify your question, please edit it in your question. You may want to take the tour for some advise on how to ask great questions.
â Stéphane Chazelas
Sep 26 '17 at 22:17
@desu add-F'n'to separate each input lines, sogawk -F'n' -vRS='\s*\S*' -vORS= '$0=RT;$1!=prev;prev=$1'
â Ã±ÃÂsýù÷
Sep 27 '17 at 9:48
1
@ñÃÂsýù÷, not sure what you mean.nis already included in the default FS. The problem here is that if thatnis part of a record that is deleted, it will be deleted. Anyway, that answer doesn't answer the OP's question any more with their updated requirements. I'm only leaving it in for the trick which may be useful in other situations.
â Stéphane Chazelas
Sep 27 '17 at 9:57
 |Â
show 3 more comments
up vote
6
down vote
Maybe something like:
gawk -vRS='\s*\S*' -vORS= '$0=RT;$1!=prev;prev=$1'
The RS=pattern...$0=RT trick lets you process records defined as the parts that match the pattern.
So here, we're slicing the input into <whitespace><non-whitespace> $0 records, <non-whitespace> goes in $1 (the first and only field). We're printing the records whose $1 is not equal to the previous one.
On an input like:
A B1 B1 C1
B B2 D2
C C12 C13 C13
D D3 D5 D9
G F2 F2
The records are:
[A][ B1][ B1][ C1][
B][ B2][ D2][
C][ C12][ C13][ C13][
D][ D3][ D5][ D9][
G][ F2][ F2][
]
Doesn't work for your second example though and note that it could remove some newline characters.
What if a row begins with a dupe from the preceding line, e.g. if we addC1at the beginning of row 2? The linebreak clearly should not get removed even then.
â dessert
Sep 26 '17 at 22:00
A CD274 PDCD1LG2 CD276 PDCD1LG2 CD274 B NEK2 NEK6 NEK10 NEK10 NEKL-4 C TNFAIP3 OTUD7B OTUD7B TNFAIP3 TNFAIP3 D DUSP16 DUSP4 DUSP8 VHP-1 DUSP8 E AGO2 AGO2 AGO2 AGO2 AGO2
â desu
Sep 26 '17 at 22:13
3
@desu, whatever you're trying to say to clarify your question, please edit it in your question. You may want to take the tour for some advise on how to ask great questions.
â Stéphane Chazelas
Sep 26 '17 at 22:17
@desu add-F'n'to separate each input lines, sogawk -F'n' -vRS='\s*\S*' -vORS= '$0=RT;$1!=prev;prev=$1'
â Ã±ÃÂsýù÷
Sep 27 '17 at 9:48
1
@ñÃÂsýù÷, not sure what you mean.nis already included in the default FS. The problem here is that if thatnis part of a record that is deleted, it will be deleted. Anyway, that answer doesn't answer the OP's question any more with their updated requirements. I'm only leaving it in for the trick which may be useful in other situations.
â Stéphane Chazelas
Sep 27 '17 at 9:57
 |Â
show 3 more comments
up vote
6
down vote
up vote
6
down vote
Maybe something like:
gawk -vRS='\s*\S*' -vORS= '$0=RT;$1!=prev;prev=$1'
The RS=pattern...$0=RT trick lets you process records defined as the parts that match the pattern.
So here, we're slicing the input into <whitespace><non-whitespace> $0 records, <non-whitespace> goes in $1 (the first and only field). We're printing the records whose $1 is not equal to the previous one.
On an input like:
A B1 B1 C1
B B2 D2
C C12 C13 C13
D D3 D5 D9
G F2 F2
The records are:
[A][ B1][ B1][ C1][
B][ B2][ D2][
C][ C12][ C13][ C13][
D][ D3][ D5][ D9][
G][ F2][ F2][
]
Doesn't work for your second example though and note that it could remove some newline characters.
Maybe something like:
gawk -vRS='\s*\S*' -vORS= '$0=RT;$1!=prev;prev=$1'
The RS=pattern...$0=RT trick lets you process records defined as the parts that match the pattern.
So here, we're slicing the input into <whitespace><non-whitespace> $0 records, <non-whitespace> goes in $1 (the first and only field). We're printing the records whose $1 is not equal to the previous one.
On an input like:
A B1 B1 C1
B B2 D2
C C12 C13 C13
D D3 D5 D9
G F2 F2
The records are:
[A][ B1][ B1][ C1][
B][ B2][ D2][
C][ C12][ C13][ C13][
D][ D3][ D5][ D9][
G][ F2][ F2][
]
Doesn't work for your second example though and note that it could remove some newline characters.
edited Sep 27 '17 at 6:48
answered Sep 26 '17 at 21:34
Stéphane Chazelas
284k53523859
284k53523859
What if a row begins with a dupe from the preceding line, e.g. if we addC1at the beginning of row 2? The linebreak clearly should not get removed even then.
â dessert
Sep 26 '17 at 22:00
A CD274 PDCD1LG2 CD276 PDCD1LG2 CD274 B NEK2 NEK6 NEK10 NEK10 NEKL-4 C TNFAIP3 OTUD7B OTUD7B TNFAIP3 TNFAIP3 D DUSP16 DUSP4 DUSP8 VHP-1 DUSP8 E AGO2 AGO2 AGO2 AGO2 AGO2
â desu
Sep 26 '17 at 22:13
3
@desu, whatever you're trying to say to clarify your question, please edit it in your question. You may want to take the tour for some advise on how to ask great questions.
â Stéphane Chazelas
Sep 26 '17 at 22:17
@desu add-F'n'to separate each input lines, sogawk -F'n' -vRS='\s*\S*' -vORS= '$0=RT;$1!=prev;prev=$1'
â Ã±ÃÂsýù÷
Sep 27 '17 at 9:48
1
@ñÃÂsýù÷, not sure what you mean.nis already included in the default FS. The problem here is that if thatnis part of a record that is deleted, it will be deleted. Anyway, that answer doesn't answer the OP's question any more with their updated requirements. I'm only leaving it in for the trick which may be useful in other situations.
â Stéphane Chazelas
Sep 27 '17 at 9:57
 |Â
show 3 more comments
What if a row begins with a dupe from the preceding line, e.g. if we addC1at the beginning of row 2? The linebreak clearly should not get removed even then.
â dessert
Sep 26 '17 at 22:00
A CD274 PDCD1LG2 CD276 PDCD1LG2 CD274 B NEK2 NEK6 NEK10 NEK10 NEKL-4 C TNFAIP3 OTUD7B OTUD7B TNFAIP3 TNFAIP3 D DUSP16 DUSP4 DUSP8 VHP-1 DUSP8 E AGO2 AGO2 AGO2 AGO2 AGO2
â desu
Sep 26 '17 at 22:13
3
@desu, whatever you're trying to say to clarify your question, please edit it in your question. You may want to take the tour for some advise on how to ask great questions.
â Stéphane Chazelas
Sep 26 '17 at 22:17
@desu add-F'n'to separate each input lines, sogawk -F'n' -vRS='\s*\S*' -vORS= '$0=RT;$1!=prev;prev=$1'
â Ã±ÃÂsýù÷
Sep 27 '17 at 9:48
1
@ñÃÂsýù÷, not sure what you mean.nis already included in the default FS. The problem here is that if thatnis part of a record that is deleted, it will be deleted. Anyway, that answer doesn't answer the OP's question any more with their updated requirements. I'm only leaving it in for the trick which may be useful in other situations.
â Stéphane Chazelas
Sep 27 '17 at 9:57
What if a row begins with a dupe from the preceding line, e.g. if we add
C1 at the beginning of row 2? The linebreak clearly should not get removed even then.â dessert
Sep 26 '17 at 22:00
What if a row begins with a dupe from the preceding line, e.g. if we add
C1 at the beginning of row 2? The linebreak clearly should not get removed even then.â dessert
Sep 26 '17 at 22:00
A CD274 PDCD1LG2 CD276 PDCD1LG2 CD274 B NEK2 NEK6 NEK10 NEK10 NEKL-4 C TNFAIP3 OTUD7B OTUD7B TNFAIP3 TNFAIP3 D DUSP16 DUSP4 DUSP8 VHP-1 DUSP8 E AGO2 AGO2 AGO2 AGO2 AGO2
â desu
Sep 26 '17 at 22:13
A CD274 PDCD1LG2 CD276 PDCD1LG2 CD274 B NEK2 NEK6 NEK10 NEK10 NEKL-4 C TNFAIP3 OTUD7B OTUD7B TNFAIP3 TNFAIP3 D DUSP16 DUSP4 DUSP8 VHP-1 DUSP8 E AGO2 AGO2 AGO2 AGO2 AGO2
â desu
Sep 26 '17 at 22:13
3
3
@desu, whatever you're trying to say to clarify your question, please edit it in your question. You may want to take the tour for some advise on how to ask great questions.
â Stéphane Chazelas
Sep 26 '17 at 22:17
@desu, whatever you're trying to say to clarify your question, please edit it in your question. You may want to take the tour for some advise on how to ask great questions.
â Stéphane Chazelas
Sep 26 '17 at 22:17
@desu add
-F'n' to separate each input lines, so gawk -F'n' -vRS='\s*\S*' -vORS= '$0=RT;$1!=prev;prev=$1'â Ã±ÃÂsýù÷
Sep 27 '17 at 9:48
@desu add
-F'n' to separate each input lines, so gawk -F'n' -vRS='\s*\S*' -vORS= '$0=RT;$1!=prev;prev=$1'â Ã±ÃÂsýù÷
Sep 27 '17 at 9:48
1
1
@ñÃÂsýù÷, not sure what you mean.
n is already included in the default FS. The problem here is that if that n is part of a record that is deleted, it will be deleted. Anyway, that answer doesn't answer the OP's question any more with their updated requirements. I'm only leaving it in for the trick which may be useful in other situations.â Stéphane Chazelas
Sep 27 '17 at 9:57
@ñÃÂsýù÷, not sure what you mean.
n is already included in the default FS. The problem here is that if that n is part of a record that is deleted, it will be deleted. Anyway, that answer doesn't answer the OP's question any more with their updated requirements. I'm only leaving it in for the trick which may be useful in other situations.â Stéphane Chazelas
Sep 27 '17 at 9:57
 |Â
show 3 more comments
up vote
2
down vote
This is more of a code-golf / freak challenge solution:
xargs -L1 -I echo '; ' < ./test.txt |
xargs -n1 |
uniq |
xargs |
sed -e 's/; /n/g' -e 's/ +/t/g'
But it avoids using loops and all other heavy machinery seen in other answers.
It also builds on an assumption your data doesn't contain ; character.
It also assumes no",'backslash characters and that none of the words look like-n,-e,-nEne... (depending on theechoimplementation) It also assume GNUsed. It still spawns oneechoprocess per line. But it's true that it's less heavy than some of thewhileloops seen around. It doesn't work for the updated requirements where the duplicated words may no longer be contiguous.
â Stéphane Chazelas
Sep 27 '17 at 10:43
@StéphaneChazelas the argument toechois quoted, so that the values that look like options won't be interpreted as such. What part ofsedcall isn't POSIX? (I honestly don't know).
â wvxvw
Sep 27 '17 at 10:53
No quoting doesn't prevent option processing. Tryprintf '%sn' -n -ne foo | xargs. Note thatxargs -n1means that oneechois being run for each word which is quite heavy actually.n,+andtare GNU extensions, though you do find some other implementations supporting it nowadays.
â Stéphane Chazelas
Sep 27 '17 at 12:29
@StéphaneChazelas Well, maybe it'sechoimplementation issue, but for meecho "-n 'foo'" | xargs -L1 -I echo '; 'prints; -n foo, i.e.-nwasn't treated as an option. Or, do you mean this will propagate touniq? I think I see your point now.
â wvxvw
Sep 27 '17 at 13:19
Yes, it doesn't apply to the firstechoas the argument starts with;, it applies to the other ones (the ones implictely run byxargsuponxargsorxargs -n1alone).
â Stéphane Chazelas
Sep 27 '17 at 13:53
add a comment |Â
up vote
2
down vote
This is more of a code-golf / freak challenge solution:
xargs -L1 -I echo '; ' < ./test.txt |
xargs -n1 |
uniq |
xargs |
sed -e 's/; /n/g' -e 's/ +/t/g'
But it avoids using loops and all other heavy machinery seen in other answers.
It also builds on an assumption your data doesn't contain ; character.
It also assumes no",'backslash characters and that none of the words look like-n,-e,-nEne... (depending on theechoimplementation) It also assume GNUsed. It still spawns oneechoprocess per line. But it's true that it's less heavy than some of thewhileloops seen around. It doesn't work for the updated requirements where the duplicated words may no longer be contiguous.
â Stéphane Chazelas
Sep 27 '17 at 10:43
@StéphaneChazelas the argument toechois quoted, so that the values that look like options won't be interpreted as such. What part ofsedcall isn't POSIX? (I honestly don't know).
â wvxvw
Sep 27 '17 at 10:53
No quoting doesn't prevent option processing. Tryprintf '%sn' -n -ne foo | xargs. Note thatxargs -n1means that oneechois being run for each word which is quite heavy actually.n,+andtare GNU extensions, though you do find some other implementations supporting it nowadays.
â Stéphane Chazelas
Sep 27 '17 at 12:29
@StéphaneChazelas Well, maybe it'sechoimplementation issue, but for meecho "-n 'foo'" | xargs -L1 -I echo '; 'prints; -n foo, i.e.-nwasn't treated as an option. Or, do you mean this will propagate touniq? I think I see your point now.
â wvxvw
Sep 27 '17 at 13:19
Yes, it doesn't apply to the firstechoas the argument starts with;, it applies to the other ones (the ones implictely run byxargsuponxargsorxargs -n1alone).
â Stéphane Chazelas
Sep 27 '17 at 13:53
add a comment |Â
up vote
2
down vote
up vote
2
down vote
This is more of a code-golf / freak challenge solution:
xargs -L1 -I echo '; ' < ./test.txt |
xargs -n1 |
uniq |
xargs |
sed -e 's/; /n/g' -e 's/ +/t/g'
But it avoids using loops and all other heavy machinery seen in other answers.
It also builds on an assumption your data doesn't contain ; character.
This is more of a code-golf / freak challenge solution:
xargs -L1 -I echo '; ' < ./test.txt |
xargs -n1 |
uniq |
xargs |
sed -e 's/; /n/g' -e 's/ +/t/g'
But it avoids using loops and all other heavy machinery seen in other answers.
It also builds on an assumption your data doesn't contain ; character.
answered Sep 27 '17 at 7:08
wvxvw
3362412
3362412
It also assumes no",'backslash characters and that none of the words look like-n,-e,-nEne... (depending on theechoimplementation) It also assume GNUsed. It still spawns oneechoprocess per line. But it's true that it's less heavy than some of thewhileloops seen around. It doesn't work for the updated requirements where the duplicated words may no longer be contiguous.
â Stéphane Chazelas
Sep 27 '17 at 10:43
@StéphaneChazelas the argument toechois quoted, so that the values that look like options won't be interpreted as such. What part ofsedcall isn't POSIX? (I honestly don't know).
â wvxvw
Sep 27 '17 at 10:53
No quoting doesn't prevent option processing. Tryprintf '%sn' -n -ne foo | xargs. Note thatxargs -n1means that oneechois being run for each word which is quite heavy actually.n,+andtare GNU extensions, though you do find some other implementations supporting it nowadays.
â Stéphane Chazelas
Sep 27 '17 at 12:29
@StéphaneChazelas Well, maybe it'sechoimplementation issue, but for meecho "-n 'foo'" | xargs -L1 -I echo '; 'prints; -n foo, i.e.-nwasn't treated as an option. Or, do you mean this will propagate touniq? I think I see your point now.
â wvxvw
Sep 27 '17 at 13:19
Yes, it doesn't apply to the firstechoas the argument starts with;, it applies to the other ones (the ones implictely run byxargsuponxargsorxargs -n1alone).
â Stéphane Chazelas
Sep 27 '17 at 13:53
add a comment |Â
It also assumes no",'backslash characters and that none of the words look like-n,-e,-nEne... (depending on theechoimplementation) It also assume GNUsed. It still spawns oneechoprocess per line. But it's true that it's less heavy than some of thewhileloops seen around. It doesn't work for the updated requirements where the duplicated words may no longer be contiguous.
â Stéphane Chazelas
Sep 27 '17 at 10:43
@StéphaneChazelas the argument toechois quoted, so that the values that look like options won't be interpreted as such. What part ofsedcall isn't POSIX? (I honestly don't know).
â wvxvw
Sep 27 '17 at 10:53
No quoting doesn't prevent option processing. Tryprintf '%sn' -n -ne foo | xargs. Note thatxargs -n1means that oneechois being run for each word which is quite heavy actually.n,+andtare GNU extensions, though you do find some other implementations supporting it nowadays.
â Stéphane Chazelas
Sep 27 '17 at 12:29
@StéphaneChazelas Well, maybe it'sechoimplementation issue, but for meecho "-n 'foo'" | xargs -L1 -I echo '; 'prints; -n foo, i.e.-nwasn't treated as an option. Or, do you mean this will propagate touniq? I think I see your point now.
â wvxvw
Sep 27 '17 at 13:19
Yes, it doesn't apply to the firstechoas the argument starts with;, it applies to the other ones (the ones implictely run byxargsuponxargsorxargs -n1alone).
â Stéphane Chazelas
Sep 27 '17 at 13:53
It also assumes no
", ' backslash characters and that none of the words look like -n, -e, -nEne... (depending on the echo implementation) It also assume GNU sed. It still spawns one echo process per line. But it's true that it's less heavy than some of the while loops seen around. It doesn't work for the updated requirements where the duplicated words may no longer be contiguous.â Stéphane Chazelas
Sep 27 '17 at 10:43
It also assumes no
", ' backslash characters and that none of the words look like -n, -e, -nEne... (depending on the echo implementation) It also assume GNU sed. It still spawns one echo process per line. But it's true that it's less heavy than some of the while loops seen around. It doesn't work for the updated requirements where the duplicated words may no longer be contiguous.â Stéphane Chazelas
Sep 27 '17 at 10:43
@StéphaneChazelas the argument to
echo is quoted, so that the values that look like options won't be interpreted as such. What part of sed call isn't POSIX? (I honestly don't know).â wvxvw
Sep 27 '17 at 10:53
@StéphaneChazelas the argument to
echo is quoted, so that the values that look like options won't be interpreted as such. What part of sed call isn't POSIX? (I honestly don't know).â wvxvw
Sep 27 '17 at 10:53
No quoting doesn't prevent option processing. Try
printf '%sn' -n -ne foo | xargs. Note that xargs -n1 means that one echo is being run for each word which is quite heavy actually. n, + and t are GNU extensions, though you do find some other implementations supporting it nowadays.â Stéphane Chazelas
Sep 27 '17 at 12:29
No quoting doesn't prevent option processing. Try
printf '%sn' -n -ne foo | xargs. Note that xargs -n1 means that one echo is being run for each word which is quite heavy actually. n, + and t are GNU extensions, though you do find some other implementations supporting it nowadays.â Stéphane Chazelas
Sep 27 '17 at 12:29
@StéphaneChazelas Well, maybe it's
echo implementation issue, but for me echo "-n 'foo'" | xargs -L1 -I echo '; ' prints ; -n foo, i.e. -n wasn't treated as an option. Or, do you mean this will propagate to uniq? I think I see your point now.â wvxvw
Sep 27 '17 at 13:19
@StéphaneChazelas Well, maybe it's
echo implementation issue, but for me echo "-n 'foo'" | xargs -L1 -I echo '; ' prints ; -n foo, i.e. -n wasn't treated as an option. Or, do you mean this will propagate to uniq? I think I see your point now.â wvxvw
Sep 27 '17 at 13:19
Yes, it doesn't apply to the first
echo as the argument starts with ;, it applies to the other ones (the ones implictely run by xargs upon xargs or xargs -n1 alone).â Stéphane Chazelas
Sep 27 '17 at 13:53
Yes, it doesn't apply to the first
echo as the argument starts with ;, it applies to the other ones (the ones implictely run by xargs upon xargs or xargs -n1 alone).â Stéphane Chazelas
Sep 27 '17 at 13:53
add a comment |Â
up vote
1
down vote
With perl:
unique words on each line:
perl -MList::Util=uniq -lape '$_ = join "t", uniq @F'
unique words globally:
perl -lape '$_ = join "t", grep !$count$_++ @F'
Or to only consider words of each line starting with the 2nd one:
perl -lape '$_ = join "t", shift(@F), grep !$count$_++ @F'
add a comment |Â
up vote
1
down vote
With perl:
unique words on each line:
perl -MList::Util=uniq -lape '$_ = join "t", uniq @F'
unique words globally:
perl -lape '$_ = join "t", grep !$count$_++ @F'
Or to only consider words of each line starting with the 2nd one:
perl -lape '$_ = join "t", shift(@F), grep !$count$_++ @F'
add a comment |Â
up vote
1
down vote
up vote
1
down vote
With perl:
unique words on each line:
perl -MList::Util=uniq -lape '$_ = join "t", uniq @F'
unique words globally:
perl -lape '$_ = join "t", grep !$count$_++ @F'
Or to only consider words of each line starting with the 2nd one:
perl -lape '$_ = join "t", shift(@F), grep !$count$_++ @F'
With perl:
unique words on each line:
perl -MList::Util=uniq -lape '$_ = join "t", uniq @F'
unique words globally:
perl -lape '$_ = join "t", grep !$count$_++ @F'
Or to only consider words of each line starting with the 2nd one:
perl -lape '$_ = join "t", shift(@F), grep !$count$_++ @F'
edited Sep 27 '17 at 10:45
answered Sep 27 '17 at 10:08
Stéphane Chazelas
284k53523859
284k53523859
add a comment |Â
add a comment |Â
up vote
0
down vote
With bash v4.3 (if you don't mind the order of fields as it's sorted except first)
while IFS='n' read -r line;
do aline=( $line );
echo $aline[0] $(sort -u <(printf "%sn" $aline[@]:1));
done < infile
Explanation:
aline=( $line )this make the line save into an array 'aline'$aline[0]prints first element of an array 'aline' (array index is starting with zero inbash)printf "%sn" $aline[@]:1prints each element of array 'aline' in separate lines and ignore first element; Thensort -usorts each line and remove duplicates entriesechothis also combine splited line elements after sort into one linear.Please see below example to have better view of this step:
printf "Cn4nBnC" |sort -u
4
B
C
echo $(printf "Cn4nBnC" |sort -u)
4 B C
This will give output as:
A CD274 CD276 PDCD1LG2
B NEK10 NEK2 NEK6 NEKL-4
C OTUD7B TNFAIP3
D DUSP16 DUSP4 DUSP8 VHP-1
E AGO2
add a comment |Â
up vote
0
down vote
With bash v4.3 (if you don't mind the order of fields as it's sorted except first)
while IFS='n' read -r line;
do aline=( $line );
echo $aline[0] $(sort -u <(printf "%sn" $aline[@]:1));
done < infile
Explanation:
aline=( $line )this make the line save into an array 'aline'$aline[0]prints first element of an array 'aline' (array index is starting with zero inbash)printf "%sn" $aline[@]:1prints each element of array 'aline' in separate lines and ignore first element; Thensort -usorts each line and remove duplicates entriesechothis also combine splited line elements after sort into one linear.Please see below example to have better view of this step:
printf "Cn4nBnC" |sort -u
4
B
C
echo $(printf "Cn4nBnC" |sort -u)
4 B C
This will give output as:
A CD274 CD276 PDCD1LG2
B NEK10 NEK2 NEK6 NEKL-4
C OTUD7B TNFAIP3
D DUSP16 DUSP4 DUSP8 VHP-1
E AGO2
add a comment |Â
up vote
0
down vote
up vote
0
down vote
With bash v4.3 (if you don't mind the order of fields as it's sorted except first)
while IFS='n' read -r line;
do aline=( $line );
echo $aline[0] $(sort -u <(printf "%sn" $aline[@]:1));
done < infile
Explanation:
aline=( $line )this make the line save into an array 'aline'$aline[0]prints first element of an array 'aline' (array index is starting with zero inbash)printf "%sn" $aline[@]:1prints each element of array 'aline' in separate lines and ignore first element; Thensort -usorts each line and remove duplicates entriesechothis also combine splited line elements after sort into one linear.Please see below example to have better view of this step:
printf "Cn4nBnC" |sort -u
4
B
C
echo $(printf "Cn4nBnC" |sort -u)
4 B C
This will give output as:
A CD274 CD276 PDCD1LG2
B NEK10 NEK2 NEK6 NEKL-4
C OTUD7B TNFAIP3
D DUSP16 DUSP4 DUSP8 VHP-1
E AGO2
With bash v4.3 (if you don't mind the order of fields as it's sorted except first)
while IFS='n' read -r line;
do aline=( $line );
echo $aline[0] $(sort -u <(printf "%sn" $aline[@]:1));
done < infile
Explanation:
aline=( $line )this make the line save into an array 'aline'$aline[0]prints first element of an array 'aline' (array index is starting with zero inbash)printf "%sn" $aline[@]:1prints each element of array 'aline' in separate lines and ignore first element; Thensort -usorts each line and remove duplicates entriesechothis also combine splited line elements after sort into one linear.Please see below example to have better view of this step:
printf "Cn4nBnC" |sort -u
4
B
C
echo $(printf "Cn4nBnC" |sort -u)
4 B C
This will give output as:
A CD274 CD276 PDCD1LG2
B NEK10 NEK2 NEK6 NEKL-4
C OTUD7B TNFAIP3
D DUSP16 DUSP4 DUSP8 VHP-1
E AGO2
edited Sep 27 '17 at 10:46
answered Sep 27 '17 at 10:08
ñÃÂsýù÷
15.7k92563
15.7k92563
add a comment |Â
add a comment |Â
up vote
0
down vote
sed substitution with back reference
sed -re 's/s+$//; s/(t[^t]+)1+$/1/'
(s/s+$// gets rid of trailing white-space like in your example.)
add a comment |Â
up vote
0
down vote
sed substitution with back reference
sed -re 's/s+$//; s/(t[^t]+)1+$/1/'
(s/s+$// gets rid of trailing white-space like in your example.)
add a comment |Â
up vote
0
down vote
up vote
0
down vote
sed substitution with back reference
sed -re 's/s+$//; s/(t[^t]+)1+$/1/'
(s/s+$// gets rid of trailing white-space like in your example.)
sed substitution with back reference
sed -re 's/s+$//; s/(t[^t]+)1+$/1/'
(s/s+$// gets rid of trailing white-space like in your example.)
answered Sep 27 '17 at 11:36
David Foerster
918616
918616
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f394634%2fhow-to-remove-duplicate-value-in-a-tab-delimited-text-file%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Does order of fields in a line in output is important ? like
AGO2 EorC OTUD7B TNFAIP3â Ã±ÃÂsýù÷
Sep 27 '17 at 9:09
@ñÃÂsýù÷
ABCseems to be the line numbering, I think at least they should stay there.â dessert
Sep 27 '17 at 10:01
If you're happy with one or several of the answers, upvote them. If one is solving your issue, accepting it would be the best way of saying "Thank You!" :-)
â Kusalananda
Sep 27 '17 at 10:50