How to remove duplicate value in a tab-delimited text file

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
5
down vote

favorite
2












I have a tab delimited column text like below



A B1 B1 C1
B B2 D2
C C12 C13 C13
D D3 D5 D9
G F2 F2


how could I convert the above table like below



A B1 C1
B B2 D2
C C12 C13
D D3 D5 D9
G F2



I have extracted my real data file, it is a tab delimited file and I have tried the command line you (Stéphane Chazelas?) posted it works fine but it couldn't remove the duplicate on the last column



A CD274 PDCD1LG2 CD276 PDCD1LG2 CD274
B NEK2 NEK6 NEK10 NEK10 NEKL-4
C TNFAIP3 OTUD7B OTUD7B TNFAIP3 TNFAIP3
D DUSP16 DUSP4 DUSP8 VHP-1 DUSP8
E AGO2 AGO2 AGO2 AGO2 AGO2


output need to be as below



A CD274 CD276 PDCD1LG2
B NEK2 NEK6 NEK10 NEKL-4
C TNFAIP3 OTUD7B
D DUSP16 DUSP4 DUSP8 VHP-1
E AGO2









share|improve this question























  • Does order of fields in a line in output is important ? like AGO2 E or C OTUD7B TNFAIP3
    – Î±Ò“sнιη
    Sep 27 '17 at 9:09











  • @αғsнιη A B C seems to be the line numbering, I think at least they should stay there.
    – dessert
    Sep 27 '17 at 10:01










  • If you're happy with one or several of the answers, upvote them. If one is solving your issue, accepting it would be the best way of saying "Thank You!" :-)
    – Kusalananda
    Sep 27 '17 at 10:50














up vote
5
down vote

favorite
2












I have a tab delimited column text like below



A B1 B1 C1
B B2 D2
C C12 C13 C13
D D3 D5 D9
G F2 F2


how could I convert the above table like below



A B1 C1
B B2 D2
C C12 C13
D D3 D5 D9
G F2



I have extracted my real data file, it is a tab delimited file and I have tried the command line you (Stéphane Chazelas?) posted it works fine but it couldn't remove the duplicate on the last column



A CD274 PDCD1LG2 CD276 PDCD1LG2 CD274
B NEK2 NEK6 NEK10 NEK10 NEKL-4
C TNFAIP3 OTUD7B OTUD7B TNFAIP3 TNFAIP3
D DUSP16 DUSP4 DUSP8 VHP-1 DUSP8
E AGO2 AGO2 AGO2 AGO2 AGO2


output need to be as below



A CD274 CD276 PDCD1LG2
B NEK2 NEK6 NEK10 NEKL-4
C TNFAIP3 OTUD7B
D DUSP16 DUSP4 DUSP8 VHP-1
E AGO2









share|improve this question























  • Does order of fields in a line in output is important ? like AGO2 E or C OTUD7B TNFAIP3
    – Î±Ò“sнιη
    Sep 27 '17 at 9:09











  • @αғsнιη A B C seems to be the line numbering, I think at least they should stay there.
    – dessert
    Sep 27 '17 at 10:01










  • If you're happy with one or several of the answers, upvote them. If one is solving your issue, accepting it would be the best way of saying "Thank You!" :-)
    – Kusalananda
    Sep 27 '17 at 10:50












up vote
5
down vote

favorite
2









up vote
5
down vote

favorite
2






2





I have a tab delimited column text like below



A B1 B1 C1
B B2 D2
C C12 C13 C13
D D3 D5 D9
G F2 F2


how could I convert the above table like below



A B1 C1
B B2 D2
C C12 C13
D D3 D5 D9
G F2



I have extracted my real data file, it is a tab delimited file and I have tried the command line you (Stéphane Chazelas?) posted it works fine but it couldn't remove the duplicate on the last column



A CD274 PDCD1LG2 CD276 PDCD1LG2 CD274
B NEK2 NEK6 NEK10 NEK10 NEKL-4
C TNFAIP3 OTUD7B OTUD7B TNFAIP3 TNFAIP3
D DUSP16 DUSP4 DUSP8 VHP-1 DUSP8
E AGO2 AGO2 AGO2 AGO2 AGO2


output need to be as below



A CD274 CD276 PDCD1LG2
B NEK2 NEK6 NEK10 NEKL-4
C TNFAIP3 OTUD7B
D DUSP16 DUSP4 DUSP8 VHP-1
E AGO2









share|improve this question















I have a tab delimited column text like below



A B1 B1 C1
B B2 D2
C C12 C13 C13
D D3 D5 D9
G F2 F2


how could I convert the above table like below



A B1 C1
B B2 D2
C C12 C13
D D3 D5 D9
G F2



I have extracted my real data file, it is a tab delimited file and I have tried the command line you (Stéphane Chazelas?) posted it works fine but it couldn't remove the duplicate on the last column



A CD274 PDCD1LG2 CD276 PDCD1LG2 CD274
B NEK2 NEK6 NEK10 NEK10 NEKL-4
C TNFAIP3 OTUD7B OTUD7B TNFAIP3 TNFAIP3
D DUSP16 DUSP4 DUSP8 VHP-1 DUSP8
E AGO2 AGO2 AGO2 AGO2 AGO2


output need to be as below



A CD274 CD276 PDCD1LG2
B NEK2 NEK6 NEK10 NEKL-4
C TNFAIP3 OTUD7B
D DUSP16 DUSP4 DUSP8 VHP-1
E AGO2






text-processing csv-simple






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Sep 26 '17 at 22:33









Kusalananda

106k14209327




106k14209327










asked Sep 26 '17 at 21:15









desu

544




544











  • Does order of fields in a line in output is important ? like AGO2 E or C OTUD7B TNFAIP3
    – Î±Ò“sнιη
    Sep 27 '17 at 9:09











  • @αғsнιη A B C seems to be the line numbering, I think at least they should stay there.
    – dessert
    Sep 27 '17 at 10:01










  • If you're happy with one or several of the answers, upvote them. If one is solving your issue, accepting it would be the best way of saying "Thank You!" :-)
    – Kusalananda
    Sep 27 '17 at 10:50
















  • Does order of fields in a line in output is important ? like AGO2 E or C OTUD7B TNFAIP3
    – Î±Ò“sнιη
    Sep 27 '17 at 9:09











  • @αғsнιη A B C seems to be the line numbering, I think at least they should stay there.
    – dessert
    Sep 27 '17 at 10:01










  • If you're happy with one or several of the answers, upvote them. If one is solving your issue, accepting it would be the best way of saying "Thank You!" :-)
    – Kusalananda
    Sep 27 '17 at 10:50















Does order of fields in a line in output is important ? like AGO2 E or C OTUD7B TNFAIP3
– Î±Ò“sнιη
Sep 27 '17 at 9:09





Does order of fields in a line in output is important ? like AGO2 E or C OTUD7B TNFAIP3
– Î±Ò“sнιη
Sep 27 '17 at 9:09













@αғsнιη A B C seems to be the line numbering, I think at least they should stay there.
– dessert
Sep 27 '17 at 10:01




@αғsнιη A B C seems to be the line numbering, I think at least they should stay there.
– dessert
Sep 27 '17 at 10:01












If you're happy with one or several of the answers, upvote them. If one is solving your issue, accepting it would be the best way of saying "Thank You!" :-)
– Kusalananda
Sep 27 '17 at 10:50




If you're happy with one or several of the answers, upvote them. If one is solving your issue, accepting it would be the best way of saying "Thank You!" :-)
– Kusalananda
Sep 27 '17 at 10:50










7 Answers
7






active

oldest

votes

















up vote
7
down vote













First set of example data:



$ awk -vOFS='t' ' r=""; delete t; for (i=1;i<=NF;++i) if (!t[$i]++) r = r ? r OFS $i : $i print r ' file
A B1 C1
B B2 D2
C C12 C13
D D3 D5 D9
G F2


Second set of example data (same awk script):



$ awk -vOFS='t' ' r=""; delete t; for (i=1;i<=NF;++i) if (!t[$i]++) r = r ? r OFS $i : $i print r ' file
A CD274 PDCD1LG2 CD276
B NEK2 NEK6 NEK10 NEKL-4
C TNFAIP3 OTUD7B
D DUSP16 DUSP4 DUSP8 VHP-1
E AGO2


The script reads the input file file line by line, and for each line it goes through each field, building up the output line, r. If the value in a field has already been added to the output line (determined by a lookup table, t, of used field values), then the field is ignored, otherwise it's added.



When all the fields of an input line have been processed, the constructed line is outputted.



The output field delimiter is set to tab through -vOFS='t' on the command line.




The awk script unravelled:




r = ""
delete t

for (i = 1; i <= NF; ++i)
if (!t[$i]++)
r = r ? r OFS $i : $i



print r






share|improve this answer


















  • 2




    See split("", t) for the POSIX equivalent to delete t
    – Stéphane Chazelas
    Sep 27 '17 at 6:45


















up vote
6
down vote













sed/tr, uniq and paste



while read -r l; do sed 's/t/n/g' <<< "$l" | uniq | paste -s; done < test


or POSIX compliant:



while read -r l; do echo "$l" | tr 't' 'n' | uniq | paste -s -; done < test


For the file test this will line by line replace all Tab characters with linebreaks, run uniq to delete dupes and replace the linebreaks with Tab characters again.



$ cat test
A B1 B1 C1
B B2 D2
C C12 C13 C13
D D3 D5 D9
G F2 F2

$ while read -r l; do sed 's/t/n/g' <<< "$l" | uniq | paste -s; done < test
A B1 C1
B B2 D2
C C12 C13
D D3 D5 D9
G F2


NB: This solution will not work for duplicates over multiple rows, e.g. C1 in



A B1 B1 C1
C1 B B2 D2





share|improve this answer





























    up vote
    6
    down vote













    Maybe something like:



    gawk -vRS='\s*\S*' -vORS= '$0=RT;$1!=prev;prev=$1'


    The RS=pattern...$0=RT trick lets you process records defined as the parts that match the pattern.



    So here, we're slicing the input into <whitespace><non-whitespace> $0 records, <non-whitespace> goes in $1 (the first and only field). We're printing the records whose $1 is not equal to the previous one.



    On an input like:



    A B1 B1 C1
    B B2 D2
    C C12 C13 C13
    D D3 D5 D9
    G F2 F2


    The records are:




    [A][ B1][ B1][ C1][
    B][ B2][ D2][
    C][ C12][ C13][ C13][
    D][ D3][ D5][ D9][
    G][ F2][ F2][
    ]


    Doesn't work for your second example though and note that it could remove some newline characters.






    share|improve this answer






















    • What if a row begins with a dupe from the preceding line, e.g. if we add C1 at the beginning of row 2? The linebreak clearly should not get removed even then.
      – dessert
      Sep 26 '17 at 22:00










    • A CD274 PDCD1LG2 CD276 PDCD1LG2 CD274 B NEK2 NEK6 NEK10 NEK10 NEKL-4 C TNFAIP3 OTUD7B OTUD7B TNFAIP3 TNFAIP3 D DUSP16 DUSP4 DUSP8 VHP-1 DUSP8 E AGO2 AGO2 AGO2 AGO2 AGO2
      – desu
      Sep 26 '17 at 22:13






    • 3




      @desu, whatever you're trying to say to clarify your question, please edit it in your question. You may want to take the tour for some advise on how to ask great questions.
      – Stéphane Chazelas
      Sep 26 '17 at 22:17










    • @desu add -F'n' to separate each input lines, so gawk -F'n' -vRS='\s*\S*' -vORS= '$0=RT;$1!=prev;prev=$1'
      – Î±Ò“sнιη
      Sep 27 '17 at 9:48







    • 1




      @αғsнιη, not sure what you mean. n is already included in the default FS. The problem here is that if that n is part of a record that is deleted, it will be deleted. Anyway, that answer doesn't answer the OP's question any more with their updated requirements. I'm only leaving it in for the trick which may be useful in other situations.
      – Stéphane Chazelas
      Sep 27 '17 at 9:57

















    up vote
    2
    down vote













    This is more of a code-golf / freak challenge solution:



    xargs -L1 -I echo '; ' < ./test.txt | 
    xargs -n1 |
    uniq |
    xargs |
    sed -e 's/; /n/g' -e 's/ +/t/g'


    But it avoids using loops and all other heavy machinery seen in other answers.



    It also builds on an assumption your data doesn't contain ; character.






    share|improve this answer




















    • It also assumes no ", ' backslash characters and that none of the words look like -n, -e, -nEne... (depending on the echo implementation) It also assume GNU sed. It still spawns one echo process per line. But it's true that it's less heavy than some of the while loops seen around. It doesn't work for the updated requirements where the duplicated words may no longer be contiguous.
      – Stéphane Chazelas
      Sep 27 '17 at 10:43











    • @StéphaneChazelas the argument to echo is quoted, so that the values that look like options won't be interpreted as such. What part of sed call isn't POSIX? (I honestly don't know).
      – wvxvw
      Sep 27 '17 at 10:53










    • No quoting doesn't prevent option processing. Try printf '%sn' -n -ne foo | xargs. Note that xargs -n1 means that one echo is being run for each word which is quite heavy actually. n, + and t are GNU extensions, though you do find some other implementations supporting it nowadays.
      – Stéphane Chazelas
      Sep 27 '17 at 12:29











    • @StéphaneChazelas Well, maybe it's echo implementation issue, but for me echo "-n 'foo'" | xargs -L1 -I echo '; ' prints ; -n foo, i.e. -n wasn't treated as an option. Or, do you mean this will propagate to uniq? I think I see your point now.
      – wvxvw
      Sep 27 '17 at 13:19











    • Yes, it doesn't apply to the first echo as the argument starts with ;, it applies to the other ones (the ones implictely run by xargs upon xargs or xargs -n1 alone).
      – Stéphane Chazelas
      Sep 27 '17 at 13:53

















    up vote
    1
    down vote













    With perl:



    unique words on each line:



    perl -MList::Util=uniq -lape '$_ = join "t", uniq @F'


    unique words globally:



    perl -lape '$_ = join "t", grep !$count$_++ @F'


    Or to only consider words of each line starting with the 2nd one:



    perl -lape '$_ = join "t", shift(@F), grep !$count$_++ @F'





    share|improve this answer





























      up vote
      0
      down vote













      With bash v4.3 (if you don't mind the order of fields as it's sorted except first)



      while IFS='n' read -r line; 
      do aline=( $line );
      echo $aline[0] $(sort -u <(printf "%sn" $aline[@]:1));
      done < infile


      Explanation:




      • aline=( $line ) this make the line save into an array 'aline'


      • $aline[0] prints first element of an array 'aline' (array index is starting with zero in bash)


      • printf "%sn" $aline[@]:1 prints each element of array 'aline' in separate lines and ignore first element; Then


      • sort -u sorts each line and remove duplicates entries


      • echo this also combine splited line elements after sort into one linear.



        Please see below example to have better view of this step:



        printf "Cn4nBnC" |sort -u 
        4
        B
        C
        echo $(printf "Cn4nBnC" |sort -u)
        4 B C


      This will give output as:



      A CD274 CD276 PDCD1LG2
      B NEK10 NEK2 NEK6 NEKL-4
      C OTUD7B TNFAIP3
      D DUSP16 DUSP4 DUSP8 VHP-1
      E AGO2





      share|improve this answer





























        up vote
        0
        down vote













        sed substitution with back reference



        sed -re 's/s+$//; s/(t[^t]+)1+$/1/'


        (s/s+$// gets rid of trailing white-space like in your example.)






        share|improve this answer




















          Your Answer







          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "106"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: false,
          noModals: false,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f394634%2fhow-to-remove-duplicate-value-in-a-tab-delimited-text-file%23new-answer', 'question_page');

          );

          Post as a guest






























          7 Answers
          7






          active

          oldest

          votes








          7 Answers
          7






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          7
          down vote













          First set of example data:



          $ awk -vOFS='t' ' r=""; delete t; for (i=1;i<=NF;++i) if (!t[$i]++) r = r ? r OFS $i : $i print r ' file
          A B1 C1
          B B2 D2
          C C12 C13
          D D3 D5 D9
          G F2


          Second set of example data (same awk script):



          $ awk -vOFS='t' ' r=""; delete t; for (i=1;i<=NF;++i) if (!t[$i]++) r = r ? r OFS $i : $i print r ' file
          A CD274 PDCD1LG2 CD276
          B NEK2 NEK6 NEK10 NEKL-4
          C TNFAIP3 OTUD7B
          D DUSP16 DUSP4 DUSP8 VHP-1
          E AGO2


          The script reads the input file file line by line, and for each line it goes through each field, building up the output line, r. If the value in a field has already been added to the output line (determined by a lookup table, t, of used field values), then the field is ignored, otherwise it's added.



          When all the fields of an input line have been processed, the constructed line is outputted.



          The output field delimiter is set to tab through -vOFS='t' on the command line.




          The awk script unravelled:




          r = ""
          delete t

          for (i = 1; i <= NF; ++i)
          if (!t[$i]++)
          r = r ? r OFS $i : $i



          print r






          share|improve this answer


















          • 2




            See split("", t) for the POSIX equivalent to delete t
            – Stéphane Chazelas
            Sep 27 '17 at 6:45















          up vote
          7
          down vote













          First set of example data:



          $ awk -vOFS='t' ' r=""; delete t; for (i=1;i<=NF;++i) if (!t[$i]++) r = r ? r OFS $i : $i print r ' file
          A B1 C1
          B B2 D2
          C C12 C13
          D D3 D5 D9
          G F2


          Second set of example data (same awk script):



          $ awk -vOFS='t' ' r=""; delete t; for (i=1;i<=NF;++i) if (!t[$i]++) r = r ? r OFS $i : $i print r ' file
          A CD274 PDCD1LG2 CD276
          B NEK2 NEK6 NEK10 NEKL-4
          C TNFAIP3 OTUD7B
          D DUSP16 DUSP4 DUSP8 VHP-1
          E AGO2


          The script reads the input file file line by line, and for each line it goes through each field, building up the output line, r. If the value in a field has already been added to the output line (determined by a lookup table, t, of used field values), then the field is ignored, otherwise it's added.



          When all the fields of an input line have been processed, the constructed line is outputted.



          The output field delimiter is set to tab through -vOFS='t' on the command line.




          The awk script unravelled:




          r = ""
          delete t

          for (i = 1; i <= NF; ++i)
          if (!t[$i]++)
          r = r ? r OFS $i : $i



          print r






          share|improve this answer


















          • 2




            See split("", t) for the POSIX equivalent to delete t
            – Stéphane Chazelas
            Sep 27 '17 at 6:45













          up vote
          7
          down vote










          up vote
          7
          down vote









          First set of example data:



          $ awk -vOFS='t' ' r=""; delete t; for (i=1;i<=NF;++i) if (!t[$i]++) r = r ? r OFS $i : $i print r ' file
          A B1 C1
          B B2 D2
          C C12 C13
          D D3 D5 D9
          G F2


          Second set of example data (same awk script):



          $ awk -vOFS='t' ' r=""; delete t; for (i=1;i<=NF;++i) if (!t[$i]++) r = r ? r OFS $i : $i print r ' file
          A CD274 PDCD1LG2 CD276
          B NEK2 NEK6 NEK10 NEKL-4
          C TNFAIP3 OTUD7B
          D DUSP16 DUSP4 DUSP8 VHP-1
          E AGO2


          The script reads the input file file line by line, and for each line it goes through each field, building up the output line, r. If the value in a field has already been added to the output line (determined by a lookup table, t, of used field values), then the field is ignored, otherwise it's added.



          When all the fields of an input line have been processed, the constructed line is outputted.



          The output field delimiter is set to tab through -vOFS='t' on the command line.




          The awk script unravelled:




          r = ""
          delete t

          for (i = 1; i <= NF; ++i)
          if (!t[$i]++)
          r = r ? r OFS $i : $i



          print r






          share|improve this answer














          First set of example data:



          $ awk -vOFS='t' ' r=""; delete t; for (i=1;i<=NF;++i) if (!t[$i]++) r = r ? r OFS $i : $i print r ' file
          A B1 C1
          B B2 D2
          C C12 C13
          D D3 D5 D9
          G F2


          Second set of example data (same awk script):



          $ awk -vOFS='t' ' r=""; delete t; for (i=1;i<=NF;++i) if (!t[$i]++) r = r ? r OFS $i : $i print r ' file
          A CD274 PDCD1LG2 CD276
          B NEK2 NEK6 NEK10 NEKL-4
          C TNFAIP3 OTUD7B
          D DUSP16 DUSP4 DUSP8 VHP-1
          E AGO2


          The script reads the input file file line by line, and for each line it goes through each field, building up the output line, r. If the value in a field has already been added to the output line (determined by a lookup table, t, of used field values), then the field is ignored, otherwise it's added.



          When all the fields of an input line have been processed, the constructed line is outputted.



          The output field delimiter is set to tab through -vOFS='t' on the command line.




          The awk script unravelled:




          r = ""
          delete t

          for (i = 1; i <= NF; ++i)
          if (!t[$i]++)
          r = r ? r OFS $i : $i



          print r







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Sep 26 '17 at 23:22

























          answered Sep 26 '17 at 22:54









          Kusalananda

          106k14209327




          106k14209327







          • 2




            See split("", t) for the POSIX equivalent to delete t
            – Stéphane Chazelas
            Sep 27 '17 at 6:45













          • 2




            See split("", t) for the POSIX equivalent to delete t
            – Stéphane Chazelas
            Sep 27 '17 at 6:45








          2




          2




          See split("", t) for the POSIX equivalent to delete t
          – Stéphane Chazelas
          Sep 27 '17 at 6:45





          See split("", t) for the POSIX equivalent to delete t
          – Stéphane Chazelas
          Sep 27 '17 at 6:45













          up vote
          6
          down vote













          sed/tr, uniq and paste



          while read -r l; do sed 's/t/n/g' <<< "$l" | uniq | paste -s; done < test


          or POSIX compliant:



          while read -r l; do echo "$l" | tr 't' 'n' | uniq | paste -s -; done < test


          For the file test this will line by line replace all Tab characters with linebreaks, run uniq to delete dupes and replace the linebreaks with Tab characters again.



          $ cat test
          A B1 B1 C1
          B B2 D2
          C C12 C13 C13
          D D3 D5 D9
          G F2 F2

          $ while read -r l; do sed 's/t/n/g' <<< "$l" | uniq | paste -s; done < test
          A B1 C1
          B B2 D2
          C C12 C13
          D D3 D5 D9
          G F2


          NB: This solution will not work for duplicates over multiple rows, e.g. C1 in



          A B1 B1 C1
          C1 B B2 D2





          share|improve this answer


























            up vote
            6
            down vote













            sed/tr, uniq and paste



            while read -r l; do sed 's/t/n/g' <<< "$l" | uniq | paste -s; done < test


            or POSIX compliant:



            while read -r l; do echo "$l" | tr 't' 'n' | uniq | paste -s -; done < test


            For the file test this will line by line replace all Tab characters with linebreaks, run uniq to delete dupes and replace the linebreaks with Tab characters again.



            $ cat test
            A B1 B1 C1
            B B2 D2
            C C12 C13 C13
            D D3 D5 D9
            G F2 F2

            $ while read -r l; do sed 's/t/n/g' <<< "$l" | uniq | paste -s; done < test
            A B1 C1
            B B2 D2
            C C12 C13
            D D3 D5 D9
            G F2


            NB: This solution will not work for duplicates over multiple rows, e.g. C1 in



            A B1 B1 C1
            C1 B B2 D2





            share|improve this answer
























              up vote
              6
              down vote










              up vote
              6
              down vote









              sed/tr, uniq and paste



              while read -r l; do sed 's/t/n/g' <<< "$l" | uniq | paste -s; done < test


              or POSIX compliant:



              while read -r l; do echo "$l" | tr 't' 'n' | uniq | paste -s -; done < test


              For the file test this will line by line replace all Tab characters with linebreaks, run uniq to delete dupes and replace the linebreaks with Tab characters again.



              $ cat test
              A B1 B1 C1
              B B2 D2
              C C12 C13 C13
              D D3 D5 D9
              G F2 F2

              $ while read -r l; do sed 's/t/n/g' <<< "$l" | uniq | paste -s; done < test
              A B1 C1
              B B2 D2
              C C12 C13
              D D3 D5 D9
              G F2


              NB: This solution will not work for duplicates over multiple rows, e.g. C1 in



              A B1 B1 C1
              C1 B B2 D2





              share|improve this answer














              sed/tr, uniq and paste



              while read -r l; do sed 's/t/n/g' <<< "$l" | uniq | paste -s; done < test


              or POSIX compliant:



              while read -r l; do echo "$l" | tr 't' 'n' | uniq | paste -s -; done < test


              For the file test this will line by line replace all Tab characters with linebreaks, run uniq to delete dupes and replace the linebreaks with Tab characters again.



              $ cat test
              A B1 B1 C1
              B B2 D2
              C C12 C13 C13
              D D3 D5 D9
              G F2 F2

              $ while read -r l; do sed 's/t/n/g' <<< "$l" | uniq | paste -s; done < test
              A B1 C1
              B B2 D2
              C C12 C13
              D D3 D5 D9
              G F2


              NB: This solution will not work for duplicates over multiple rows, e.g. C1 in



              A B1 B1 C1
              C1 B B2 D2






              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited Sep 26 '17 at 22:19

























              answered Sep 26 '17 at 21:26









              dessert

              1,013321




              1,013321




















                  up vote
                  6
                  down vote













                  Maybe something like:



                  gawk -vRS='\s*\S*' -vORS= '$0=RT;$1!=prev;prev=$1'


                  The RS=pattern...$0=RT trick lets you process records defined as the parts that match the pattern.



                  So here, we're slicing the input into <whitespace><non-whitespace> $0 records, <non-whitespace> goes in $1 (the first and only field). We're printing the records whose $1 is not equal to the previous one.



                  On an input like:



                  A B1 B1 C1
                  B B2 D2
                  C C12 C13 C13
                  D D3 D5 D9
                  G F2 F2


                  The records are:




                  [A][ B1][ B1][ C1][
                  B][ B2][ D2][
                  C][ C12][ C13][ C13][
                  D][ D3][ D5][ D9][
                  G][ F2][ F2][
                  ]


                  Doesn't work for your second example though and note that it could remove some newline characters.






                  share|improve this answer






















                  • What if a row begins with a dupe from the preceding line, e.g. if we add C1 at the beginning of row 2? The linebreak clearly should not get removed even then.
                    – dessert
                    Sep 26 '17 at 22:00










                  • A CD274 PDCD1LG2 CD276 PDCD1LG2 CD274 B NEK2 NEK6 NEK10 NEK10 NEKL-4 C TNFAIP3 OTUD7B OTUD7B TNFAIP3 TNFAIP3 D DUSP16 DUSP4 DUSP8 VHP-1 DUSP8 E AGO2 AGO2 AGO2 AGO2 AGO2
                    – desu
                    Sep 26 '17 at 22:13






                  • 3




                    @desu, whatever you're trying to say to clarify your question, please edit it in your question. You may want to take the tour for some advise on how to ask great questions.
                    – Stéphane Chazelas
                    Sep 26 '17 at 22:17










                  • @desu add -F'n' to separate each input lines, so gawk -F'n' -vRS='\s*\S*' -vORS= '$0=RT;$1!=prev;prev=$1'
                    – Î±Ò“sнιη
                    Sep 27 '17 at 9:48







                  • 1




                    @αғsнιη, not sure what you mean. n is already included in the default FS. The problem here is that if that n is part of a record that is deleted, it will be deleted. Anyway, that answer doesn't answer the OP's question any more with their updated requirements. I'm only leaving it in for the trick which may be useful in other situations.
                    – Stéphane Chazelas
                    Sep 27 '17 at 9:57














                  up vote
                  6
                  down vote













                  Maybe something like:



                  gawk -vRS='\s*\S*' -vORS= '$0=RT;$1!=prev;prev=$1'


                  The RS=pattern...$0=RT trick lets you process records defined as the parts that match the pattern.



                  So here, we're slicing the input into <whitespace><non-whitespace> $0 records, <non-whitespace> goes in $1 (the first and only field). We're printing the records whose $1 is not equal to the previous one.



                  On an input like:



                  A B1 B1 C1
                  B B2 D2
                  C C12 C13 C13
                  D D3 D5 D9
                  G F2 F2


                  The records are:




                  [A][ B1][ B1][ C1][
                  B][ B2][ D2][
                  C][ C12][ C13][ C13][
                  D][ D3][ D5][ D9][
                  G][ F2][ F2][
                  ]


                  Doesn't work for your second example though and note that it could remove some newline characters.






                  share|improve this answer






















                  • What if a row begins with a dupe from the preceding line, e.g. if we add C1 at the beginning of row 2? The linebreak clearly should not get removed even then.
                    – dessert
                    Sep 26 '17 at 22:00










                  • A CD274 PDCD1LG2 CD276 PDCD1LG2 CD274 B NEK2 NEK6 NEK10 NEK10 NEKL-4 C TNFAIP3 OTUD7B OTUD7B TNFAIP3 TNFAIP3 D DUSP16 DUSP4 DUSP8 VHP-1 DUSP8 E AGO2 AGO2 AGO2 AGO2 AGO2
                    – desu
                    Sep 26 '17 at 22:13






                  • 3




                    @desu, whatever you're trying to say to clarify your question, please edit it in your question. You may want to take the tour for some advise on how to ask great questions.
                    – Stéphane Chazelas
                    Sep 26 '17 at 22:17










                  • @desu add -F'n' to separate each input lines, so gawk -F'n' -vRS='\s*\S*' -vORS= '$0=RT;$1!=prev;prev=$1'
                    – Î±Ò“sнιη
                    Sep 27 '17 at 9:48







                  • 1




                    @αғsнιη, not sure what you mean. n is already included in the default FS. The problem here is that if that n is part of a record that is deleted, it will be deleted. Anyway, that answer doesn't answer the OP's question any more with their updated requirements. I'm only leaving it in for the trick which may be useful in other situations.
                    – Stéphane Chazelas
                    Sep 27 '17 at 9:57












                  up vote
                  6
                  down vote










                  up vote
                  6
                  down vote









                  Maybe something like:



                  gawk -vRS='\s*\S*' -vORS= '$0=RT;$1!=prev;prev=$1'


                  The RS=pattern...$0=RT trick lets you process records defined as the parts that match the pattern.



                  So here, we're slicing the input into <whitespace><non-whitespace> $0 records, <non-whitespace> goes in $1 (the first and only field). We're printing the records whose $1 is not equal to the previous one.



                  On an input like:



                  A B1 B1 C1
                  B B2 D2
                  C C12 C13 C13
                  D D3 D5 D9
                  G F2 F2


                  The records are:




                  [A][ B1][ B1][ C1][
                  B][ B2][ D2][
                  C][ C12][ C13][ C13][
                  D][ D3][ D5][ D9][
                  G][ F2][ F2][
                  ]


                  Doesn't work for your second example though and note that it could remove some newline characters.






                  share|improve this answer














                  Maybe something like:



                  gawk -vRS='\s*\S*' -vORS= '$0=RT;$1!=prev;prev=$1'


                  The RS=pattern...$0=RT trick lets you process records defined as the parts that match the pattern.



                  So here, we're slicing the input into <whitespace><non-whitespace> $0 records, <non-whitespace> goes in $1 (the first and only field). We're printing the records whose $1 is not equal to the previous one.



                  On an input like:



                  A B1 B1 C1
                  B B2 D2
                  C C12 C13 C13
                  D D3 D5 D9
                  G F2 F2


                  The records are:




                  [A][ B1][ B1][ C1][
                  B][ B2][ D2][
                  C][ C12][ C13][ C13][
                  D][ D3][ D5][ D9][
                  G][ F2][ F2][
                  ]


                  Doesn't work for your second example though and note that it could remove some newline characters.







                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited Sep 27 '17 at 6:48

























                  answered Sep 26 '17 at 21:34









                  Stéphane Chazelas

                  284k53523859




                  284k53523859











                  • What if a row begins with a dupe from the preceding line, e.g. if we add C1 at the beginning of row 2? The linebreak clearly should not get removed even then.
                    – dessert
                    Sep 26 '17 at 22:00










                  • A CD274 PDCD1LG2 CD276 PDCD1LG2 CD274 B NEK2 NEK6 NEK10 NEK10 NEKL-4 C TNFAIP3 OTUD7B OTUD7B TNFAIP3 TNFAIP3 D DUSP16 DUSP4 DUSP8 VHP-1 DUSP8 E AGO2 AGO2 AGO2 AGO2 AGO2
                    – desu
                    Sep 26 '17 at 22:13






                  • 3




                    @desu, whatever you're trying to say to clarify your question, please edit it in your question. You may want to take the tour for some advise on how to ask great questions.
                    – Stéphane Chazelas
                    Sep 26 '17 at 22:17










                  • @desu add -F'n' to separate each input lines, so gawk -F'n' -vRS='\s*\S*' -vORS= '$0=RT;$1!=prev;prev=$1'
                    – Î±Ò“sнιη
                    Sep 27 '17 at 9:48







                  • 1




                    @αғsнιη, not sure what you mean. n is already included in the default FS. The problem here is that if that n is part of a record that is deleted, it will be deleted. Anyway, that answer doesn't answer the OP's question any more with their updated requirements. I'm only leaving it in for the trick which may be useful in other situations.
                    – Stéphane Chazelas
                    Sep 27 '17 at 9:57
















                  • What if a row begins with a dupe from the preceding line, e.g. if we add C1 at the beginning of row 2? The linebreak clearly should not get removed even then.
                    – dessert
                    Sep 26 '17 at 22:00










                  • A CD274 PDCD1LG2 CD276 PDCD1LG2 CD274 B NEK2 NEK6 NEK10 NEK10 NEKL-4 C TNFAIP3 OTUD7B OTUD7B TNFAIP3 TNFAIP3 D DUSP16 DUSP4 DUSP8 VHP-1 DUSP8 E AGO2 AGO2 AGO2 AGO2 AGO2
                    – desu
                    Sep 26 '17 at 22:13






                  • 3




                    @desu, whatever you're trying to say to clarify your question, please edit it in your question. You may want to take the tour for some advise on how to ask great questions.
                    – Stéphane Chazelas
                    Sep 26 '17 at 22:17










                  • @desu add -F'n' to separate each input lines, so gawk -F'n' -vRS='\s*\S*' -vORS= '$0=RT;$1!=prev;prev=$1'
                    – Î±Ò“sнιη
                    Sep 27 '17 at 9:48







                  • 1




                    @αғsнιη, not sure what you mean. n is already included in the default FS. The problem here is that if that n is part of a record that is deleted, it will be deleted. Anyway, that answer doesn't answer the OP's question any more with their updated requirements. I'm only leaving it in for the trick which may be useful in other situations.
                    – Stéphane Chazelas
                    Sep 27 '17 at 9:57















                  What if a row begins with a dupe from the preceding line, e.g. if we add C1 at the beginning of row 2? The linebreak clearly should not get removed even then.
                  – dessert
                  Sep 26 '17 at 22:00




                  What if a row begins with a dupe from the preceding line, e.g. if we add C1 at the beginning of row 2? The linebreak clearly should not get removed even then.
                  – dessert
                  Sep 26 '17 at 22:00












                  A CD274 PDCD1LG2 CD276 PDCD1LG2 CD274 B NEK2 NEK6 NEK10 NEK10 NEKL-4 C TNFAIP3 OTUD7B OTUD7B TNFAIP3 TNFAIP3 D DUSP16 DUSP4 DUSP8 VHP-1 DUSP8 E AGO2 AGO2 AGO2 AGO2 AGO2
                  – desu
                  Sep 26 '17 at 22:13




                  A CD274 PDCD1LG2 CD276 PDCD1LG2 CD274 B NEK2 NEK6 NEK10 NEK10 NEKL-4 C TNFAIP3 OTUD7B OTUD7B TNFAIP3 TNFAIP3 D DUSP16 DUSP4 DUSP8 VHP-1 DUSP8 E AGO2 AGO2 AGO2 AGO2 AGO2
                  – desu
                  Sep 26 '17 at 22:13




                  3




                  3




                  @desu, whatever you're trying to say to clarify your question, please edit it in your question. You may want to take the tour for some advise on how to ask great questions.
                  – Stéphane Chazelas
                  Sep 26 '17 at 22:17




                  @desu, whatever you're trying to say to clarify your question, please edit it in your question. You may want to take the tour for some advise on how to ask great questions.
                  – Stéphane Chazelas
                  Sep 26 '17 at 22:17












                  @desu add -F'n' to separate each input lines, so gawk -F'n' -vRS='\s*\S*' -vORS= '$0=RT;$1!=prev;prev=$1'
                  – Î±Ò“sнιη
                  Sep 27 '17 at 9:48





                  @desu add -F'n' to separate each input lines, so gawk -F'n' -vRS='\s*\S*' -vORS= '$0=RT;$1!=prev;prev=$1'
                  – Î±Ò“sнιη
                  Sep 27 '17 at 9:48





                  1




                  1




                  @αғsнιη, not sure what you mean. n is already included in the default FS. The problem here is that if that n is part of a record that is deleted, it will be deleted. Anyway, that answer doesn't answer the OP's question any more with their updated requirements. I'm only leaving it in for the trick which may be useful in other situations.
                  – Stéphane Chazelas
                  Sep 27 '17 at 9:57




                  @αғsнιη, not sure what you mean. n is already included in the default FS. The problem here is that if that n is part of a record that is deleted, it will be deleted. Anyway, that answer doesn't answer the OP's question any more with their updated requirements. I'm only leaving it in for the trick which may be useful in other situations.
                  – Stéphane Chazelas
                  Sep 27 '17 at 9:57










                  up vote
                  2
                  down vote













                  This is more of a code-golf / freak challenge solution:



                  xargs -L1 -I echo '; ' < ./test.txt | 
                  xargs -n1 |
                  uniq |
                  xargs |
                  sed -e 's/; /n/g' -e 's/ +/t/g'


                  But it avoids using loops and all other heavy machinery seen in other answers.



                  It also builds on an assumption your data doesn't contain ; character.






                  share|improve this answer




















                  • It also assumes no ", ' backslash characters and that none of the words look like -n, -e, -nEne... (depending on the echo implementation) It also assume GNU sed. It still spawns one echo process per line. But it's true that it's less heavy than some of the while loops seen around. It doesn't work for the updated requirements where the duplicated words may no longer be contiguous.
                    – Stéphane Chazelas
                    Sep 27 '17 at 10:43











                  • @StéphaneChazelas the argument to echo is quoted, so that the values that look like options won't be interpreted as such. What part of sed call isn't POSIX? (I honestly don't know).
                    – wvxvw
                    Sep 27 '17 at 10:53










                  • No quoting doesn't prevent option processing. Try printf '%sn' -n -ne foo | xargs. Note that xargs -n1 means that one echo is being run for each word which is quite heavy actually. n, + and t are GNU extensions, though you do find some other implementations supporting it nowadays.
                    – Stéphane Chazelas
                    Sep 27 '17 at 12:29











                  • @StéphaneChazelas Well, maybe it's echo implementation issue, but for me echo "-n 'foo'" | xargs -L1 -I echo '; ' prints ; -n foo, i.e. -n wasn't treated as an option. Or, do you mean this will propagate to uniq? I think I see your point now.
                    – wvxvw
                    Sep 27 '17 at 13:19











                  • Yes, it doesn't apply to the first echo as the argument starts with ;, it applies to the other ones (the ones implictely run by xargs upon xargs or xargs -n1 alone).
                    – Stéphane Chazelas
                    Sep 27 '17 at 13:53














                  up vote
                  2
                  down vote













                  This is more of a code-golf / freak challenge solution:



                  xargs -L1 -I echo '; ' < ./test.txt | 
                  xargs -n1 |
                  uniq |
                  xargs |
                  sed -e 's/; /n/g' -e 's/ +/t/g'


                  But it avoids using loops and all other heavy machinery seen in other answers.



                  It also builds on an assumption your data doesn't contain ; character.






                  share|improve this answer




















                  • It also assumes no ", ' backslash characters and that none of the words look like -n, -e, -nEne... (depending on the echo implementation) It also assume GNU sed. It still spawns one echo process per line. But it's true that it's less heavy than some of the while loops seen around. It doesn't work for the updated requirements where the duplicated words may no longer be contiguous.
                    – Stéphane Chazelas
                    Sep 27 '17 at 10:43











                  • @StéphaneChazelas the argument to echo is quoted, so that the values that look like options won't be interpreted as such. What part of sed call isn't POSIX? (I honestly don't know).
                    – wvxvw
                    Sep 27 '17 at 10:53










                  • No quoting doesn't prevent option processing. Try printf '%sn' -n -ne foo | xargs. Note that xargs -n1 means that one echo is being run for each word which is quite heavy actually. n, + and t are GNU extensions, though you do find some other implementations supporting it nowadays.
                    – Stéphane Chazelas
                    Sep 27 '17 at 12:29











                  • @StéphaneChazelas Well, maybe it's echo implementation issue, but for me echo "-n 'foo'" | xargs -L1 -I echo '; ' prints ; -n foo, i.e. -n wasn't treated as an option. Or, do you mean this will propagate to uniq? I think I see your point now.
                    – wvxvw
                    Sep 27 '17 at 13:19











                  • Yes, it doesn't apply to the first echo as the argument starts with ;, it applies to the other ones (the ones implictely run by xargs upon xargs or xargs -n1 alone).
                    – Stéphane Chazelas
                    Sep 27 '17 at 13:53












                  up vote
                  2
                  down vote










                  up vote
                  2
                  down vote









                  This is more of a code-golf / freak challenge solution:



                  xargs -L1 -I echo '; ' < ./test.txt | 
                  xargs -n1 |
                  uniq |
                  xargs |
                  sed -e 's/; /n/g' -e 's/ +/t/g'


                  But it avoids using loops and all other heavy machinery seen in other answers.



                  It also builds on an assumption your data doesn't contain ; character.






                  share|improve this answer












                  This is more of a code-golf / freak challenge solution:



                  xargs -L1 -I echo '; ' < ./test.txt | 
                  xargs -n1 |
                  uniq |
                  xargs |
                  sed -e 's/; /n/g' -e 's/ +/t/g'


                  But it avoids using loops and all other heavy machinery seen in other answers.



                  It also builds on an assumption your data doesn't contain ; character.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Sep 27 '17 at 7:08









                  wvxvw

                  3362412




                  3362412











                  • It also assumes no ", ' backslash characters and that none of the words look like -n, -e, -nEne... (depending on the echo implementation) It also assume GNU sed. It still spawns one echo process per line. But it's true that it's less heavy than some of the while loops seen around. It doesn't work for the updated requirements where the duplicated words may no longer be contiguous.
                    – Stéphane Chazelas
                    Sep 27 '17 at 10:43











                  • @StéphaneChazelas the argument to echo is quoted, so that the values that look like options won't be interpreted as such. What part of sed call isn't POSIX? (I honestly don't know).
                    – wvxvw
                    Sep 27 '17 at 10:53










                  • No quoting doesn't prevent option processing. Try printf '%sn' -n -ne foo | xargs. Note that xargs -n1 means that one echo is being run for each word which is quite heavy actually. n, + and t are GNU extensions, though you do find some other implementations supporting it nowadays.
                    – Stéphane Chazelas
                    Sep 27 '17 at 12:29











                  • @StéphaneChazelas Well, maybe it's echo implementation issue, but for me echo "-n 'foo'" | xargs -L1 -I echo '; ' prints ; -n foo, i.e. -n wasn't treated as an option. Or, do you mean this will propagate to uniq? I think I see your point now.
                    – wvxvw
                    Sep 27 '17 at 13:19











                  • Yes, it doesn't apply to the first echo as the argument starts with ;, it applies to the other ones (the ones implictely run by xargs upon xargs or xargs -n1 alone).
                    – Stéphane Chazelas
                    Sep 27 '17 at 13:53
















                  • It also assumes no ", ' backslash characters and that none of the words look like -n, -e, -nEne... (depending on the echo implementation) It also assume GNU sed. It still spawns one echo process per line. But it's true that it's less heavy than some of the while loops seen around. It doesn't work for the updated requirements where the duplicated words may no longer be contiguous.
                    – Stéphane Chazelas
                    Sep 27 '17 at 10:43











                  • @StéphaneChazelas the argument to echo is quoted, so that the values that look like options won't be interpreted as such. What part of sed call isn't POSIX? (I honestly don't know).
                    – wvxvw
                    Sep 27 '17 at 10:53










                  • No quoting doesn't prevent option processing. Try printf '%sn' -n -ne foo | xargs. Note that xargs -n1 means that one echo is being run for each word which is quite heavy actually. n, + and t are GNU extensions, though you do find some other implementations supporting it nowadays.
                    – Stéphane Chazelas
                    Sep 27 '17 at 12:29











                  • @StéphaneChazelas Well, maybe it's echo implementation issue, but for me echo "-n 'foo'" | xargs -L1 -I echo '; ' prints ; -n foo, i.e. -n wasn't treated as an option. Or, do you mean this will propagate to uniq? I think I see your point now.
                    – wvxvw
                    Sep 27 '17 at 13:19











                  • Yes, it doesn't apply to the first echo as the argument starts with ;, it applies to the other ones (the ones implictely run by xargs upon xargs or xargs -n1 alone).
                    – Stéphane Chazelas
                    Sep 27 '17 at 13:53















                  It also assumes no ", ' backslash characters and that none of the words look like -n, -e, -nEne... (depending on the echo implementation) It also assume GNU sed. It still spawns one echo process per line. But it's true that it's less heavy than some of the while loops seen around. It doesn't work for the updated requirements where the duplicated words may no longer be contiguous.
                  – Stéphane Chazelas
                  Sep 27 '17 at 10:43





                  It also assumes no ", ' backslash characters and that none of the words look like -n, -e, -nEne... (depending on the echo implementation) It also assume GNU sed. It still spawns one echo process per line. But it's true that it's less heavy than some of the while loops seen around. It doesn't work for the updated requirements where the duplicated words may no longer be contiguous.
                  – Stéphane Chazelas
                  Sep 27 '17 at 10:43













                  @StéphaneChazelas the argument to echo is quoted, so that the values that look like options won't be interpreted as such. What part of sed call isn't POSIX? (I honestly don't know).
                  – wvxvw
                  Sep 27 '17 at 10:53




                  @StéphaneChazelas the argument to echo is quoted, so that the values that look like options won't be interpreted as such. What part of sed call isn't POSIX? (I honestly don't know).
                  – wvxvw
                  Sep 27 '17 at 10:53












                  No quoting doesn't prevent option processing. Try printf '%sn' -n -ne foo | xargs. Note that xargs -n1 means that one echo is being run for each word which is quite heavy actually. n, + and t are GNU extensions, though you do find some other implementations supporting it nowadays.
                  – Stéphane Chazelas
                  Sep 27 '17 at 12:29





                  No quoting doesn't prevent option processing. Try printf '%sn' -n -ne foo | xargs. Note that xargs -n1 means that one echo is being run for each word which is quite heavy actually. n, + and t are GNU extensions, though you do find some other implementations supporting it nowadays.
                  – Stéphane Chazelas
                  Sep 27 '17 at 12:29













                  @StéphaneChazelas Well, maybe it's echo implementation issue, but for me echo "-n 'foo'" | xargs -L1 -I echo '; ' prints ; -n foo, i.e. -n wasn't treated as an option. Or, do you mean this will propagate to uniq? I think I see your point now.
                  – wvxvw
                  Sep 27 '17 at 13:19





                  @StéphaneChazelas Well, maybe it's echo implementation issue, but for me echo "-n 'foo'" | xargs -L1 -I echo '; ' prints ; -n foo, i.e. -n wasn't treated as an option. Or, do you mean this will propagate to uniq? I think I see your point now.
                  – wvxvw
                  Sep 27 '17 at 13:19













                  Yes, it doesn't apply to the first echo as the argument starts with ;, it applies to the other ones (the ones implictely run by xargs upon xargs or xargs -n1 alone).
                  – Stéphane Chazelas
                  Sep 27 '17 at 13:53




                  Yes, it doesn't apply to the first echo as the argument starts with ;, it applies to the other ones (the ones implictely run by xargs upon xargs or xargs -n1 alone).
                  – Stéphane Chazelas
                  Sep 27 '17 at 13:53










                  up vote
                  1
                  down vote













                  With perl:



                  unique words on each line:



                  perl -MList::Util=uniq -lape '$_ = join "t", uniq @F'


                  unique words globally:



                  perl -lape '$_ = join "t", grep !$count$_++ @F'


                  Or to only consider words of each line starting with the 2nd one:



                  perl -lape '$_ = join "t", shift(@F), grep !$count$_++ @F'





                  share|improve this answer


























                    up vote
                    1
                    down vote













                    With perl:



                    unique words on each line:



                    perl -MList::Util=uniq -lape '$_ = join "t", uniq @F'


                    unique words globally:



                    perl -lape '$_ = join "t", grep !$count$_++ @F'


                    Or to only consider words of each line starting with the 2nd one:



                    perl -lape '$_ = join "t", shift(@F), grep !$count$_++ @F'





                    share|improve this answer
























                      up vote
                      1
                      down vote










                      up vote
                      1
                      down vote









                      With perl:



                      unique words on each line:



                      perl -MList::Util=uniq -lape '$_ = join "t", uniq @F'


                      unique words globally:



                      perl -lape '$_ = join "t", grep !$count$_++ @F'


                      Or to only consider words of each line starting with the 2nd one:



                      perl -lape '$_ = join "t", shift(@F), grep !$count$_++ @F'





                      share|improve this answer














                      With perl:



                      unique words on each line:



                      perl -MList::Util=uniq -lape '$_ = join "t", uniq @F'


                      unique words globally:



                      perl -lape '$_ = join "t", grep !$count$_++ @F'


                      Or to only consider words of each line starting with the 2nd one:



                      perl -lape '$_ = join "t", shift(@F), grep !$count$_++ @F'






                      share|improve this answer














                      share|improve this answer



                      share|improve this answer








                      edited Sep 27 '17 at 10:45

























                      answered Sep 27 '17 at 10:08









                      Stéphane Chazelas

                      284k53523859




                      284k53523859




















                          up vote
                          0
                          down vote













                          With bash v4.3 (if you don't mind the order of fields as it's sorted except first)



                          while IFS='n' read -r line; 
                          do aline=( $line );
                          echo $aline[0] $(sort -u <(printf "%sn" $aline[@]:1));
                          done < infile


                          Explanation:




                          • aline=( $line ) this make the line save into an array 'aline'


                          • $aline[0] prints first element of an array 'aline' (array index is starting with zero in bash)


                          • printf "%sn" $aline[@]:1 prints each element of array 'aline' in separate lines and ignore first element; Then


                          • sort -u sorts each line and remove duplicates entries


                          • echo this also combine splited line elements after sort into one linear.



                            Please see below example to have better view of this step:



                            printf "Cn4nBnC" |sort -u 
                            4
                            B
                            C
                            echo $(printf "Cn4nBnC" |sort -u)
                            4 B C


                          This will give output as:



                          A CD274 CD276 PDCD1LG2
                          B NEK10 NEK2 NEK6 NEKL-4
                          C OTUD7B TNFAIP3
                          D DUSP16 DUSP4 DUSP8 VHP-1
                          E AGO2





                          share|improve this answer


























                            up vote
                            0
                            down vote













                            With bash v4.3 (if you don't mind the order of fields as it's sorted except first)



                            while IFS='n' read -r line; 
                            do aline=( $line );
                            echo $aline[0] $(sort -u <(printf "%sn" $aline[@]:1));
                            done < infile


                            Explanation:




                            • aline=( $line ) this make the line save into an array 'aline'


                            • $aline[0] prints first element of an array 'aline' (array index is starting with zero in bash)


                            • printf "%sn" $aline[@]:1 prints each element of array 'aline' in separate lines and ignore first element; Then


                            • sort -u sorts each line and remove duplicates entries


                            • echo this also combine splited line elements after sort into one linear.



                              Please see below example to have better view of this step:



                              printf "Cn4nBnC" |sort -u 
                              4
                              B
                              C
                              echo $(printf "Cn4nBnC" |sort -u)
                              4 B C


                            This will give output as:



                            A CD274 CD276 PDCD1LG2
                            B NEK10 NEK2 NEK6 NEKL-4
                            C OTUD7B TNFAIP3
                            D DUSP16 DUSP4 DUSP8 VHP-1
                            E AGO2





                            share|improve this answer
























                              up vote
                              0
                              down vote










                              up vote
                              0
                              down vote









                              With bash v4.3 (if you don't mind the order of fields as it's sorted except first)



                              while IFS='n' read -r line; 
                              do aline=( $line );
                              echo $aline[0] $(sort -u <(printf "%sn" $aline[@]:1));
                              done < infile


                              Explanation:




                              • aline=( $line ) this make the line save into an array 'aline'


                              • $aline[0] prints first element of an array 'aline' (array index is starting with zero in bash)


                              • printf "%sn" $aline[@]:1 prints each element of array 'aline' in separate lines and ignore first element; Then


                              • sort -u sorts each line and remove duplicates entries


                              • echo this also combine splited line elements after sort into one linear.



                                Please see below example to have better view of this step:



                                printf "Cn4nBnC" |sort -u 
                                4
                                B
                                C
                                echo $(printf "Cn4nBnC" |sort -u)
                                4 B C


                              This will give output as:



                              A CD274 CD276 PDCD1LG2
                              B NEK10 NEK2 NEK6 NEKL-4
                              C OTUD7B TNFAIP3
                              D DUSP16 DUSP4 DUSP8 VHP-1
                              E AGO2





                              share|improve this answer














                              With bash v4.3 (if you don't mind the order of fields as it's sorted except first)



                              while IFS='n' read -r line; 
                              do aline=( $line );
                              echo $aline[0] $(sort -u <(printf "%sn" $aline[@]:1));
                              done < infile


                              Explanation:




                              • aline=( $line ) this make the line save into an array 'aline'


                              • $aline[0] prints first element of an array 'aline' (array index is starting with zero in bash)


                              • printf "%sn" $aline[@]:1 prints each element of array 'aline' in separate lines and ignore first element; Then


                              • sort -u sorts each line and remove duplicates entries


                              • echo this also combine splited line elements after sort into one linear.



                                Please see below example to have better view of this step:



                                printf "Cn4nBnC" |sort -u 
                                4
                                B
                                C
                                echo $(printf "Cn4nBnC" |sort -u)
                                4 B C


                              This will give output as:



                              A CD274 CD276 PDCD1LG2
                              B NEK10 NEK2 NEK6 NEKL-4
                              C OTUD7B TNFAIP3
                              D DUSP16 DUSP4 DUSP8 VHP-1
                              E AGO2






                              share|improve this answer














                              share|improve this answer



                              share|improve this answer








                              edited Sep 27 '17 at 10:46

























                              answered Sep 27 '17 at 10:08









                              αғsнιη

                              15.7k92563




                              15.7k92563




















                                  up vote
                                  0
                                  down vote













                                  sed substitution with back reference



                                  sed -re 's/s+$//; s/(t[^t]+)1+$/1/'


                                  (s/s+$// gets rid of trailing white-space like in your example.)






                                  share|improve this answer
























                                    up vote
                                    0
                                    down vote













                                    sed substitution with back reference



                                    sed -re 's/s+$//; s/(t[^t]+)1+$/1/'


                                    (s/s+$// gets rid of trailing white-space like in your example.)






                                    share|improve this answer






















                                      up vote
                                      0
                                      down vote










                                      up vote
                                      0
                                      down vote









                                      sed substitution with back reference



                                      sed -re 's/s+$//; s/(t[^t]+)1+$/1/'


                                      (s/s+$// gets rid of trailing white-space like in your example.)






                                      share|improve this answer












                                      sed substitution with back reference



                                      sed -re 's/s+$//; s/(t[^t]+)1+$/1/'


                                      (s/s+$// gets rid of trailing white-space like in your example.)







                                      share|improve this answer












                                      share|improve this answer



                                      share|improve this answer










                                      answered Sep 27 '17 at 11:36









                                      David Foerster

                                      918616




                                      918616



























                                           

                                          draft saved


                                          draft discarded















































                                           


                                          draft saved


                                          draft discarded














                                          StackExchange.ready(
                                          function ()
                                          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f394634%2fhow-to-remove-duplicate-value-in-a-tab-delimited-text-file%23new-answer', 'question_page');

                                          );

                                          Post as a guest













































































                                          Popular posts from this blog

                                          Peggy Mitchell

                                          Palaiologos

                                          The Forum (Inglewood, California)