comm command behaving strangely

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
1
down vote

favorite












I have two files:



  1. one generated using find command in a folder to list files, sorting them numerically and writing to a file,

  2. and the other generated by a python script, which is not sorted, so I explicitly sort it numerically.

The problem is that my sort output only has two columns and is as follows:



500016
500016
500174
500174
500277
500277


As you can see, even the common entries are shown separately in two columns and the third column is missing altogether, implying that there is nothing common between the two files, whereas these first three entries are indeed same. sort otherwise works as expected with some test files that I make.



I know that comm needs the two files to be lexically sorted, and here is a list of options I tried and failed:



comm <(sort file1.txt) <(sort file2.txt)


from https://unix.stackexchange.com/a/377689/187419 failed. I also tried giving the -d option to sort explicitly, and also tried explicitly rewriting the files with dictionary sort -- both didn't work



comm --check-order <(sort file1.txt) <(sort file2.txt)


from https://unix.stackexchange.com/a/186101/187419 did not return any order error; it ran as usual giving two output columns.



This solution for a problem very close to mine is also not working.



Thinking that it might be because of some additional characters in the file, I also the solution mentioned here to do :set list in vim.



Just to test if sort is causing issues, I deliberately sorted the test files I made (with which comm worked earlier) numerically and comm still worked.



I tried the solutions I could find, to no avail. Any other suggestions?







share|improve this question



















  • Line endings? Trailing whitespace? (Check with hex, hexdump, od -c, etc.) On a UNIX-type system the line ending should be (just) n. There should be no r.
    – roaima
    May 7 at 19:42















up vote
1
down vote

favorite












I have two files:



  1. one generated using find command in a folder to list files, sorting them numerically and writing to a file,

  2. and the other generated by a python script, which is not sorted, so I explicitly sort it numerically.

The problem is that my sort output only has two columns and is as follows:



500016
500016
500174
500174
500277
500277


As you can see, even the common entries are shown separately in two columns and the third column is missing altogether, implying that there is nothing common between the two files, whereas these first three entries are indeed same. sort otherwise works as expected with some test files that I make.



I know that comm needs the two files to be lexically sorted, and here is a list of options I tried and failed:



comm <(sort file1.txt) <(sort file2.txt)


from https://unix.stackexchange.com/a/377689/187419 failed. I also tried giving the -d option to sort explicitly, and also tried explicitly rewriting the files with dictionary sort -- both didn't work



comm --check-order <(sort file1.txt) <(sort file2.txt)


from https://unix.stackexchange.com/a/186101/187419 did not return any order error; it ran as usual giving two output columns.



This solution for a problem very close to mine is also not working.



Thinking that it might be because of some additional characters in the file, I also the solution mentioned here to do :set list in vim.



Just to test if sort is causing issues, I deliberately sorted the test files I made (with which comm worked earlier) numerically and comm still worked.



I tried the solutions I could find, to no avail. Any other suggestions?







share|improve this question



















  • Line endings? Trailing whitespace? (Check with hex, hexdump, od -c, etc.) On a UNIX-type system the line ending should be (just) n. There should be no r.
    – roaima
    May 7 at 19:42













up vote
1
down vote

favorite









up vote
1
down vote

favorite











I have two files:



  1. one generated using find command in a folder to list files, sorting them numerically and writing to a file,

  2. and the other generated by a python script, which is not sorted, so I explicitly sort it numerically.

The problem is that my sort output only has two columns and is as follows:



500016
500016
500174
500174
500277
500277


As you can see, even the common entries are shown separately in two columns and the third column is missing altogether, implying that there is nothing common between the two files, whereas these first three entries are indeed same. sort otherwise works as expected with some test files that I make.



I know that comm needs the two files to be lexically sorted, and here is a list of options I tried and failed:



comm <(sort file1.txt) <(sort file2.txt)


from https://unix.stackexchange.com/a/377689/187419 failed. I also tried giving the -d option to sort explicitly, and also tried explicitly rewriting the files with dictionary sort -- both didn't work



comm --check-order <(sort file1.txt) <(sort file2.txt)


from https://unix.stackexchange.com/a/186101/187419 did not return any order error; it ran as usual giving two output columns.



This solution for a problem very close to mine is also not working.



Thinking that it might be because of some additional characters in the file, I also the solution mentioned here to do :set list in vim.



Just to test if sort is causing issues, I deliberately sorted the test files I made (with which comm worked earlier) numerically and comm still worked.



I tried the solutions I could find, to no avail. Any other suggestions?







share|improve this question











I have two files:



  1. one generated using find command in a folder to list files, sorting them numerically and writing to a file,

  2. and the other generated by a python script, which is not sorted, so I explicitly sort it numerically.

The problem is that my sort output only has two columns and is as follows:



500016
500016
500174
500174
500277
500277


As you can see, even the common entries are shown separately in two columns and the third column is missing altogether, implying that there is nothing common between the two files, whereas these first three entries are indeed same. sort otherwise works as expected with some test files that I make.



I know that comm needs the two files to be lexically sorted, and here is a list of options I tried and failed:



comm <(sort file1.txt) <(sort file2.txt)


from https://unix.stackexchange.com/a/377689/187419 failed. I also tried giving the -d option to sort explicitly, and also tried explicitly rewriting the files with dictionary sort -- both didn't work



comm --check-order <(sort file1.txt) <(sort file2.txt)


from https://unix.stackexchange.com/a/186101/187419 did not return any order error; it ran as usual giving two output columns.



This solution for a problem very close to mine is also not working.



Thinking that it might be because of some additional characters in the file, I also the solution mentioned here to do :set list in vim.



Just to test if sort is causing issues, I deliberately sorted the test files I made (with which comm worked earlier) numerically and comm still worked.



I tried the solutions I could find, to no avail. Any other suggestions?









share|improve this question










share|improve this question




share|improve this question









asked May 7 at 19:30









user128785

82




82











  • Line endings? Trailing whitespace? (Check with hex, hexdump, od -c, etc.) On a UNIX-type system the line ending should be (just) n. There should be no r.
    – roaima
    May 7 at 19:42

















  • Line endings? Trailing whitespace? (Check with hex, hexdump, od -c, etc.) On a UNIX-type system the line ending should be (just) n. There should be no r.
    – roaima
    May 7 at 19:42
















Line endings? Trailing whitespace? (Check with hex, hexdump, od -c, etc.) On a UNIX-type system the line ending should be (just) n. There should be no r.
– roaima
May 7 at 19:42





Line endings? Trailing whitespace? (Check with hex, hexdump, od -c, etc.) On a UNIX-type system the line ending should be (just) n. There should be no r.
– roaima
May 7 at 19:42











1 Answer
1






active

oldest

votes

















up vote
1
down vote



accepted










You are almost certainly right that additional characters on each line are causing corresponding lines to fail to match exactly. Those additional characters might have the form of carriage-return characters from Windows-style line terminators, space or tab characters, or possibly other non-printing characters. For example, maybe the Python script is right-justifying the numbers so that some or all of them have leading spaces.



The surest thing to do would be to filter out all such unwanted characters, and since the data are strictly numeric, that's pretty easy to do with, for example, sed:



sed 's/[^0-9]//g' < input > output


You could interpose that at various points in your process. Here's just one:



comm <(sed 's/[^0-9]//g' file1.txt | sort) <(sed 's/[^0-9]//g' file2.txt | sort)





share|improve this answer





















  • This worked! I want to find out which extra characters are lingering in those files. Can you suggest a way? This is just for my learning. Also, if I know I will be better able to integrate the solution in my larger program.
    – user128785
    May 7 at 20:36










  • @user128785, the comment on your original question suggests severall alternatives aimed exactly at looking for and identifying the extra characters. Which tools are available to you will depend on your particular system. For instance, I don't have hex, but I do have hexdump and od. Alternatively, various editors have options that would reveal the extra characters, but details vary, of course.
    – John Bollinger
    May 7 at 21:05










  • Following the suggestion of @roaima, I checked od -c file1.txt and similarly for file2.txt, and found that the file being written by the python script had rn line-endings. I installed and used dos2unix to convert the line-endings to n, and simple comm without any options worked normally. I also modified my python3 script to have the newline='n' attribute explicitly specified but I have not tested this yet. After this modification I expect the script to work, but if I have issues with it I will post another question.
    – user128785
    May 7 at 21:56











Your Answer







StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);








 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f442390%2fcomm-command-behaving-strangely%23new-answer', 'question_page');

);

Post as a guest






























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
1
down vote



accepted










You are almost certainly right that additional characters on each line are causing corresponding lines to fail to match exactly. Those additional characters might have the form of carriage-return characters from Windows-style line terminators, space or tab characters, or possibly other non-printing characters. For example, maybe the Python script is right-justifying the numbers so that some or all of them have leading spaces.



The surest thing to do would be to filter out all such unwanted characters, and since the data are strictly numeric, that's pretty easy to do with, for example, sed:



sed 's/[^0-9]//g' < input > output


You could interpose that at various points in your process. Here's just one:



comm <(sed 's/[^0-9]//g' file1.txt | sort) <(sed 's/[^0-9]//g' file2.txt | sort)





share|improve this answer





















  • This worked! I want to find out which extra characters are lingering in those files. Can you suggest a way? This is just for my learning. Also, if I know I will be better able to integrate the solution in my larger program.
    – user128785
    May 7 at 20:36










  • @user128785, the comment on your original question suggests severall alternatives aimed exactly at looking for and identifying the extra characters. Which tools are available to you will depend on your particular system. For instance, I don't have hex, but I do have hexdump and od. Alternatively, various editors have options that would reveal the extra characters, but details vary, of course.
    – John Bollinger
    May 7 at 21:05










  • Following the suggestion of @roaima, I checked od -c file1.txt and similarly for file2.txt, and found that the file being written by the python script had rn line-endings. I installed and used dos2unix to convert the line-endings to n, and simple comm without any options worked normally. I also modified my python3 script to have the newline='n' attribute explicitly specified but I have not tested this yet. After this modification I expect the script to work, but if I have issues with it I will post another question.
    – user128785
    May 7 at 21:56















up vote
1
down vote



accepted










You are almost certainly right that additional characters on each line are causing corresponding lines to fail to match exactly. Those additional characters might have the form of carriage-return characters from Windows-style line terminators, space or tab characters, or possibly other non-printing characters. For example, maybe the Python script is right-justifying the numbers so that some or all of them have leading spaces.



The surest thing to do would be to filter out all such unwanted characters, and since the data are strictly numeric, that's pretty easy to do with, for example, sed:



sed 's/[^0-9]//g' < input > output


You could interpose that at various points in your process. Here's just one:



comm <(sed 's/[^0-9]//g' file1.txt | sort) <(sed 's/[^0-9]//g' file2.txt | sort)





share|improve this answer





















  • This worked! I want to find out which extra characters are lingering in those files. Can you suggest a way? This is just for my learning. Also, if I know I will be better able to integrate the solution in my larger program.
    – user128785
    May 7 at 20:36










  • @user128785, the comment on your original question suggests severall alternatives aimed exactly at looking for and identifying the extra characters. Which tools are available to you will depend on your particular system. For instance, I don't have hex, but I do have hexdump and od. Alternatively, various editors have options that would reveal the extra characters, but details vary, of course.
    – John Bollinger
    May 7 at 21:05










  • Following the suggestion of @roaima, I checked od -c file1.txt and similarly for file2.txt, and found that the file being written by the python script had rn line-endings. I installed and used dos2unix to convert the line-endings to n, and simple comm without any options worked normally. I also modified my python3 script to have the newline='n' attribute explicitly specified but I have not tested this yet. After this modification I expect the script to work, but if I have issues with it I will post another question.
    – user128785
    May 7 at 21:56













up vote
1
down vote



accepted







up vote
1
down vote



accepted






You are almost certainly right that additional characters on each line are causing corresponding lines to fail to match exactly. Those additional characters might have the form of carriage-return characters from Windows-style line terminators, space or tab characters, or possibly other non-printing characters. For example, maybe the Python script is right-justifying the numbers so that some or all of them have leading spaces.



The surest thing to do would be to filter out all such unwanted characters, and since the data are strictly numeric, that's pretty easy to do with, for example, sed:



sed 's/[^0-9]//g' < input > output


You could interpose that at various points in your process. Here's just one:



comm <(sed 's/[^0-9]//g' file1.txt | sort) <(sed 's/[^0-9]//g' file2.txt | sort)





share|improve this answer













You are almost certainly right that additional characters on each line are causing corresponding lines to fail to match exactly. Those additional characters might have the form of carriage-return characters from Windows-style line terminators, space or tab characters, or possibly other non-printing characters. For example, maybe the Python script is right-justifying the numbers so that some or all of them have leading spaces.



The surest thing to do would be to filter out all such unwanted characters, and since the data are strictly numeric, that's pretty easy to do with, for example, sed:



sed 's/[^0-9]//g' < input > output


You could interpose that at various points in your process. Here's just one:



comm <(sed 's/[^0-9]//g' file1.txt | sort) <(sed 's/[^0-9]//g' file2.txt | sort)






share|improve this answer













share|improve this answer



share|improve this answer











answered May 7 at 19:49









John Bollinger

2168




2168











  • This worked! I want to find out which extra characters are lingering in those files. Can you suggest a way? This is just for my learning. Also, if I know I will be better able to integrate the solution in my larger program.
    – user128785
    May 7 at 20:36










  • @user128785, the comment on your original question suggests severall alternatives aimed exactly at looking for and identifying the extra characters. Which tools are available to you will depend on your particular system. For instance, I don't have hex, but I do have hexdump and od. Alternatively, various editors have options that would reveal the extra characters, but details vary, of course.
    – John Bollinger
    May 7 at 21:05










  • Following the suggestion of @roaima, I checked od -c file1.txt and similarly for file2.txt, and found that the file being written by the python script had rn line-endings. I installed and used dos2unix to convert the line-endings to n, and simple comm without any options worked normally. I also modified my python3 script to have the newline='n' attribute explicitly specified but I have not tested this yet. After this modification I expect the script to work, but if I have issues with it I will post another question.
    – user128785
    May 7 at 21:56

















  • This worked! I want to find out which extra characters are lingering in those files. Can you suggest a way? This is just for my learning. Also, if I know I will be better able to integrate the solution in my larger program.
    – user128785
    May 7 at 20:36










  • @user128785, the comment on your original question suggests severall alternatives aimed exactly at looking for and identifying the extra characters. Which tools are available to you will depend on your particular system. For instance, I don't have hex, but I do have hexdump and od. Alternatively, various editors have options that would reveal the extra characters, but details vary, of course.
    – John Bollinger
    May 7 at 21:05










  • Following the suggestion of @roaima, I checked od -c file1.txt and similarly for file2.txt, and found that the file being written by the python script had rn line-endings. I installed and used dos2unix to convert the line-endings to n, and simple comm without any options worked normally. I also modified my python3 script to have the newline='n' attribute explicitly specified but I have not tested this yet. After this modification I expect the script to work, but if I have issues with it I will post another question.
    – user128785
    May 7 at 21:56
















This worked! I want to find out which extra characters are lingering in those files. Can you suggest a way? This is just for my learning. Also, if I know I will be better able to integrate the solution in my larger program.
– user128785
May 7 at 20:36




This worked! I want to find out which extra characters are lingering in those files. Can you suggest a way? This is just for my learning. Also, if I know I will be better able to integrate the solution in my larger program.
– user128785
May 7 at 20:36












@user128785, the comment on your original question suggests severall alternatives aimed exactly at looking for and identifying the extra characters. Which tools are available to you will depend on your particular system. For instance, I don't have hex, but I do have hexdump and od. Alternatively, various editors have options that would reveal the extra characters, but details vary, of course.
– John Bollinger
May 7 at 21:05




@user128785, the comment on your original question suggests severall alternatives aimed exactly at looking for and identifying the extra characters. Which tools are available to you will depend on your particular system. For instance, I don't have hex, but I do have hexdump and od. Alternatively, various editors have options that would reveal the extra characters, but details vary, of course.
– John Bollinger
May 7 at 21:05












Following the suggestion of @roaima, I checked od -c file1.txt and similarly for file2.txt, and found that the file being written by the python script had rn line-endings. I installed and used dos2unix to convert the line-endings to n, and simple comm without any options worked normally. I also modified my python3 script to have the newline='n' attribute explicitly specified but I have not tested this yet. After this modification I expect the script to work, but if I have issues with it I will post another question.
– user128785
May 7 at 21:56





Following the suggestion of @roaima, I checked od -c file1.txt and similarly for file2.txt, and found that the file being written by the python script had rn line-endings. I installed and used dos2unix to convert the line-endings to n, and simple comm without any options worked normally. I also modified my python3 script to have the newline='n' attribute explicitly specified but I have not tested this yet. After this modification I expect the script to work, but if I have issues with it I will post another question.
– user128785
May 7 at 21:56













 

draft saved


draft discarded


























 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f442390%2fcomm-command-behaving-strangely%23new-answer', 'question_page');

);

Post as a guest













































































Popular posts from this blog

How to check contact read email or not when send email to Individual?

How many registers does an x86_64 CPU actually have?

Nur Jahan