comm command behaving strangely

up vote
1
down vote

favorite

I have two files:

one generated using find command in a folder to list files, sorting them numerically and writing to a file,

and the other generated by a python script, which is not sorted, so I explicitly sort it numerically.

The problem is that my sort output only has two columns and is as follows:

As you can see, even the common entries are shown separately in two columns and the third column is missing altogether, implying that there is nothing common between the two files, whereas these first three entries are indeed same. sort otherwise works as expected with some test files that I make.

I know that comm needs the two files to be lexically sorted, and here is a list of options I tried and failed:

comm <(sort file1.txt) <(sort file2.txt)

from https://unix.stackexchange.com/a/377689/187419 failed. I also tried giving the -d option to sort explicitly, and also tried explicitly rewriting the files with dictionary sort -- both didn't work

comm --check-order <(sort file1.txt) <(sort file2.txt)

from https://unix.stackexchange.com/a/186101/187419 did not return any order error; it ran as usual giving two output columns.

This solution for a problem very close to mine is also not working.

Thinking that it might be because of some additional characters in the file, I also the solution mentioned here to do :set list in vim.

Just to test if sort is causing issues, I deliberately sorted the test files I made (with which comm worked earlier) numerically and comm still worked.

I tried the solutions I could find, to no avail. Any other suggestions?

asked May 7 at 19:30

user128785

Line endings? Trailing whitespace? (Check with hex, hexdump, od -c, etc.) On a UNIX-type system the line ending should be (just) n. There should be no r.
â€“Â roaima
May 7 at 19:42

add a commentÂ |Â

up vote
1
down vote

favorite

I have two files:

one generated using find command in a folder to list files, sorting them numerically and writing to a file,

and the other generated by a python script, which is not sorted, so I explicitly sort it numerically.

The problem is that my sort output only has two columns and is as follows:

I know that comm needs the two files to be lexically sorted, and here is a list of options I tried and failed:

comm <(sort file1.txt) <(sort file2.txt)

comm --check-order <(sort file1.txt) <(sort file2.txt)

from https://unix.stackexchange.com/a/186101/187419 did not return any order error; it ran as usual giving two output columns.

This solution for a problem very close to mine is also not working.

Thinking that it might be because of some additional characters in the file, I also the solution mentioned here to do :set list in vim.

Just to test if sort is causing issues, I deliberately sorted the test files I made (with which comm worked earlier) numerically and comm still worked.

I tried the solutions I could find, to no avail. Any other suggestions?

asked May 7 at 19:30

user128785

Line endings? Trailing whitespace? (Check with hex, hexdump, od -c, etc.) On a UNIX-type system the line ending should be (just) n. There should be no r.
â€“Â roaima
May 7 at 19:42

add a commentÂ |Â

up vote
1
down vote

favorite

I have two files:

one generated using find command in a folder to list files, sorting them numerically and writing to a file,

and the other generated by a python script, which is not sorted, so I explicitly sort it numerically.

The problem is that my sort output only has two columns and is as follows:

I know that comm needs the two files to be lexically sorted, and here is a list of options I tried and failed:

comm <(sort file1.txt) <(sort file2.txt)

comm --check-order <(sort file1.txt) <(sort file2.txt)

from https://unix.stackexchange.com/a/186101/187419 did not return any order error; it ran as usual giving two output columns.

This solution for a problem very close to mine is also not working.

Thinking that it might be because of some additional characters in the file, I also the solution mentioned here to do :set list in vim.

Just to test if sort is causing issues, I deliberately sorted the test files I made (with which comm worked earlier) numerically and comm still worked.

I tried the solutions I could find, to no avail. Any other suggestions?

asked May 7 at 19:30

user128785

I have two files:

one generated using find command in a folder to list files, sorting them numerically and writing to a file,

and the other generated by a python script, which is not sorted, so I explicitly sort it numerically.

The problem is that my sort output only has two columns and is as follows:

I know that comm needs the two files to be lexically sorted, and here is a list of options I tried and failed:

comm <(sort file1.txt) <(sort file2.txt)

comm --check-order <(sort file1.txt) <(sort file2.txt)

from https://unix.stackexchange.com/a/186101/187419 did not return any order error; it ran as usual giving two output columns.

This solution for a problem very close to mine is also not working.

Thinking that it might be because of some additional characters in the file, I also the solution mentioned here to do :set list in vim.

Just to test if sort is causing issues, I deliberately sorted the test files I made (with which comm worked earlier) numerically and comm still worked.

I tried the solutions I could find, to no avail. Any other suggestions?

asked May 7 at 19:30

user128785

asked May 7 at 19:30

user128785

asked May 7 at 19:30

user128785

asked May 7 at 19:30

user128785

Line endings? Trailing whitespace? (Check with hex, hexdump, od -c, etc.) On a UNIX-type system the line ending should be (just) n. There should be no r.
â€“Â roaima
May 7 at 19:42

add a commentÂ |Â

Line endings? Trailing whitespace? (Check with hex, hexdump, od -c, etc.) On a UNIX-type system the line ending should be (just) n. There should be no r.
â€“Â roaima
May 7 at 19:42

Line endings? Trailing whitespace? (Check with hex, hexdump, od -c, etc.) On a UNIX-type system the line ending should be (just) n. There should be no r.
â€“Â roaima
May 7 at 19:42

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

You are almost certainly right that additional characters on each line are causing corresponding lines to fail to match exactly. Those additional characters might have the form of carriage-return characters from Windows-style line terminators, space or tab characters, or possibly other non-printing characters. For example, maybe the Python script is right-justifying the numbers so that some or all of them have leading spaces.

The surest thing to do would be to filter out all such unwanted characters, and since the data are strictly numeric, that's pretty easy to do with, for example, sed:

sed 's/[^0-9]//g' < input > output

You could interpose that at various points in your process. Here's just one:

comm <(sed 's/[^0-9]//g' file1.txt | sort) <(sed 's/[^0-9]//g' file2.txt | sort)

answered May 7 at 19:49

John Bollinger

2168

This worked! I want to find out which extra characters are lingering in those files. Can you suggest a way? This is just for my learning. Also, if I know I will be better able to integrate the solution in my larger program.
â€“Â user128785
May 7 at 20:36

@user128785, the comment on your original question suggests severall alternatives aimed exactly at looking for and identifying the extra characters. Which tools are available to you will depend on your particular system. For instance, I don't have hex, but I do have hexdump and od. Alternatively, various editors have options that would reveal the extra characters, but details vary, of course.
â€“Â John Bollinger
May 7 at 21:05

Following the suggestion of @roaima, I checked od -c file1.txt and similarly for file2.txt, and found that the file being written by the python script had rn line-endings. I installed and used dos2unix to convert the line-endings to n, and simple comm without any options worked normally. I also modified my python3 script to have the newline='n' attribute explicitly specified but I have not tested this yet. After this modification I expect the script to work, but if I have issues with it I will post another question.
â€“Â user128785
May 7 at 21:56

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f442390%2fcomm-command-behaving-strangely%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

The surest thing to do would be to filter out all such unwanted characters, and since the data are strictly numeric, that's pretty easy to do with, for example, sed:

sed 's/[^0-9]//g' < input > output

You could interpose that at various points in your process. Here's just one:

comm <(sed 's/[^0-9]//g' file1.txt | sort) <(sed 's/[^0-9]//g' file2.txt | sort)

answered May 7 at 19:49

John Bollinger

2168

This worked! I want to find out which extra characters are lingering in those files. Can you suggest a way? This is just for my learning. Also, if I know I will be better able to integrate the solution in my larger program.
â€“Â user128785
May 7 at 20:36

@user128785, the comment on your original question suggests severall alternatives aimed exactly at looking for and identifying the extra characters. Which tools are available to you will depend on your particular system. For instance, I don't have hex, but I do have hexdump and od. Alternatively, various editors have options that would reveal the extra characters, but details vary, of course.
â€“Â John Bollinger
May 7 at 21:05

Following the suggestion of @roaima, I checked od -c file1.txt and similarly for file2.txt, and found that the file being written by the python script had rn line-endings. I installed and used dos2unix to convert the line-endings to n, and simple comm without any options worked normally. I also modified my python3 script to have the newline='n' attribute explicitly specified but I have not tested this yet. After this modification I expect the script to work, but if I have issues with it I will post another question.
â€“Â user128785
May 7 at 21:56

add a commentÂ |Â

up vote
1
down vote

accepted

The surest thing to do would be to filter out all such unwanted characters, and since the data are strictly numeric, that's pretty easy to do with, for example, sed:

sed 's/[^0-9]//g' < input > output

You could interpose that at various points in your process. Here's just one:

comm <(sed 's/[^0-9]//g' file1.txt | sort) <(sed 's/[^0-9]//g' file2.txt | sort)

answered May 7 at 19:49

John Bollinger

2168

This worked! I want to find out which extra characters are lingering in those files. Can you suggest a way? This is just for my learning. Also, if I know I will be better able to integrate the solution in my larger program.
â€“Â user128785
May 7 at 20:36

@user128785, the comment on your original question suggests severall alternatives aimed exactly at looking for and identifying the extra characters. Which tools are available to you will depend on your particular system. For instance, I don't have hex, but I do have hexdump and od. Alternatively, various editors have options that would reveal the extra characters, but details vary, of course.
â€“Â John Bollinger
May 7 at 21:05

Following the suggestion of @roaima, I checked od -c file1.txt and similarly for file2.txt, and found that the file being written by the python script had rn line-endings. I installed and used dos2unix to convert the line-endings to n, and simple comm without any options worked normally. I also modified my python3 script to have the newline='n' attribute explicitly specified but I have not tested this yet. After this modification I expect the script to work, but if I have issues with it I will post another question.
â€“Â user128785
May 7 at 21:56

add a commentÂ |Â

up vote
1
down vote

accepted

The surest thing to do would be to filter out all such unwanted characters, and since the data are strictly numeric, that's pretty easy to do with, for example, sed:

sed 's/[^0-9]//g' < input > output

You could interpose that at various points in your process. Here's just one:

comm <(sed 's/[^0-9]//g' file1.txt | sort) <(sed 's/[^0-9]//g' file2.txt | sort)

answered May 7 at 19:49

John Bollinger

2168

The surest thing to do would be to filter out all such unwanted characters, and since the data are strictly numeric, that's pretty easy to do with, for example, sed:

sed 's/[^0-9]//g' < input > output

You could interpose that at various points in your process. Here's just one:

comm <(sed 's/[^0-9]//g' file1.txt | sort) <(sed 's/[^0-9]//g' file2.txt | sort)

answered May 7 at 19:49

John Bollinger

2168

answered May 7 at 19:49

John Bollinger

2168

answered May 7 at 19:49

John Bollinger

2168

answered May 7 at 19:49

John Bollinger

2168

This worked! I want to find out which extra characters are lingering in those files. Can you suggest a way? This is just for my learning. Also, if I know I will be better able to integrate the solution in my larger program.
â€“Â user128785
May 7 at 20:36

@user128785, the comment on your original question suggests severall alternatives aimed exactly at looking for and identifying the extra characters. Which tools are available to you will depend on your particular system. For instance, I don't have hex, but I do have hexdump and od. Alternatively, various editors have options that would reveal the extra characters, but details vary, of course.
â€“Â John Bollinger
May 7 at 21:05

Following the suggestion of @roaima, I checked od -c file1.txt and similarly for file2.txt, and found that the file being written by the python script had rn line-endings. I installed and used dos2unix to convert the line-endings to n, and simple comm without any options worked normally. I also modified my python3 script to have the newline='n' attribute explicitly specified but I have not tested this yet. After this modification I expect the script to work, but if I have issues with it I will post another question.
â€“Â user128785
May 7 at 21:56

add a commentÂ |Â

This worked! I want to find out which extra characters are lingering in those files. Can you suggest a way? This is just for my learning. Also, if I know I will be better able to integrate the solution in my larger program.
â€“Â user128785
May 7 at 20:36

@user128785, the comment on your original question suggests severall alternatives aimed exactly at looking for and identifying the extra characters. Which tools are available to you will depend on your particular system. For instance, I don't have hex, but I do have hexdump and od. Alternatively, various editors have options that would reveal the extra characters, but details vary, of course.
â€“Â John Bollinger
May 7 at 21:05

Following the suggestion of @roaima, I checked od -c file1.txt and similarly for file2.txt, and found that the file being written by the python script had rn line-endings. I installed and used dos2unix to convert the line-endings to n, and simple comm without any options worked normally. I also modified my python3 script to have the newline='n' attribute explicitly specified but I have not tested this yet. After this modification I expect the script to work, but if I have issues with it I will post another question.
â€“Â user128785
May 7 at 21:56

This worked! I want to find out which extra characters are lingering in those files. Can you suggest a way? This is just for my learning. Also, if I know I will be better able to integrate the solution in my larger program.
â€“Â user128785
May 7 at 20:36

@user128785, the comment on your original question suggests severall alternatives aimed exactly at looking for and identifying the extra characters. Which tools are available to you will depend on your particular system. For instance, I don't have hex, but I do have hexdump and od. Alternatively, various editors have options that would reveal the extra characters, but details vary, of course.
â€“Â John Bollinger
May 7 at 21:05

Following the suggestion of @roaima, I checked od -c file1.txt and similarly for file2.txt, and found that the file being written by the python script had rn line-endings. I installed and used dos2unix to convert the line-endings to n, and simple comm without any options worked normally. I also modified my python3 script to have the newline='n' attribute explicitly specified but I have not tested this yet. After this modification I expect the script to work, but if I have issues with it I will post another question.
â€“Â user128785
May 7 at 21:56

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu