comm command behaving strangely
Clash Royale CLAN TAG#URR8PPP
up vote
1
down vote
favorite
I have two files:
- one generated using
find
command in a folder to list files, sorting them numerically and writing to a file, - and the other generated by a python script, which is not sorted, so I explicitly sort it numerically.
The problem is that my sort
output only has two columns and is as follows:
500016
500016
500174
500174
500277
500277
As you can see, even the common entries are shown separately in two columns and the third column is missing altogether, implying that there is nothing common between the two files, whereas these first three entries are indeed same. sort
otherwise works as expected with some test files that I make.
I know that comm
needs the two files to be lexically sorted, and here is a list of options I tried and failed:
comm <(sort file1.txt) <(sort file2.txt)
from https://unix.stackexchange.com/a/377689/187419 failed. I also tried giving the -d
option to sort
explicitly, and also tried explicitly rewriting the files with dictionary sort -- both didn't work
comm --check-order <(sort file1.txt) <(sort file2.txt)
from https://unix.stackexchange.com/a/186101/187419 did not return any order error; it ran as usual giving two output columns.
This solution for a problem very close to mine is also not working.
Thinking that it might be because of some additional characters in the file, I also the solution mentioned here to do :set list
in vim.
Just to test if sort
is causing issues, I deliberately sorted the test files I made (with which comm
worked earlier) numerically and comm
still worked.
I tried the solutions I could find, to no avail. Any other suggestions?
bash comm
add a comment |Â
up vote
1
down vote
favorite
I have two files:
- one generated using
find
command in a folder to list files, sorting them numerically and writing to a file, - and the other generated by a python script, which is not sorted, so I explicitly sort it numerically.
The problem is that my sort
output only has two columns and is as follows:
500016
500016
500174
500174
500277
500277
As you can see, even the common entries are shown separately in two columns and the third column is missing altogether, implying that there is nothing common between the two files, whereas these first three entries are indeed same. sort
otherwise works as expected with some test files that I make.
I know that comm
needs the two files to be lexically sorted, and here is a list of options I tried and failed:
comm <(sort file1.txt) <(sort file2.txt)
from https://unix.stackexchange.com/a/377689/187419 failed. I also tried giving the -d
option to sort
explicitly, and also tried explicitly rewriting the files with dictionary sort -- both didn't work
comm --check-order <(sort file1.txt) <(sort file2.txt)
from https://unix.stackexchange.com/a/186101/187419 did not return any order error; it ran as usual giving two output columns.
This solution for a problem very close to mine is also not working.
Thinking that it might be because of some additional characters in the file, I also the solution mentioned here to do :set list
in vim.
Just to test if sort
is causing issues, I deliberately sorted the test files I made (with which comm
worked earlier) numerically and comm
still worked.
I tried the solutions I could find, to no avail. Any other suggestions?
bash comm
Line endings? Trailing whitespace? (Check withhex
,hexdump
,od -c
, etc.) On a UNIX-type system the line ending should be (just)n
. There should be nor
.
â roaima
May 7 at 19:42
add a comment |Â
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I have two files:
- one generated using
find
command in a folder to list files, sorting them numerically and writing to a file, - and the other generated by a python script, which is not sorted, so I explicitly sort it numerically.
The problem is that my sort
output only has two columns and is as follows:
500016
500016
500174
500174
500277
500277
As you can see, even the common entries are shown separately in two columns and the third column is missing altogether, implying that there is nothing common between the two files, whereas these first three entries are indeed same. sort
otherwise works as expected with some test files that I make.
I know that comm
needs the two files to be lexically sorted, and here is a list of options I tried and failed:
comm <(sort file1.txt) <(sort file2.txt)
from https://unix.stackexchange.com/a/377689/187419 failed. I also tried giving the -d
option to sort
explicitly, and also tried explicitly rewriting the files with dictionary sort -- both didn't work
comm --check-order <(sort file1.txt) <(sort file2.txt)
from https://unix.stackexchange.com/a/186101/187419 did not return any order error; it ran as usual giving two output columns.
This solution for a problem very close to mine is also not working.
Thinking that it might be because of some additional characters in the file, I also the solution mentioned here to do :set list
in vim.
Just to test if sort
is causing issues, I deliberately sorted the test files I made (with which comm
worked earlier) numerically and comm
still worked.
I tried the solutions I could find, to no avail. Any other suggestions?
bash comm
I have two files:
- one generated using
find
command in a folder to list files, sorting them numerically and writing to a file, - and the other generated by a python script, which is not sorted, so I explicitly sort it numerically.
The problem is that my sort
output only has two columns and is as follows:
500016
500016
500174
500174
500277
500277
As you can see, even the common entries are shown separately in two columns and the third column is missing altogether, implying that there is nothing common between the two files, whereas these first three entries are indeed same. sort
otherwise works as expected with some test files that I make.
I know that comm
needs the two files to be lexically sorted, and here is a list of options I tried and failed:
comm <(sort file1.txt) <(sort file2.txt)
from https://unix.stackexchange.com/a/377689/187419 failed. I also tried giving the -d
option to sort
explicitly, and also tried explicitly rewriting the files with dictionary sort -- both didn't work
comm --check-order <(sort file1.txt) <(sort file2.txt)
from https://unix.stackexchange.com/a/186101/187419 did not return any order error; it ran as usual giving two output columns.
This solution for a problem very close to mine is also not working.
Thinking that it might be because of some additional characters in the file, I also the solution mentioned here to do :set list
in vim.
Just to test if sort
is causing issues, I deliberately sorted the test files I made (with which comm
worked earlier) numerically and comm
still worked.
I tried the solutions I could find, to no avail. Any other suggestions?
bash comm
asked May 7 at 19:30
user128785
82
82
Line endings? Trailing whitespace? (Check withhex
,hexdump
,od -c
, etc.) On a UNIX-type system the line ending should be (just)n
. There should be nor
.
â roaima
May 7 at 19:42
add a comment |Â
Line endings? Trailing whitespace? (Check withhex
,hexdump
,od -c
, etc.) On a UNIX-type system the line ending should be (just)n
. There should be nor
.
â roaima
May 7 at 19:42
Line endings? Trailing whitespace? (Check with
hex
, hexdump
, od -c
, etc.) On a UNIX-type system the line ending should be (just) n
. There should be no r
.â roaima
May 7 at 19:42
Line endings? Trailing whitespace? (Check with
hex
, hexdump
, od -c
, etc.) On a UNIX-type system the line ending should be (just) n
. There should be no r
.â roaima
May 7 at 19:42
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
1
down vote
accepted
You are almost certainly right that additional characters on each line are causing corresponding lines to fail to match exactly. Those additional characters might have the form of carriage-return characters from Windows-style line terminators, space or tab characters, or possibly other non-printing characters. For example, maybe the Python script is right-justifying the numbers so that some or all of them have leading spaces.
The surest thing to do would be to filter out all such unwanted characters, and since the data are strictly numeric, that's pretty easy to do with, for example, sed
:
sed 's/[^0-9]//g' < input > output
You could interpose that at various points in your process. Here's just one:
comm <(sed 's/[^0-9]//g' file1.txt | sort) <(sed 's/[^0-9]//g' file2.txt | sort)
This worked! I want to find out which extra characters are lingering in those files. Can you suggest a way? This is just for my learning. Also, if I know I will be better able to integrate the solution in my larger program.
â user128785
May 7 at 20:36
@user128785, the comment on your original question suggests severall alternatives aimed exactly at looking for and identifying the extra characters. Which tools are available to you will depend on your particular system. For instance, I don't havehex
, but I do havehexdump
andod
. Alternatively, various editors have options that would reveal the extra characters, but details vary, of course.
â John Bollinger
May 7 at 21:05
Following the suggestion of @roaima, I checkedod -c file1.txt
and similarly forfile2.txt
, and found that the file being written by the python script hadrn
line-endings. I installed and useddos2unix
to convert the line-endings ton
, and simplecomm
without any options worked normally. I also modified my python3 script to have thenewline='n'
attribute explicitly specified but I have not tested this yet. After this modification I expect the script to work, but if I have issues with it I will post another question.
â user128785
May 7 at 21:56
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
You are almost certainly right that additional characters on each line are causing corresponding lines to fail to match exactly. Those additional characters might have the form of carriage-return characters from Windows-style line terminators, space or tab characters, or possibly other non-printing characters. For example, maybe the Python script is right-justifying the numbers so that some or all of them have leading spaces.
The surest thing to do would be to filter out all such unwanted characters, and since the data are strictly numeric, that's pretty easy to do with, for example, sed
:
sed 's/[^0-9]//g' < input > output
You could interpose that at various points in your process. Here's just one:
comm <(sed 's/[^0-9]//g' file1.txt | sort) <(sed 's/[^0-9]//g' file2.txt | sort)
This worked! I want to find out which extra characters are lingering in those files. Can you suggest a way? This is just for my learning. Also, if I know I will be better able to integrate the solution in my larger program.
â user128785
May 7 at 20:36
@user128785, the comment on your original question suggests severall alternatives aimed exactly at looking for and identifying the extra characters. Which tools are available to you will depend on your particular system. For instance, I don't havehex
, but I do havehexdump
andod
. Alternatively, various editors have options that would reveal the extra characters, but details vary, of course.
â John Bollinger
May 7 at 21:05
Following the suggestion of @roaima, I checkedod -c file1.txt
and similarly forfile2.txt
, and found that the file being written by the python script hadrn
line-endings. I installed and useddos2unix
to convert the line-endings ton
, and simplecomm
without any options worked normally. I also modified my python3 script to have thenewline='n'
attribute explicitly specified but I have not tested this yet. After this modification I expect the script to work, but if I have issues with it I will post another question.
â user128785
May 7 at 21:56
add a comment |Â
up vote
1
down vote
accepted
You are almost certainly right that additional characters on each line are causing corresponding lines to fail to match exactly. Those additional characters might have the form of carriage-return characters from Windows-style line terminators, space or tab characters, or possibly other non-printing characters. For example, maybe the Python script is right-justifying the numbers so that some or all of them have leading spaces.
The surest thing to do would be to filter out all such unwanted characters, and since the data are strictly numeric, that's pretty easy to do with, for example, sed
:
sed 's/[^0-9]//g' < input > output
You could interpose that at various points in your process. Here's just one:
comm <(sed 's/[^0-9]//g' file1.txt | sort) <(sed 's/[^0-9]//g' file2.txt | sort)
This worked! I want to find out which extra characters are lingering in those files. Can you suggest a way? This is just for my learning. Also, if I know I will be better able to integrate the solution in my larger program.
â user128785
May 7 at 20:36
@user128785, the comment on your original question suggests severall alternatives aimed exactly at looking for and identifying the extra characters. Which tools are available to you will depend on your particular system. For instance, I don't havehex
, but I do havehexdump
andod
. Alternatively, various editors have options that would reveal the extra characters, but details vary, of course.
â John Bollinger
May 7 at 21:05
Following the suggestion of @roaima, I checkedod -c file1.txt
and similarly forfile2.txt
, and found that the file being written by the python script hadrn
line-endings. I installed and useddos2unix
to convert the line-endings ton
, and simplecomm
without any options worked normally. I also modified my python3 script to have thenewline='n'
attribute explicitly specified but I have not tested this yet. After this modification I expect the script to work, but if I have issues with it I will post another question.
â user128785
May 7 at 21:56
add a comment |Â
up vote
1
down vote
accepted
up vote
1
down vote
accepted
You are almost certainly right that additional characters on each line are causing corresponding lines to fail to match exactly. Those additional characters might have the form of carriage-return characters from Windows-style line terminators, space or tab characters, or possibly other non-printing characters. For example, maybe the Python script is right-justifying the numbers so that some or all of them have leading spaces.
The surest thing to do would be to filter out all such unwanted characters, and since the data are strictly numeric, that's pretty easy to do with, for example, sed
:
sed 's/[^0-9]//g' < input > output
You could interpose that at various points in your process. Here's just one:
comm <(sed 's/[^0-9]//g' file1.txt | sort) <(sed 's/[^0-9]//g' file2.txt | sort)
You are almost certainly right that additional characters on each line are causing corresponding lines to fail to match exactly. Those additional characters might have the form of carriage-return characters from Windows-style line terminators, space or tab characters, or possibly other non-printing characters. For example, maybe the Python script is right-justifying the numbers so that some or all of them have leading spaces.
The surest thing to do would be to filter out all such unwanted characters, and since the data are strictly numeric, that's pretty easy to do with, for example, sed
:
sed 's/[^0-9]//g' < input > output
You could interpose that at various points in your process. Here's just one:
comm <(sed 's/[^0-9]//g' file1.txt | sort) <(sed 's/[^0-9]//g' file2.txt | sort)
answered May 7 at 19:49
John Bollinger
2168
2168
This worked! I want to find out which extra characters are lingering in those files. Can you suggest a way? This is just for my learning. Also, if I know I will be better able to integrate the solution in my larger program.
â user128785
May 7 at 20:36
@user128785, the comment on your original question suggests severall alternatives aimed exactly at looking for and identifying the extra characters. Which tools are available to you will depend on your particular system. For instance, I don't havehex
, but I do havehexdump
andod
. Alternatively, various editors have options that would reveal the extra characters, but details vary, of course.
â John Bollinger
May 7 at 21:05
Following the suggestion of @roaima, I checkedod -c file1.txt
and similarly forfile2.txt
, and found that the file being written by the python script hadrn
line-endings. I installed and useddos2unix
to convert the line-endings ton
, and simplecomm
without any options worked normally. I also modified my python3 script to have thenewline='n'
attribute explicitly specified but I have not tested this yet. After this modification I expect the script to work, but if I have issues with it I will post another question.
â user128785
May 7 at 21:56
add a comment |Â
This worked! I want to find out which extra characters are lingering in those files. Can you suggest a way? This is just for my learning. Also, if I know I will be better able to integrate the solution in my larger program.
â user128785
May 7 at 20:36
@user128785, the comment on your original question suggests severall alternatives aimed exactly at looking for and identifying the extra characters. Which tools are available to you will depend on your particular system. For instance, I don't havehex
, but I do havehexdump
andod
. Alternatively, various editors have options that would reveal the extra characters, but details vary, of course.
â John Bollinger
May 7 at 21:05
Following the suggestion of @roaima, I checkedod -c file1.txt
and similarly forfile2.txt
, and found that the file being written by the python script hadrn
line-endings. I installed and useddos2unix
to convert the line-endings ton
, and simplecomm
without any options worked normally. I also modified my python3 script to have thenewline='n'
attribute explicitly specified but I have not tested this yet. After this modification I expect the script to work, but if I have issues with it I will post another question.
â user128785
May 7 at 21:56
This worked! I want to find out which extra characters are lingering in those files. Can you suggest a way? This is just for my learning. Also, if I know I will be better able to integrate the solution in my larger program.
â user128785
May 7 at 20:36
This worked! I want to find out which extra characters are lingering in those files. Can you suggest a way? This is just for my learning. Also, if I know I will be better able to integrate the solution in my larger program.
â user128785
May 7 at 20:36
@user128785, the comment on your original question suggests severall alternatives aimed exactly at looking for and identifying the extra characters. Which tools are available to you will depend on your particular system. For instance, I don't have
hex
, but I do have hexdump
and od
. Alternatively, various editors have options that would reveal the extra characters, but details vary, of course.â John Bollinger
May 7 at 21:05
@user128785, the comment on your original question suggests severall alternatives aimed exactly at looking for and identifying the extra characters. Which tools are available to you will depend on your particular system. For instance, I don't have
hex
, but I do have hexdump
and od
. Alternatively, various editors have options that would reveal the extra characters, but details vary, of course.â John Bollinger
May 7 at 21:05
Following the suggestion of @roaima, I checked
od -c file1.txt
and similarly for file2.txt
, and found that the file being written by the python script had rn
line-endings. I installed and used dos2unix
to convert the line-endings to n
, and simple comm
without any options worked normally. I also modified my python3 script to have the newline='n'
attribute explicitly specified but I have not tested this yet. After this modification I expect the script to work, but if I have issues with it I will post another question.â user128785
May 7 at 21:56
Following the suggestion of @roaima, I checked
od -c file1.txt
and similarly for file2.txt
, and found that the file being written by the python script had rn
line-endings. I installed and used dos2unix
to convert the line-endings to n
, and simple comm
without any options worked normally. I also modified my python3 script to have the newline='n'
attribute explicitly specified but I have not tested this yet. After this modification I expect the script to work, but if I have issues with it I will post another question.â user128785
May 7 at 21:56
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f442390%2fcomm-command-behaving-strangely%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Line endings? Trailing whitespace? (Check with
hex
,hexdump
,od -c
, etc.) On a UNIX-type system the line ending should be (just)n
. There should be nor
.â roaima
May 7 at 19:42