What makes grep consider a file to be binary?
Clash Royale CLAN TAG#URR8PPP
up vote
150
down vote
favorite
I have some database dumps from a Windows system on my box. They are text files. I'm using cygwin to grep through them. These appear to be plain text files; I open them with text editors such as notepad and wordpad and they look legible. However, when I run grep on them, it will say binary file foo.txt matches
.
I have noticed that the files contain some ascii NUL
characters, which I believe are artifacts from the database dump.
So what makes grep consider these files to be binary? The NUL
character? Is there a flag on the filesystem? What do I need to change to get grep to show me the line matches?
grep
add a comment |Â
up vote
150
down vote
favorite
I have some database dumps from a Windows system on my box. They are text files. I'm using cygwin to grep through them. These appear to be plain text files; I open them with text editors such as notepad and wordpad and they look legible. However, when I run grep on them, it will say binary file foo.txt matches
.
I have noticed that the files contain some ascii NUL
characters, which I believe are artifacts from the database dump.
So what makes grep consider these files to be binary? The NUL
character? Is there a flag on the filesystem? What do I need to change to get grep to show me the line matches?
grep
2
--null-data
may be useful ifNUL
is the delimiter.
â Steve-o
Sep 1 '11 at 13:27
add a comment |Â
up vote
150
down vote
favorite
up vote
150
down vote
favorite
I have some database dumps from a Windows system on my box. They are text files. I'm using cygwin to grep through them. These appear to be plain text files; I open them with text editors such as notepad and wordpad and they look legible. However, when I run grep on them, it will say binary file foo.txt matches
.
I have noticed that the files contain some ascii NUL
characters, which I believe are artifacts from the database dump.
So what makes grep consider these files to be binary? The NUL
character? Is there a flag on the filesystem? What do I need to change to get grep to show me the line matches?
grep
I have some database dumps from a Windows system on my box. They are text files. I'm using cygwin to grep through them. These appear to be plain text files; I open them with text editors such as notepad and wordpad and they look legible. However, when I run grep on them, it will say binary file foo.txt matches
.
I have noticed that the files contain some ascii NUL
characters, which I believe are artifacts from the database dump.
So what makes grep consider these files to be binary? The NUL
character? Is there a flag on the filesystem? What do I need to change to get grep to show me the line matches?
grep
grep
edited Feb 9 '15 at 11:02
Michel de Ruiter
1033
1033
asked Sep 1 '11 at 13:21
user394
4,747155071
4,747155071
2
--null-data
may be useful ifNUL
is the delimiter.
â Steve-o
Sep 1 '11 at 13:27
add a comment |Â
2
--null-data
may be useful ifNUL
is the delimiter.
â Steve-o
Sep 1 '11 at 13:27
2
2
--null-data
may be useful if NUL
is the delimiter.â Steve-o
Sep 1 '11 at 13:27
--null-data
may be useful if NUL
is the delimiter.â Steve-o
Sep 1 '11 at 13:27
add a comment |Â
9 Answers
9
active
oldest
votes
up vote
109
down vote
accepted
If there is a NUL
character anywhere in the file, grep will consider it as a binary file.
There might a workaround like this cat file | tr -d '00' | yourgrep
to eliminate all null first, and then to search through file.
116
... or use-a
/--text
, at least with GNU grep.
â derobert
Nov 26 '12 at 20:44
1
@derobert: actually, on some (older) systems, grep see lines, but its output will truncate each matching line at the firstNUL
(probably becauses it calls C's printf and gives it the matched line?). On such a system agrep cmd .sh_history
will return as many empty lines as there are lines matching 'cmd', as each line of sh_history has a specific format with aNUL
at the begining of each line. (but your comment "at least on GNU grep" probably comes true. I don't have one at hand right now to test, but I expect they handle this nicely)
â Olivier Dulac
Nov 25 '13 at 11:46
4
Is the presence of a NUL character the only criteria? I doubt it. It's probably smarter than that. Anything falling outside the Ascii 32-126 range would be my guess, but we'd have to look at the source code to be sure.
â Michael Martinez
Aug 14 '15 at 16:58
2
My info was from the man page of the specific grep instance. Your comment about implementation is valid, source trumps docs.
â bbaja42
Aug 18 '15 at 22:31
2
I had a file whichgrep
on cygwin considered binary because it had a long dash (0x96) instead of a regular ASCII hyphen/minus (0x2d). I guess this answer resolved the OP's issue, but it appears it is incomplete.
â cp.engr
Feb 15 '16 at 16:15
 |Â
show 4 more comments
up vote
87
down vote
grep -a
worked for me:
$ grep --help
[...]
-a, --text equivalent to --binary-files=text
2
This is the best, least expensive answer IMO.
â pydsigner
Sep 24 '16 at 18:32
add a comment |Â
up vote
20
down vote
You can use the strings
utility to extract the text content from any file and then pipe it through grep
, like this: strings file | grep pattern
.
1
Ideal for grepping log files that might be partly corrupted
â Hannes R.
Feb 27 '15 at 7:43
yes, sometimes binary mixed logging also happens. This is good.
â sdkks
Sep 3 '17 at 16:59
add a comment |Â
up vote
11
down vote
GNU grep 2.24 RTFS
Conclusion: 2 and 2 cases only:
NUL
, e.g.printf 'a' | grep 'a'
encoding error according to the C99
mbrlen()
, e.g.:export LC_CTYPE='en_US.UTF-8'
printf 'ax80' | grep 'a'because
x80
cannot be the first byte of an UTF-8 Unicode point: UTF-8 - Description | en.wikipedia.org
Furthermore, as mentioned by Stéphane Chazelas What makes grep consider a file to be binary? | Unix & Linux Stack Exchange, those checks are only done up to the first buffer read of length TODO.
Only up to the first buffer read
So if a NUL or encoding error happens in the middle of a very large file, it might be grepped anyways.
I imagine this is for performance reasons.
E.g.: this prints the line:
printf '%10000000snx80a' | grep 'a'
but this does not:
printf '%10snx80a' | grep 'a'
The actual buffer size depends on how the file is read. E.g. compare:
export LC_CTYPE='en_US.UTF-8'
(printf 'nx80a') | grep 'a'
(printf 'n'; sleep 1; printf 'x80a') | grep 'a'
With the sleep
, the first line gets passed to grep even if it is only 1 byte long because the process goes to sleep, and the second read does not check if the file is binary.
RTFS
git clone git://git.savannah.gnu.org/grep.git
cd grep
git checkout v2.24
Find where the stderr error message is encoded:
git grep 'Binary file'
Leads us to /src/grep.c
:
if (!out_quiet && (encoding_error_output
|| (0 <= nlines_first_null && nlines_first_null < nlines)))
{
printf (_("Binary file %s matchesn"), filename);
If those variables were well named, we basically reached the conclusion.
encoding_error_output
Quick grepping for encoding_error_output
shows that the only code path that can modify it goes through buf_has_encoding_errors
:
clen = mbrlen (p, buf + size - p, &mbs);
if ((size_t) -2 <= clen)
return true;
then just man mbrlen
.
nlines_first_null and nlines
Initialized as:
intmax_t nlines_first_null = -1;
nlines = 0;
so when a null is found 0 <= nlines_first_null
becomes true.
TODO when can nlines_first_null < nlines
ever be false? I got lazy.
POSIX
Does not define binary options grep - search a file for a pattern | pubs.opengroup.org , and GNU grep does not document it, so RTFS is the only way.
1
Impressive explication!
â user394
Apr 13 '16 at 2:02
2
Note that the check for valid UTF-8 only happens in UTF-8 locales. Also note that the check is only done on the first buffer read from the file which for a regular file seems to be 32768 bytes on my system, but for a pipe or socket can be as small as one byte. Compare(printf 'ny') | grep y
with(printf 'n'; sleep 1; printf 'y') | grep y
for instance.
â Stéphane Chazelas
Apr 13 '16 at 12:18
@StéphaneChazelas "Note that the check for valid UTF-8 only happens in UTF-8 locales": do you mean about theexport LC_CTYPE='en_US.UTF-8'
as in my example, or something else? Buf read: amazing example, added to answer. You have obviously read the source more than me, reminds me of those hacker koans "The student was enlightened" :-)
â Ciro Santilli æ°çÂÂæ¹é ä¸Âå¿ å ÂÃ¥ÂÂäºÂ件 æ³Âè½®åÂÂ
Apr 13 '16 at 13:05
1
I didn't look into great detail either, but did very recently
â Stéphane Chazelas
Apr 13 '16 at 13:09
1
@CiroSantilliå·´æ¿馬æÂÂ件å ÂÃ¥ÂÂäºÂ件æ³Âè½®å what version of GNU grep did you test against?
â jrw32982
Jun 8 '16 at 23:33
 |Â
show 6 more comments
up vote
6
down vote
One of my text files was suddenly being seen as binary by grep:
$ file foo.txt
foo.txt: ISO-8859 text
Solution was to convert it by using iconv
:
iconv -t UTF-8 -f ISO-8859-1 foo.txt > foo_new.txt
1
This happened to me as well. In particular, the cause was an ISO-8859-1-encoded non-breaking space, which I had to replace with a regular space in order to get grep to search in the file.
â Gallaecio
Jun 9 '15 at 13:50
4
grep 2.21 treats ISO-8859 text files as if they are binary, add export LC_ALL=C before grep command.
â netawater
Aug 17 '15 at 2:52
@netawater Thanks! This is e.g. the case if you have something like Müller in a text-file. That's0xFC
hexadecimal, so outside the range grep would expect for utf8 (up to0x7F
). Check with printf 'ax7F' | grep 'a' as Ciro describe above.
â Anne van Rossum
Nov 26 '16 at 16:51
add a comment |Â
up vote
5
down vote
The file /etc/magic
or /usr/share/misc/magic
has a list of sequences that the command file
uses for determining the file type.
Note that binary may just be a fallback solution. Sometimes files with strange encoding are considered binary too.
grep
on Linux has some options to handle binary files like --binary-files
or -U / --binary
More precisely, encoding error according to C99'smbrlen()
. Example and source interpretation at: unix.stackexchange.com/a/276028/32558
â Ciro Santilli æ°çÂÂæ¹é ä¸Âå¿ å ÂÃ¥ÂÂäºÂ件 æ³Âè½®åÂÂ
Apr 12 '16 at 20:51
add a comment |Â
up vote
2
down vote
One of my students had this problem. There is a bug in grep
in Cygwin
. If the file has non-Ascii characters, grep
and egrep
see it as binary.
That sounds like a feature, not a bug. Especially given there is a command-line option to control it (-a / --text)
â Will Sheppard
Jan 29 at 11:39
add a comment |Â
up vote
2
down vote
Actually answering the question "What makes grep consider a file to be binary?", you can use iconv
:
$ iconv < myfile.java
iconv: (stdin):267:70: cannot convert
In my case there were Spanish characters that showed up correctly in text editors but grep considered them as binary; iconv
output pointed me to the line and column numbers of those characters
In the case of NUL
characters, iconv
will consider them normal and will not print that kind of output so this method is not suitable
add a comment |Â
up vote
1
down vote
I had the same problem. I used vi -b [filename]
to see the added characters. I found the control characters ^@
and ^M
. Then in vi type :1,$s/^@//g
to remove the ^@
characters. Repeat this command for ^M
.
Warning: To get the "blue" control characters press Ctrl+v then Ctrl+M or Ctrl+@. Then save and exit vi.
add a comment |Â
protected by Community⦠5 mins ago
Thank you for your interest in this question.
Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).
Would you like to answer one of these unanswered questions instead?
9 Answers
9
active
oldest
votes
9 Answers
9
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
109
down vote
accepted
If there is a NUL
character anywhere in the file, grep will consider it as a binary file.
There might a workaround like this cat file | tr -d '00' | yourgrep
to eliminate all null first, and then to search through file.
116
... or use-a
/--text
, at least with GNU grep.
â derobert
Nov 26 '12 at 20:44
1
@derobert: actually, on some (older) systems, grep see lines, but its output will truncate each matching line at the firstNUL
(probably becauses it calls C's printf and gives it the matched line?). On such a system agrep cmd .sh_history
will return as many empty lines as there are lines matching 'cmd', as each line of sh_history has a specific format with aNUL
at the begining of each line. (but your comment "at least on GNU grep" probably comes true. I don't have one at hand right now to test, but I expect they handle this nicely)
â Olivier Dulac
Nov 25 '13 at 11:46
4
Is the presence of a NUL character the only criteria? I doubt it. It's probably smarter than that. Anything falling outside the Ascii 32-126 range would be my guess, but we'd have to look at the source code to be sure.
â Michael Martinez
Aug 14 '15 at 16:58
2
My info was from the man page of the specific grep instance. Your comment about implementation is valid, source trumps docs.
â bbaja42
Aug 18 '15 at 22:31
2
I had a file whichgrep
on cygwin considered binary because it had a long dash (0x96) instead of a regular ASCII hyphen/minus (0x2d). I guess this answer resolved the OP's issue, but it appears it is incomplete.
â cp.engr
Feb 15 '16 at 16:15
 |Â
show 4 more comments
up vote
109
down vote
accepted
If there is a NUL
character anywhere in the file, grep will consider it as a binary file.
There might a workaround like this cat file | tr -d '00' | yourgrep
to eliminate all null first, and then to search through file.
116
... or use-a
/--text
, at least with GNU grep.
â derobert
Nov 26 '12 at 20:44
1
@derobert: actually, on some (older) systems, grep see lines, but its output will truncate each matching line at the firstNUL
(probably becauses it calls C's printf and gives it the matched line?). On such a system agrep cmd .sh_history
will return as many empty lines as there are lines matching 'cmd', as each line of sh_history has a specific format with aNUL
at the begining of each line. (but your comment "at least on GNU grep" probably comes true. I don't have one at hand right now to test, but I expect they handle this nicely)
â Olivier Dulac
Nov 25 '13 at 11:46
4
Is the presence of a NUL character the only criteria? I doubt it. It's probably smarter than that. Anything falling outside the Ascii 32-126 range would be my guess, but we'd have to look at the source code to be sure.
â Michael Martinez
Aug 14 '15 at 16:58
2
My info was from the man page of the specific grep instance. Your comment about implementation is valid, source trumps docs.
â bbaja42
Aug 18 '15 at 22:31
2
I had a file whichgrep
on cygwin considered binary because it had a long dash (0x96) instead of a regular ASCII hyphen/minus (0x2d). I guess this answer resolved the OP's issue, but it appears it is incomplete.
â cp.engr
Feb 15 '16 at 16:15
 |Â
show 4 more comments
up vote
109
down vote
accepted
up vote
109
down vote
accepted
If there is a NUL
character anywhere in the file, grep will consider it as a binary file.
There might a workaround like this cat file | tr -d '00' | yourgrep
to eliminate all null first, and then to search through file.
If there is a NUL
character anywhere in the file, grep will consider it as a binary file.
There might a workaround like this cat file | tr -d '00' | yourgrep
to eliminate all null first, and then to search through file.
answered Sep 1 '11 at 13:28
bbaja42
1,80721015
1,80721015
116
... or use-a
/--text
, at least with GNU grep.
â derobert
Nov 26 '12 at 20:44
1
@derobert: actually, on some (older) systems, grep see lines, but its output will truncate each matching line at the firstNUL
(probably becauses it calls C's printf and gives it the matched line?). On such a system agrep cmd .sh_history
will return as many empty lines as there are lines matching 'cmd', as each line of sh_history has a specific format with aNUL
at the begining of each line. (but your comment "at least on GNU grep" probably comes true. I don't have one at hand right now to test, but I expect they handle this nicely)
â Olivier Dulac
Nov 25 '13 at 11:46
4
Is the presence of a NUL character the only criteria? I doubt it. It's probably smarter than that. Anything falling outside the Ascii 32-126 range would be my guess, but we'd have to look at the source code to be sure.
â Michael Martinez
Aug 14 '15 at 16:58
2
My info was from the man page of the specific grep instance. Your comment about implementation is valid, source trumps docs.
â bbaja42
Aug 18 '15 at 22:31
2
I had a file whichgrep
on cygwin considered binary because it had a long dash (0x96) instead of a regular ASCII hyphen/minus (0x2d). I guess this answer resolved the OP's issue, but it appears it is incomplete.
â cp.engr
Feb 15 '16 at 16:15
 |Â
show 4 more comments
116
... or use-a
/--text
, at least with GNU grep.
â derobert
Nov 26 '12 at 20:44
1
@derobert: actually, on some (older) systems, grep see lines, but its output will truncate each matching line at the firstNUL
(probably becauses it calls C's printf and gives it the matched line?). On such a system agrep cmd .sh_history
will return as many empty lines as there are lines matching 'cmd', as each line of sh_history has a specific format with aNUL
at the begining of each line. (but your comment "at least on GNU grep" probably comes true. I don't have one at hand right now to test, but I expect they handle this nicely)
â Olivier Dulac
Nov 25 '13 at 11:46
4
Is the presence of a NUL character the only criteria? I doubt it. It's probably smarter than that. Anything falling outside the Ascii 32-126 range would be my guess, but we'd have to look at the source code to be sure.
â Michael Martinez
Aug 14 '15 at 16:58
2
My info was from the man page of the specific grep instance. Your comment about implementation is valid, source trumps docs.
â bbaja42
Aug 18 '15 at 22:31
2
I had a file whichgrep
on cygwin considered binary because it had a long dash (0x96) instead of a regular ASCII hyphen/minus (0x2d). I guess this answer resolved the OP's issue, but it appears it is incomplete.
â cp.engr
Feb 15 '16 at 16:15
116
116
... or use
-a
/--text
, at least with GNU grep.â derobert
Nov 26 '12 at 20:44
... or use
-a
/--text
, at least with GNU grep.â derobert
Nov 26 '12 at 20:44
1
1
@derobert: actually, on some (older) systems, grep see lines, but its output will truncate each matching line at the first
NUL
(probably becauses it calls C's printf and gives it the matched line?). On such a system a grep cmd .sh_history
will return as many empty lines as there are lines matching 'cmd', as each line of sh_history has a specific format with a NUL
at the begining of each line. (but your comment "at least on GNU grep" probably comes true. I don't have one at hand right now to test, but I expect they handle this nicely)â Olivier Dulac
Nov 25 '13 at 11:46
@derobert: actually, on some (older) systems, grep see lines, but its output will truncate each matching line at the first
NUL
(probably becauses it calls C's printf and gives it the matched line?). On such a system a grep cmd .sh_history
will return as many empty lines as there are lines matching 'cmd', as each line of sh_history has a specific format with a NUL
at the begining of each line. (but your comment "at least on GNU grep" probably comes true. I don't have one at hand right now to test, but I expect they handle this nicely)â Olivier Dulac
Nov 25 '13 at 11:46
4
4
Is the presence of a NUL character the only criteria? I doubt it. It's probably smarter than that. Anything falling outside the Ascii 32-126 range would be my guess, but we'd have to look at the source code to be sure.
â Michael Martinez
Aug 14 '15 at 16:58
Is the presence of a NUL character the only criteria? I doubt it. It's probably smarter than that. Anything falling outside the Ascii 32-126 range would be my guess, but we'd have to look at the source code to be sure.
â Michael Martinez
Aug 14 '15 at 16:58
2
2
My info was from the man page of the specific grep instance. Your comment about implementation is valid, source trumps docs.
â bbaja42
Aug 18 '15 at 22:31
My info was from the man page of the specific grep instance. Your comment about implementation is valid, source trumps docs.
â bbaja42
Aug 18 '15 at 22:31
2
2
I had a file which
grep
on cygwin considered binary because it had a long dash (0x96) instead of a regular ASCII hyphen/minus (0x2d). I guess this answer resolved the OP's issue, but it appears it is incomplete.â cp.engr
Feb 15 '16 at 16:15
I had a file which
grep
on cygwin considered binary because it had a long dash (0x96) instead of a regular ASCII hyphen/minus (0x2d). I guess this answer resolved the OP's issue, but it appears it is incomplete.â cp.engr
Feb 15 '16 at 16:15
 |Â
show 4 more comments
up vote
87
down vote
grep -a
worked for me:
$ grep --help
[...]
-a, --text equivalent to --binary-files=text
2
This is the best, least expensive answer IMO.
â pydsigner
Sep 24 '16 at 18:32
add a comment |Â
up vote
87
down vote
grep -a
worked for me:
$ grep --help
[...]
-a, --text equivalent to --binary-files=text
2
This is the best, least expensive answer IMO.
â pydsigner
Sep 24 '16 at 18:32
add a comment |Â
up vote
87
down vote
up vote
87
down vote
grep -a
worked for me:
$ grep --help
[...]
-a, --text equivalent to --binary-files=text
grep -a
worked for me:
$ grep --help
[...]
-a, --text equivalent to --binary-files=text
answered Sep 2 '15 at 9:43
Plouff
97153
97153
2
This is the best, least expensive answer IMO.
â pydsigner
Sep 24 '16 at 18:32
add a comment |Â
2
This is the best, least expensive answer IMO.
â pydsigner
Sep 24 '16 at 18:32
2
2
This is the best, least expensive answer IMO.
â pydsigner
Sep 24 '16 at 18:32
This is the best, least expensive answer IMO.
â pydsigner
Sep 24 '16 at 18:32
add a comment |Â
up vote
20
down vote
You can use the strings
utility to extract the text content from any file and then pipe it through grep
, like this: strings file | grep pattern
.
1
Ideal for grepping log files that might be partly corrupted
â Hannes R.
Feb 27 '15 at 7:43
yes, sometimes binary mixed logging also happens. This is good.
â sdkks
Sep 3 '17 at 16:59
add a comment |Â
up vote
20
down vote
You can use the strings
utility to extract the text content from any file and then pipe it through grep
, like this: strings file | grep pattern
.
1
Ideal for grepping log files that might be partly corrupted
â Hannes R.
Feb 27 '15 at 7:43
yes, sometimes binary mixed logging also happens. This is good.
â sdkks
Sep 3 '17 at 16:59
add a comment |Â
up vote
20
down vote
up vote
20
down vote
You can use the strings
utility to extract the text content from any file and then pipe it through grep
, like this: strings file | grep pattern
.
You can use the strings
utility to extract the text content from any file and then pipe it through grep
, like this: strings file | grep pattern
.
answered Nov 26 '12 at 20:24
holgero
30125
30125
1
Ideal for grepping log files that might be partly corrupted
â Hannes R.
Feb 27 '15 at 7:43
yes, sometimes binary mixed logging also happens. This is good.
â sdkks
Sep 3 '17 at 16:59
add a comment |Â
1
Ideal for grepping log files that might be partly corrupted
â Hannes R.
Feb 27 '15 at 7:43
yes, sometimes binary mixed logging also happens. This is good.
â sdkks
Sep 3 '17 at 16:59
1
1
Ideal for grepping log files that might be partly corrupted
â Hannes R.
Feb 27 '15 at 7:43
Ideal for grepping log files that might be partly corrupted
â Hannes R.
Feb 27 '15 at 7:43
yes, sometimes binary mixed logging also happens. This is good.
â sdkks
Sep 3 '17 at 16:59
yes, sometimes binary mixed logging also happens. This is good.
â sdkks
Sep 3 '17 at 16:59
add a comment |Â
up vote
11
down vote
GNU grep 2.24 RTFS
Conclusion: 2 and 2 cases only:
NUL
, e.g.printf 'a' | grep 'a'
encoding error according to the C99
mbrlen()
, e.g.:export LC_CTYPE='en_US.UTF-8'
printf 'ax80' | grep 'a'because
x80
cannot be the first byte of an UTF-8 Unicode point: UTF-8 - Description | en.wikipedia.org
Furthermore, as mentioned by Stéphane Chazelas What makes grep consider a file to be binary? | Unix & Linux Stack Exchange, those checks are only done up to the first buffer read of length TODO.
Only up to the first buffer read
So if a NUL or encoding error happens in the middle of a very large file, it might be grepped anyways.
I imagine this is for performance reasons.
E.g.: this prints the line:
printf '%10000000snx80a' | grep 'a'
but this does not:
printf '%10snx80a' | grep 'a'
The actual buffer size depends on how the file is read. E.g. compare:
export LC_CTYPE='en_US.UTF-8'
(printf 'nx80a') | grep 'a'
(printf 'n'; sleep 1; printf 'x80a') | grep 'a'
With the sleep
, the first line gets passed to grep even if it is only 1 byte long because the process goes to sleep, and the second read does not check if the file is binary.
RTFS
git clone git://git.savannah.gnu.org/grep.git
cd grep
git checkout v2.24
Find where the stderr error message is encoded:
git grep 'Binary file'
Leads us to /src/grep.c
:
if (!out_quiet && (encoding_error_output
|| (0 <= nlines_first_null && nlines_first_null < nlines)))
{
printf (_("Binary file %s matchesn"), filename);
If those variables were well named, we basically reached the conclusion.
encoding_error_output
Quick grepping for encoding_error_output
shows that the only code path that can modify it goes through buf_has_encoding_errors
:
clen = mbrlen (p, buf + size - p, &mbs);
if ((size_t) -2 <= clen)
return true;
then just man mbrlen
.
nlines_first_null and nlines
Initialized as:
intmax_t nlines_first_null = -1;
nlines = 0;
so when a null is found 0 <= nlines_first_null
becomes true.
TODO when can nlines_first_null < nlines
ever be false? I got lazy.
POSIX
Does not define binary options grep - search a file for a pattern | pubs.opengroup.org , and GNU grep does not document it, so RTFS is the only way.
1
Impressive explication!
â user394
Apr 13 '16 at 2:02
2
Note that the check for valid UTF-8 only happens in UTF-8 locales. Also note that the check is only done on the first buffer read from the file which for a regular file seems to be 32768 bytes on my system, but for a pipe or socket can be as small as one byte. Compare(printf 'ny') | grep y
with(printf 'n'; sleep 1; printf 'y') | grep y
for instance.
â Stéphane Chazelas
Apr 13 '16 at 12:18
@StéphaneChazelas "Note that the check for valid UTF-8 only happens in UTF-8 locales": do you mean about theexport LC_CTYPE='en_US.UTF-8'
as in my example, or something else? Buf read: amazing example, added to answer. You have obviously read the source more than me, reminds me of those hacker koans "The student was enlightened" :-)
â Ciro Santilli æ°çÂÂæ¹é ä¸Âå¿ å ÂÃ¥ÂÂäºÂ件 æ³Âè½®åÂÂ
Apr 13 '16 at 13:05
1
I didn't look into great detail either, but did very recently
â Stéphane Chazelas
Apr 13 '16 at 13:09
1
@CiroSantilliå·´æ¿馬æÂÂ件å ÂÃ¥ÂÂäºÂ件æ³Âè½®å what version of GNU grep did you test against?
â jrw32982
Jun 8 '16 at 23:33
 |Â
show 6 more comments
up vote
11
down vote
GNU grep 2.24 RTFS
Conclusion: 2 and 2 cases only:
NUL
, e.g.printf 'a' | grep 'a'
encoding error according to the C99
mbrlen()
, e.g.:export LC_CTYPE='en_US.UTF-8'
printf 'ax80' | grep 'a'because
x80
cannot be the first byte of an UTF-8 Unicode point: UTF-8 - Description | en.wikipedia.org
Furthermore, as mentioned by Stéphane Chazelas What makes grep consider a file to be binary? | Unix & Linux Stack Exchange, those checks are only done up to the first buffer read of length TODO.
Only up to the first buffer read
So if a NUL or encoding error happens in the middle of a very large file, it might be grepped anyways.
I imagine this is for performance reasons.
E.g.: this prints the line:
printf '%10000000snx80a' | grep 'a'
but this does not:
printf '%10snx80a' | grep 'a'
The actual buffer size depends on how the file is read. E.g. compare:
export LC_CTYPE='en_US.UTF-8'
(printf 'nx80a') | grep 'a'
(printf 'n'; sleep 1; printf 'x80a') | grep 'a'
With the sleep
, the first line gets passed to grep even if it is only 1 byte long because the process goes to sleep, and the second read does not check if the file is binary.
RTFS
git clone git://git.savannah.gnu.org/grep.git
cd grep
git checkout v2.24
Find where the stderr error message is encoded:
git grep 'Binary file'
Leads us to /src/grep.c
:
if (!out_quiet && (encoding_error_output
|| (0 <= nlines_first_null && nlines_first_null < nlines)))
{
printf (_("Binary file %s matchesn"), filename);
If those variables were well named, we basically reached the conclusion.
encoding_error_output
Quick grepping for encoding_error_output
shows that the only code path that can modify it goes through buf_has_encoding_errors
:
clen = mbrlen (p, buf + size - p, &mbs);
if ((size_t) -2 <= clen)
return true;
then just man mbrlen
.
nlines_first_null and nlines
Initialized as:
intmax_t nlines_first_null = -1;
nlines = 0;
so when a null is found 0 <= nlines_first_null
becomes true.
TODO when can nlines_first_null < nlines
ever be false? I got lazy.
POSIX
Does not define binary options grep - search a file for a pattern | pubs.opengroup.org , and GNU grep does not document it, so RTFS is the only way.
1
Impressive explication!
â user394
Apr 13 '16 at 2:02
2
Note that the check for valid UTF-8 only happens in UTF-8 locales. Also note that the check is only done on the first buffer read from the file which for a regular file seems to be 32768 bytes on my system, but for a pipe or socket can be as small as one byte. Compare(printf 'ny') | grep y
with(printf 'n'; sleep 1; printf 'y') | grep y
for instance.
â Stéphane Chazelas
Apr 13 '16 at 12:18
@StéphaneChazelas "Note that the check for valid UTF-8 only happens in UTF-8 locales": do you mean about theexport LC_CTYPE='en_US.UTF-8'
as in my example, or something else? Buf read: amazing example, added to answer. You have obviously read the source more than me, reminds me of those hacker koans "The student was enlightened" :-)
â Ciro Santilli æ°çÂÂæ¹é ä¸Âå¿ å ÂÃ¥ÂÂäºÂ件 æ³Âè½®åÂÂ
Apr 13 '16 at 13:05
1
I didn't look into great detail either, but did very recently
â Stéphane Chazelas
Apr 13 '16 at 13:09
1
@CiroSantilliå·´æ¿馬æÂÂ件å ÂÃ¥ÂÂäºÂ件æ³Âè½®å what version of GNU grep did you test against?
â jrw32982
Jun 8 '16 at 23:33
 |Â
show 6 more comments
up vote
11
down vote
up vote
11
down vote
GNU grep 2.24 RTFS
Conclusion: 2 and 2 cases only:
NUL
, e.g.printf 'a' | grep 'a'
encoding error according to the C99
mbrlen()
, e.g.:export LC_CTYPE='en_US.UTF-8'
printf 'ax80' | grep 'a'because
x80
cannot be the first byte of an UTF-8 Unicode point: UTF-8 - Description | en.wikipedia.org
Furthermore, as mentioned by Stéphane Chazelas What makes grep consider a file to be binary? | Unix & Linux Stack Exchange, those checks are only done up to the first buffer read of length TODO.
Only up to the first buffer read
So if a NUL or encoding error happens in the middle of a very large file, it might be grepped anyways.
I imagine this is for performance reasons.
E.g.: this prints the line:
printf '%10000000snx80a' | grep 'a'
but this does not:
printf '%10snx80a' | grep 'a'
The actual buffer size depends on how the file is read. E.g. compare:
export LC_CTYPE='en_US.UTF-8'
(printf 'nx80a') | grep 'a'
(printf 'n'; sleep 1; printf 'x80a') | grep 'a'
With the sleep
, the first line gets passed to grep even if it is only 1 byte long because the process goes to sleep, and the second read does not check if the file is binary.
RTFS
git clone git://git.savannah.gnu.org/grep.git
cd grep
git checkout v2.24
Find where the stderr error message is encoded:
git grep 'Binary file'
Leads us to /src/grep.c
:
if (!out_quiet && (encoding_error_output
|| (0 <= nlines_first_null && nlines_first_null < nlines)))
{
printf (_("Binary file %s matchesn"), filename);
If those variables were well named, we basically reached the conclusion.
encoding_error_output
Quick grepping for encoding_error_output
shows that the only code path that can modify it goes through buf_has_encoding_errors
:
clen = mbrlen (p, buf + size - p, &mbs);
if ((size_t) -2 <= clen)
return true;
then just man mbrlen
.
nlines_first_null and nlines
Initialized as:
intmax_t nlines_first_null = -1;
nlines = 0;
so when a null is found 0 <= nlines_first_null
becomes true.
TODO when can nlines_first_null < nlines
ever be false? I got lazy.
POSIX
Does not define binary options grep - search a file for a pattern | pubs.opengroup.org , and GNU grep does not document it, so RTFS is the only way.
GNU grep 2.24 RTFS
Conclusion: 2 and 2 cases only:
NUL
, e.g.printf 'a' | grep 'a'
encoding error according to the C99
mbrlen()
, e.g.:export LC_CTYPE='en_US.UTF-8'
printf 'ax80' | grep 'a'because
x80
cannot be the first byte of an UTF-8 Unicode point: UTF-8 - Description | en.wikipedia.org
Furthermore, as mentioned by Stéphane Chazelas What makes grep consider a file to be binary? | Unix & Linux Stack Exchange, those checks are only done up to the first buffer read of length TODO.
Only up to the first buffer read
So if a NUL or encoding error happens in the middle of a very large file, it might be grepped anyways.
I imagine this is for performance reasons.
E.g.: this prints the line:
printf '%10000000snx80a' | grep 'a'
but this does not:
printf '%10snx80a' | grep 'a'
The actual buffer size depends on how the file is read. E.g. compare:
export LC_CTYPE='en_US.UTF-8'
(printf 'nx80a') | grep 'a'
(printf 'n'; sleep 1; printf 'x80a') | grep 'a'
With the sleep
, the first line gets passed to grep even if it is only 1 byte long because the process goes to sleep, and the second read does not check if the file is binary.
RTFS
git clone git://git.savannah.gnu.org/grep.git
cd grep
git checkout v2.24
Find where the stderr error message is encoded:
git grep 'Binary file'
Leads us to /src/grep.c
:
if (!out_quiet && (encoding_error_output
|| (0 <= nlines_first_null && nlines_first_null < nlines)))
{
printf (_("Binary file %s matchesn"), filename);
If those variables were well named, we basically reached the conclusion.
encoding_error_output
Quick grepping for encoding_error_output
shows that the only code path that can modify it goes through buf_has_encoding_errors
:
clen = mbrlen (p, buf + size - p, &mbs);
if ((size_t) -2 <= clen)
return true;
then just man mbrlen
.
nlines_first_null and nlines
Initialized as:
intmax_t nlines_first_null = -1;
nlines = 0;
so when a null is found 0 <= nlines_first_null
becomes true.
TODO when can nlines_first_null < nlines
ever be false? I got lazy.
POSIX
Does not define binary options grep - search a file for a pattern | pubs.opengroup.org , and GNU grep does not document it, so RTFS is the only way.
edited Apr 15 at 7:34
Drakonoved
684518
684518
answered Apr 12 '16 at 20:50
Ciro Santilli æ°çÂÂæ¹é ä¸Âå¿ å ÂÃ¥ÂÂäºÂ件 æ³Âè½®åÂÂ
4,54323938
4,54323938
1
Impressive explication!
â user394
Apr 13 '16 at 2:02
2
Note that the check for valid UTF-8 only happens in UTF-8 locales. Also note that the check is only done on the first buffer read from the file which for a regular file seems to be 32768 bytes on my system, but for a pipe or socket can be as small as one byte. Compare(printf 'ny') | grep y
with(printf 'n'; sleep 1; printf 'y') | grep y
for instance.
â Stéphane Chazelas
Apr 13 '16 at 12:18
@StéphaneChazelas "Note that the check for valid UTF-8 only happens in UTF-8 locales": do you mean about theexport LC_CTYPE='en_US.UTF-8'
as in my example, or something else? Buf read: amazing example, added to answer. You have obviously read the source more than me, reminds me of those hacker koans "The student was enlightened" :-)
â Ciro Santilli æ°çÂÂæ¹é ä¸Âå¿ å ÂÃ¥ÂÂäºÂ件 æ³Âè½®åÂÂ
Apr 13 '16 at 13:05
1
I didn't look into great detail either, but did very recently
â Stéphane Chazelas
Apr 13 '16 at 13:09
1
@CiroSantilliå·´æ¿馬æÂÂ件å ÂÃ¥ÂÂäºÂ件æ³Âè½®å what version of GNU grep did you test against?
â jrw32982
Jun 8 '16 at 23:33
 |Â
show 6 more comments
1
Impressive explication!
â user394
Apr 13 '16 at 2:02
2
Note that the check for valid UTF-8 only happens in UTF-8 locales. Also note that the check is only done on the first buffer read from the file which for a regular file seems to be 32768 bytes on my system, but for a pipe or socket can be as small as one byte. Compare(printf 'ny') | grep y
with(printf 'n'; sleep 1; printf 'y') | grep y
for instance.
â Stéphane Chazelas
Apr 13 '16 at 12:18
@StéphaneChazelas "Note that the check for valid UTF-8 only happens in UTF-8 locales": do you mean about theexport LC_CTYPE='en_US.UTF-8'
as in my example, or something else? Buf read: amazing example, added to answer. You have obviously read the source more than me, reminds me of those hacker koans "The student was enlightened" :-)
â Ciro Santilli æ°çÂÂæ¹é ä¸Âå¿ å ÂÃ¥ÂÂäºÂ件 æ³Âè½®åÂÂ
Apr 13 '16 at 13:05
1
I didn't look into great detail either, but did very recently
â Stéphane Chazelas
Apr 13 '16 at 13:09
1
@CiroSantilliå·´æ¿馬æÂÂ件å ÂÃ¥ÂÂäºÂ件æ³Âè½®å what version of GNU grep did you test against?
â jrw32982
Jun 8 '16 at 23:33
1
1
Impressive explication!
â user394
Apr 13 '16 at 2:02
Impressive explication!
â user394
Apr 13 '16 at 2:02
2
2
Note that the check for valid UTF-8 only happens in UTF-8 locales. Also note that the check is only done on the first buffer read from the file which for a regular file seems to be 32768 bytes on my system, but for a pipe or socket can be as small as one byte. Compare
(printf 'ny') | grep y
with (printf 'n'; sleep 1; printf 'y') | grep y
for instance.â Stéphane Chazelas
Apr 13 '16 at 12:18
Note that the check for valid UTF-8 only happens in UTF-8 locales. Also note that the check is only done on the first buffer read from the file which for a regular file seems to be 32768 bytes on my system, but for a pipe or socket can be as small as one byte. Compare
(printf 'ny') | grep y
with (printf 'n'; sleep 1; printf 'y') | grep y
for instance.â Stéphane Chazelas
Apr 13 '16 at 12:18
@StéphaneChazelas "Note that the check for valid UTF-8 only happens in UTF-8 locales": do you mean about the
export LC_CTYPE='en_US.UTF-8'
as in my example, or something else? Buf read: amazing example, added to answer. You have obviously read the source more than me, reminds me of those hacker koans "The student was enlightened" :-)â Ciro Santilli æ°çÂÂæ¹é ä¸Âå¿ å ÂÃ¥ÂÂäºÂ件 æ³Âè½®åÂÂ
Apr 13 '16 at 13:05
@StéphaneChazelas "Note that the check for valid UTF-8 only happens in UTF-8 locales": do you mean about the
export LC_CTYPE='en_US.UTF-8'
as in my example, or something else? Buf read: amazing example, added to answer. You have obviously read the source more than me, reminds me of those hacker koans "The student was enlightened" :-)â Ciro Santilli æ°çÂÂæ¹é ä¸Âå¿ å ÂÃ¥ÂÂäºÂ件 æ³Âè½®åÂÂ
Apr 13 '16 at 13:05
1
1
I didn't look into great detail either, but did very recently
â Stéphane Chazelas
Apr 13 '16 at 13:09
I didn't look into great detail either, but did very recently
â Stéphane Chazelas
Apr 13 '16 at 13:09
1
1
@CiroSantilliå·´æ¿馬æÂÂ件å ÂÃ¥ÂÂäºÂ件æ³Âè½®å what version of GNU grep did you test against?
â jrw32982
Jun 8 '16 at 23:33
@CiroSantilliå·´æ¿馬æÂÂ件å ÂÃ¥ÂÂäºÂ件æ³Âè½®å what version of GNU grep did you test against?
â jrw32982
Jun 8 '16 at 23:33
 |Â
show 6 more comments
up vote
6
down vote
One of my text files was suddenly being seen as binary by grep:
$ file foo.txt
foo.txt: ISO-8859 text
Solution was to convert it by using iconv
:
iconv -t UTF-8 -f ISO-8859-1 foo.txt > foo_new.txt
1
This happened to me as well. In particular, the cause was an ISO-8859-1-encoded non-breaking space, which I had to replace with a regular space in order to get grep to search in the file.
â Gallaecio
Jun 9 '15 at 13:50
4
grep 2.21 treats ISO-8859 text files as if they are binary, add export LC_ALL=C before grep command.
â netawater
Aug 17 '15 at 2:52
@netawater Thanks! This is e.g. the case if you have something like Müller in a text-file. That's0xFC
hexadecimal, so outside the range grep would expect for utf8 (up to0x7F
). Check with printf 'ax7F' | grep 'a' as Ciro describe above.
â Anne van Rossum
Nov 26 '16 at 16:51
add a comment |Â
up vote
6
down vote
One of my text files was suddenly being seen as binary by grep:
$ file foo.txt
foo.txt: ISO-8859 text
Solution was to convert it by using iconv
:
iconv -t UTF-8 -f ISO-8859-1 foo.txt > foo_new.txt
1
This happened to me as well. In particular, the cause was an ISO-8859-1-encoded non-breaking space, which I had to replace with a regular space in order to get grep to search in the file.
â Gallaecio
Jun 9 '15 at 13:50
4
grep 2.21 treats ISO-8859 text files as if they are binary, add export LC_ALL=C before grep command.
â netawater
Aug 17 '15 at 2:52
@netawater Thanks! This is e.g. the case if you have something like Müller in a text-file. That's0xFC
hexadecimal, so outside the range grep would expect for utf8 (up to0x7F
). Check with printf 'ax7F' | grep 'a' as Ciro describe above.
â Anne van Rossum
Nov 26 '16 at 16:51
add a comment |Â
up vote
6
down vote
up vote
6
down vote
One of my text files was suddenly being seen as binary by grep:
$ file foo.txt
foo.txt: ISO-8859 text
Solution was to convert it by using iconv
:
iconv -t UTF-8 -f ISO-8859-1 foo.txt > foo_new.txt
One of my text files was suddenly being seen as binary by grep:
$ file foo.txt
foo.txt: ISO-8859 text
Solution was to convert it by using iconv
:
iconv -t UTF-8 -f ISO-8859-1 foo.txt > foo_new.txt
edited Jun 1 '15 at 20:24
kenorb
7,841365105
7,841365105
answered Dec 8 '14 at 21:30
zzapper
709513
709513
1
This happened to me as well. In particular, the cause was an ISO-8859-1-encoded non-breaking space, which I had to replace with a regular space in order to get grep to search in the file.
â Gallaecio
Jun 9 '15 at 13:50
4
grep 2.21 treats ISO-8859 text files as if they are binary, add export LC_ALL=C before grep command.
â netawater
Aug 17 '15 at 2:52
@netawater Thanks! This is e.g. the case if you have something like Müller in a text-file. That's0xFC
hexadecimal, so outside the range grep would expect for utf8 (up to0x7F
). Check with printf 'ax7F' | grep 'a' as Ciro describe above.
â Anne van Rossum
Nov 26 '16 at 16:51
add a comment |Â
1
This happened to me as well. In particular, the cause was an ISO-8859-1-encoded non-breaking space, which I had to replace with a regular space in order to get grep to search in the file.
â Gallaecio
Jun 9 '15 at 13:50
4
grep 2.21 treats ISO-8859 text files as if they are binary, add export LC_ALL=C before grep command.
â netawater
Aug 17 '15 at 2:52
@netawater Thanks! This is e.g. the case if you have something like Müller in a text-file. That's0xFC
hexadecimal, so outside the range grep would expect for utf8 (up to0x7F
). Check with printf 'ax7F' | grep 'a' as Ciro describe above.
â Anne van Rossum
Nov 26 '16 at 16:51
1
1
This happened to me as well. In particular, the cause was an ISO-8859-1-encoded non-breaking space, which I had to replace with a regular space in order to get grep to search in the file.
â Gallaecio
Jun 9 '15 at 13:50
This happened to me as well. In particular, the cause was an ISO-8859-1-encoded non-breaking space, which I had to replace with a regular space in order to get grep to search in the file.
â Gallaecio
Jun 9 '15 at 13:50
4
4
grep 2.21 treats ISO-8859 text files as if they are binary, add export LC_ALL=C before grep command.
â netawater
Aug 17 '15 at 2:52
grep 2.21 treats ISO-8859 text files as if they are binary, add export LC_ALL=C before grep command.
â netawater
Aug 17 '15 at 2:52
@netawater Thanks! This is e.g. the case if you have something like Müller in a text-file. That's
0xFC
hexadecimal, so outside the range grep would expect for utf8 (up to 0x7F
). Check with printf 'ax7F' | grep 'a' as Ciro describe above.â Anne van Rossum
Nov 26 '16 at 16:51
@netawater Thanks! This is e.g. the case if you have something like Müller in a text-file. That's
0xFC
hexadecimal, so outside the range grep would expect for utf8 (up to 0x7F
). Check with printf 'ax7F' | grep 'a' as Ciro describe above.â Anne van Rossum
Nov 26 '16 at 16:51
add a comment |Â
up vote
5
down vote
The file /etc/magic
or /usr/share/misc/magic
has a list of sequences that the command file
uses for determining the file type.
Note that binary may just be a fallback solution. Sometimes files with strange encoding are considered binary too.
grep
on Linux has some options to handle binary files like --binary-files
or -U / --binary
More precisely, encoding error according to C99'smbrlen()
. Example and source interpretation at: unix.stackexchange.com/a/276028/32558
â Ciro Santilli æ°çÂÂæ¹é ä¸Âå¿ å ÂÃ¥ÂÂäºÂ件 æ³Âè½®åÂÂ
Apr 12 '16 at 20:51
add a comment |Â
up vote
5
down vote
The file /etc/magic
or /usr/share/misc/magic
has a list of sequences that the command file
uses for determining the file type.
Note that binary may just be a fallback solution. Sometimes files with strange encoding are considered binary too.
grep
on Linux has some options to handle binary files like --binary-files
or -U / --binary
More precisely, encoding error according to C99'smbrlen()
. Example and source interpretation at: unix.stackexchange.com/a/276028/32558
â Ciro Santilli æ°çÂÂæ¹é ä¸Âå¿ å ÂÃ¥ÂÂäºÂ件 æ³Âè½®åÂÂ
Apr 12 '16 at 20:51
add a comment |Â
up vote
5
down vote
up vote
5
down vote
The file /etc/magic
or /usr/share/misc/magic
has a list of sequences that the command file
uses for determining the file type.
Note that binary may just be a fallback solution. Sometimes files with strange encoding are considered binary too.
grep
on Linux has some options to handle binary files like --binary-files
or -U / --binary
The file /etc/magic
or /usr/share/misc/magic
has a list of sequences that the command file
uses for determining the file type.
Note that binary may just be a fallback solution. Sometimes files with strange encoding are considered binary too.
grep
on Linux has some options to handle binary files like --binary-files
or -U / --binary
edited Feb 9 '15 at 11:49
fduff
2,61931933
2,61931933
answered Sep 1 '11 at 13:27
klapaucius
45624
45624
More precisely, encoding error according to C99'smbrlen()
. Example and source interpretation at: unix.stackexchange.com/a/276028/32558
â Ciro Santilli æ°çÂÂæ¹é ä¸Âå¿ å ÂÃ¥ÂÂäºÂ件 æ³Âè½®åÂÂ
Apr 12 '16 at 20:51
add a comment |Â
More precisely, encoding error according to C99'smbrlen()
. Example and source interpretation at: unix.stackexchange.com/a/276028/32558
â Ciro Santilli æ°çÂÂæ¹é ä¸Âå¿ å ÂÃ¥ÂÂäºÂ件 æ³Âè½®åÂÂ
Apr 12 '16 at 20:51
More precisely, encoding error according to C99's
mbrlen()
. Example and source interpretation at: unix.stackexchange.com/a/276028/32558â Ciro Santilli æ°çÂÂæ¹é ä¸Âå¿ å ÂÃ¥ÂÂäºÂ件 æ³Âè½®åÂÂ
Apr 12 '16 at 20:51
More precisely, encoding error according to C99's
mbrlen()
. Example and source interpretation at: unix.stackexchange.com/a/276028/32558â Ciro Santilli æ°çÂÂæ¹é ä¸Âå¿ å ÂÃ¥ÂÂäºÂ件 æ³Âè½®åÂÂ
Apr 12 '16 at 20:51
add a comment |Â
up vote
2
down vote
One of my students had this problem. There is a bug in grep
in Cygwin
. If the file has non-Ascii characters, grep
and egrep
see it as binary.
That sounds like a feature, not a bug. Especially given there is a command-line option to control it (-a / --text)
â Will Sheppard
Jan 29 at 11:39
add a comment |Â
up vote
2
down vote
One of my students had this problem. There is a bug in grep
in Cygwin
. If the file has non-Ascii characters, grep
and egrep
see it as binary.
That sounds like a feature, not a bug. Especially given there is a command-line option to control it (-a / --text)
â Will Sheppard
Jan 29 at 11:39
add a comment |Â
up vote
2
down vote
up vote
2
down vote
One of my students had this problem. There is a bug in grep
in Cygwin
. If the file has non-Ascii characters, grep
and egrep
see it as binary.
One of my students had this problem. There is a bug in grep
in Cygwin
. If the file has non-Ascii characters, grep
and egrep
see it as binary.
edited Sep 10 '15 at 11:14
Tejas
1,77821837
1,77821837
answered Sep 10 '15 at 9:31
Joan Pontius
291
291
That sounds like a feature, not a bug. Especially given there is a command-line option to control it (-a / --text)
â Will Sheppard
Jan 29 at 11:39
add a comment |Â
That sounds like a feature, not a bug. Especially given there is a command-line option to control it (-a / --text)
â Will Sheppard
Jan 29 at 11:39
That sounds like a feature, not a bug. Especially given there is a command-line option to control it (-a / --text)
â Will Sheppard
Jan 29 at 11:39
That sounds like a feature, not a bug. Especially given there is a command-line option to control it (-a / --text)
â Will Sheppard
Jan 29 at 11:39
add a comment |Â
up vote
2
down vote
Actually answering the question "What makes grep consider a file to be binary?", you can use iconv
:
$ iconv < myfile.java
iconv: (stdin):267:70: cannot convert
In my case there were Spanish characters that showed up correctly in text editors but grep considered them as binary; iconv
output pointed me to the line and column numbers of those characters
In the case of NUL
characters, iconv
will consider them normal and will not print that kind of output so this method is not suitable
add a comment |Â
up vote
2
down vote
Actually answering the question "What makes grep consider a file to be binary?", you can use iconv
:
$ iconv < myfile.java
iconv: (stdin):267:70: cannot convert
In my case there were Spanish characters that showed up correctly in text editors but grep considered them as binary; iconv
output pointed me to the line and column numbers of those characters
In the case of NUL
characters, iconv
will consider them normal and will not print that kind of output so this method is not suitable
add a comment |Â
up vote
2
down vote
up vote
2
down vote
Actually answering the question "What makes grep consider a file to be binary?", you can use iconv
:
$ iconv < myfile.java
iconv: (stdin):267:70: cannot convert
In my case there were Spanish characters that showed up correctly in text editors but grep considered them as binary; iconv
output pointed me to the line and column numbers of those characters
In the case of NUL
characters, iconv
will consider them normal and will not print that kind of output so this method is not suitable
Actually answering the question "What makes grep consider a file to be binary?", you can use iconv
:
$ iconv < myfile.java
iconv: (stdin):267:70: cannot convert
In my case there were Spanish characters that showed up correctly in text editors but grep considered them as binary; iconv
output pointed me to the line and column numbers of those characters
In the case of NUL
characters, iconv
will consider them normal and will not print that kind of output so this method is not suitable
edited Apr 14 '16 at 16:49
answered May 20 '15 at 15:12
golimar
27519
27519
add a comment |Â
add a comment |Â
up vote
1
down vote
I had the same problem. I used vi -b [filename]
to see the added characters. I found the control characters ^@
and ^M
. Then in vi type :1,$s/^@//g
to remove the ^@
characters. Repeat this command for ^M
.
Warning: To get the "blue" control characters press Ctrl+v then Ctrl+M or Ctrl+@. Then save and exit vi.
add a comment |Â
up vote
1
down vote
I had the same problem. I used vi -b [filename]
to see the added characters. I found the control characters ^@
and ^M
. Then in vi type :1,$s/^@//g
to remove the ^@
characters. Repeat this command for ^M
.
Warning: To get the "blue" control characters press Ctrl+v then Ctrl+M or Ctrl+@. Then save and exit vi.
add a comment |Â
up vote
1
down vote
up vote
1
down vote
I had the same problem. I used vi -b [filename]
to see the added characters. I found the control characters ^@
and ^M
. Then in vi type :1,$s/^@//g
to remove the ^@
characters. Repeat this command for ^M
.
Warning: To get the "blue" control characters press Ctrl+v then Ctrl+M or Ctrl+@. Then save and exit vi.
I had the same problem. I used vi -b [filename]
to see the added characters. I found the control characters ^@
and ^M
. Then in vi type :1,$s/^@//g
to remove the ^@
characters. Repeat this command for ^M
.
Warning: To get the "blue" control characters press Ctrl+v then Ctrl+M or Ctrl+@. Then save and exit vi.
edited Jun 1 '15 at 20:29
kenorb
7,841365105
7,841365105
answered Apr 3 '15 at 18:58
Not Sure
112
112
add a comment |Â
add a comment |Â
protected by Community⦠5 mins ago
Thank you for your interest in this question.
Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).
Would you like to answer one of these unanswered questions instead?
2
--null-data
may be useful ifNUL
is the delimiter.â Steve-o
Sep 1 '11 at 13:27