What makes grep consider a file to be binary?

up vote
150
down vote

favorite

I have some database dumps from a Windows system on my box. They are text files. I'm using cygwin to grep through them. These appear to be plain text files; I open them with text editors such as notepad and wordpad and they look legible. However, when I run grep on them, it will say binary file foo.txt matches.

I have noticed that the files contain some ascii NUL characters, which I believe are artifacts from the database dump.

So what makes grep consider these files to be binary? The NUL character? Is there a flag on the filesystem? What do I need to change to get grep to show me the line matches?

edited Feb 9 '15 at 11:02

Michel de Ruiter

1033

asked Sep 1 '11 at 13:21

user394

4,747155071

2

--null-data may be useful if NUL is the delimiter.
â€“Â Steve-o
Sep 1 '11 at 13:27

add a commentÂ |Â

up vote
150
down vote

favorite

I have noticed that the files contain some ascii NUL characters, which I believe are artifacts from the database dump.

So what makes grep consider these files to be binary? The NUL character? Is there a flag on the filesystem? What do I need to change to get grep to show me the line matches?

edited Feb 9 '15 at 11:02

Michel de Ruiter

1033

asked Sep 1 '11 at 13:21

user394

4,747155071

2

--null-data may be useful if NUL is the delimiter.
â€“Â Steve-o
Sep 1 '11 at 13:27

add a commentÂ |Â

up vote
150
down vote

favorite

I have noticed that the files contain some ascii NUL characters, which I believe are artifacts from the database dump.

So what makes grep consider these files to be binary? The NUL character? Is there a flag on the filesystem? What do I need to change to get grep to show me the line matches?

edited Feb 9 '15 at 11:02

Michel de Ruiter

1033

asked Sep 1 '11 at 13:21

user394

4,747155071

I have noticed that the files contain some ascii NUL characters, which I believe are artifacts from the database dump.

So what makes grep consider these files to be binary? The NUL character? Is there a flag on the filesystem? What do I need to change to get grep to show me the line matches?

grep

edited Feb 9 '15 at 11:02

Michel de Ruiter

1033

asked Sep 1 '11 at 13:21

user394

4,747155071

edited Feb 9 '15 at 11:02

Michel de Ruiter

1033

asked Sep 1 '11 at 13:21

user394

4,747155071

edited Feb 9 '15 at 11:02

Michel de Ruiter

1033

edited Feb 9 '15 at 11:02

Michel de Ruiter

1033

edited Feb 9 '15 at 11:02

Michel de Ruiter

1033

asked Sep 1 '11 at 13:21

user394

4,747155071

asked Sep 1 '11 at 13:21

user394

4,747155071

asked Sep 1 '11 at 13:21

user394

4,747155071

2

--null-data may be useful if NUL is the delimiter.
â€“Â Steve-o
Sep 1 '11 at 13:27

add a commentÂ |Â

2

--null-data may be useful if NUL is the delimiter.
â€“Â Steve-o
Sep 1 '11 at 13:27

--null-data may be useful if NUL is the delimiter.
â€“Â Steve-o
Sep 1 '11 at 13:27

add a commentÂ |Â

9 Answers
9

active

oldest

votes

up vote
109
down vote

accepted

If there is a NUL character anywhere in the file, grep will consider it as a binary file.

There might a workaround like this cat file | tr -d '00' | yourgrep to eliminate all null first, and then to search through file.

answered Sep 1 '11 at 13:28

bbaja42

1,80721015

116

... or use -a/--text, at least with GNU grep.
â€“Â derobert
Nov 26 '12 at 20:44

1

@derobert: actually, on some (older) systems, grep see lines, but its output will truncate each matching line at the first NUL (probably becauses it calls C's printf and gives it the matched line?). On such a system a grep cmd .sh_history will return as many empty lines as there are lines matching 'cmd', as each line of sh_history has a specific format with a NUL at the begining of each line. (but your comment "at least on GNU grep" probably comes true. I don't have one at hand right now to test, but I expect they handle this nicely)
â€“Â Olivier Dulac
Nov 25 '13 at 11:46

4

Is the presence of a NUL character the only criteria? I doubt it. It's probably smarter than that. Anything falling outside the Ascii 32-126 range would be my guess, but we'd have to look at the source code to be sure.
â€“Â Michael Martinez
Aug 14 '15 at 16:58

2

My info was from the man page of the specific grep instance. Your comment about implementation is valid, source trumps docs.
â€“Â bbaja42
Aug 18 '15 at 22:31

2

I had a file which grep on cygwin considered binary because it had a long dash (0x96) instead of a regular ASCII hyphen/minus (0x2d). I guess this answer resolved the OP's issue, but it appears it is incomplete.
â€“Â cp.engr
Feb 15 '16 at 16:15

Â |Â
show 4 more comments

up vote
87
down vote

grep -a worked for me:

$ grep --help
[...]
 -a, --text equivalent to --binary-files=text

answered Sep 2 '15 at 9:43

Plouff

97153

2

This is the best, least expensive answer IMO.
â€“Â pydsigner
Sep 24 '16 at 18:32

add a commentÂ |Â

up vote
20
down vote

You can use the strings utility to extract the text content from any file and then pipe it through grep, like this: strings file | grep pattern.

answered Nov 26 '12 at 20:24

holgero

30125

1

Ideal for grepping log files that might be partly corrupted
â€“Â Hannes R.
Feb 27 '15 at 7:43

yes, sometimes binary mixed logging also happens. This is good.
â€“Â sdkks
Sep 3 '17 at 16:59

add a commentÂ |Â

up vote
11
down vote

GNU grep 2.24 RTFS

Conclusion: 2 and 2 cases only:

NUL, e.g. printf 'a' | grep 'a'

encoding error according to the C99 mbrlen(), e.g.:
```
export LC_CTYPE='en_US.UTF-8'
printf 'ax80' | grep 'a'
```
because x80 cannot be the first byte of an UTF-8 Unicode point: UTF-8 - Description | en.wikipedia.org

Furthermore, as mentioned by StÃƒÂ©phane Chazelas What makes grep consider a file to be binary? | Unix & Linux Stack Exchange, those checks are only done up to the first buffer read of length TODO.

Only up to the first buffer read

So if a NUL or encoding error happens in the middle of a very large file, it might be grepped anyways.

I imagine this is for performance reasons.

E.g.: this prints the line:

printf '%10000000snx80a' | grep 'a'

but this does not:

printf '%10snx80a' | grep 'a'

The actual buffer size depends on how the file is read. E.g. compare:

export LC_CTYPE='en_US.UTF-8'
(printf 'nx80a') | grep 'a'
(printf 'n'; sleep 1; printf 'x80a') | grep 'a'

With the sleep, the first line gets passed to grep even if it is only 1 byte long because the process goes to sleep, and the second read does not check if the file is binary.

RTFS

git clone git://git.savannah.gnu.org/grep.git 
cd grep
git checkout v2.24

Find where the stderr error message is encoded:

git grep 'Binary file'

Leads us to /src/grep.c:

if (!out_quiet && (encoding_error_output
 || (0 <= nlines_first_null && nlines_first_null < nlines)))
 {
 printf (_("Binary file %s matchesn"), filename);

If those variables were well named, we basically reached the conclusion.

encoding_error_output

Quick grepping for encoding_error_output shows that the only code path that can modify it goes through buf_has_encoding_errors:

clen = mbrlen (p, buf + size - p, &mbs);
if ((size_t) -2 <= clen)
 return true;

then just man mbrlen.

nlines_first_null and nlines

Initialized as:

intmax_t nlines_first_null = -1;
nlines = 0;

so when a null is found 0 <= nlines_first_null becomes true.

TODO when can nlines_first_null < nlines ever be false? I got lazy.

POSIX

Does not define binary options grep - search a file for a pattern | pubs.opengroup.org , and GNU grep does not document it, so RTFS is the only way.

edited Apr 15 at 7:34

Drakonoved

684518

answered Apr 12 '16 at 20:50

4,54323938

1

Impressive explication!
â€“Â user394
Apr 13 '16 at 2:02

2

Note that the check for valid UTF-8 only happens in UTF-8 locales. Also note that the check is only done on the first buffer read from the file which for a regular file seems to be 32768 bytes on my system, but for a pipe or socket can be as small as one byte. Compare (printf 'ny') | grep y with (printf 'n'; sleep 1; printf 'y') | grep y for instance.
â€“Â StÃ©phane Chazelas
Apr 13 '16 at 12:18

@StÃ©phaneChazelas "Note that the check for valid UTF-8 only happens in UTF-8 locales": do you mean about the export LC_CTYPE='en_US.UTF-8' as in my example, or something else? Buf read: amazing example, added to answer. You have obviously read the source more than me, reminds me of those hacker koans "The student was enlightened" :-)
â€“Â Ciro Santilli Ã¦Â–Â°Ã§Â–Â†Ã¦Â”Â¹Ã©Â€Â Ã¤Â¸ÂÃ¥Â¿Âƒ Ã¥Â…ÂÃ¥Â›Â›Ã¤ÂºÂ‹Ã¤Â»Â¶ Ã¦Â³Â•Ã¨Â½Â®Ã¥ÂŠÂŸ
Apr 13 '16 at 13:05

1

I didn't look into great detail either, but did very recently
â€“Â StÃ©phane Chazelas
Apr 13 '16 at 13:09

1

@CiroSantilliÃ¥Â·Â´Ã¦Â‹Â¿Ã©Â¦Â¬Ã¦Â–Â‡Ã¤Â»Â¶Ã¥Â…ÂÃ¥Â›Â›Ã¤ÂºÂ‹Ã¤Â»Â¶Ã¦Â³Â•Ã¨Â½Â®Ã¥ÂŠÂŸ what version of GNU grep did you test against?
â€“Â jrw32982
Jun 8 '16 at 23:33

Â |Â
show 6 more comments

up vote
6
down vote

One of my text files was suddenly being seen as binary by grep:

$ file foo.txt
foo.txt: ISO-8859 text

Solution was to convert it by using iconv:

iconv -t UTF-8 -f ISO-8859-1 foo.txt > foo_new.txt

edited Jun 1 '15 at 20:24

kenorb

7,841365105

answered Dec 8 '14 at 21:30

zzapper

709513

1

This happened to me as well. In particular, the cause was an ISO-8859-1-encoded non-breaking space, which I had to replace with a regular space in order to get grep to search in the file.
â€“Â Gallaecio
Jun 9 '15 at 13:50

4

grep 2.21 treats ISO-8859 text files as if they are binary, add export LC_ALL=C before grep command.
â€“Â netawater
Aug 17 '15 at 2:52

@netawater Thanks! This is e.g. the case if you have something like MÃ¼ller in a text-file. That's 0xFC hexadecimal, so outside the range grep would expect for utf8 (up to 0x7F). Check with printf 'ax7F' | grep 'a' as Ciro describe above.
â€“Â Anne van Rossum
Nov 26 '16 at 16:51

add a commentÂ |Â

up vote
5
down vote

The file /etc/magic or /usr/share/misc/magic has a list of sequences that the command file uses for determining the file type.

Note that binary may just be a fallback solution. Sometimes files with strange encoding are considered binary too.

grep on Linux has some options to handle binary files like --binary-files or -U / --binary

edited Feb 9 '15 at 11:49

fduff

2,61931933

answered Sep 1 '11 at 13:27

klapaucius

45624

More precisely, encoding error according to C99's mbrlen(). Example and source interpretation at: unix.stackexchange.com/a/276028/32558
â€“Â Ciro Santilli Ã¦Â–Â°Ã§Â–Â†Ã¦Â”Â¹Ã©Â€Â Ã¤Â¸ÂÃ¥Â¿Âƒ Ã¥Â…ÂÃ¥Â›Â›Ã¤ÂºÂ‹Ã¤Â»Â¶ Ã¦Â³Â•Ã¨Â½Â®Ã¥ÂŠÂŸ
Apr 12 '16 at 20:51

add a commentÂ |Â

up vote
2
down vote

One of my students had this problem. There is a bug in grep in Cygwin. If the file has non-Ascii characters, grep and egrep see it as binary.

edited Sep 10 '15 at 11:14

Tejas

1,77821837

answered Sep 10 '15 at 9:31

Joan Pontius

291

That sounds like a feature, not a bug. Especially given there is a command-line option to control it (-a / --text)
â€“Â Will Sheppard
Jan 29 at 11:39

add a commentÂ |Â

up vote
2
down vote

Actually answering the question "What makes grep consider a file to be binary?", you can use iconv:

$ iconv < myfile.java
iconv: (stdin):267:70: cannot convert

In my case there were Spanish characters that showed up correctly in text editors but grep considered them as binary; iconv output pointed me to the line and column numbers of those characters

In the case of NUL characters, iconv will consider them normal and will not print that kind of output so this method is not suitable

edited Apr 14 '16 at 16:49

answered May 20 '15 at 15:12

golimar

27519

add a commentÂ |Â

up vote
1
down vote

I had the same problem. I used vi -b [filename] to see the added characters. I found the control characters ^@ and ^M. Then in vi type :1,$s/^@//g to remove the ^@ characters. Repeat this command for ^M.

Warning: To get the "blue" control characters press Ctrl+v then Ctrl+M or Ctrl+@. Then save and exit vi.

edited Jun 1 '15 at 20:29

kenorb

7,841365105

answered Apr 3 '15 at 18:58

Not Sure

112

add a commentÂ |Â

protected by Communityâ™¦ 5 mins ago

Thank you for your interest in this question.
Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).

Would you like to answer one of these unanswered questions instead?

9 Answers
9

active

oldest

votes

9 Answers
9

active

oldest

votes

up vote
109
down vote

accepted

If there is a NUL character anywhere in the file, grep will consider it as a binary file.

There might a workaround like this cat file | tr -d '00' | yourgrep to eliminate all null first, and then to search through file.

answered Sep 1 '11 at 13:28

bbaja42

1,80721015

116

... or use -a/--text, at least with GNU grep.
â€“Â derobert
Nov 26 '12 at 20:44

1

@derobert: actually, on some (older) systems, grep see lines, but its output will truncate each matching line at the first NUL (probably becauses it calls C's printf and gives it the matched line?). On such a system a grep cmd .sh_history will return as many empty lines as there are lines matching 'cmd', as each line of sh_history has a specific format with a NUL at the begining of each line. (but your comment "at least on GNU grep" probably comes true. I don't have one at hand right now to test, but I expect they handle this nicely)
â€“Â Olivier Dulac
Nov 25 '13 at 11:46

4

Is the presence of a NUL character the only criteria? I doubt it. It's probably smarter than that. Anything falling outside the Ascii 32-126 range would be my guess, but we'd have to look at the source code to be sure.
â€“Â Michael Martinez
Aug 14 '15 at 16:58

2

My info was from the man page of the specific grep instance. Your comment about implementation is valid, source trumps docs.
â€“Â bbaja42
Aug 18 '15 at 22:31

2

I had a file which grep on cygwin considered binary because it had a long dash (0x96) instead of a regular ASCII hyphen/minus (0x2d). I guess this answer resolved the OP's issue, but it appears it is incomplete.
â€“Â cp.engr
Feb 15 '16 at 16:15

Â |Â
show 4 more comments

up vote
109
down vote

accepted

If there is a NUL character anywhere in the file, grep will consider it as a binary file.

There might a workaround like this cat file | tr -d '00' | yourgrep to eliminate all null first, and then to search through file.

answered Sep 1 '11 at 13:28

bbaja42

1,80721015

116

... or use -a/--text, at least with GNU grep.
â€“Â derobert
Nov 26 '12 at 20:44

1

@derobert: actually, on some (older) systems, grep see lines, but its output will truncate each matching line at the first NUL (probably becauses it calls C's printf and gives it the matched line?). On such a system a grep cmd .sh_history will return as many empty lines as there are lines matching 'cmd', as each line of sh_history has a specific format with a NUL at the begining of each line. (but your comment "at least on GNU grep" probably comes true. I don't have one at hand right now to test, but I expect they handle this nicely)
â€“Â Olivier Dulac
Nov 25 '13 at 11:46

4

Is the presence of a NUL character the only criteria? I doubt it. It's probably smarter than that. Anything falling outside the Ascii 32-126 range would be my guess, but we'd have to look at the source code to be sure.
â€“Â Michael Martinez
Aug 14 '15 at 16:58

2

My info was from the man page of the specific grep instance. Your comment about implementation is valid, source trumps docs.
â€“Â bbaja42
Aug 18 '15 at 22:31

2

I had a file which grep on cygwin considered binary because it had a long dash (0x96) instead of a regular ASCII hyphen/minus (0x2d). I guess this answer resolved the OP's issue, but it appears it is incomplete.
â€“Â cp.engr
Feb 15 '16 at 16:15

Â |Â
show 4 more comments

up vote
109
down vote

accepted

If there is a NUL character anywhere in the file, grep will consider it as a binary file.

There might a workaround like this cat file | tr -d '00' | yourgrep to eliminate all null first, and then to search through file.

answered Sep 1 '11 at 13:28

bbaja42

1,80721015

If there is a NUL character anywhere in the file, grep will consider it as a binary file.

There might a workaround like this cat file | tr -d '00' | yourgrep to eliminate all null first, and then to search through file.

answered Sep 1 '11 at 13:28

bbaja42

1,80721015

answered Sep 1 '11 at 13:28

bbaja42

1,80721015

answered Sep 1 '11 at 13:28

bbaja42

1,80721015

answered Sep 1 '11 at 13:28

bbaja42

1,80721015

116

... or use -a/--text, at least with GNU grep.
â€“Â derobert
Nov 26 '12 at 20:44

1

@derobert: actually, on some (older) systems, grep see lines, but its output will truncate each matching line at the first NUL (probably becauses it calls C's printf and gives it the matched line?). On such a system a grep cmd .sh_history will return as many empty lines as there are lines matching 'cmd', as each line of sh_history has a specific format with a NUL at the begining of each line. (but your comment "at least on GNU grep" probably comes true. I don't have one at hand right now to test, but I expect they handle this nicely)
â€“Â Olivier Dulac
Nov 25 '13 at 11:46

4

Is the presence of a NUL character the only criteria? I doubt it. It's probably smarter than that. Anything falling outside the Ascii 32-126 range would be my guess, but we'd have to look at the source code to be sure.
â€“Â Michael Martinez
Aug 14 '15 at 16:58

2

My info was from the man page of the specific grep instance. Your comment about implementation is valid, source trumps docs.
â€“Â bbaja42
Aug 18 '15 at 22:31

2

I had a file which grep on cygwin considered binary because it had a long dash (0x96) instead of a regular ASCII hyphen/minus (0x2d). I guess this answer resolved the OP's issue, but it appears it is incomplete.
â€“Â cp.engr
Feb 15 '16 at 16:15

Â |Â
show 4 more comments

116

... or use -a/--text, at least with GNU grep.
â€“Â derobert
Nov 26 '12 at 20:44

1

@derobert: actually, on some (older) systems, grep see lines, but its output will truncate each matching line at the first NUL (probably becauses it calls C's printf and gives it the matched line?). On such a system a grep cmd .sh_history will return as many empty lines as there are lines matching 'cmd', as each line of sh_history has a specific format with a NUL at the begining of each line. (but your comment "at least on GNU grep" probably comes true. I don't have one at hand right now to test, but I expect they handle this nicely)
â€“Â Olivier Dulac
Nov 25 '13 at 11:46

4

Is the presence of a NUL character the only criteria? I doubt it. It's probably smarter than that. Anything falling outside the Ascii 32-126 range would be my guess, but we'd have to look at the source code to be sure.
â€“Â Michael Martinez
Aug 14 '15 at 16:58

2

My info was from the man page of the specific grep instance. Your comment about implementation is valid, source trumps docs.
â€“Â bbaja42
Aug 18 '15 at 22:31

2

I had a file which grep on cygwin considered binary because it had a long dash (0x96) instead of a regular ASCII hyphen/minus (0x2d). I guess this answer resolved the OP's issue, but it appears it is incomplete.
â€“Â cp.engr
Feb 15 '16 at 16:15

116

... or use -a/--text, at least with GNU grep.
â€“Â derobert
Nov 26 '12 at 20:44

@derobert: actually, on some (older) systems, grep see lines, but its output will truncate each matching line at the first NUL (probably becauses it calls C's printf and gives it the matched line?). On such a system a grep cmd .sh_history will return as many empty lines as there are lines matching 'cmd', as each line of sh_history has a specific format with a NUL at the begining of each line. (but your comment "at least on GNU grep" probably comes true. I don't have one at hand right now to test, but I expect they handle this nicely)
â€“Â Olivier Dulac
Nov 25 '13 at 11:46

Is the presence of a NUL character the only criteria? I doubt it. It's probably smarter than that. Anything falling outside the Ascii 32-126 range would be my guess, but we'd have to look at the source code to be sure.
â€“Â Michael Martinez
Aug 14 '15 at 16:58

My info was from the man page of the specific grep instance. Your comment about implementation is valid, source trumps docs.
â€“Â bbaja42
Aug 18 '15 at 22:31

I had a file which grep on cygwin considered binary because it had a long dash (0x96) instead of a regular ASCII hyphen/minus (0x2d). I guess this answer resolved the OP's issue, but it appears it is incomplete.
â€“Â cp.engr
Feb 15 '16 at 16:15

Â |Â
show 4 more comments

up vote
87
down vote

grep -a worked for me:

$ grep --help
[...]
 -a, --text equivalent to --binary-files=text

answered Sep 2 '15 at 9:43

Plouff

97153

2

This is the best, least expensive answer IMO.
â€“Â pydsigner
Sep 24 '16 at 18:32

add a commentÂ |Â

up vote
87
down vote

grep -a worked for me:

$ grep --help
[...]
 -a, --text equivalent to --binary-files=text

answered Sep 2 '15 at 9:43

Plouff

97153

2

This is the best, least expensive answer IMO.
â€“Â pydsigner
Sep 24 '16 at 18:32

add a commentÂ |Â

up vote
87
down vote

grep -a worked for me:

$ grep --help
[...]
 -a, --text equivalent to --binary-files=text

answered Sep 2 '15 at 9:43

Plouff

97153

grep -a worked for me:

$ grep --help
[...]
 -a, --text equivalent to --binary-files=text

answered Sep 2 '15 at 9:43

Plouff

97153

answered Sep 2 '15 at 9:43

Plouff

97153

answered Sep 2 '15 at 9:43

Plouff

97153

answered Sep 2 '15 at 9:43

Plouff

97153

2

This is the best, least expensive answer IMO.
â€“Â pydsigner
Sep 24 '16 at 18:32

add a commentÂ |Â

2

This is the best, least expensive answer IMO.
â€“Â pydsigner
Sep 24 '16 at 18:32

This is the best, least expensive answer IMO.
â€“Â pydsigner
Sep 24 '16 at 18:32

add a commentÂ |Â

up vote
20
down vote

You can use the strings utility to extract the text content from any file and then pipe it through grep, like this: strings file | grep pattern.

answered Nov 26 '12 at 20:24

holgero

30125

1

Ideal for grepping log files that might be partly corrupted
â€“Â Hannes R.
Feb 27 '15 at 7:43

yes, sometimes binary mixed logging also happens. This is good.
â€“Â sdkks
Sep 3 '17 at 16:59

add a commentÂ |Â

up vote
20
down vote

You can use the strings utility to extract the text content from any file and then pipe it through grep, like this: strings file | grep pattern.

answered Nov 26 '12 at 20:24

holgero

30125

1

Ideal for grepping log files that might be partly corrupted
â€“Â Hannes R.
Feb 27 '15 at 7:43

yes, sometimes binary mixed logging also happens. This is good.
â€“Â sdkks
Sep 3 '17 at 16:59

add a commentÂ |Â

up vote
20
down vote

You can use the strings utility to extract the text content from any file and then pipe it through grep, like this: strings file | grep pattern.

answered Nov 26 '12 at 20:24

holgero

30125

You can use the strings utility to extract the text content from any file and then pipe it through grep, like this: strings file | grep pattern.

answered Nov 26 '12 at 20:24

holgero

30125

answered Nov 26 '12 at 20:24

holgero

30125

answered Nov 26 '12 at 20:24

holgero

30125

answered Nov 26 '12 at 20:24

holgero

30125

1

Ideal for grepping log files that might be partly corrupted
â€“Â Hannes R.
Feb 27 '15 at 7:43

yes, sometimes binary mixed logging also happens. This is good.
â€“Â sdkks
Sep 3 '17 at 16:59

add a commentÂ |Â

1

Ideal for grepping log files that might be partly corrupted
â€“Â Hannes R.
Feb 27 '15 at 7:43

yes, sometimes binary mixed logging also happens. This is good.
â€“Â sdkks
Sep 3 '17 at 16:59

Ideal for grepping log files that might be partly corrupted
â€“Â Hannes R.
Feb 27 '15 at 7:43

yes, sometimes binary mixed logging also happens. This is good.
â€“Â sdkks
Sep 3 '17 at 16:59

add a commentÂ |Â

up vote
11
down vote

GNU grep 2.24 RTFS

Conclusion: 2 and 2 cases only:

NUL, e.g. printf 'a' | grep 'a'

encoding error according to the C99 mbrlen(), e.g.:
```
export LC_CTYPE='en_US.UTF-8'
printf 'ax80' | grep 'a'
```
because x80 cannot be the first byte of an UTF-8 Unicode point: UTF-8 - Description | en.wikipedia.org

Only up to the first buffer read

So if a NUL or encoding error happens in the middle of a very large file, it might be grepped anyways.

I imagine this is for performance reasons.

E.g.: this prints the line:

printf '%10000000snx80a' | grep 'a'

but this does not:

printf '%10snx80a' | grep 'a'

The actual buffer size depends on how the file is read. E.g. compare:

export LC_CTYPE='en_US.UTF-8'
(printf 'nx80a') | grep 'a'
(printf 'n'; sleep 1; printf 'x80a') | grep 'a'

With the sleep, the first line gets passed to grep even if it is only 1 byte long because the process goes to sleep, and the second read does not check if the file is binary.

RTFS

git clone git://git.savannah.gnu.org/grep.git 
cd grep
git checkout v2.24

Find where the stderr error message is encoded:

git grep 'Binary file'

Leads us to /src/grep.c:

if (!out_quiet && (encoding_error_output
 || (0 <= nlines_first_null && nlines_first_null < nlines)))
 {
 printf (_("Binary file %s matchesn"), filename);

If those variables were well named, we basically reached the conclusion.

encoding_error_output

Quick grepping for encoding_error_output shows that the only code path that can modify it goes through buf_has_encoding_errors:

clen = mbrlen (p, buf + size - p, &mbs);
if ((size_t) -2 <= clen)
 return true;

then just man mbrlen.

nlines_first_null and nlines

Initialized as:

intmax_t nlines_first_null = -1;
nlines = 0;

so when a null is found 0 <= nlines_first_null becomes true.

TODO when can nlines_first_null < nlines ever be false? I got lazy.

POSIX

Does not define binary options grep - search a file for a pattern | pubs.opengroup.org , and GNU grep does not document it, so RTFS is the only way.

edited Apr 15 at 7:34

Drakonoved

684518

answered Apr 12 '16 at 20:50

4,54323938

1

Impressive explication!
â€“Â user394
Apr 13 '16 at 2:02

2

Note that the check for valid UTF-8 only happens in UTF-8 locales. Also note that the check is only done on the first buffer read from the file which for a regular file seems to be 32768 bytes on my system, but for a pipe or socket can be as small as one byte. Compare (printf 'ny') | grep y with (printf 'n'; sleep 1; printf 'y') | grep y for instance.
â€“Â StÃ©phane Chazelas
Apr 13 '16 at 12:18

@StÃ©phaneChazelas "Note that the check for valid UTF-8 only happens in UTF-8 locales": do you mean about the export LC_CTYPE='en_US.UTF-8' as in my example, or something else? Buf read: amazing example, added to answer. You have obviously read the source more than me, reminds me of those hacker koans "The student was enlightened" :-)
â€“Â Ciro Santilli Ã¦Â–Â°Ã§Â–Â†Ã¦Â”Â¹Ã©Â€Â Ã¤Â¸ÂÃ¥Â¿Âƒ Ã¥Â…ÂÃ¥Â›Â›Ã¤ÂºÂ‹Ã¤Â»Â¶ Ã¦Â³Â•Ã¨Â½Â®Ã¥ÂŠÂŸ
Apr 13 '16 at 13:05

1

I didn't look into great detail either, but did very recently
â€“Â StÃ©phane Chazelas
Apr 13 '16 at 13:09

1

@CiroSantilliÃ¥Â·Â´Ã¦Â‹Â¿Ã©Â¦Â¬Ã¦Â–Â‡Ã¤Â»Â¶Ã¥Â…ÂÃ¥Â›Â›Ã¤ÂºÂ‹Ã¤Â»Â¶Ã¦Â³Â•Ã¨Â½Â®Ã¥ÂŠÂŸ what version of GNU grep did you test against?
â€“Â jrw32982
Jun 8 '16 at 23:33

Â |Â
show 6 more comments

up vote
11
down vote

GNU grep 2.24 RTFS

Conclusion: 2 and 2 cases only:

NUL, e.g. printf 'a' | grep 'a'

encoding error according to the C99 mbrlen(), e.g.:
```
export LC_CTYPE='en_US.UTF-8'
printf 'ax80' | grep 'a'
```
because x80 cannot be the first byte of an UTF-8 Unicode point: UTF-8 - Description | en.wikipedia.org

Only up to the first buffer read

So if a NUL or encoding error happens in the middle of a very large file, it might be grepped anyways.

I imagine this is for performance reasons.

E.g.: this prints the line:

printf '%10000000snx80a' | grep 'a'

but this does not:

printf '%10snx80a' | grep 'a'

The actual buffer size depends on how the file is read. E.g. compare:

export LC_CTYPE='en_US.UTF-8'
(printf 'nx80a') | grep 'a'
(printf 'n'; sleep 1; printf 'x80a') | grep 'a'

With the sleep, the first line gets passed to grep even if it is only 1 byte long because the process goes to sleep, and the second read does not check if the file is binary.

RTFS

git clone git://git.savannah.gnu.org/grep.git 
cd grep
git checkout v2.24

Find where the stderr error message is encoded:

git grep 'Binary file'

Leads us to /src/grep.c:

if (!out_quiet && (encoding_error_output
 || (0 <= nlines_first_null && nlines_first_null < nlines)))
 {
 printf (_("Binary file %s matchesn"), filename);

If those variables were well named, we basically reached the conclusion.

encoding_error_output

Quick grepping for encoding_error_output shows that the only code path that can modify it goes through buf_has_encoding_errors:

clen = mbrlen (p, buf + size - p, &mbs);
if ((size_t) -2 <= clen)
 return true;

then just man mbrlen.

nlines_first_null and nlines

Initialized as:

intmax_t nlines_first_null = -1;
nlines = 0;

so when a null is found 0 <= nlines_first_null becomes true.

TODO when can nlines_first_null < nlines ever be false? I got lazy.

POSIX

Does not define binary options grep - search a file for a pattern | pubs.opengroup.org , and GNU grep does not document it, so RTFS is the only way.

edited Apr 15 at 7:34

Drakonoved

684518

answered Apr 12 '16 at 20:50

4,54323938

1

Impressive explication!
â€“Â user394
Apr 13 '16 at 2:02

2

Note that the check for valid UTF-8 only happens in UTF-8 locales. Also note that the check is only done on the first buffer read from the file which for a regular file seems to be 32768 bytes on my system, but for a pipe or socket can be as small as one byte. Compare (printf 'ny') | grep y with (printf 'n'; sleep 1; printf 'y') | grep y for instance.
â€“Â StÃ©phane Chazelas
Apr 13 '16 at 12:18

@StÃ©phaneChazelas "Note that the check for valid UTF-8 only happens in UTF-8 locales": do you mean about the export LC_CTYPE='en_US.UTF-8' as in my example, or something else? Buf read: amazing example, added to answer. You have obviously read the source more than me, reminds me of those hacker koans "The student was enlightened" :-)
â€“Â Ciro Santilli Ã¦Â–Â°Ã§Â–Â†Ã¦Â”Â¹Ã©Â€Â Ã¤Â¸ÂÃ¥Â¿Âƒ Ã¥Â…ÂÃ¥Â›Â›Ã¤ÂºÂ‹Ã¤Â»Â¶ Ã¦Â³Â•Ã¨Â½Â®Ã¥ÂŠÂŸ
Apr 13 '16 at 13:05

1

I didn't look into great detail either, but did very recently
â€“Â StÃ©phane Chazelas
Apr 13 '16 at 13:09

1

@CiroSantilliÃ¥Â·Â´Ã¦Â‹Â¿Ã©Â¦Â¬Ã¦Â–Â‡Ã¤Â»Â¶Ã¥Â…ÂÃ¥Â›Â›Ã¤ÂºÂ‹Ã¤Â»Â¶Ã¦Â³Â•Ã¨Â½Â®Ã¥ÂŠÂŸ what version of GNU grep did you test against?
â€“Â jrw32982
Jun 8 '16 at 23:33

Â |Â
show 6 more comments

up vote
11
down vote

GNU grep 2.24 RTFS

Conclusion: 2 and 2 cases only:

NUL, e.g. printf 'a' | grep 'a'

encoding error according to the C99 mbrlen(), e.g.:
```
export LC_CTYPE='en_US.UTF-8'
printf 'ax80' | grep 'a'
```
because x80 cannot be the first byte of an UTF-8 Unicode point: UTF-8 - Description | en.wikipedia.org

Only up to the first buffer read

So if a NUL or encoding error happens in the middle of a very large file, it might be grepped anyways.

I imagine this is for performance reasons.

E.g.: this prints the line:

printf '%10000000snx80a' | grep 'a'

but this does not:

printf '%10snx80a' | grep 'a'

The actual buffer size depends on how the file is read. E.g. compare:

export LC_CTYPE='en_US.UTF-8'
(printf 'nx80a') | grep 'a'
(printf 'n'; sleep 1; printf 'x80a') | grep 'a'

With the sleep, the first line gets passed to grep even if it is only 1 byte long because the process goes to sleep, and the second read does not check if the file is binary.

RTFS

git clone git://git.savannah.gnu.org/grep.git 
cd grep
git checkout v2.24

Find where the stderr error message is encoded:

git grep 'Binary file'

Leads us to /src/grep.c:

if (!out_quiet && (encoding_error_output
 || (0 <= nlines_first_null && nlines_first_null < nlines)))
 {
 printf (_("Binary file %s matchesn"), filename);

If those variables were well named, we basically reached the conclusion.

encoding_error_output

Quick grepping for encoding_error_output shows that the only code path that can modify it goes through buf_has_encoding_errors:

clen = mbrlen (p, buf + size - p, &mbs);
if ((size_t) -2 <= clen)
 return true;

then just man mbrlen.

nlines_first_null and nlines

Initialized as:

intmax_t nlines_first_null = -1;
nlines = 0;

so when a null is found 0 <= nlines_first_null becomes true.

TODO when can nlines_first_null < nlines ever be false? I got lazy.

POSIX

Does not define binary options grep - search a file for a pattern | pubs.opengroup.org , and GNU grep does not document it, so RTFS is the only way.

edited Apr 15 at 7:34

Drakonoved

684518

answered Apr 12 '16 at 20:50

4,54323938

GNU grep 2.24 RTFS

Conclusion: 2 and 2 cases only:

NUL, e.g. printf 'a' | grep 'a'

encoding error according to the C99 mbrlen(), e.g.:
```
export LC_CTYPE='en_US.UTF-8'
printf 'ax80' | grep 'a'
```
because x80 cannot be the first byte of an UTF-8 Unicode point: UTF-8 - Description | en.wikipedia.org

Only up to the first buffer read

So if a NUL or encoding error happens in the middle of a very large file, it might be grepped anyways.

I imagine this is for performance reasons.

E.g.: this prints the line:

printf '%10000000snx80a' | grep 'a'

but this does not:

printf '%10snx80a' | grep 'a'

The actual buffer size depends on how the file is read. E.g. compare:

export LC_CTYPE='en_US.UTF-8'
(printf 'nx80a') | grep 'a'
(printf 'n'; sleep 1; printf 'x80a') | grep 'a'

With the sleep, the first line gets passed to grep even if it is only 1 byte long because the process goes to sleep, and the second read does not check if the file is binary.

RTFS

git clone git://git.savannah.gnu.org/grep.git 
cd grep
git checkout v2.24

Find where the stderr error message is encoded:

git grep 'Binary file'

Leads us to /src/grep.c:

if (!out_quiet && (encoding_error_output
 || (0 <= nlines_first_null && nlines_first_null < nlines)))
 {
 printf (_("Binary file %s matchesn"), filename);

If those variables were well named, we basically reached the conclusion.

encoding_error_output

Quick grepping for encoding_error_output shows that the only code path that can modify it goes through buf_has_encoding_errors:

clen = mbrlen (p, buf + size - p, &mbs);
if ((size_t) -2 <= clen)
 return true;

then just man mbrlen.

nlines_first_null and nlines

Initialized as:

intmax_t nlines_first_null = -1;
nlines = 0;

so when a null is found 0 <= nlines_first_null becomes true.

TODO when can nlines_first_null < nlines ever be false? I got lazy.

POSIX

Does not define binary options grep - search a file for a pattern | pubs.opengroup.org , and GNU grep does not document it, so RTFS is the only way.

edited Apr 15 at 7:34

Drakonoved

684518

answered Apr 12 '16 at 20:50

4,54323938

edited Apr 15 at 7:34

Drakonoved

684518

edited Apr 15 at 7:34

Drakonoved

684518

edited Apr 15 at 7:34

Drakonoved

684518

answered Apr 12 '16 at 20:50

4,54323938

answered Apr 12 '16 at 20:50

4,54323938

answered Apr 12 '16 at 20:50

4,54323938

1

Impressive explication!
â€“Â user394
Apr 13 '16 at 2:02

2

Note that the check for valid UTF-8 only happens in UTF-8 locales. Also note that the check is only done on the first buffer read from the file which for a regular file seems to be 32768 bytes on my system, but for a pipe or socket can be as small as one byte. Compare (printf 'ny') | grep y with (printf 'n'; sleep 1; printf 'y') | grep y for instance.
â€“Â StÃ©phane Chazelas
Apr 13 '16 at 12:18

@StÃ©phaneChazelas "Note that the check for valid UTF-8 only happens in UTF-8 locales": do you mean about the export LC_CTYPE='en_US.UTF-8' as in my example, or something else? Buf read: amazing example, added to answer. You have obviously read the source more than me, reminds me of those hacker koans "The student was enlightened" :-)
â€“Â Ciro Santilli Ã¦Â–Â°Ã§Â–Â†Ã¦Â”Â¹Ã©Â€Â Ã¤Â¸ÂÃ¥Â¿Âƒ Ã¥Â…ÂÃ¥Â›Â›Ã¤ÂºÂ‹Ã¤Â»Â¶ Ã¦Â³Â•Ã¨Â½Â®Ã¥ÂŠÂŸ
Apr 13 '16 at 13:05

1

I didn't look into great detail either, but did very recently
â€“Â StÃ©phane Chazelas
Apr 13 '16 at 13:09

1

@CiroSantilliÃ¥Â·Â´Ã¦Â‹Â¿Ã©Â¦Â¬Ã¦Â–Â‡Ã¤Â»Â¶Ã¥Â…ÂÃ¥Â›Â›Ã¤ÂºÂ‹Ã¤Â»Â¶Ã¦Â³Â•Ã¨Â½Â®Ã¥ÂŠÂŸ what version of GNU grep did you test against?
â€“Â jrw32982
Jun 8 '16 at 23:33

Â |Â
show 6 more comments

1

Impressive explication!
â€“Â user394
Apr 13 '16 at 2:02

2

Note that the check for valid UTF-8 only happens in UTF-8 locales. Also note that the check is only done on the first buffer read from the file which for a regular file seems to be 32768 bytes on my system, but for a pipe or socket can be as small as one byte. Compare (printf 'ny') | grep y with (printf 'n'; sleep 1; printf 'y') | grep y for instance.
â€“Â StÃ©phane Chazelas
Apr 13 '16 at 12:18

@StÃ©phaneChazelas "Note that the check for valid UTF-8 only happens in UTF-8 locales": do you mean about the export LC_CTYPE='en_US.UTF-8' as in my example, or something else? Buf read: amazing example, added to answer. You have obviously read the source more than me, reminds me of those hacker koans "The student was enlightened" :-)
â€“Â Ciro Santilli Ã¦Â–Â°Ã§Â–Â†Ã¦Â”Â¹Ã©Â€Â Ã¤Â¸ÂÃ¥Â¿Âƒ Ã¥Â…ÂÃ¥Â›Â›Ã¤ÂºÂ‹Ã¤Â»Â¶ Ã¦Â³Â•Ã¨Â½Â®Ã¥ÂŠÂŸ
Apr 13 '16 at 13:05

1

I didn't look into great detail either, but did very recently
â€“Â StÃ©phane Chazelas
Apr 13 '16 at 13:09

1

@CiroSantilliÃ¥Â·Â´Ã¦Â‹Â¿Ã©Â¦Â¬Ã¦Â–Â‡Ã¤Â»Â¶Ã¥Â…ÂÃ¥Â›Â›Ã¤ÂºÂ‹Ã¤Â»Â¶Ã¦Â³Â•Ã¨Â½Â®Ã¥ÂŠÂŸ what version of GNU grep did you test against?
â€“Â jrw32982
Jun 8 '16 at 23:33

Impressive explication!
â€“Â user394
Apr 13 '16 at 2:02

Note that the check for valid UTF-8 only happens in UTF-8 locales. Also note that the check is only done on the first buffer read from the file which for a regular file seems to be 32768 bytes on my system, but for a pipe or socket can be as small as one byte. Compare (printf 'ny') | grep y with (printf 'n'; sleep 1; printf 'y') | grep y for instance.
â€“Â StÃ©phane Chazelas
Apr 13 '16 at 12:18

@StÃ©phaneChazelas "Note that the check for valid UTF-8 only happens in UTF-8 locales": do you mean about the export LC_CTYPE='en_US.UTF-8' as in my example, or something else? Buf read: amazing example, added to answer. You have obviously read the source more than me, reminds me of those hacker koans "The student was enlightened" :-)
â€“Â Ciro Santilli Ã¦Â–Â°Ã§Â–Â†Ã¦Â”Â¹Ã©Â€Â Ã¤Â¸ÂÃ¥Â¿Âƒ Ã¥Â…ÂÃ¥Â›Â›Ã¤ÂºÂ‹Ã¤Â»Â¶ Ã¦Â³Â•Ã¨Â½Â®Ã¥ÂŠÂŸ
Apr 13 '16 at 13:05

I didn't look into great detail either, but did very recently
â€“Â StÃ©phane Chazelas
Apr 13 '16 at 13:09

@CiroSantilliÃ¥Â·Â´Ã¦Â‹Â¿Ã©Â¦Â¬Ã¦Â–Â‡Ã¤Â»Â¶Ã¥Â…ÂÃ¥Â›Â›Ã¤ÂºÂ‹Ã¤Â»Â¶Ã¦Â³Â•Ã¨Â½Â®Ã¥ÂŠÂŸ what version of GNU grep did you test against?
â€“Â jrw32982
Jun 8 '16 at 23:33

Â |Â
show 6 more comments

up vote
6
down vote

One of my text files was suddenly being seen as binary by grep:

$ file foo.txt
foo.txt: ISO-8859 text

Solution was to convert it by using iconv:

iconv -t UTF-8 -f ISO-8859-1 foo.txt > foo_new.txt

edited Jun 1 '15 at 20:24

kenorb

7,841365105

answered Dec 8 '14 at 21:30

zzapper

709513

1

This happened to me as well. In particular, the cause was an ISO-8859-1-encoded non-breaking space, which I had to replace with a regular space in order to get grep to search in the file.
â€“Â Gallaecio
Jun 9 '15 at 13:50

4

grep 2.21 treats ISO-8859 text files as if they are binary, add export LC_ALL=C before grep command.
â€“Â netawater
Aug 17 '15 at 2:52

@netawater Thanks! This is e.g. the case if you have something like MÃ¼ller in a text-file. That's 0xFC hexadecimal, so outside the range grep would expect for utf8 (up to 0x7F). Check with printf 'ax7F' | grep 'a' as Ciro describe above.
â€“Â Anne van Rossum
Nov 26 '16 at 16:51

add a commentÂ |Â

up vote
6
down vote

One of my text files was suddenly being seen as binary by grep:

$ file foo.txt
foo.txt: ISO-8859 text

Solution was to convert it by using iconv:

iconv -t UTF-8 -f ISO-8859-1 foo.txt > foo_new.txt

edited Jun 1 '15 at 20:24

kenorb

7,841365105

answered Dec 8 '14 at 21:30

zzapper

709513

1

This happened to me as well. In particular, the cause was an ISO-8859-1-encoded non-breaking space, which I had to replace with a regular space in order to get grep to search in the file.
â€“Â Gallaecio
Jun 9 '15 at 13:50

4

grep 2.21 treats ISO-8859 text files as if they are binary, add export LC_ALL=C before grep command.
â€“Â netawater
Aug 17 '15 at 2:52

@netawater Thanks! This is e.g. the case if you have something like MÃ¼ller in a text-file. That's 0xFC hexadecimal, so outside the range grep would expect for utf8 (up to 0x7F). Check with printf 'ax7F' | grep 'a' as Ciro describe above.
â€“Â Anne van Rossum
Nov 26 '16 at 16:51

add a commentÂ |Â

up vote
6
down vote

One of my text files was suddenly being seen as binary by grep:

$ file foo.txt
foo.txt: ISO-8859 text

Solution was to convert it by using iconv:

iconv -t UTF-8 -f ISO-8859-1 foo.txt > foo_new.txt

edited Jun 1 '15 at 20:24

kenorb

7,841365105

answered Dec 8 '14 at 21:30

zzapper

709513

One of my text files was suddenly being seen as binary by grep:

$ file foo.txt
foo.txt: ISO-8859 text

Solution was to convert it by using iconv:

iconv -t UTF-8 -f ISO-8859-1 foo.txt > foo_new.txt

edited Jun 1 '15 at 20:24

kenorb

7,841365105

answered Dec 8 '14 at 21:30

zzapper

709513

edited Jun 1 '15 at 20:24

kenorb

7,841365105

edited Jun 1 '15 at 20:24

kenorb

7,841365105

edited Jun 1 '15 at 20:24

kenorb

7,841365105

answered Dec 8 '14 at 21:30

zzapper

709513

answered Dec 8 '14 at 21:30

zzapper

709513

answered Dec 8 '14 at 21:30

zzapper

709513

1

This happened to me as well. In particular, the cause was an ISO-8859-1-encoded non-breaking space, which I had to replace with a regular space in order to get grep to search in the file.
â€“Â Gallaecio
Jun 9 '15 at 13:50

4

grep 2.21 treats ISO-8859 text files as if they are binary, add export LC_ALL=C before grep command.
â€“Â netawater
Aug 17 '15 at 2:52

@netawater Thanks! This is e.g. the case if you have something like MÃ¼ller in a text-file. That's 0xFC hexadecimal, so outside the range grep would expect for utf8 (up to 0x7F). Check with printf 'ax7F' | grep 'a' as Ciro describe above.
â€“Â Anne van Rossum
Nov 26 '16 at 16:51

add a commentÂ |Â

1

This happened to me as well. In particular, the cause was an ISO-8859-1-encoded non-breaking space, which I had to replace with a regular space in order to get grep to search in the file.
â€“Â Gallaecio
Jun 9 '15 at 13:50

4

grep 2.21 treats ISO-8859 text files as if they are binary, add export LC_ALL=C before grep command.
â€“Â netawater
Aug 17 '15 at 2:52

@netawater Thanks! This is e.g. the case if you have something like MÃ¼ller in a text-file. That's 0xFC hexadecimal, so outside the range grep would expect for utf8 (up to 0x7F). Check with printf 'ax7F' | grep 'a' as Ciro describe above.
â€“Â Anne van Rossum
Nov 26 '16 at 16:51

This happened to me as well. In particular, the cause was an ISO-8859-1-encoded non-breaking space, which I had to replace with a regular space in order to get grep to search in the file.
â€“Â Gallaecio
Jun 9 '15 at 13:50

grep 2.21 treats ISO-8859 text files as if they are binary, add export LC_ALL=C before grep command.
â€“Â netawater
Aug 17 '15 at 2:52

@netawater Thanks! This is e.g. the case if you have something like MÃ¼ller in a text-file. That's 0xFC hexadecimal, so outside the range grep would expect for utf8 (up to 0x7F). Check with printf 'ax7F' | grep 'a' as Ciro describe above.
â€“Â Anne van Rossum
Nov 26 '16 at 16:51

add a commentÂ |Â

up vote
5
down vote

The file /etc/magic or /usr/share/misc/magic has a list of sequences that the command file uses for determining the file type.

Note that binary may just be a fallback solution. Sometimes files with strange encoding are considered binary too.

grep on Linux has some options to handle binary files like --binary-files or -U / --binary

edited Feb 9 '15 at 11:49

fduff

2,61931933

answered Sep 1 '11 at 13:27

klapaucius

45624

More precisely, encoding error according to C99's mbrlen(). Example and source interpretation at: unix.stackexchange.com/a/276028/32558
â€“Â Ciro Santilli Ã¦Â–Â°Ã§Â–Â†Ã¦Â”Â¹Ã©Â€Â Ã¤Â¸ÂÃ¥Â¿Âƒ Ã¥Â…ÂÃ¥Â›Â›Ã¤ÂºÂ‹Ã¤Â»Â¶ Ã¦Â³Â•Ã¨Â½Â®Ã¥ÂŠÂŸ
Apr 12 '16 at 20:51

add a commentÂ |Â

up vote
5
down vote

The file /etc/magic or /usr/share/misc/magic has a list of sequences that the command file uses for determining the file type.

Note that binary may just be a fallback solution. Sometimes files with strange encoding are considered binary too.

grep on Linux has some options to handle binary files like --binary-files or -U / --binary

edited Feb 9 '15 at 11:49

fduff

2,61931933

answered Sep 1 '11 at 13:27

klapaucius

45624

More precisely, encoding error according to C99's mbrlen(). Example and source interpretation at: unix.stackexchange.com/a/276028/32558
â€“Â Ciro Santilli Ã¦Â–Â°Ã§Â–Â†Ã¦Â”Â¹Ã©Â€Â Ã¤Â¸ÂÃ¥Â¿Âƒ Ã¥Â…ÂÃ¥Â›Â›Ã¤ÂºÂ‹Ã¤Â»Â¶ Ã¦Â³Â•Ã¨Â½Â®Ã¥ÂŠÂŸ
Apr 12 '16 at 20:51

add a commentÂ |Â

up vote
5
down vote

The file /etc/magic or /usr/share/misc/magic has a list of sequences that the command file uses for determining the file type.

Note that binary may just be a fallback solution. Sometimes files with strange encoding are considered binary too.

grep on Linux has some options to handle binary files like --binary-files or -U / --binary

edited Feb 9 '15 at 11:49

fduff

2,61931933

answered Sep 1 '11 at 13:27

klapaucius

45624

The file /etc/magic or /usr/share/misc/magic has a list of sequences that the command file uses for determining the file type.

Note that binary may just be a fallback solution. Sometimes files with strange encoding are considered binary too.

grep on Linux has some options to handle binary files like --binary-files or -U / --binary

edited Feb 9 '15 at 11:49

fduff

2,61931933

answered Sep 1 '11 at 13:27

klapaucius

45624

edited Feb 9 '15 at 11:49

fduff

2,61931933

edited Feb 9 '15 at 11:49

fduff

2,61931933

edited Feb 9 '15 at 11:49

fduff

2,61931933

answered Sep 1 '11 at 13:27

klapaucius

45624

answered Sep 1 '11 at 13:27

klapaucius

45624

answered Sep 1 '11 at 13:27

klapaucius

45624

More precisely, encoding error according to C99's mbrlen(). Example and source interpretation at: unix.stackexchange.com/a/276028/32558
â€“Â Ciro Santilli Ã¦Â–Â°Ã§Â–Â†Ã¦Â”Â¹Ã©Â€Â Ã¤Â¸ÂÃ¥Â¿Âƒ Ã¥Â…ÂÃ¥Â›Â›Ã¤ÂºÂ‹Ã¤Â»Â¶ Ã¦Â³Â•Ã¨Â½Â®Ã¥ÂŠÂŸ
Apr 12 '16 at 20:51

add a commentÂ |Â

More precisely, encoding error according to C99's mbrlen(). Example and source interpretation at: unix.stackexchange.com/a/276028/32558
â€“Â Ciro Santilli Ã¦Â–Â°Ã§Â–Â†Ã¦Â”Â¹Ã©Â€Â Ã¤Â¸ÂÃ¥Â¿Âƒ Ã¥Â…ÂÃ¥Â›Â›Ã¤ÂºÂ‹Ã¤Â»Â¶ Ã¦Â³Â•Ã¨Â½Â®Ã¥ÂŠÂŸ
Apr 12 '16 at 20:51

More precisely, encoding error according to C99's mbrlen(). Example and source interpretation at: unix.stackexchange.com/a/276028/32558
â€“Â Ciro Santilli Ã¦Â–Â°Ã§Â–Â†Ã¦Â”Â¹Ã©Â€Â Ã¤Â¸ÂÃ¥Â¿Âƒ Ã¥Â…ÂÃ¥Â›Â›Ã¤ÂºÂ‹Ã¤Â»Â¶ Ã¦Â³Â•Ã¨Â½Â®Ã¥ÂŠÂŸ
Apr 12 '16 at 20:51

add a commentÂ |Â

up vote
2
down vote

One of my students had this problem. There is a bug in grep in Cygwin. If the file has non-Ascii characters, grep and egrep see it as binary.

edited Sep 10 '15 at 11:14

Tejas

1,77821837

answered Sep 10 '15 at 9:31

Joan Pontius

291

That sounds like a feature, not a bug. Especially given there is a command-line option to control it (-a / --text)
â€“Â Will Sheppard
Jan 29 at 11:39

add a commentÂ |Â

up vote
2
down vote

One of my students had this problem. There is a bug in grep in Cygwin. If the file has non-Ascii characters, grep and egrep see it as binary.

edited Sep 10 '15 at 11:14

Tejas

1,77821837

answered Sep 10 '15 at 9:31

Joan Pontius

291

That sounds like a feature, not a bug. Especially given there is a command-line option to control it (-a / --text)
â€“Â Will Sheppard
Jan 29 at 11:39

add a commentÂ |Â

up vote
2
down vote

One of my students had this problem. There is a bug in grep in Cygwin. If the file has non-Ascii characters, grep and egrep see it as binary.

edited Sep 10 '15 at 11:14

Tejas

1,77821837

answered Sep 10 '15 at 9:31

Joan Pontius

291

One of my students had this problem. There is a bug in grep in Cygwin. If the file has non-Ascii characters, grep and egrep see it as binary.

edited Sep 10 '15 at 11:14

Tejas

1,77821837

answered Sep 10 '15 at 9:31

Joan Pontius

291

edited Sep 10 '15 at 11:14

Tejas

1,77821837

edited Sep 10 '15 at 11:14

Tejas

1,77821837

edited Sep 10 '15 at 11:14

Tejas

1,77821837

answered Sep 10 '15 at 9:31

Joan Pontius

291

answered Sep 10 '15 at 9:31

Joan Pontius

291

answered Sep 10 '15 at 9:31

Joan Pontius

291

That sounds like a feature, not a bug. Especially given there is a command-line option to control it (-a / --text)
â€“Â Will Sheppard
Jan 29 at 11:39

add a commentÂ |Â

That sounds like a feature, not a bug. Especially given there is a command-line option to control it (-a / --text)
â€“Â Will Sheppard
Jan 29 at 11:39

That sounds like a feature, not a bug. Especially given there is a command-line option to control it (-a / --text)
â€“Â Will Sheppard
Jan 29 at 11:39

add a commentÂ |Â

up vote
2
down vote

Actually answering the question "What makes grep consider a file to be binary?", you can use iconv:

$ iconv < myfile.java
iconv: (stdin):267:70: cannot convert

In my case there were Spanish characters that showed up correctly in text editors but grep considered them as binary; iconv output pointed me to the line and column numbers of those characters

In the case of NUL characters, iconv will consider them normal and will not print that kind of output so this method is not suitable

edited Apr 14 '16 at 16:49

answered May 20 '15 at 15:12

golimar

27519

add a commentÂ |Â

up vote
2
down vote

Actually answering the question "What makes grep consider a file to be binary?", you can use iconv:

$ iconv < myfile.java
iconv: (stdin):267:70: cannot convert

In my case there were Spanish characters that showed up correctly in text editors but grep considered them as binary; iconv output pointed me to the line and column numbers of those characters

In the case of NUL characters, iconv will consider them normal and will not print that kind of output so this method is not suitable

edited Apr 14 '16 at 16:49

answered May 20 '15 at 15:12

golimar

27519

add a commentÂ |Â

up vote
2
down vote

Actually answering the question "What makes grep consider a file to be binary?", you can use iconv:

$ iconv < myfile.java
iconv: (stdin):267:70: cannot convert

In my case there were Spanish characters that showed up correctly in text editors but grep considered them as binary; iconv output pointed me to the line and column numbers of those characters

In the case of NUL characters, iconv will consider them normal and will not print that kind of output so this method is not suitable

edited Apr 14 '16 at 16:49

answered May 20 '15 at 15:12

golimar

27519

Actually answering the question "What makes grep consider a file to be binary?", you can use iconv:

$ iconv < myfile.java
iconv: (stdin):267:70: cannot convert

In my case there were Spanish characters that showed up correctly in text editors but grep considered them as binary; iconv output pointed me to the line and column numbers of those characters

In the case of NUL characters, iconv will consider them normal and will not print that kind of output so this method is not suitable

edited Apr 14 '16 at 16:49

answered May 20 '15 at 15:12

golimar

27519

edited Apr 14 '16 at 16:49

answered May 20 '15 at 15:12

golimar

27519

answered May 20 '15 at 15:12

golimar

27519

answered May 20 '15 at 15:12

golimar

27519

add a commentÂ |Â

up vote
1
down vote

Warning: To get the "blue" control characters press Ctrl+v then Ctrl+M or Ctrl+@. Then save and exit vi.

edited Jun 1 '15 at 20:29

kenorb

7,841365105

answered Apr 3 '15 at 18:58

Not Sure

112

add a commentÂ |Â

up vote
1
down vote

Warning: To get the "blue" control characters press Ctrl+v then Ctrl+M or Ctrl+@. Then save and exit vi.

edited Jun 1 '15 at 20:29

kenorb

7,841365105

answered Apr 3 '15 at 18:58

Not Sure

112

add a commentÂ |Â

up vote
1
down vote

Warning: To get the "blue" control characters press Ctrl+v then Ctrl+M or Ctrl+@. Then save and exit vi.

edited Jun 1 '15 at 20:29

kenorb

7,841365105

answered Apr 3 '15 at 18:58

Not Sure

112

Warning: To get the "blue" control characters press Ctrl+v then Ctrl+M or Ctrl+@. Then save and exit vi.

edited Jun 1 '15 at 20:29

kenorb

7,841365105

answered Apr 3 '15 at 18:58

Not Sure

112

edited Jun 1 '15 at 20:29

kenorb

7,841365105

edited Jun 1 '15 at 20:29

kenorb

7,841365105

edited Jun 1 '15 at 20:29

kenorb

7,841365105

answered Apr 3 '15 at 18:58

Not Sure

112

answered Apr 3 '15 at 18:58

Not Sure

112

answered Apr 3 '15 at 18:58

Not Sure

112

add a commentÂ |Â

搜尋此網誌

mjhjmtu