What makes grep consider a file to be binary?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
150
down vote

favorite
31












I have some database dumps from a Windows system on my box. They are text files. I'm using cygwin to grep through them. These appear to be plain text files; I open them with text editors such as notepad and wordpad and they look legible. However, when I run grep on them, it will say binary file foo.txt matches.



I have noticed that the files contain some ascii NUL characters, which I believe are artifacts from the database dump.



So what makes grep consider these files to be binary? The NUL character? Is there a flag on the filesystem? What do I need to change to get grep to show me the line matches?










share|improve this question



















  • 2




    --null-data may be useful if NUL is the delimiter.
    – Steve-o
    Sep 1 '11 at 13:27














up vote
150
down vote

favorite
31












I have some database dumps from a Windows system on my box. They are text files. I'm using cygwin to grep through them. These appear to be plain text files; I open them with text editors such as notepad and wordpad and they look legible. However, when I run grep on them, it will say binary file foo.txt matches.



I have noticed that the files contain some ascii NUL characters, which I believe are artifacts from the database dump.



So what makes grep consider these files to be binary? The NUL character? Is there a flag on the filesystem? What do I need to change to get grep to show me the line matches?










share|improve this question



















  • 2




    --null-data may be useful if NUL is the delimiter.
    – Steve-o
    Sep 1 '11 at 13:27












up vote
150
down vote

favorite
31









up vote
150
down vote

favorite
31






31





I have some database dumps from a Windows system on my box. They are text files. I'm using cygwin to grep through them. These appear to be plain text files; I open them with text editors such as notepad and wordpad and they look legible. However, when I run grep on them, it will say binary file foo.txt matches.



I have noticed that the files contain some ascii NUL characters, which I believe are artifacts from the database dump.



So what makes grep consider these files to be binary? The NUL character? Is there a flag on the filesystem? What do I need to change to get grep to show me the line matches?










share|improve this question















I have some database dumps from a Windows system on my box. They are text files. I'm using cygwin to grep through them. These appear to be plain text files; I open them with text editors such as notepad and wordpad and they look legible. However, when I run grep on them, it will say binary file foo.txt matches.



I have noticed that the files contain some ascii NUL characters, which I believe are artifacts from the database dump.



So what makes grep consider these files to be binary? The NUL character? Is there a flag on the filesystem? What do I need to change to get grep to show me the line matches?







grep






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Feb 9 '15 at 11:02









Michel de Ruiter

1033




1033










asked Sep 1 '11 at 13:21









user394

4,747155071




4,747155071







  • 2




    --null-data may be useful if NUL is the delimiter.
    – Steve-o
    Sep 1 '11 at 13:27












  • 2




    --null-data may be useful if NUL is the delimiter.
    – Steve-o
    Sep 1 '11 at 13:27







2




2




--null-data may be useful if NUL is the delimiter.
– Steve-o
Sep 1 '11 at 13:27




--null-data may be useful if NUL is the delimiter.
– Steve-o
Sep 1 '11 at 13:27










9 Answers
9






active

oldest

votes

















up vote
109
down vote



accepted










If there is a NUL character anywhere in the file, grep will consider it as a binary file.



There might a workaround like this cat file | tr -d '00' | yourgrep to eliminate all null first, and then to search through file.






share|improve this answer
















  • 116




    ... or use -a/--text, at least with GNU grep.
    – derobert
    Nov 26 '12 at 20:44






  • 1




    @derobert: actually, on some (older) systems, grep see lines, but its output will truncate each matching line at the first NUL (probably becauses it calls C's printf and gives it the matched line?). On such a system a grep cmd .sh_history will return as many empty lines as there are lines matching 'cmd', as each line of sh_history has a specific format with a NUL at the begining of each line. (but your comment "at least on GNU grep" probably comes true. I don't have one at hand right now to test, but I expect they handle this nicely)
    – Olivier Dulac
    Nov 25 '13 at 11:46







  • 4




    Is the presence of a NUL character the only criteria? I doubt it. It's probably smarter than that. Anything falling outside the Ascii 32-126 range would be my guess, but we'd have to look at the source code to be sure.
    – Michael Martinez
    Aug 14 '15 at 16:58






  • 2




    My info was from the man page of the specific grep instance. Your comment about implementation is valid, source trumps docs.
    – bbaja42
    Aug 18 '15 at 22:31






  • 2




    I had a file which grep on cygwin considered binary because it had a long dash (0x96) instead of a regular ASCII hyphen/minus (0x2d). I guess this answer resolved the OP's issue, but it appears it is incomplete.
    – cp.engr
    Feb 15 '16 at 16:15

















up vote
87
down vote













grep -a worked for me:



$ grep --help
[...]
-a, --text equivalent to --binary-files=text





share|improve this answer
















  • 2




    This is the best, least expensive answer IMO.
    – pydsigner
    Sep 24 '16 at 18:32

















up vote
20
down vote













You can use the strings utility to extract the text content from any file and then pipe it through grep, like this: strings file | grep pattern.






share|improve this answer
















  • 1




    Ideal for grepping log files that might be partly corrupted
    – Hannes R.
    Feb 27 '15 at 7:43










  • yes, sometimes binary mixed logging also happens. This is good.
    – sdkks
    Sep 3 '17 at 16:59

















up vote
11
down vote













GNU grep 2.24 RTFS



Conclusion: 2 and 2 cases only:



  • NUL, e.g. printf 'a' | grep 'a'



  • encoding error according to the C99 mbrlen(), e.g.:



    export LC_CTYPE='en_US.UTF-8'
    printf 'ax80' | grep 'a'


    because x80 cannot be the first byte of an UTF-8 Unicode point: UTF-8 - Description | en.wikipedia.org



Furthermore, as mentioned by Stéphane Chazelas What makes grep consider a file to be binary? | Unix & Linux Stack Exchange, those checks are only done up to the first buffer read of length TODO.



Only up to the first buffer read



So if a NUL or encoding error happens in the middle of a very large file, it might be grepped anyways.



I imagine this is for performance reasons.



E.g.: this prints the line:



printf '%10000000snx80a' | grep 'a'


but this does not:



printf '%10snx80a' | grep 'a'


The actual buffer size depends on how the file is read. E.g. compare:



export LC_CTYPE='en_US.UTF-8'
(printf 'nx80a') | grep 'a'
(printf 'n'; sleep 1; printf 'x80a') | grep 'a'


With the sleep, the first line gets passed to grep even if it is only 1 byte long because the process goes to sleep, and the second read does not check if the file is binary.



RTFS



git clone git://git.savannah.gnu.org/grep.git 
cd grep
git checkout v2.24


Find where the stderr error message is encoded:



git grep 'Binary file'


Leads us to /src/grep.c:



if (!out_quiet && (encoding_error_output
|| (0 <= nlines_first_null && nlines_first_null < nlines)))
{
printf (_("Binary file %s matchesn"), filename);


If those variables were well named, we basically reached the conclusion.



encoding_error_output



Quick grepping for encoding_error_output shows that the only code path that can modify it goes through buf_has_encoding_errors:



clen = mbrlen (p, buf + size - p, &mbs);
if ((size_t) -2 <= clen)
return true;


then just man mbrlen.



nlines_first_null and nlines



Initialized as:



intmax_t nlines_first_null = -1;
nlines = 0;


so when a null is found 0 <= nlines_first_null becomes true.



TODO when can nlines_first_null < nlines ever be false? I got lazy.



POSIX



Does not define binary options grep - search a file for a pattern | pubs.opengroup.org , and GNU grep does not document it, so RTFS is the only way.






share|improve this answer


















  • 1




    Impressive explication!
    – user394
    Apr 13 '16 at 2:02






  • 2




    Note that the check for valid UTF-8 only happens in UTF-8 locales. Also note that the check is only done on the first buffer read from the file which for a regular file seems to be 32768 bytes on my system, but for a pipe or socket can be as small as one byte. Compare (printf 'ny') | grep y with (printf 'n'; sleep 1; printf 'y') | grep y for instance.
    – Stéphane Chazelas
    Apr 13 '16 at 12:18











  • @StéphaneChazelas "Note that the check for valid UTF-8 only happens in UTF-8 locales": do you mean about the export LC_CTYPE='en_US.UTF-8' as in my example, or something else? Buf read: amazing example, added to answer. You have obviously read the source more than me, reminds me of those hacker koans "The student was enlightened" :-)
    – Ciro Santilli 新疆改造中心 六四事件 法轮功
    Apr 13 '16 at 13:05






  • 1




    I didn't look into great detail either, but did very recently
    – Stéphane Chazelas
    Apr 13 '16 at 13:09






  • 1




    @CiroSantilli巴拿馬文件六四事件法轮功 what version of GNU grep did you test against?
    – jrw32982
    Jun 8 '16 at 23:33

















up vote
6
down vote













One of my text files was suddenly being seen as binary by grep:



$ file foo.txt
foo.txt: ISO-8859 text


Solution was to convert it by using iconv:



iconv -t UTF-8 -f ISO-8859-1 foo.txt > foo_new.txt





share|improve this answer


















  • 1




    This happened to me as well. In particular, the cause was an ISO-8859-1-encoded non-breaking space, which I had to replace with a regular space in order to get grep to search in the file.
    – Gallaecio
    Jun 9 '15 at 13:50






  • 4




    grep 2.21 treats ISO-8859 text files as if they are binary, add export LC_ALL=C before grep command.
    – netawater
    Aug 17 '15 at 2:52











  • @netawater Thanks! This is e.g. the case if you have something like Müller in a text-file. That's 0xFC hexadecimal, so outside the range grep would expect for utf8 (up to 0x7F). Check with printf 'ax7F' | grep 'a' as Ciro describe above.
    – Anne van Rossum
    Nov 26 '16 at 16:51

















up vote
5
down vote













The file /etc/magic or /usr/share/misc/magic has a list of sequences that the command file uses for determining the file type.



Note that binary may just be a fallback solution. Sometimes files with strange encoding are considered binary too.



grep on Linux has some options to handle binary files like --binary-files or -U / --binary






share|improve this answer






















  • More precisely, encoding error according to C99's mbrlen(). Example and source interpretation at: unix.stackexchange.com/a/276028/32558
    – Ciro Santilli 新疆改造中心 六四事件 法轮功
    Apr 12 '16 at 20:51

















up vote
2
down vote













One of my students had this problem. There is a bug in grep in Cygwin. If the file has non-Ascii characters, grep and egrep see it as binary.






share|improve this answer






















  • That sounds like a feature, not a bug. Especially given there is a command-line option to control it (-a / --text)
    – Will Sheppard
    Jan 29 at 11:39


















up vote
2
down vote













Actually answering the question "What makes grep consider a file to be binary?", you can use iconv:



$ iconv < myfile.java
iconv: (stdin):267:70: cannot convert


In my case there were Spanish characters that showed up correctly in text editors but grep considered them as binary; iconv output pointed me to the line and column numbers of those characters



In the case of NUL characters, iconv will consider them normal and will not print that kind of output so this method is not suitable






share|improve this answer





























    up vote
    1
    down vote













    I had the same problem. I used vi -b [filename] to see the added characters. I found the control characters ^@ and ^M. Then in vi type :1,$s/^@//g to remove the ^@ characters. Repeat this command for ^M.



    Warning: To get the "blue" control characters press Ctrl+v then Ctrl+M or Ctrl+@. Then save and exit vi.






    share|improve this answer





















      protected by Community♦ 5 mins ago



      Thank you for your interest in this question.
      Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).



      Would you like to answer one of these unanswered questions instead?














      9 Answers
      9






      active

      oldest

      votes








      9 Answers
      9






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes








      up vote
      109
      down vote



      accepted










      If there is a NUL character anywhere in the file, grep will consider it as a binary file.



      There might a workaround like this cat file | tr -d '00' | yourgrep to eliminate all null first, and then to search through file.






      share|improve this answer
















      • 116




        ... or use -a/--text, at least with GNU grep.
        – derobert
        Nov 26 '12 at 20:44






      • 1




        @derobert: actually, on some (older) systems, grep see lines, but its output will truncate each matching line at the first NUL (probably becauses it calls C's printf and gives it the matched line?). On such a system a grep cmd .sh_history will return as many empty lines as there are lines matching 'cmd', as each line of sh_history has a specific format with a NUL at the begining of each line. (but your comment "at least on GNU grep" probably comes true. I don't have one at hand right now to test, but I expect they handle this nicely)
        – Olivier Dulac
        Nov 25 '13 at 11:46







      • 4




        Is the presence of a NUL character the only criteria? I doubt it. It's probably smarter than that. Anything falling outside the Ascii 32-126 range would be my guess, but we'd have to look at the source code to be sure.
        – Michael Martinez
        Aug 14 '15 at 16:58






      • 2




        My info was from the man page of the specific grep instance. Your comment about implementation is valid, source trumps docs.
        – bbaja42
        Aug 18 '15 at 22:31






      • 2




        I had a file which grep on cygwin considered binary because it had a long dash (0x96) instead of a regular ASCII hyphen/minus (0x2d). I guess this answer resolved the OP's issue, but it appears it is incomplete.
        – cp.engr
        Feb 15 '16 at 16:15














      up vote
      109
      down vote



      accepted










      If there is a NUL character anywhere in the file, grep will consider it as a binary file.



      There might a workaround like this cat file | tr -d '00' | yourgrep to eliminate all null first, and then to search through file.






      share|improve this answer
















      • 116




        ... or use -a/--text, at least with GNU grep.
        – derobert
        Nov 26 '12 at 20:44






      • 1




        @derobert: actually, on some (older) systems, grep see lines, but its output will truncate each matching line at the first NUL (probably becauses it calls C's printf and gives it the matched line?). On such a system a grep cmd .sh_history will return as many empty lines as there are lines matching 'cmd', as each line of sh_history has a specific format with a NUL at the begining of each line. (but your comment "at least on GNU grep" probably comes true. I don't have one at hand right now to test, but I expect they handle this nicely)
        – Olivier Dulac
        Nov 25 '13 at 11:46







      • 4




        Is the presence of a NUL character the only criteria? I doubt it. It's probably smarter than that. Anything falling outside the Ascii 32-126 range would be my guess, but we'd have to look at the source code to be sure.
        – Michael Martinez
        Aug 14 '15 at 16:58






      • 2




        My info was from the man page of the specific grep instance. Your comment about implementation is valid, source trumps docs.
        – bbaja42
        Aug 18 '15 at 22:31






      • 2




        I had a file which grep on cygwin considered binary because it had a long dash (0x96) instead of a regular ASCII hyphen/minus (0x2d). I guess this answer resolved the OP's issue, but it appears it is incomplete.
        – cp.engr
        Feb 15 '16 at 16:15












      up vote
      109
      down vote



      accepted







      up vote
      109
      down vote



      accepted






      If there is a NUL character anywhere in the file, grep will consider it as a binary file.



      There might a workaround like this cat file | tr -d '00' | yourgrep to eliminate all null first, and then to search through file.






      share|improve this answer












      If there is a NUL character anywhere in the file, grep will consider it as a binary file.



      There might a workaround like this cat file | tr -d '00' | yourgrep to eliminate all null first, and then to search through file.







      share|improve this answer












      share|improve this answer



      share|improve this answer










      answered Sep 1 '11 at 13:28









      bbaja42

      1,80721015




      1,80721015







      • 116




        ... or use -a/--text, at least with GNU grep.
        – derobert
        Nov 26 '12 at 20:44






      • 1




        @derobert: actually, on some (older) systems, grep see lines, but its output will truncate each matching line at the first NUL (probably becauses it calls C's printf and gives it the matched line?). On such a system a grep cmd .sh_history will return as many empty lines as there are lines matching 'cmd', as each line of sh_history has a specific format with a NUL at the begining of each line. (but your comment "at least on GNU grep" probably comes true. I don't have one at hand right now to test, but I expect they handle this nicely)
        – Olivier Dulac
        Nov 25 '13 at 11:46







      • 4




        Is the presence of a NUL character the only criteria? I doubt it. It's probably smarter than that. Anything falling outside the Ascii 32-126 range would be my guess, but we'd have to look at the source code to be sure.
        – Michael Martinez
        Aug 14 '15 at 16:58






      • 2




        My info was from the man page of the specific grep instance. Your comment about implementation is valid, source trumps docs.
        – bbaja42
        Aug 18 '15 at 22:31






      • 2




        I had a file which grep on cygwin considered binary because it had a long dash (0x96) instead of a regular ASCII hyphen/minus (0x2d). I guess this answer resolved the OP's issue, but it appears it is incomplete.
        – cp.engr
        Feb 15 '16 at 16:15












      • 116




        ... or use -a/--text, at least with GNU grep.
        – derobert
        Nov 26 '12 at 20:44






      • 1




        @derobert: actually, on some (older) systems, grep see lines, but its output will truncate each matching line at the first NUL (probably becauses it calls C's printf and gives it the matched line?). On such a system a grep cmd .sh_history will return as many empty lines as there are lines matching 'cmd', as each line of sh_history has a specific format with a NUL at the begining of each line. (but your comment "at least on GNU grep" probably comes true. I don't have one at hand right now to test, but I expect they handle this nicely)
        – Olivier Dulac
        Nov 25 '13 at 11:46







      • 4




        Is the presence of a NUL character the only criteria? I doubt it. It's probably smarter than that. Anything falling outside the Ascii 32-126 range would be my guess, but we'd have to look at the source code to be sure.
        – Michael Martinez
        Aug 14 '15 at 16:58






      • 2




        My info was from the man page of the specific grep instance. Your comment about implementation is valid, source trumps docs.
        – bbaja42
        Aug 18 '15 at 22:31






      • 2




        I had a file which grep on cygwin considered binary because it had a long dash (0x96) instead of a regular ASCII hyphen/minus (0x2d). I guess this answer resolved the OP's issue, but it appears it is incomplete.
        – cp.engr
        Feb 15 '16 at 16:15







      116




      116




      ... or use -a/--text, at least with GNU grep.
      – derobert
      Nov 26 '12 at 20:44




      ... or use -a/--text, at least with GNU grep.
      – derobert
      Nov 26 '12 at 20:44




      1




      1




      @derobert: actually, on some (older) systems, grep see lines, but its output will truncate each matching line at the first NUL (probably becauses it calls C's printf and gives it the matched line?). On such a system a grep cmd .sh_history will return as many empty lines as there are lines matching 'cmd', as each line of sh_history has a specific format with a NUL at the begining of each line. (but your comment "at least on GNU grep" probably comes true. I don't have one at hand right now to test, but I expect they handle this nicely)
      – Olivier Dulac
      Nov 25 '13 at 11:46





      @derobert: actually, on some (older) systems, grep see lines, but its output will truncate each matching line at the first NUL (probably becauses it calls C's printf and gives it the matched line?). On such a system a grep cmd .sh_history will return as many empty lines as there are lines matching 'cmd', as each line of sh_history has a specific format with a NUL at the begining of each line. (but your comment "at least on GNU grep" probably comes true. I don't have one at hand right now to test, but I expect they handle this nicely)
      – Olivier Dulac
      Nov 25 '13 at 11:46





      4




      4




      Is the presence of a NUL character the only criteria? I doubt it. It's probably smarter than that. Anything falling outside the Ascii 32-126 range would be my guess, but we'd have to look at the source code to be sure.
      – Michael Martinez
      Aug 14 '15 at 16:58




      Is the presence of a NUL character the only criteria? I doubt it. It's probably smarter than that. Anything falling outside the Ascii 32-126 range would be my guess, but we'd have to look at the source code to be sure.
      – Michael Martinez
      Aug 14 '15 at 16:58




      2




      2




      My info was from the man page of the specific grep instance. Your comment about implementation is valid, source trumps docs.
      – bbaja42
      Aug 18 '15 at 22:31




      My info was from the man page of the specific grep instance. Your comment about implementation is valid, source trumps docs.
      – bbaja42
      Aug 18 '15 at 22:31




      2




      2




      I had a file which grep on cygwin considered binary because it had a long dash (0x96) instead of a regular ASCII hyphen/minus (0x2d). I guess this answer resolved the OP's issue, but it appears it is incomplete.
      – cp.engr
      Feb 15 '16 at 16:15




      I had a file which grep on cygwin considered binary because it had a long dash (0x96) instead of a regular ASCII hyphen/minus (0x2d). I guess this answer resolved the OP's issue, but it appears it is incomplete.
      – cp.engr
      Feb 15 '16 at 16:15












      up vote
      87
      down vote













      grep -a worked for me:



      $ grep --help
      [...]
      -a, --text equivalent to --binary-files=text





      share|improve this answer
















      • 2




        This is the best, least expensive answer IMO.
        – pydsigner
        Sep 24 '16 at 18:32














      up vote
      87
      down vote













      grep -a worked for me:



      $ grep --help
      [...]
      -a, --text equivalent to --binary-files=text





      share|improve this answer
















      • 2




        This is the best, least expensive answer IMO.
        – pydsigner
        Sep 24 '16 at 18:32












      up vote
      87
      down vote










      up vote
      87
      down vote









      grep -a worked for me:



      $ grep --help
      [...]
      -a, --text equivalent to --binary-files=text





      share|improve this answer












      grep -a worked for me:



      $ grep --help
      [...]
      -a, --text equivalent to --binary-files=text






      share|improve this answer












      share|improve this answer



      share|improve this answer










      answered Sep 2 '15 at 9:43









      Plouff

      97153




      97153







      • 2




        This is the best, least expensive answer IMO.
        – pydsigner
        Sep 24 '16 at 18:32












      • 2




        This is the best, least expensive answer IMO.
        – pydsigner
        Sep 24 '16 at 18:32







      2




      2




      This is the best, least expensive answer IMO.
      – pydsigner
      Sep 24 '16 at 18:32




      This is the best, least expensive answer IMO.
      – pydsigner
      Sep 24 '16 at 18:32










      up vote
      20
      down vote













      You can use the strings utility to extract the text content from any file and then pipe it through grep, like this: strings file | grep pattern.






      share|improve this answer
















      • 1




        Ideal for grepping log files that might be partly corrupted
        – Hannes R.
        Feb 27 '15 at 7:43










      • yes, sometimes binary mixed logging also happens. This is good.
        – sdkks
        Sep 3 '17 at 16:59














      up vote
      20
      down vote













      You can use the strings utility to extract the text content from any file and then pipe it through grep, like this: strings file | grep pattern.






      share|improve this answer
















      • 1




        Ideal for grepping log files that might be partly corrupted
        – Hannes R.
        Feb 27 '15 at 7:43










      • yes, sometimes binary mixed logging also happens. This is good.
        – sdkks
        Sep 3 '17 at 16:59












      up vote
      20
      down vote










      up vote
      20
      down vote









      You can use the strings utility to extract the text content from any file and then pipe it through grep, like this: strings file | grep pattern.






      share|improve this answer












      You can use the strings utility to extract the text content from any file and then pipe it through grep, like this: strings file | grep pattern.







      share|improve this answer












      share|improve this answer



      share|improve this answer










      answered Nov 26 '12 at 20:24









      holgero

      30125




      30125







      • 1




        Ideal for grepping log files that might be partly corrupted
        – Hannes R.
        Feb 27 '15 at 7:43










      • yes, sometimes binary mixed logging also happens. This is good.
        – sdkks
        Sep 3 '17 at 16:59












      • 1




        Ideal for grepping log files that might be partly corrupted
        – Hannes R.
        Feb 27 '15 at 7:43










      • yes, sometimes binary mixed logging also happens. This is good.
        – sdkks
        Sep 3 '17 at 16:59







      1




      1




      Ideal for grepping log files that might be partly corrupted
      – Hannes R.
      Feb 27 '15 at 7:43




      Ideal for grepping log files that might be partly corrupted
      – Hannes R.
      Feb 27 '15 at 7:43












      yes, sometimes binary mixed logging also happens. This is good.
      – sdkks
      Sep 3 '17 at 16:59




      yes, sometimes binary mixed logging also happens. This is good.
      – sdkks
      Sep 3 '17 at 16:59










      up vote
      11
      down vote













      GNU grep 2.24 RTFS



      Conclusion: 2 and 2 cases only:



      • NUL, e.g. printf 'a' | grep 'a'



      • encoding error according to the C99 mbrlen(), e.g.:



        export LC_CTYPE='en_US.UTF-8'
        printf 'ax80' | grep 'a'


        because x80 cannot be the first byte of an UTF-8 Unicode point: UTF-8 - Description | en.wikipedia.org



      Furthermore, as mentioned by Stéphane Chazelas What makes grep consider a file to be binary? | Unix & Linux Stack Exchange, those checks are only done up to the first buffer read of length TODO.



      Only up to the first buffer read



      So if a NUL or encoding error happens in the middle of a very large file, it might be grepped anyways.



      I imagine this is for performance reasons.



      E.g.: this prints the line:



      printf '%10000000snx80a' | grep 'a'


      but this does not:



      printf '%10snx80a' | grep 'a'


      The actual buffer size depends on how the file is read. E.g. compare:



      export LC_CTYPE='en_US.UTF-8'
      (printf 'nx80a') | grep 'a'
      (printf 'n'; sleep 1; printf 'x80a') | grep 'a'


      With the sleep, the first line gets passed to grep even if it is only 1 byte long because the process goes to sleep, and the second read does not check if the file is binary.



      RTFS



      git clone git://git.savannah.gnu.org/grep.git 
      cd grep
      git checkout v2.24


      Find where the stderr error message is encoded:



      git grep 'Binary file'


      Leads us to /src/grep.c:



      if (!out_quiet && (encoding_error_output
      || (0 <= nlines_first_null && nlines_first_null < nlines)))
      {
      printf (_("Binary file %s matchesn"), filename);


      If those variables were well named, we basically reached the conclusion.



      encoding_error_output



      Quick grepping for encoding_error_output shows that the only code path that can modify it goes through buf_has_encoding_errors:



      clen = mbrlen (p, buf + size - p, &mbs);
      if ((size_t) -2 <= clen)
      return true;


      then just man mbrlen.



      nlines_first_null and nlines



      Initialized as:



      intmax_t nlines_first_null = -1;
      nlines = 0;


      so when a null is found 0 <= nlines_first_null becomes true.



      TODO when can nlines_first_null < nlines ever be false? I got lazy.



      POSIX



      Does not define binary options grep - search a file for a pattern | pubs.opengroup.org , and GNU grep does not document it, so RTFS is the only way.






      share|improve this answer


















      • 1




        Impressive explication!
        – user394
        Apr 13 '16 at 2:02






      • 2




        Note that the check for valid UTF-8 only happens in UTF-8 locales. Also note that the check is only done on the first buffer read from the file which for a regular file seems to be 32768 bytes on my system, but for a pipe or socket can be as small as one byte. Compare (printf 'ny') | grep y with (printf 'n'; sleep 1; printf 'y') | grep y for instance.
        – Stéphane Chazelas
        Apr 13 '16 at 12:18











      • @StéphaneChazelas "Note that the check for valid UTF-8 only happens in UTF-8 locales": do you mean about the export LC_CTYPE='en_US.UTF-8' as in my example, or something else? Buf read: amazing example, added to answer. You have obviously read the source more than me, reminds me of those hacker koans "The student was enlightened" :-)
        – Ciro Santilli 新疆改造中心 六四事件 法轮功
        Apr 13 '16 at 13:05






      • 1




        I didn't look into great detail either, but did very recently
        – Stéphane Chazelas
        Apr 13 '16 at 13:09






      • 1




        @CiroSantilli巴拿馬文件六四事件法轮功 what version of GNU grep did you test against?
        – jrw32982
        Jun 8 '16 at 23:33














      up vote
      11
      down vote













      GNU grep 2.24 RTFS



      Conclusion: 2 and 2 cases only:



      • NUL, e.g. printf 'a' | grep 'a'



      • encoding error according to the C99 mbrlen(), e.g.:



        export LC_CTYPE='en_US.UTF-8'
        printf 'ax80' | grep 'a'


        because x80 cannot be the first byte of an UTF-8 Unicode point: UTF-8 - Description | en.wikipedia.org



      Furthermore, as mentioned by Stéphane Chazelas What makes grep consider a file to be binary? | Unix & Linux Stack Exchange, those checks are only done up to the first buffer read of length TODO.



      Only up to the first buffer read



      So if a NUL or encoding error happens in the middle of a very large file, it might be grepped anyways.



      I imagine this is for performance reasons.



      E.g.: this prints the line:



      printf '%10000000snx80a' | grep 'a'


      but this does not:



      printf '%10snx80a' | grep 'a'


      The actual buffer size depends on how the file is read. E.g. compare:



      export LC_CTYPE='en_US.UTF-8'
      (printf 'nx80a') | grep 'a'
      (printf 'n'; sleep 1; printf 'x80a') | grep 'a'


      With the sleep, the first line gets passed to grep even if it is only 1 byte long because the process goes to sleep, and the second read does not check if the file is binary.



      RTFS



      git clone git://git.savannah.gnu.org/grep.git 
      cd grep
      git checkout v2.24


      Find where the stderr error message is encoded:



      git grep 'Binary file'


      Leads us to /src/grep.c:



      if (!out_quiet && (encoding_error_output
      || (0 <= nlines_first_null && nlines_first_null < nlines)))
      {
      printf (_("Binary file %s matchesn"), filename);


      If those variables were well named, we basically reached the conclusion.



      encoding_error_output



      Quick grepping for encoding_error_output shows that the only code path that can modify it goes through buf_has_encoding_errors:



      clen = mbrlen (p, buf + size - p, &mbs);
      if ((size_t) -2 <= clen)
      return true;


      then just man mbrlen.



      nlines_first_null and nlines



      Initialized as:



      intmax_t nlines_first_null = -1;
      nlines = 0;


      so when a null is found 0 <= nlines_first_null becomes true.



      TODO when can nlines_first_null < nlines ever be false? I got lazy.



      POSIX



      Does not define binary options grep - search a file for a pattern | pubs.opengroup.org , and GNU grep does not document it, so RTFS is the only way.






      share|improve this answer


















      • 1




        Impressive explication!
        – user394
        Apr 13 '16 at 2:02






      • 2




        Note that the check for valid UTF-8 only happens in UTF-8 locales. Also note that the check is only done on the first buffer read from the file which for a regular file seems to be 32768 bytes on my system, but for a pipe or socket can be as small as one byte. Compare (printf 'ny') | grep y with (printf 'n'; sleep 1; printf 'y') | grep y for instance.
        – Stéphane Chazelas
        Apr 13 '16 at 12:18











      • @StéphaneChazelas "Note that the check for valid UTF-8 only happens in UTF-8 locales": do you mean about the export LC_CTYPE='en_US.UTF-8' as in my example, or something else? Buf read: amazing example, added to answer. You have obviously read the source more than me, reminds me of those hacker koans "The student was enlightened" :-)
        – Ciro Santilli 新疆改造中心 六四事件 法轮功
        Apr 13 '16 at 13:05






      • 1




        I didn't look into great detail either, but did very recently
        – Stéphane Chazelas
        Apr 13 '16 at 13:09






      • 1




        @CiroSantilli巴拿馬文件六四事件法轮功 what version of GNU grep did you test against?
        – jrw32982
        Jun 8 '16 at 23:33












      up vote
      11
      down vote










      up vote
      11
      down vote









      GNU grep 2.24 RTFS



      Conclusion: 2 and 2 cases only:



      • NUL, e.g. printf 'a' | grep 'a'



      • encoding error according to the C99 mbrlen(), e.g.:



        export LC_CTYPE='en_US.UTF-8'
        printf 'ax80' | grep 'a'


        because x80 cannot be the first byte of an UTF-8 Unicode point: UTF-8 - Description | en.wikipedia.org



      Furthermore, as mentioned by Stéphane Chazelas What makes grep consider a file to be binary? | Unix & Linux Stack Exchange, those checks are only done up to the first buffer read of length TODO.



      Only up to the first buffer read



      So if a NUL or encoding error happens in the middle of a very large file, it might be grepped anyways.



      I imagine this is for performance reasons.



      E.g.: this prints the line:



      printf '%10000000snx80a' | grep 'a'


      but this does not:



      printf '%10snx80a' | grep 'a'


      The actual buffer size depends on how the file is read. E.g. compare:



      export LC_CTYPE='en_US.UTF-8'
      (printf 'nx80a') | grep 'a'
      (printf 'n'; sleep 1; printf 'x80a') | grep 'a'


      With the sleep, the first line gets passed to grep even if it is only 1 byte long because the process goes to sleep, and the second read does not check if the file is binary.



      RTFS



      git clone git://git.savannah.gnu.org/grep.git 
      cd grep
      git checkout v2.24


      Find where the stderr error message is encoded:



      git grep 'Binary file'


      Leads us to /src/grep.c:



      if (!out_quiet && (encoding_error_output
      || (0 <= nlines_first_null && nlines_first_null < nlines)))
      {
      printf (_("Binary file %s matchesn"), filename);


      If those variables were well named, we basically reached the conclusion.



      encoding_error_output



      Quick grepping for encoding_error_output shows that the only code path that can modify it goes through buf_has_encoding_errors:



      clen = mbrlen (p, buf + size - p, &mbs);
      if ((size_t) -2 <= clen)
      return true;


      then just man mbrlen.



      nlines_first_null and nlines



      Initialized as:



      intmax_t nlines_first_null = -1;
      nlines = 0;


      so when a null is found 0 <= nlines_first_null becomes true.



      TODO when can nlines_first_null < nlines ever be false? I got lazy.



      POSIX



      Does not define binary options grep - search a file for a pattern | pubs.opengroup.org , and GNU grep does not document it, so RTFS is the only way.






      share|improve this answer














      GNU grep 2.24 RTFS



      Conclusion: 2 and 2 cases only:



      • NUL, e.g. printf 'a' | grep 'a'



      • encoding error according to the C99 mbrlen(), e.g.:



        export LC_CTYPE='en_US.UTF-8'
        printf 'ax80' | grep 'a'


        because x80 cannot be the first byte of an UTF-8 Unicode point: UTF-8 - Description | en.wikipedia.org



      Furthermore, as mentioned by Stéphane Chazelas What makes grep consider a file to be binary? | Unix & Linux Stack Exchange, those checks are only done up to the first buffer read of length TODO.



      Only up to the first buffer read



      So if a NUL or encoding error happens in the middle of a very large file, it might be grepped anyways.



      I imagine this is for performance reasons.



      E.g.: this prints the line:



      printf '%10000000snx80a' | grep 'a'


      but this does not:



      printf '%10snx80a' | grep 'a'


      The actual buffer size depends on how the file is read. E.g. compare:



      export LC_CTYPE='en_US.UTF-8'
      (printf 'nx80a') | grep 'a'
      (printf 'n'; sleep 1; printf 'x80a') | grep 'a'


      With the sleep, the first line gets passed to grep even if it is only 1 byte long because the process goes to sleep, and the second read does not check if the file is binary.



      RTFS



      git clone git://git.savannah.gnu.org/grep.git 
      cd grep
      git checkout v2.24


      Find where the stderr error message is encoded:



      git grep 'Binary file'


      Leads us to /src/grep.c:



      if (!out_quiet && (encoding_error_output
      || (0 <= nlines_first_null && nlines_first_null < nlines)))
      {
      printf (_("Binary file %s matchesn"), filename);


      If those variables were well named, we basically reached the conclusion.



      encoding_error_output



      Quick grepping for encoding_error_output shows that the only code path that can modify it goes through buf_has_encoding_errors:



      clen = mbrlen (p, buf + size - p, &mbs);
      if ((size_t) -2 <= clen)
      return true;


      then just man mbrlen.



      nlines_first_null and nlines



      Initialized as:



      intmax_t nlines_first_null = -1;
      nlines = 0;


      so when a null is found 0 <= nlines_first_null becomes true.



      TODO when can nlines_first_null < nlines ever be false? I got lazy.



      POSIX



      Does not define binary options grep - search a file for a pattern | pubs.opengroup.org , and GNU grep does not document it, so RTFS is the only way.







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Apr 15 at 7:34









      Drakonoved

      684518




      684518










      answered Apr 12 '16 at 20:50









      Ciro Santilli 新疆改造中心 六四事件 法轮功

      4,54323938




      4,54323938







      • 1




        Impressive explication!
        – user394
        Apr 13 '16 at 2:02






      • 2




        Note that the check for valid UTF-8 only happens in UTF-8 locales. Also note that the check is only done on the first buffer read from the file which for a regular file seems to be 32768 bytes on my system, but for a pipe or socket can be as small as one byte. Compare (printf 'ny') | grep y with (printf 'n'; sleep 1; printf 'y') | grep y for instance.
        – Stéphane Chazelas
        Apr 13 '16 at 12:18











      • @StéphaneChazelas "Note that the check for valid UTF-8 only happens in UTF-8 locales": do you mean about the export LC_CTYPE='en_US.UTF-8' as in my example, or something else? Buf read: amazing example, added to answer. You have obviously read the source more than me, reminds me of those hacker koans "The student was enlightened" :-)
        – Ciro Santilli 新疆改造中心 六四事件 法轮功
        Apr 13 '16 at 13:05






      • 1




        I didn't look into great detail either, but did very recently
        – Stéphane Chazelas
        Apr 13 '16 at 13:09






      • 1




        @CiroSantilli巴拿馬文件六四事件法轮功 what version of GNU grep did you test against?
        – jrw32982
        Jun 8 '16 at 23:33












      • 1




        Impressive explication!
        – user394
        Apr 13 '16 at 2:02






      • 2




        Note that the check for valid UTF-8 only happens in UTF-8 locales. Also note that the check is only done on the first buffer read from the file which for a regular file seems to be 32768 bytes on my system, but for a pipe or socket can be as small as one byte. Compare (printf 'ny') | grep y with (printf 'n'; sleep 1; printf 'y') | grep y for instance.
        – Stéphane Chazelas
        Apr 13 '16 at 12:18











      • @StéphaneChazelas "Note that the check for valid UTF-8 only happens in UTF-8 locales": do you mean about the export LC_CTYPE='en_US.UTF-8' as in my example, or something else? Buf read: amazing example, added to answer. You have obviously read the source more than me, reminds me of those hacker koans "The student was enlightened" :-)
        – Ciro Santilli 新疆改造中心 六四事件 法轮功
        Apr 13 '16 at 13:05






      • 1




        I didn't look into great detail either, but did very recently
        – Stéphane Chazelas
        Apr 13 '16 at 13:09






      • 1




        @CiroSantilli巴拿馬文件六四事件法轮功 what version of GNU grep did you test against?
        – jrw32982
        Jun 8 '16 at 23:33







      1




      1




      Impressive explication!
      – user394
      Apr 13 '16 at 2:02




      Impressive explication!
      – user394
      Apr 13 '16 at 2:02




      2




      2




      Note that the check for valid UTF-8 only happens in UTF-8 locales. Also note that the check is only done on the first buffer read from the file which for a regular file seems to be 32768 bytes on my system, but for a pipe or socket can be as small as one byte. Compare (printf 'ny') | grep y with (printf 'n'; sleep 1; printf 'y') | grep y for instance.
      – Stéphane Chazelas
      Apr 13 '16 at 12:18





      Note that the check for valid UTF-8 only happens in UTF-8 locales. Also note that the check is only done on the first buffer read from the file which for a regular file seems to be 32768 bytes on my system, but for a pipe or socket can be as small as one byte. Compare (printf 'ny') | grep y with (printf 'n'; sleep 1; printf 'y') | grep y for instance.
      – Stéphane Chazelas
      Apr 13 '16 at 12:18













      @StéphaneChazelas "Note that the check for valid UTF-8 only happens in UTF-8 locales": do you mean about the export LC_CTYPE='en_US.UTF-8' as in my example, or something else? Buf read: amazing example, added to answer. You have obviously read the source more than me, reminds me of those hacker koans "The student was enlightened" :-)
      – Ciro Santilli 新疆改造中心 六四事件 法轮功
      Apr 13 '16 at 13:05




      @StéphaneChazelas "Note that the check for valid UTF-8 only happens in UTF-8 locales": do you mean about the export LC_CTYPE='en_US.UTF-8' as in my example, or something else? Buf read: amazing example, added to answer. You have obviously read the source more than me, reminds me of those hacker koans "The student was enlightened" :-)
      – Ciro Santilli 新疆改造中心 六四事件 法轮功
      Apr 13 '16 at 13:05




      1




      1




      I didn't look into great detail either, but did very recently
      – Stéphane Chazelas
      Apr 13 '16 at 13:09




      I didn't look into great detail either, but did very recently
      – Stéphane Chazelas
      Apr 13 '16 at 13:09




      1




      1




      @CiroSantilli巴拿馬文件六四事件法轮功 what version of GNU grep did you test against?
      – jrw32982
      Jun 8 '16 at 23:33




      @CiroSantilli巴拿馬文件六四事件法轮功 what version of GNU grep did you test against?
      – jrw32982
      Jun 8 '16 at 23:33










      up vote
      6
      down vote













      One of my text files was suddenly being seen as binary by grep:



      $ file foo.txt
      foo.txt: ISO-8859 text


      Solution was to convert it by using iconv:



      iconv -t UTF-8 -f ISO-8859-1 foo.txt > foo_new.txt





      share|improve this answer


















      • 1




        This happened to me as well. In particular, the cause was an ISO-8859-1-encoded non-breaking space, which I had to replace with a regular space in order to get grep to search in the file.
        – Gallaecio
        Jun 9 '15 at 13:50






      • 4




        grep 2.21 treats ISO-8859 text files as if they are binary, add export LC_ALL=C before grep command.
        – netawater
        Aug 17 '15 at 2:52











      • @netawater Thanks! This is e.g. the case if you have something like Müller in a text-file. That's 0xFC hexadecimal, so outside the range grep would expect for utf8 (up to 0x7F). Check with printf 'ax7F' | grep 'a' as Ciro describe above.
        – Anne van Rossum
        Nov 26 '16 at 16:51














      up vote
      6
      down vote













      One of my text files was suddenly being seen as binary by grep:



      $ file foo.txt
      foo.txt: ISO-8859 text


      Solution was to convert it by using iconv:



      iconv -t UTF-8 -f ISO-8859-1 foo.txt > foo_new.txt





      share|improve this answer


















      • 1




        This happened to me as well. In particular, the cause was an ISO-8859-1-encoded non-breaking space, which I had to replace with a regular space in order to get grep to search in the file.
        – Gallaecio
        Jun 9 '15 at 13:50






      • 4




        grep 2.21 treats ISO-8859 text files as if they are binary, add export LC_ALL=C before grep command.
        – netawater
        Aug 17 '15 at 2:52











      • @netawater Thanks! This is e.g. the case if you have something like Müller in a text-file. That's 0xFC hexadecimal, so outside the range grep would expect for utf8 (up to 0x7F). Check with printf 'ax7F' | grep 'a' as Ciro describe above.
        – Anne van Rossum
        Nov 26 '16 at 16:51












      up vote
      6
      down vote










      up vote
      6
      down vote









      One of my text files was suddenly being seen as binary by grep:



      $ file foo.txt
      foo.txt: ISO-8859 text


      Solution was to convert it by using iconv:



      iconv -t UTF-8 -f ISO-8859-1 foo.txt > foo_new.txt





      share|improve this answer














      One of my text files was suddenly being seen as binary by grep:



      $ file foo.txt
      foo.txt: ISO-8859 text


      Solution was to convert it by using iconv:



      iconv -t UTF-8 -f ISO-8859-1 foo.txt > foo_new.txt






      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Jun 1 '15 at 20:24









      kenorb

      7,841365105




      7,841365105










      answered Dec 8 '14 at 21:30









      zzapper

      709513




      709513







      • 1




        This happened to me as well. In particular, the cause was an ISO-8859-1-encoded non-breaking space, which I had to replace with a regular space in order to get grep to search in the file.
        – Gallaecio
        Jun 9 '15 at 13:50






      • 4




        grep 2.21 treats ISO-8859 text files as if they are binary, add export LC_ALL=C before grep command.
        – netawater
        Aug 17 '15 at 2:52











      • @netawater Thanks! This is e.g. the case if you have something like Müller in a text-file. That's 0xFC hexadecimal, so outside the range grep would expect for utf8 (up to 0x7F). Check with printf 'ax7F' | grep 'a' as Ciro describe above.
        – Anne van Rossum
        Nov 26 '16 at 16:51












      • 1




        This happened to me as well. In particular, the cause was an ISO-8859-1-encoded non-breaking space, which I had to replace with a regular space in order to get grep to search in the file.
        – Gallaecio
        Jun 9 '15 at 13:50






      • 4




        grep 2.21 treats ISO-8859 text files as if they are binary, add export LC_ALL=C before grep command.
        – netawater
        Aug 17 '15 at 2:52











      • @netawater Thanks! This is e.g. the case if you have something like Müller in a text-file. That's 0xFC hexadecimal, so outside the range grep would expect for utf8 (up to 0x7F). Check with printf 'ax7F' | grep 'a' as Ciro describe above.
        – Anne van Rossum
        Nov 26 '16 at 16:51







      1




      1




      This happened to me as well. In particular, the cause was an ISO-8859-1-encoded non-breaking space, which I had to replace with a regular space in order to get grep to search in the file.
      – Gallaecio
      Jun 9 '15 at 13:50




      This happened to me as well. In particular, the cause was an ISO-8859-1-encoded non-breaking space, which I had to replace with a regular space in order to get grep to search in the file.
      – Gallaecio
      Jun 9 '15 at 13:50




      4




      4




      grep 2.21 treats ISO-8859 text files as if they are binary, add export LC_ALL=C before grep command.
      – netawater
      Aug 17 '15 at 2:52





      grep 2.21 treats ISO-8859 text files as if they are binary, add export LC_ALL=C before grep command.
      – netawater
      Aug 17 '15 at 2:52













      @netawater Thanks! This is e.g. the case if you have something like Müller in a text-file. That's 0xFC hexadecimal, so outside the range grep would expect for utf8 (up to 0x7F). Check with printf 'ax7F' | grep 'a' as Ciro describe above.
      – Anne van Rossum
      Nov 26 '16 at 16:51




      @netawater Thanks! This is e.g. the case if you have something like Müller in a text-file. That's 0xFC hexadecimal, so outside the range grep would expect for utf8 (up to 0x7F). Check with printf 'ax7F' | grep 'a' as Ciro describe above.
      – Anne van Rossum
      Nov 26 '16 at 16:51










      up vote
      5
      down vote













      The file /etc/magic or /usr/share/misc/magic has a list of sequences that the command file uses for determining the file type.



      Note that binary may just be a fallback solution. Sometimes files with strange encoding are considered binary too.



      grep on Linux has some options to handle binary files like --binary-files or -U / --binary






      share|improve this answer






















      • More precisely, encoding error according to C99's mbrlen(). Example and source interpretation at: unix.stackexchange.com/a/276028/32558
        – Ciro Santilli 新疆改造中心 六四事件 法轮功
        Apr 12 '16 at 20:51














      up vote
      5
      down vote













      The file /etc/magic or /usr/share/misc/magic has a list of sequences that the command file uses for determining the file type.



      Note that binary may just be a fallback solution. Sometimes files with strange encoding are considered binary too.



      grep on Linux has some options to handle binary files like --binary-files or -U / --binary






      share|improve this answer






















      • More precisely, encoding error according to C99's mbrlen(). Example and source interpretation at: unix.stackexchange.com/a/276028/32558
        – Ciro Santilli 新疆改造中心 六四事件 法轮功
        Apr 12 '16 at 20:51












      up vote
      5
      down vote










      up vote
      5
      down vote









      The file /etc/magic or /usr/share/misc/magic has a list of sequences that the command file uses for determining the file type.



      Note that binary may just be a fallback solution. Sometimes files with strange encoding are considered binary too.



      grep on Linux has some options to handle binary files like --binary-files or -U / --binary






      share|improve this answer














      The file /etc/magic or /usr/share/misc/magic has a list of sequences that the command file uses for determining the file type.



      Note that binary may just be a fallback solution. Sometimes files with strange encoding are considered binary too.



      grep on Linux has some options to handle binary files like --binary-files or -U / --binary







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Feb 9 '15 at 11:49









      fduff

      2,61931933




      2,61931933










      answered Sep 1 '11 at 13:27









      klapaucius

      45624




      45624











      • More precisely, encoding error according to C99's mbrlen(). Example and source interpretation at: unix.stackexchange.com/a/276028/32558
        – Ciro Santilli 新疆改造中心 六四事件 法轮功
        Apr 12 '16 at 20:51
















      • More precisely, encoding error according to C99's mbrlen(). Example and source interpretation at: unix.stackexchange.com/a/276028/32558
        – Ciro Santilli 新疆改造中心 六四事件 法轮功
        Apr 12 '16 at 20:51















      More precisely, encoding error according to C99's mbrlen(). Example and source interpretation at: unix.stackexchange.com/a/276028/32558
      – Ciro Santilli 新疆改造中心 六四事件 法轮功
      Apr 12 '16 at 20:51




      More precisely, encoding error according to C99's mbrlen(). Example and source interpretation at: unix.stackexchange.com/a/276028/32558
      – Ciro Santilli 新疆改造中心 六四事件 法轮功
      Apr 12 '16 at 20:51










      up vote
      2
      down vote













      One of my students had this problem. There is a bug in grep in Cygwin. If the file has non-Ascii characters, grep and egrep see it as binary.






      share|improve this answer






















      • That sounds like a feature, not a bug. Especially given there is a command-line option to control it (-a / --text)
        – Will Sheppard
        Jan 29 at 11:39















      up vote
      2
      down vote













      One of my students had this problem. There is a bug in grep in Cygwin. If the file has non-Ascii characters, grep and egrep see it as binary.






      share|improve this answer






















      • That sounds like a feature, not a bug. Especially given there is a command-line option to control it (-a / --text)
        – Will Sheppard
        Jan 29 at 11:39













      up vote
      2
      down vote










      up vote
      2
      down vote









      One of my students had this problem. There is a bug in grep in Cygwin. If the file has non-Ascii characters, grep and egrep see it as binary.






      share|improve this answer














      One of my students had this problem. There is a bug in grep in Cygwin. If the file has non-Ascii characters, grep and egrep see it as binary.







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Sep 10 '15 at 11:14









      Tejas

      1,77821837




      1,77821837










      answered Sep 10 '15 at 9:31









      Joan Pontius

      291




      291











      • That sounds like a feature, not a bug. Especially given there is a command-line option to control it (-a / --text)
        – Will Sheppard
        Jan 29 at 11:39

















      • That sounds like a feature, not a bug. Especially given there is a command-line option to control it (-a / --text)
        – Will Sheppard
        Jan 29 at 11:39
















      That sounds like a feature, not a bug. Especially given there is a command-line option to control it (-a / --text)
      – Will Sheppard
      Jan 29 at 11:39





      That sounds like a feature, not a bug. Especially given there is a command-line option to control it (-a / --text)
      – Will Sheppard
      Jan 29 at 11:39











      up vote
      2
      down vote













      Actually answering the question "What makes grep consider a file to be binary?", you can use iconv:



      $ iconv < myfile.java
      iconv: (stdin):267:70: cannot convert


      In my case there were Spanish characters that showed up correctly in text editors but grep considered them as binary; iconv output pointed me to the line and column numbers of those characters



      In the case of NUL characters, iconv will consider them normal and will not print that kind of output so this method is not suitable






      share|improve this answer


























        up vote
        2
        down vote













        Actually answering the question "What makes grep consider a file to be binary?", you can use iconv:



        $ iconv < myfile.java
        iconv: (stdin):267:70: cannot convert


        In my case there were Spanish characters that showed up correctly in text editors but grep considered them as binary; iconv output pointed me to the line and column numbers of those characters



        In the case of NUL characters, iconv will consider them normal and will not print that kind of output so this method is not suitable






        share|improve this answer
























          up vote
          2
          down vote










          up vote
          2
          down vote









          Actually answering the question "What makes grep consider a file to be binary?", you can use iconv:



          $ iconv < myfile.java
          iconv: (stdin):267:70: cannot convert


          In my case there were Spanish characters that showed up correctly in text editors but grep considered them as binary; iconv output pointed me to the line and column numbers of those characters



          In the case of NUL characters, iconv will consider them normal and will not print that kind of output so this method is not suitable






          share|improve this answer














          Actually answering the question "What makes grep consider a file to be binary?", you can use iconv:



          $ iconv < myfile.java
          iconv: (stdin):267:70: cannot convert


          In my case there were Spanish characters that showed up correctly in text editors but grep considered them as binary; iconv output pointed me to the line and column numbers of those characters



          In the case of NUL characters, iconv will consider them normal and will not print that kind of output so this method is not suitable







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Apr 14 '16 at 16:49

























          answered May 20 '15 at 15:12









          golimar

          27519




          27519




















              up vote
              1
              down vote













              I had the same problem. I used vi -b [filename] to see the added characters. I found the control characters ^@ and ^M. Then in vi type :1,$s/^@//g to remove the ^@ characters. Repeat this command for ^M.



              Warning: To get the "blue" control characters press Ctrl+v then Ctrl+M or Ctrl+@. Then save and exit vi.






              share|improve this answer


























                up vote
                1
                down vote













                I had the same problem. I used vi -b [filename] to see the added characters. I found the control characters ^@ and ^M. Then in vi type :1,$s/^@//g to remove the ^@ characters. Repeat this command for ^M.



                Warning: To get the "blue" control characters press Ctrl+v then Ctrl+M or Ctrl+@. Then save and exit vi.






                share|improve this answer
























                  up vote
                  1
                  down vote










                  up vote
                  1
                  down vote









                  I had the same problem. I used vi -b [filename] to see the added characters. I found the control characters ^@ and ^M. Then in vi type :1,$s/^@//g to remove the ^@ characters. Repeat this command for ^M.



                  Warning: To get the "blue" control characters press Ctrl+v then Ctrl+M or Ctrl+@. Then save and exit vi.






                  share|improve this answer














                  I had the same problem. I used vi -b [filename] to see the added characters. I found the control characters ^@ and ^M. Then in vi type :1,$s/^@//g to remove the ^@ characters. Repeat this command for ^M.



                  Warning: To get the "blue" control characters press Ctrl+v then Ctrl+M or Ctrl+@. Then save and exit vi.







                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited Jun 1 '15 at 20:29









                  kenorb

                  7,841365105




                  7,841365105










                  answered Apr 3 '15 at 18:58









                  Not Sure

                  112




                  112















                      protected by Community♦ 5 mins ago



                      Thank you for your interest in this question.
                      Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).



                      Would you like to answer one of these unanswered questions instead?


                      Popular posts from this blog

                      How to check contact read email or not when send email to Individual?

                      How many registers does an x86_64 CPU actually have?

                      Nur Jahan