Should the shell read (an script) one character at a time?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP












3















While reading an script, the shell will read the script from a file, or from a pipe or possibly, some other source (stdin ?). The input may not be seek able under some corner conditions (no way to rewind the file position to a previous position).



It has been said that read reads stdin one byte at a time until it finds an unescaped newline character



Should the shell also read one character at a time from its script input?.

I mean the script, not an additional data text file that could be used.



If so: why is that needed? Is it defined in some spec?



Do all shells work similarly? Which not?










share|improve this question




























    3















    While reading an script, the shell will read the script from a file, or from a pipe or possibly, some other source (stdin ?). The input may not be seek able under some corner conditions (no way to rewind the file position to a previous position).



    It has been said that read reads stdin one byte at a time until it finds an unescaped newline character



    Should the shell also read one character at a time from its script input?.

    I mean the script, not an additional data text file that could be used.



    If so: why is that needed? Is it defined in some spec?



    Do all shells work similarly? Which not?










    share|improve this question


























      3












      3








      3








      While reading an script, the shell will read the script from a file, or from a pipe or possibly, some other source (stdin ?). The input may not be seek able under some corner conditions (no way to rewind the file position to a previous position).



      It has been said that read reads stdin one byte at a time until it finds an unescaped newline character



      Should the shell also read one character at a time from its script input?.

      I mean the script, not an additional data text file that could be used.



      If so: why is that needed? Is it defined in some spec?



      Do all shells work similarly? Which not?










      share|improve this question
















      While reading an script, the shell will read the script from a file, or from a pipe or possibly, some other source (stdin ?). The input may not be seek able under some corner conditions (no way to rewind the file position to a previous position).



      It has been said that read reads stdin one byte at a time until it finds an unescaped newline character



      Should the shell also read one character at a time from its script input?.

      I mean the script, not an additional data text file that could be used.



      If so: why is that needed? Is it defined in some spec?



      Do all shells work similarly? Which not?







      shell-script shell read






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Jan 8 at 11:25







      Isaac

















      asked Jan 8 at 7:25









      IsaacIsaac

      11.6k11652




      11.6k11652




















          2 Answers
          2






          active

          oldest

          votes


















          3















          the shell will read from the script file or from a device descriptor




          Or from a pipe, which is probably the easiest way to get a non-seekable input fd.




          Should the shell also read one character at a time from its script input?.




          If it wants to support scripts that run commands that read from stdin, and expect to get their input using lines from the script itself.



          As in something like this:



          $ cat foo.sh
          #!/bin/sh
          line | sed -e 's/^/* /'
          xxx
          echo "end."

          $ cat foo.sh | bash
          * xxx
          end.


          The line command reads a single line from standard input (the xxx) line, and the shell reads the other lines as commands. For this to work, line also needs to take care not to read the input too far, as otherwise the shell would not see the following lines. With GNU utilities, head -n1 would read too much, as would e.g. sed. The line utility from util-linux takes care to read one byte at a time to not read past the newline.



          The above script doesn't work with e.g. dash, as it reads the script full blocks at a time:



          $ cat foo.sh | dash
          *
          dash: 3: xxx: not found
          end.


          Dash and Busybox read full blocks, the others I tested (Bash, Ksh, mksh and Zsh) read byte-by-byte.



          Note that that's a rather convoluted script, and it doesn't work properly if run as, e.g. bash foo.sh, since in that case stdin doesn't point to the script itself, and the xxx line would be taken as a command. It would probably be better to use a here-doc if ne wants to include data within the script itself. This works with any shell, when run as sh bar.sh, sh < bar.sh or cat bar.sh | sh:



          $ cat bar.sh
          #!/bin/sh
          sed -e 's/^/* /' <<EOF
          xxx
          EOF
          echo "end."





          share|improve this answer























          • @Isaac, yes. I was trying to say that you make a distinction between a file and "device descriptor" in the first sentence of your question. I can't see what you're aiming at with that distinction, especially with stdin as an example of the second group. If you did something like sh < a.sh, then the stdin of the script would be a file instead.

            – ilkkachu
            Jan 8 at 11:16











          • Ah, just that a file in the filesystem is usually seek able, a file descriptor (like stdin 0,1,2,...) is usually not. Exactly your description of what happens with the pipe, but with different words, hope that makes sense.

            – Isaac
            Jan 8 at 11:21











          • @Isaac, Hmm. Bash still reads byte-by-byte if we do something like mkfifo p; cat foo.sh > p & bash p. That seems a bit overly careful: I can't really see there would be a chance for commands in the script to read the same pipe in that case (they can't really get the fd directly, Bash seems to open it at number 255, the script would need to know that). None of the other shells seem to do a byte-by-byte read in that case.

            – ilkkachu
            Jan 8 at 12:31











          • It is a POSIX requirement, read my (added) answer. (any comment?).

            – Isaac
            Jan 9 at 2:07


















          2














          Ok, I contacted the bash developer, he had this to say:




          POSIX requires it for scripts that are read from stdin. When reading from a script given as an argument, bash reads blocks.




          And, indeed, the POSIX spec says this (emphasis mine):




          When the shell is using standard input and it invokes a command that also uses standard input, the shell shall ensure that the standard input file pointer points directly after the command it has read when the command begins execution. It shall not read ahead in such a manner that any characters intended to be read by the invoked command are consumed by the shell (whether interpreted by the shell or not) or that characters that are not read by the invoked command are not seen by the shell.




          That is: (for stdin script) the shell shall read one-character-at-a-time.



          In C locale, one char is one byte.



          It seems that posh, mksh, lksh, attsh, yash, ksh, zsh and bash conform to this requirement.



          However ash (busybox sh) and dash do not.






          share|improve this answer

























          • @ilkkachu (1) The length of a character is generally related to the encoding. A UTF-16 encoded file has 2-byte characters in odd-even offsets (except the surrogate range needed for everything outside the BMP, which anyway use two odd-even pairs). Assuming only BMP characters, a UTF-16 encoded file could be read by getting pairs of bytes. A UTF-32 file use four byte characters. A UTF-32 (correctly) encoded file could be read by getting four bytes each time. So, no, "in general" not all files need to be read one-byte-at-a-time (if the encoding is correct and known). (Cont...)

            – Isaac
            Jan 18 at 17:17











          • @ilkkachu (2) However, a UTF-16 (or the simpler UCS-2) will generate an invalid Unix text file. A Unix text file could not contain NUL bytes, which, for UTF-16, is a common first byte value. It is the existence of this encodings (of fixed size) that breaks the assumption of the need to read one byte first. (Cont.)

            – Isaac
            Jan 18 at 17:20











          • @ilkkachu (3) For UTF-8, only the first byte carry the information of how many bytes follow. The first bits will be 1 followed by a 0. The number of 1's will indicate the number of bytes following. But all the other bytes will start as 10. Therefore, the number of bytes to read to synchronize to the encoding is variable, for current day UTF-8 rules, no more than three (3) bytes. Older rules allowed up to 5 following bytes.

            – Isaac
            Jan 18 at 17:27












          • @ilkkachu That is assuming you know which is the first encoded byte. The point is that: reading at a random file offset might require reading several bytes (one-byte-at-a-time) to get in sync again and find the "first (encoded) byte".

            – Isaac
            Jan 18 at 17:39











          • @ilkkachu If any read sequence is found to be invalid: it could be rejected (right there and then), but generally it isn't. Valid C char values (any byte value) could generate any sequence of bytes. A shell must be able to deal with such sequences, even if invalid in some particular locale encoding. Try, for example: printf 'echo onexeaxd5two' | sh -s.

            – Isaac
            Jan 18 at 17:48











          Your Answer








          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "106"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f493159%2fshould-the-shell-read-an-script-one-character-at-a-time%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          2 Answers
          2






          active

          oldest

          votes








          2 Answers
          2






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          3















          the shell will read from the script file or from a device descriptor




          Or from a pipe, which is probably the easiest way to get a non-seekable input fd.




          Should the shell also read one character at a time from its script input?.




          If it wants to support scripts that run commands that read from stdin, and expect to get their input using lines from the script itself.



          As in something like this:



          $ cat foo.sh
          #!/bin/sh
          line | sed -e 's/^/* /'
          xxx
          echo "end."

          $ cat foo.sh | bash
          * xxx
          end.


          The line command reads a single line from standard input (the xxx) line, and the shell reads the other lines as commands. For this to work, line also needs to take care not to read the input too far, as otherwise the shell would not see the following lines. With GNU utilities, head -n1 would read too much, as would e.g. sed. The line utility from util-linux takes care to read one byte at a time to not read past the newline.



          The above script doesn't work with e.g. dash, as it reads the script full blocks at a time:



          $ cat foo.sh | dash
          *
          dash: 3: xxx: not found
          end.


          Dash and Busybox read full blocks, the others I tested (Bash, Ksh, mksh and Zsh) read byte-by-byte.



          Note that that's a rather convoluted script, and it doesn't work properly if run as, e.g. bash foo.sh, since in that case stdin doesn't point to the script itself, and the xxx line would be taken as a command. It would probably be better to use a here-doc if ne wants to include data within the script itself. This works with any shell, when run as sh bar.sh, sh < bar.sh or cat bar.sh | sh:



          $ cat bar.sh
          #!/bin/sh
          sed -e 's/^/* /' <<EOF
          xxx
          EOF
          echo "end."





          share|improve this answer























          • @Isaac, yes. I was trying to say that you make a distinction between a file and "device descriptor" in the first sentence of your question. I can't see what you're aiming at with that distinction, especially with stdin as an example of the second group. If you did something like sh < a.sh, then the stdin of the script would be a file instead.

            – ilkkachu
            Jan 8 at 11:16











          • Ah, just that a file in the filesystem is usually seek able, a file descriptor (like stdin 0,1,2,...) is usually not. Exactly your description of what happens with the pipe, but with different words, hope that makes sense.

            – Isaac
            Jan 8 at 11:21











          • @Isaac, Hmm. Bash still reads byte-by-byte if we do something like mkfifo p; cat foo.sh > p & bash p. That seems a bit overly careful: I can't really see there would be a chance for commands in the script to read the same pipe in that case (they can't really get the fd directly, Bash seems to open it at number 255, the script would need to know that). None of the other shells seem to do a byte-by-byte read in that case.

            – ilkkachu
            Jan 8 at 12:31











          • It is a POSIX requirement, read my (added) answer. (any comment?).

            – Isaac
            Jan 9 at 2:07















          3















          the shell will read from the script file or from a device descriptor




          Or from a pipe, which is probably the easiest way to get a non-seekable input fd.




          Should the shell also read one character at a time from its script input?.




          If it wants to support scripts that run commands that read from stdin, and expect to get their input using lines from the script itself.



          As in something like this:



          $ cat foo.sh
          #!/bin/sh
          line | sed -e 's/^/* /'
          xxx
          echo "end."

          $ cat foo.sh | bash
          * xxx
          end.


          The line command reads a single line from standard input (the xxx) line, and the shell reads the other lines as commands. For this to work, line also needs to take care not to read the input too far, as otherwise the shell would not see the following lines. With GNU utilities, head -n1 would read too much, as would e.g. sed. The line utility from util-linux takes care to read one byte at a time to not read past the newline.



          The above script doesn't work with e.g. dash, as it reads the script full blocks at a time:



          $ cat foo.sh | dash
          *
          dash: 3: xxx: not found
          end.


          Dash and Busybox read full blocks, the others I tested (Bash, Ksh, mksh and Zsh) read byte-by-byte.



          Note that that's a rather convoluted script, and it doesn't work properly if run as, e.g. bash foo.sh, since in that case stdin doesn't point to the script itself, and the xxx line would be taken as a command. It would probably be better to use a here-doc if ne wants to include data within the script itself. This works with any shell, when run as sh bar.sh, sh < bar.sh or cat bar.sh | sh:



          $ cat bar.sh
          #!/bin/sh
          sed -e 's/^/* /' <<EOF
          xxx
          EOF
          echo "end."





          share|improve this answer























          • @Isaac, yes. I was trying to say that you make a distinction between a file and "device descriptor" in the first sentence of your question. I can't see what you're aiming at with that distinction, especially with stdin as an example of the second group. If you did something like sh < a.sh, then the stdin of the script would be a file instead.

            – ilkkachu
            Jan 8 at 11:16











          • Ah, just that a file in the filesystem is usually seek able, a file descriptor (like stdin 0,1,2,...) is usually not. Exactly your description of what happens with the pipe, but with different words, hope that makes sense.

            – Isaac
            Jan 8 at 11:21











          • @Isaac, Hmm. Bash still reads byte-by-byte if we do something like mkfifo p; cat foo.sh > p & bash p. That seems a bit overly careful: I can't really see there would be a chance for commands in the script to read the same pipe in that case (they can't really get the fd directly, Bash seems to open it at number 255, the script would need to know that). None of the other shells seem to do a byte-by-byte read in that case.

            – ilkkachu
            Jan 8 at 12:31











          • It is a POSIX requirement, read my (added) answer. (any comment?).

            – Isaac
            Jan 9 at 2:07













          3












          3








          3








          the shell will read from the script file or from a device descriptor




          Or from a pipe, which is probably the easiest way to get a non-seekable input fd.




          Should the shell also read one character at a time from its script input?.




          If it wants to support scripts that run commands that read from stdin, and expect to get their input using lines from the script itself.



          As in something like this:



          $ cat foo.sh
          #!/bin/sh
          line | sed -e 's/^/* /'
          xxx
          echo "end."

          $ cat foo.sh | bash
          * xxx
          end.


          The line command reads a single line from standard input (the xxx) line, and the shell reads the other lines as commands. For this to work, line also needs to take care not to read the input too far, as otherwise the shell would not see the following lines. With GNU utilities, head -n1 would read too much, as would e.g. sed. The line utility from util-linux takes care to read one byte at a time to not read past the newline.



          The above script doesn't work with e.g. dash, as it reads the script full blocks at a time:



          $ cat foo.sh | dash
          *
          dash: 3: xxx: not found
          end.


          Dash and Busybox read full blocks, the others I tested (Bash, Ksh, mksh and Zsh) read byte-by-byte.



          Note that that's a rather convoluted script, and it doesn't work properly if run as, e.g. bash foo.sh, since in that case stdin doesn't point to the script itself, and the xxx line would be taken as a command. It would probably be better to use a here-doc if ne wants to include data within the script itself. This works with any shell, when run as sh bar.sh, sh < bar.sh or cat bar.sh | sh:



          $ cat bar.sh
          #!/bin/sh
          sed -e 's/^/* /' <<EOF
          xxx
          EOF
          echo "end."





          share|improve this answer














          the shell will read from the script file or from a device descriptor




          Or from a pipe, which is probably the easiest way to get a non-seekable input fd.




          Should the shell also read one character at a time from its script input?.




          If it wants to support scripts that run commands that read from stdin, and expect to get their input using lines from the script itself.



          As in something like this:



          $ cat foo.sh
          #!/bin/sh
          line | sed -e 's/^/* /'
          xxx
          echo "end."

          $ cat foo.sh | bash
          * xxx
          end.


          The line command reads a single line from standard input (the xxx) line, and the shell reads the other lines as commands. For this to work, line also needs to take care not to read the input too far, as otherwise the shell would not see the following lines. With GNU utilities, head -n1 would read too much, as would e.g. sed. The line utility from util-linux takes care to read one byte at a time to not read past the newline.



          The above script doesn't work with e.g. dash, as it reads the script full blocks at a time:



          $ cat foo.sh | dash
          *
          dash: 3: xxx: not found
          end.


          Dash and Busybox read full blocks, the others I tested (Bash, Ksh, mksh and Zsh) read byte-by-byte.



          Note that that's a rather convoluted script, and it doesn't work properly if run as, e.g. bash foo.sh, since in that case stdin doesn't point to the script itself, and the xxx line would be taken as a command. It would probably be better to use a here-doc if ne wants to include data within the script itself. This works with any shell, when run as sh bar.sh, sh < bar.sh or cat bar.sh | sh:



          $ cat bar.sh
          #!/bin/sh
          sed -e 's/^/* /' <<EOF
          xxx
          EOF
          echo "end."






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Jan 8 at 9:11









          ilkkachuilkkachu

          56.9k785158




          56.9k785158












          • @Isaac, yes. I was trying to say that you make a distinction between a file and "device descriptor" in the first sentence of your question. I can't see what you're aiming at with that distinction, especially with stdin as an example of the second group. If you did something like sh < a.sh, then the stdin of the script would be a file instead.

            – ilkkachu
            Jan 8 at 11:16











          • Ah, just that a file in the filesystem is usually seek able, a file descriptor (like stdin 0,1,2,...) is usually not. Exactly your description of what happens with the pipe, but with different words, hope that makes sense.

            – Isaac
            Jan 8 at 11:21











          • @Isaac, Hmm. Bash still reads byte-by-byte if we do something like mkfifo p; cat foo.sh > p & bash p. That seems a bit overly careful: I can't really see there would be a chance for commands in the script to read the same pipe in that case (they can't really get the fd directly, Bash seems to open it at number 255, the script would need to know that). None of the other shells seem to do a byte-by-byte read in that case.

            – ilkkachu
            Jan 8 at 12:31











          • It is a POSIX requirement, read my (added) answer. (any comment?).

            – Isaac
            Jan 9 at 2:07

















          • @Isaac, yes. I was trying to say that you make a distinction between a file and "device descriptor" in the first sentence of your question. I can't see what you're aiming at with that distinction, especially with stdin as an example of the second group. If you did something like sh < a.sh, then the stdin of the script would be a file instead.

            – ilkkachu
            Jan 8 at 11:16











          • Ah, just that a file in the filesystem is usually seek able, a file descriptor (like stdin 0,1,2,...) is usually not. Exactly your description of what happens with the pipe, but with different words, hope that makes sense.

            – Isaac
            Jan 8 at 11:21











          • @Isaac, Hmm. Bash still reads byte-by-byte if we do something like mkfifo p; cat foo.sh > p & bash p. That seems a bit overly careful: I can't really see there would be a chance for commands in the script to read the same pipe in that case (they can't really get the fd directly, Bash seems to open it at number 255, the script would need to know that). None of the other shells seem to do a byte-by-byte read in that case.

            – ilkkachu
            Jan 8 at 12:31











          • It is a POSIX requirement, read my (added) answer. (any comment?).

            – Isaac
            Jan 9 at 2:07
















          @Isaac, yes. I was trying to say that you make a distinction between a file and "device descriptor" in the first sentence of your question. I can't see what you're aiming at with that distinction, especially with stdin as an example of the second group. If you did something like sh < a.sh, then the stdin of the script would be a file instead.

          – ilkkachu
          Jan 8 at 11:16





          @Isaac, yes. I was trying to say that you make a distinction between a file and "device descriptor" in the first sentence of your question. I can't see what you're aiming at with that distinction, especially with stdin as an example of the second group. If you did something like sh < a.sh, then the stdin of the script would be a file instead.

          – ilkkachu
          Jan 8 at 11:16













          Ah, just that a file in the filesystem is usually seek able, a file descriptor (like stdin 0,1,2,...) is usually not. Exactly your description of what happens with the pipe, but with different words, hope that makes sense.

          – Isaac
          Jan 8 at 11:21





          Ah, just that a file in the filesystem is usually seek able, a file descriptor (like stdin 0,1,2,...) is usually not. Exactly your description of what happens with the pipe, but with different words, hope that makes sense.

          – Isaac
          Jan 8 at 11:21













          @Isaac, Hmm. Bash still reads byte-by-byte if we do something like mkfifo p; cat foo.sh > p & bash p. That seems a bit overly careful: I can't really see there would be a chance for commands in the script to read the same pipe in that case (they can't really get the fd directly, Bash seems to open it at number 255, the script would need to know that). None of the other shells seem to do a byte-by-byte read in that case.

          – ilkkachu
          Jan 8 at 12:31





          @Isaac, Hmm. Bash still reads byte-by-byte if we do something like mkfifo p; cat foo.sh > p & bash p. That seems a bit overly careful: I can't really see there would be a chance for commands in the script to read the same pipe in that case (they can't really get the fd directly, Bash seems to open it at number 255, the script would need to know that). None of the other shells seem to do a byte-by-byte read in that case.

          – ilkkachu
          Jan 8 at 12:31













          It is a POSIX requirement, read my (added) answer. (any comment?).

          – Isaac
          Jan 9 at 2:07





          It is a POSIX requirement, read my (added) answer. (any comment?).

          – Isaac
          Jan 9 at 2:07













          2














          Ok, I contacted the bash developer, he had this to say:




          POSIX requires it for scripts that are read from stdin. When reading from a script given as an argument, bash reads blocks.




          And, indeed, the POSIX spec says this (emphasis mine):




          When the shell is using standard input and it invokes a command that also uses standard input, the shell shall ensure that the standard input file pointer points directly after the command it has read when the command begins execution. It shall not read ahead in such a manner that any characters intended to be read by the invoked command are consumed by the shell (whether interpreted by the shell or not) or that characters that are not read by the invoked command are not seen by the shell.




          That is: (for stdin script) the shell shall read one-character-at-a-time.



          In C locale, one char is one byte.



          It seems that posh, mksh, lksh, attsh, yash, ksh, zsh and bash conform to this requirement.



          However ash (busybox sh) and dash do not.






          share|improve this answer

























          • @ilkkachu (1) The length of a character is generally related to the encoding. A UTF-16 encoded file has 2-byte characters in odd-even offsets (except the surrogate range needed for everything outside the BMP, which anyway use two odd-even pairs). Assuming only BMP characters, a UTF-16 encoded file could be read by getting pairs of bytes. A UTF-32 file use four byte characters. A UTF-32 (correctly) encoded file could be read by getting four bytes each time. So, no, "in general" not all files need to be read one-byte-at-a-time (if the encoding is correct and known). (Cont...)

            – Isaac
            Jan 18 at 17:17











          • @ilkkachu (2) However, a UTF-16 (or the simpler UCS-2) will generate an invalid Unix text file. A Unix text file could not contain NUL bytes, which, for UTF-16, is a common first byte value. It is the existence of this encodings (of fixed size) that breaks the assumption of the need to read one byte first. (Cont.)

            – Isaac
            Jan 18 at 17:20











          • @ilkkachu (3) For UTF-8, only the first byte carry the information of how many bytes follow. The first bits will be 1 followed by a 0. The number of 1's will indicate the number of bytes following. But all the other bytes will start as 10. Therefore, the number of bytes to read to synchronize to the encoding is variable, for current day UTF-8 rules, no more than three (3) bytes. Older rules allowed up to 5 following bytes.

            – Isaac
            Jan 18 at 17:27












          • @ilkkachu That is assuming you know which is the first encoded byte. The point is that: reading at a random file offset might require reading several bytes (one-byte-at-a-time) to get in sync again and find the "first (encoded) byte".

            – Isaac
            Jan 18 at 17:39











          • @ilkkachu If any read sequence is found to be invalid: it could be rejected (right there and then), but generally it isn't. Valid C char values (any byte value) could generate any sequence of bytes. A shell must be able to deal with such sequences, even if invalid in some particular locale encoding. Try, for example: printf 'echo onexeaxd5two' | sh -s.

            – Isaac
            Jan 18 at 17:48
















          2














          Ok, I contacted the bash developer, he had this to say:




          POSIX requires it for scripts that are read from stdin. When reading from a script given as an argument, bash reads blocks.




          And, indeed, the POSIX spec says this (emphasis mine):




          When the shell is using standard input and it invokes a command that also uses standard input, the shell shall ensure that the standard input file pointer points directly after the command it has read when the command begins execution. It shall not read ahead in such a manner that any characters intended to be read by the invoked command are consumed by the shell (whether interpreted by the shell or not) or that characters that are not read by the invoked command are not seen by the shell.




          That is: (for stdin script) the shell shall read one-character-at-a-time.



          In C locale, one char is one byte.



          It seems that posh, mksh, lksh, attsh, yash, ksh, zsh and bash conform to this requirement.



          However ash (busybox sh) and dash do not.






          share|improve this answer

























          • @ilkkachu (1) The length of a character is generally related to the encoding. A UTF-16 encoded file has 2-byte characters in odd-even offsets (except the surrogate range needed for everything outside the BMP, which anyway use two odd-even pairs). Assuming only BMP characters, a UTF-16 encoded file could be read by getting pairs of bytes. A UTF-32 file use four byte characters. A UTF-32 (correctly) encoded file could be read by getting four bytes each time. So, no, "in general" not all files need to be read one-byte-at-a-time (if the encoding is correct and known). (Cont...)

            – Isaac
            Jan 18 at 17:17











          • @ilkkachu (2) However, a UTF-16 (or the simpler UCS-2) will generate an invalid Unix text file. A Unix text file could not contain NUL bytes, which, for UTF-16, is a common first byte value. It is the existence of this encodings (of fixed size) that breaks the assumption of the need to read one byte first. (Cont.)

            – Isaac
            Jan 18 at 17:20











          • @ilkkachu (3) For UTF-8, only the first byte carry the information of how many bytes follow. The first bits will be 1 followed by a 0. The number of 1's will indicate the number of bytes following. But all the other bytes will start as 10. Therefore, the number of bytes to read to synchronize to the encoding is variable, for current day UTF-8 rules, no more than three (3) bytes. Older rules allowed up to 5 following bytes.

            – Isaac
            Jan 18 at 17:27












          • @ilkkachu That is assuming you know which is the first encoded byte. The point is that: reading at a random file offset might require reading several bytes (one-byte-at-a-time) to get in sync again and find the "first (encoded) byte".

            – Isaac
            Jan 18 at 17:39











          • @ilkkachu If any read sequence is found to be invalid: it could be rejected (right there and then), but generally it isn't. Valid C char values (any byte value) could generate any sequence of bytes. A shell must be able to deal with such sequences, even if invalid in some particular locale encoding. Try, for example: printf 'echo onexeaxd5two' | sh -s.

            – Isaac
            Jan 18 at 17:48














          2












          2








          2







          Ok, I contacted the bash developer, he had this to say:




          POSIX requires it for scripts that are read from stdin. When reading from a script given as an argument, bash reads blocks.




          And, indeed, the POSIX spec says this (emphasis mine):




          When the shell is using standard input and it invokes a command that also uses standard input, the shell shall ensure that the standard input file pointer points directly after the command it has read when the command begins execution. It shall not read ahead in such a manner that any characters intended to be read by the invoked command are consumed by the shell (whether interpreted by the shell or not) or that characters that are not read by the invoked command are not seen by the shell.




          That is: (for stdin script) the shell shall read one-character-at-a-time.



          In C locale, one char is one byte.



          It seems that posh, mksh, lksh, attsh, yash, ksh, zsh and bash conform to this requirement.



          However ash (busybox sh) and dash do not.






          share|improve this answer















          Ok, I contacted the bash developer, he had this to say:




          POSIX requires it for scripts that are read from stdin. When reading from a script given as an argument, bash reads blocks.




          And, indeed, the POSIX spec says this (emphasis mine):




          When the shell is using standard input and it invokes a command that also uses standard input, the shell shall ensure that the standard input file pointer points directly after the command it has read when the command begins execution. It shall not read ahead in such a manner that any characters intended to be read by the invoked command are consumed by the shell (whether interpreted by the shell or not) or that characters that are not read by the invoked command are not seen by the shell.




          That is: (for stdin script) the shell shall read one-character-at-a-time.



          In C locale, one char is one byte.



          It seems that posh, mksh, lksh, attsh, yash, ksh, zsh and bash conform to this requirement.



          However ash (busybox sh) and dash do not.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Jan 18 at 16:31

























          answered Jan 9 at 2:04









          IsaacIsaac

          11.6k11652




          11.6k11652












          • @ilkkachu (1) The length of a character is generally related to the encoding. A UTF-16 encoded file has 2-byte characters in odd-even offsets (except the surrogate range needed for everything outside the BMP, which anyway use two odd-even pairs). Assuming only BMP characters, a UTF-16 encoded file could be read by getting pairs of bytes. A UTF-32 file use four byte characters. A UTF-32 (correctly) encoded file could be read by getting four bytes each time. So, no, "in general" not all files need to be read one-byte-at-a-time (if the encoding is correct and known). (Cont...)

            – Isaac
            Jan 18 at 17:17











          • @ilkkachu (2) However, a UTF-16 (or the simpler UCS-2) will generate an invalid Unix text file. A Unix text file could not contain NUL bytes, which, for UTF-16, is a common first byte value. It is the existence of this encodings (of fixed size) that breaks the assumption of the need to read one byte first. (Cont.)

            – Isaac
            Jan 18 at 17:20











          • @ilkkachu (3) For UTF-8, only the first byte carry the information of how many bytes follow. The first bits will be 1 followed by a 0. The number of 1's will indicate the number of bytes following. But all the other bytes will start as 10. Therefore, the number of bytes to read to synchronize to the encoding is variable, for current day UTF-8 rules, no more than three (3) bytes. Older rules allowed up to 5 following bytes.

            – Isaac
            Jan 18 at 17:27












          • @ilkkachu That is assuming you know which is the first encoded byte. The point is that: reading at a random file offset might require reading several bytes (one-byte-at-a-time) to get in sync again and find the "first (encoded) byte".

            – Isaac
            Jan 18 at 17:39











          • @ilkkachu If any read sequence is found to be invalid: it could be rejected (right there and then), but generally it isn't. Valid C char values (any byte value) could generate any sequence of bytes. A shell must be able to deal with such sequences, even if invalid in some particular locale encoding. Try, for example: printf 'echo onexeaxd5two' | sh -s.

            – Isaac
            Jan 18 at 17:48


















          • @ilkkachu (1) The length of a character is generally related to the encoding. A UTF-16 encoded file has 2-byte characters in odd-even offsets (except the surrogate range needed for everything outside the BMP, which anyway use two odd-even pairs). Assuming only BMP characters, a UTF-16 encoded file could be read by getting pairs of bytes. A UTF-32 file use four byte characters. A UTF-32 (correctly) encoded file could be read by getting four bytes each time. So, no, "in general" not all files need to be read one-byte-at-a-time (if the encoding is correct and known). (Cont...)

            – Isaac
            Jan 18 at 17:17











          • @ilkkachu (2) However, a UTF-16 (or the simpler UCS-2) will generate an invalid Unix text file. A Unix text file could not contain NUL bytes, which, for UTF-16, is a common first byte value. It is the existence of this encodings (of fixed size) that breaks the assumption of the need to read one byte first. (Cont.)

            – Isaac
            Jan 18 at 17:20











          • @ilkkachu (3) For UTF-8, only the first byte carry the information of how many bytes follow. The first bits will be 1 followed by a 0. The number of 1's will indicate the number of bytes following. But all the other bytes will start as 10. Therefore, the number of bytes to read to synchronize to the encoding is variable, for current day UTF-8 rules, no more than three (3) bytes. Older rules allowed up to 5 following bytes.

            – Isaac
            Jan 18 at 17:27












          • @ilkkachu That is assuming you know which is the first encoded byte. The point is that: reading at a random file offset might require reading several bytes (one-byte-at-a-time) to get in sync again and find the "first (encoded) byte".

            – Isaac
            Jan 18 at 17:39











          • @ilkkachu If any read sequence is found to be invalid: it could be rejected (right there and then), but generally it isn't. Valid C char values (any byte value) could generate any sequence of bytes. A shell must be able to deal with such sequences, even if invalid in some particular locale encoding. Try, for example: printf 'echo onexeaxd5two' | sh -s.

            – Isaac
            Jan 18 at 17:48

















          @ilkkachu (1) The length of a character is generally related to the encoding. A UTF-16 encoded file has 2-byte characters in odd-even offsets (except the surrogate range needed for everything outside the BMP, which anyway use two odd-even pairs). Assuming only BMP characters, a UTF-16 encoded file could be read by getting pairs of bytes. A UTF-32 file use four byte characters. A UTF-32 (correctly) encoded file could be read by getting four bytes each time. So, no, "in general" not all files need to be read one-byte-at-a-time (if the encoding is correct and known). (Cont...)

          – Isaac
          Jan 18 at 17:17





          @ilkkachu (1) The length of a character is generally related to the encoding. A UTF-16 encoded file has 2-byte characters in odd-even offsets (except the surrogate range needed for everything outside the BMP, which anyway use two odd-even pairs). Assuming only BMP characters, a UTF-16 encoded file could be read by getting pairs of bytes. A UTF-32 file use four byte characters. A UTF-32 (correctly) encoded file could be read by getting four bytes each time. So, no, "in general" not all files need to be read one-byte-at-a-time (if the encoding is correct and known). (Cont...)

          – Isaac
          Jan 18 at 17:17













          @ilkkachu (2) However, a UTF-16 (or the simpler UCS-2) will generate an invalid Unix text file. A Unix text file could not contain NUL bytes, which, for UTF-16, is a common first byte value. It is the existence of this encodings (of fixed size) that breaks the assumption of the need to read one byte first. (Cont.)

          – Isaac
          Jan 18 at 17:20





          @ilkkachu (2) However, a UTF-16 (or the simpler UCS-2) will generate an invalid Unix text file. A Unix text file could not contain NUL bytes, which, for UTF-16, is a common first byte value. It is the existence of this encodings (of fixed size) that breaks the assumption of the need to read one byte first. (Cont.)

          – Isaac
          Jan 18 at 17:20













          @ilkkachu (3) For UTF-8, only the first byte carry the information of how many bytes follow. The first bits will be 1 followed by a 0. The number of 1's will indicate the number of bytes following. But all the other bytes will start as 10. Therefore, the number of bytes to read to synchronize to the encoding is variable, for current day UTF-8 rules, no more than three (3) bytes. Older rules allowed up to 5 following bytes.

          – Isaac
          Jan 18 at 17:27






          @ilkkachu (3) For UTF-8, only the first byte carry the information of how many bytes follow. The first bits will be 1 followed by a 0. The number of 1's will indicate the number of bytes following. But all the other bytes will start as 10. Therefore, the number of bytes to read to synchronize to the encoding is variable, for current day UTF-8 rules, no more than three (3) bytes. Older rules allowed up to 5 following bytes.

          – Isaac
          Jan 18 at 17:27














          @ilkkachu That is assuming you know which is the first encoded byte. The point is that: reading at a random file offset might require reading several bytes (one-byte-at-a-time) to get in sync again and find the "first (encoded) byte".

          – Isaac
          Jan 18 at 17:39





          @ilkkachu That is assuming you know which is the first encoded byte. The point is that: reading at a random file offset might require reading several bytes (one-byte-at-a-time) to get in sync again and find the "first (encoded) byte".

          – Isaac
          Jan 18 at 17:39













          @ilkkachu If any read sequence is found to be invalid: it could be rejected (right there and then), but generally it isn't. Valid C char values (any byte value) could generate any sequence of bytes. A shell must be able to deal with such sequences, even if invalid in some particular locale encoding. Try, for example: printf 'echo onexeaxd5two' | sh -s.

          – Isaac
          Jan 18 at 17:48






          @ilkkachu If any read sequence is found to be invalid: it could be rejected (right there and then), but generally it isn't. Valid C char values (any byte value) could generate any sequence of bytes. A shell must be able to deal with such sequences, even if invalid in some particular locale encoding. Try, for example: printf 'echo onexeaxd5two' | sh -s.

          – Isaac
          Jan 18 at 17:48


















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Unix & Linux Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f493159%2fshould-the-shell-read-an-script-one-character-at-a-time%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown






          Popular posts from this blog

          How to check contact read email or not when send email to Individual?

          Bahrain

          Postfix configuration issue with fips on centos 7; mailgun relay