Should the shell read (an script) one character at a time?

While reading an script, the shell will read the script from a file, or from a pipe or possibly, some other source (stdin ?). The input may not be seek able under some corner conditions (no way to rewind the file position to a previous position).

It has been said that read reads stdin one byte at a time until it finds an unescaped newline character

Should the shell also read one character at a time from its script input?.

I mean the script, not an additional data text file that could be used.

If so: why is that needed? Is it defined in some spec?

Do all shells work similarly? Which not?

edited Jan 8 at 11:25

asked Jan 8 at 7:25

Isaac

11.6k11652

add a comment |

It has been said that read reads stdin one byte at a time until it finds an unescaped newline character

Should the shell also read one character at a time from its script input?.

I mean the script, not an additional data text file that could be used.

If so: why is that needed? Is it defined in some spec?

Do all shells work similarly? Which not?

edited Jan 8 at 11:25

asked Jan 8 at 7:25

Isaac

11.6k11652

add a comment |

It has been said that read reads stdin one byte at a time until it finds an unescaped newline character

Should the shell also read one character at a time from its script input?.

I mean the script, not an additional data text file that could be used.

If so: why is that needed? Is it defined in some spec?

Do all shells work similarly? Which not?

edited Jan 8 at 11:25

asked Jan 8 at 7:25

Isaac

11.6k11652

It has been said that read reads stdin one byte at a time until it finds an unescaped newline character

Should the shell also read one character at a time from its script input?.

I mean the script, not an additional data text file that could be used.

If so: why is that needed? Is it defined in some spec?

Do all shells work similarly? Which not?

shell-script shell read

edited Jan 8 at 11:25

asked Jan 8 at 7:25

Isaac

11.6k11652

edited Jan 8 at 11:25

asked Jan 8 at 7:25

Isaac

11.6k11652

edited Jan 8 at 11:25

asked Jan 8 at 7:25

Isaac

11.6k11652

asked Jan 8 at 7:25

Isaac

11.6k11652

asked Jan 8 at 7:25

Isaac

11.6k11652

add a comment |

2 Answers
2

active

oldest

votes

the shell will read from the script file or from a device descriptor

Or from a pipe, which is probably the easiest way to get a non-seekable input fd.

Should the shell also read one character at a time from its script input?.

If it wants to support scripts that run commands that read from stdin, and expect to get their input using lines from the script itself.

As in something like this:

$ cat foo.sh
#!/bin/sh
line | sed -e 's/^/* /'
xxx
echo "end."

$ cat foo.sh | bash
* xxx
end.

The line command reads a single line from standard input (the xxx) line, and the shell reads the other lines as commands. For this to work, line also needs to take care not to read the input too far, as otherwise the shell would not see the following lines. With GNU utilities, head -n1 would read too much, as would e.g. sed. The line utility from util-linux takes care to read one byte at a time to not read past the newline.

The above script doesn't work with e.g. dash, as it reads the script full blocks at a time:

$ cat foo.sh | dash
* 
dash: 3: xxx: not found
end.

Dash and Busybox read full blocks, the others I tested (Bash, Ksh, mksh and Zsh) read byte-by-byte.

Note that that's a rather convoluted script, and it doesn't work properly if run as, e.g. bash foo.sh, since in that case stdin doesn't point to the script itself, and the xxx line would be taken as a command. It would probably be better to use a here-doc if ne wants to include data within the script itself. This works with any shell, when run as sh bar.sh, sh < bar.sh or cat bar.sh | sh:

$ cat bar.sh
#!/bin/sh
sed -e 's/^/* /' <<EOF
xxx
EOF
echo "end."

answered Jan 8 at 9:11

ilkkachu

56.9k785158

@Isaac, yes. I was trying to say that you make a distinction between a file and "device descriptor" in the first sentence of your question. I can't see what you're aiming at with that distinction, especially with stdin as an example of the second group. If you did something like sh < a.sh, then the stdin of the script would be a file instead.

– ilkkachu
Jan 8 at 11:16

Ah, just that a file in the filesystem is usually seek able, a file descriptor (like stdin 0,1,2,...) is usually not. Exactly your description of what happens with the pipe, but with different words, hope that makes sense.

– Isaac
Jan 8 at 11:21

@Isaac, Hmm. Bash still reads byte-by-byte if we do something like mkfifo p; cat foo.sh > p & bash p. That seems a bit overly careful: I can't really see there would be a chance for commands in the script to read the same pipe in that case (they can't really get the fd directly, Bash seems to open it at number 255, the script would need to know that). None of the other shells seem to do a byte-by-byte read in that case.

– ilkkachu
Jan 8 at 12:31

It is a POSIX requirement, read my (added) answer. (any comment?).

– Isaac
Jan 9 at 2:07

add a comment |

Ok, I contacted the bash developer, he had this to say:

POSIX requires it for scripts that are read from stdin. When reading from a script given as an argument, bash reads blocks.

And, indeed, the POSIX spec says this (emphasis mine):

When the shell is using standard input and it invokes a command that also uses standard input, the shell shall ensure that the standard input file pointer points directly after the command it has read when the command begins execution. It shall not read ahead in such a manner that any characters intended to be read by the invoked command are consumed by the shell (whether interpreted by the shell or not) or that characters that are not read by the invoked command are not seen by the shell.

That is: (for stdin script) the shell shall read one-character-at-a-time.

In C locale, one char is one byte.

It seems that posh, mksh, lksh, attsh, yash, ksh, zsh and bash conform to this requirement.

However ash (busybox sh) and dash do not.

edited Jan 18 at 16:31

answered Jan 9 at 2:04

Isaac

11.6k11652

@ilkkachu (1) The length of a character is generally related to the encoding. A UTF-16 encoded file has 2-byte characters in odd-even offsets (except the surrogate range needed for everything outside the BMP, which anyway use two odd-even pairs). Assuming only BMP characters, a UTF-16 encoded file could be read by getting pairs of bytes. A UTF-32 file use four byte characters. A UTF-32 (correctly) encoded file could be read by getting four bytes each time. So, no, "in general" not all files need to be read one-byte-at-a-time (if the encoding is correct and known). (Cont...)

– Isaac
Jan 18 at 17:17

@ilkkachu (2) However, a UTF-16 (or the simpler UCS-2) will generate an invalid Unix text file. A Unix text file could not contain NUL bytes, which, for UTF-16, is a common first byte value. It is the existence of this encodings (of fixed size) that breaks the assumption of the need to read one byte first. (Cont.)

– Isaac
Jan 18 at 17:20

@ilkkachu (3) For UTF-8, only the first byte carry the information of how many bytes follow. The first bits will be 1 followed by a 0. The number of 1's will indicate the number of bytes following. But all the other bytes will start as 10. Therefore, the number of bytes to read to synchronize to the encoding is variable, for current day UTF-8 rules, no more than three (3) bytes. Older rules allowed up to 5 following bytes.

– Isaac
Jan 18 at 17:27

@ilkkachu That is assuming you know which is the first encoded byte. The point is that: reading at a random file offset might require reading several bytes (one-byte-at-a-time) to get in sync again and find the "first (encoded) byte".

– Isaac
Jan 18 at 17:39

@ilkkachu If any read sequence is found to be invalid: it could be rejected (right there and then), but generally it isn't. Valid C char values (any byte value) could generate any sequence of bytes. A shell must be able to deal with such sequences, even if invalid in some particular locale encoding. Try, for example: printf 'echo onexeaxd5two' | sh -s.

– Isaac
Jan 18 at 17:48

|
show 3 more comments

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f493159%2fshould-the-shell-read-an-script-one-character-at-a-time%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

the shell will read from the script file or from a device descriptor

Or from a pipe, which is probably the easiest way to get a non-seekable input fd.

Should the shell also read one character at a time from its script input?.

If it wants to support scripts that run commands that read from stdin, and expect to get their input using lines from the script itself.

As in something like this:

$ cat foo.sh
#!/bin/sh
line | sed -e 's/^/* /'
xxx
echo "end."

$ cat foo.sh | bash
* xxx
end.

The above script doesn't work with e.g. dash, as it reads the script full blocks at a time:

$ cat foo.sh | dash
* 
dash: 3: xxx: not found
end.

Dash and Busybox read full blocks, the others I tested (Bash, Ksh, mksh and Zsh) read byte-by-byte.

$ cat bar.sh
#!/bin/sh
sed -e 's/^/* /' <<EOF
xxx
EOF
echo "end."

answered Jan 8 at 9:11

ilkkachu

56.9k785158

@Isaac, yes. I was trying to say that you make a distinction between a file and "device descriptor" in the first sentence of your question. I can't see what you're aiming at with that distinction, especially with stdin as an example of the second group. If you did something like sh < a.sh, then the stdin of the script would be a file instead.

– ilkkachu
Jan 8 at 11:16

Ah, just that a file in the filesystem is usually seek able, a file descriptor (like stdin 0,1,2,...) is usually not. Exactly your description of what happens with the pipe, but with different words, hope that makes sense.

– Isaac
Jan 8 at 11:21

@Isaac, Hmm. Bash still reads byte-by-byte if we do something like mkfifo p; cat foo.sh > p & bash p. That seems a bit overly careful: I can't really see there would be a chance for commands in the script to read the same pipe in that case (they can't really get the fd directly, Bash seems to open it at number 255, the script would need to know that). None of the other shells seem to do a byte-by-byte read in that case.

– ilkkachu
Jan 8 at 12:31

It is a POSIX requirement, read my (added) answer. (any comment?).

– Isaac
Jan 9 at 2:07

add a comment |

the shell will read from the script file or from a device descriptor

Or from a pipe, which is probably the easiest way to get a non-seekable input fd.

Should the shell also read one character at a time from its script input?.

If it wants to support scripts that run commands that read from stdin, and expect to get their input using lines from the script itself.

As in something like this:

$ cat foo.sh
#!/bin/sh
line | sed -e 's/^/* /'
xxx
echo "end."

$ cat foo.sh | bash
* xxx
end.

The above script doesn't work with e.g. dash, as it reads the script full blocks at a time:

$ cat foo.sh | dash
* 
dash: 3: xxx: not found
end.

Dash and Busybox read full blocks, the others I tested (Bash, Ksh, mksh and Zsh) read byte-by-byte.

$ cat bar.sh
#!/bin/sh
sed -e 's/^/* /' <<EOF
xxx
EOF
echo "end."

answered Jan 8 at 9:11

ilkkachu

56.9k785158

@Isaac, yes. I was trying to say that you make a distinction between a file and "device descriptor" in the first sentence of your question. I can't see what you're aiming at with that distinction, especially with stdin as an example of the second group. If you did something like sh < a.sh, then the stdin of the script would be a file instead.

– ilkkachu
Jan 8 at 11:16

Ah, just that a file in the filesystem is usually seek able, a file descriptor (like stdin 0,1,2,...) is usually not. Exactly your description of what happens with the pipe, but with different words, hope that makes sense.

– Isaac
Jan 8 at 11:21

@Isaac, Hmm. Bash still reads byte-by-byte if we do something like mkfifo p; cat foo.sh > p & bash p. That seems a bit overly careful: I can't really see there would be a chance for commands in the script to read the same pipe in that case (they can't really get the fd directly, Bash seems to open it at number 255, the script would need to know that). None of the other shells seem to do a byte-by-byte read in that case.

– ilkkachu
Jan 8 at 12:31

It is a POSIX requirement, read my (added) answer. (any comment?).

– Isaac
Jan 9 at 2:07

add a comment |

the shell will read from the script file or from a device descriptor

Or from a pipe, which is probably the easiest way to get a non-seekable input fd.

Should the shell also read one character at a time from its script input?.

If it wants to support scripts that run commands that read from stdin, and expect to get their input using lines from the script itself.

As in something like this:

$ cat foo.sh
#!/bin/sh
line | sed -e 's/^/* /'
xxx
echo "end."

$ cat foo.sh | bash
* xxx
end.

The above script doesn't work with e.g. dash, as it reads the script full blocks at a time:

$ cat foo.sh | dash
* 
dash: 3: xxx: not found
end.

Dash and Busybox read full blocks, the others I tested (Bash, Ksh, mksh and Zsh) read byte-by-byte.

$ cat bar.sh
#!/bin/sh
sed -e 's/^/* /' <<EOF
xxx
EOF
echo "end."

answered Jan 8 at 9:11

ilkkachu

56.9k785158

the shell will read from the script file or from a device descriptor

Or from a pipe, which is probably the easiest way to get a non-seekable input fd.

Should the shell also read one character at a time from its script input?.

If it wants to support scripts that run commands that read from stdin, and expect to get their input using lines from the script itself.

As in something like this:

$ cat foo.sh
#!/bin/sh
line | sed -e 's/^/* /'
xxx
echo "end."

$ cat foo.sh | bash
* xxx
end.

The above script doesn't work with e.g. dash, as it reads the script full blocks at a time:

$ cat foo.sh | dash
* 
dash: 3: xxx: not found
end.

Dash and Busybox read full blocks, the others I tested (Bash, Ksh, mksh and Zsh) read byte-by-byte.

$ cat bar.sh
#!/bin/sh
sed -e 's/^/* /' <<EOF
xxx
EOF
echo "end."

answered Jan 8 at 9:11

ilkkachu

56.9k785158

answered Jan 8 at 9:11

ilkkachu

56.9k785158

answered Jan 8 at 9:11

ilkkachu

56.9k785158

answered Jan 8 at 9:11

ilkkachu

56.9k785158

@Isaac, yes. I was trying to say that you make a distinction between a file and "device descriptor" in the first sentence of your question. I can't see what you're aiming at with that distinction, especially with stdin as an example of the second group. If you did something like sh < a.sh, then the stdin of the script would be a file instead.

– ilkkachu
Jan 8 at 11:16

Ah, just that a file in the filesystem is usually seek able, a file descriptor (like stdin 0,1,2,...) is usually not. Exactly your description of what happens with the pipe, but with different words, hope that makes sense.

– Isaac
Jan 8 at 11:21

@Isaac, Hmm. Bash still reads byte-by-byte if we do something like mkfifo p; cat foo.sh > p & bash p. That seems a bit overly careful: I can't really see there would be a chance for commands in the script to read the same pipe in that case (they can't really get the fd directly, Bash seems to open it at number 255, the script would need to know that). None of the other shells seem to do a byte-by-byte read in that case.

– ilkkachu
Jan 8 at 12:31

It is a POSIX requirement, read my (added) answer. (any comment?).

– Isaac
Jan 9 at 2:07

add a comment |

@Isaac, yes. I was trying to say that you make a distinction between a file and "device descriptor" in the first sentence of your question. I can't see what you're aiming at with that distinction, especially with stdin as an example of the second group. If you did something like sh < a.sh, then the stdin of the script would be a file instead.

– ilkkachu
Jan 8 at 11:16

Ah, just that a file in the filesystem is usually seek able, a file descriptor (like stdin 0,1,2,...) is usually not. Exactly your description of what happens with the pipe, but with different words, hope that makes sense.

– Isaac
Jan 8 at 11:21

@Isaac, Hmm. Bash still reads byte-by-byte if we do something like mkfifo p; cat foo.sh > p & bash p. That seems a bit overly careful: I can't really see there would be a chance for commands in the script to read the same pipe in that case (they can't really get the fd directly, Bash seems to open it at number 255, the script would need to know that). None of the other shells seem to do a byte-by-byte read in that case.

– ilkkachu
Jan 8 at 12:31

It is a POSIX requirement, read my (added) answer. (any comment?).

– Isaac
Jan 9 at 2:07

@Isaac, yes. I was trying to say that you make a distinction between a file and "device descriptor" in the first sentence of your question. I can't see what you're aiming at with that distinction, especially with stdin as an example of the second group. If you did something like sh < a.sh, then the stdin of the script would be a file instead.

– ilkkachu
Jan 8 at 11:16

Ah, just that a file in the filesystem is usually seek able, a file descriptor (like stdin 0,1,2,...) is usually not. Exactly your description of what happens with the pipe, but with different words, hope that makes sense.

– Isaac
Jan 8 at 11:21

@Isaac, Hmm. Bash still reads byte-by-byte if we do something like mkfifo p; cat foo.sh > p & bash p. That seems a bit overly careful: I can't really see there would be a chance for commands in the script to read the same pipe in that case (they can't really get the fd directly, Bash seems to open it at number 255, the script would need to know that). None of the other shells seem to do a byte-by-byte read in that case.

– ilkkachu
Jan 8 at 12:31

It is a POSIX requirement, read my (added) answer. (any comment?).

– Isaac
Jan 9 at 2:07

add a comment |

Ok, I contacted the bash developer, he had this to say:

POSIX requires it for scripts that are read from stdin. When reading from a script given as an argument, bash reads blocks.

And, indeed, the POSIX spec says this (emphasis mine):

When the shell is using standard input and it invokes a command that also uses standard input, the shell shall ensure that the standard input file pointer points directly after the command it has read when the command begins execution. It shall not read ahead in such a manner that any characters intended to be read by the invoked command are consumed by the shell (whether interpreted by the shell or not) or that characters that are not read by the invoked command are not seen by the shell.

That is: (for stdin script) the shell shall read one-character-at-a-time.

In C locale, one char is one byte.

It seems that posh, mksh, lksh, attsh, yash, ksh, zsh and bash conform to this requirement.

However ash (busybox sh) and dash do not.

edited Jan 18 at 16:31

answered Jan 9 at 2:04

Isaac

11.6k11652

@ilkkachu (1) The length of a character is generally related to the encoding. A UTF-16 encoded file has 2-byte characters in odd-even offsets (except the surrogate range needed for everything outside the BMP, which anyway use two odd-even pairs). Assuming only BMP characters, a UTF-16 encoded file could be read by getting pairs of bytes. A UTF-32 file use four byte characters. A UTF-32 (correctly) encoded file could be read by getting four bytes each time. So, no, "in general" not all files need to be read one-byte-at-a-time (if the encoding is correct and known). (Cont...)

– Isaac
Jan 18 at 17:17

@ilkkachu (2) However, a UTF-16 (or the simpler UCS-2) will generate an invalid Unix text file. A Unix text file could not contain NUL bytes, which, for UTF-16, is a common first byte value. It is the existence of this encodings (of fixed size) that breaks the assumption of the need to read one byte first. (Cont.)

– Isaac
Jan 18 at 17:20

@ilkkachu (3) For UTF-8, only the first byte carry the information of how many bytes follow. The first bits will be 1 followed by a 0. The number of 1's will indicate the number of bytes following. But all the other bytes will start as 10. Therefore, the number of bytes to read to synchronize to the encoding is variable, for current day UTF-8 rules, no more than three (3) bytes. Older rules allowed up to 5 following bytes.

– Isaac
Jan 18 at 17:27

@ilkkachu That is assuming you know which is the first encoded byte. The point is that: reading at a random file offset might require reading several bytes (one-byte-at-a-time) to get in sync again and find the "first (encoded) byte".

– Isaac
Jan 18 at 17:39

@ilkkachu If any read sequence is found to be invalid: it could be rejected (right there and then), but generally it isn't. Valid C char values (any byte value) could generate any sequence of bytes. A shell must be able to deal with such sequences, even if invalid in some particular locale encoding. Try, for example: printf 'echo onexeaxd5two' | sh -s.

– Isaac
Jan 18 at 17:48

|
show 3 more comments

Ok, I contacted the bash developer, he had this to say:

POSIX requires it for scripts that are read from stdin. When reading from a script given as an argument, bash reads blocks.

And, indeed, the POSIX spec says this (emphasis mine):

When the shell is using standard input and it invokes a command that also uses standard input, the shell shall ensure that the standard input file pointer points directly after the command it has read when the command begins execution. It shall not read ahead in such a manner that any characters intended to be read by the invoked command are consumed by the shell (whether interpreted by the shell or not) or that characters that are not read by the invoked command are not seen by the shell.

That is: (for stdin script) the shell shall read one-character-at-a-time.

In C locale, one char is one byte.

It seems that posh, mksh, lksh, attsh, yash, ksh, zsh and bash conform to this requirement.

However ash (busybox sh) and dash do not.

edited Jan 18 at 16:31

answered Jan 9 at 2:04

Isaac

11.6k11652

@ilkkachu (1) The length of a character is generally related to the encoding. A UTF-16 encoded file has 2-byte characters in odd-even offsets (except the surrogate range needed for everything outside the BMP, which anyway use two odd-even pairs). Assuming only BMP characters, a UTF-16 encoded file could be read by getting pairs of bytes. A UTF-32 file use four byte characters. A UTF-32 (correctly) encoded file could be read by getting four bytes each time. So, no, "in general" not all files need to be read one-byte-at-a-time (if the encoding is correct and known). (Cont...)

– Isaac
Jan 18 at 17:17

@ilkkachu (2) However, a UTF-16 (or the simpler UCS-2) will generate an invalid Unix text file. A Unix text file could not contain NUL bytes, which, for UTF-16, is a common first byte value. It is the existence of this encodings (of fixed size) that breaks the assumption of the need to read one byte first. (Cont.)

– Isaac
Jan 18 at 17:20

@ilkkachu (3) For UTF-8, only the first byte carry the information of how many bytes follow. The first bits will be 1 followed by a 0. The number of 1's will indicate the number of bytes following. But all the other bytes will start as 10. Therefore, the number of bytes to read to synchronize to the encoding is variable, for current day UTF-8 rules, no more than three (3) bytes. Older rules allowed up to 5 following bytes.

– Isaac
Jan 18 at 17:27

@ilkkachu That is assuming you know which is the first encoded byte. The point is that: reading at a random file offset might require reading several bytes (one-byte-at-a-time) to get in sync again and find the "first (encoded) byte".

– Isaac
Jan 18 at 17:39

@ilkkachu If any read sequence is found to be invalid: it could be rejected (right there and then), but generally it isn't. Valid C char values (any byte value) could generate any sequence of bytes. A shell must be able to deal with such sequences, even if invalid in some particular locale encoding. Try, for example: printf 'echo onexeaxd5two' | sh -s.

– Isaac
Jan 18 at 17:48

|
show 3 more comments

Ok, I contacted the bash developer, he had this to say:

POSIX requires it for scripts that are read from stdin. When reading from a script given as an argument, bash reads blocks.

And, indeed, the POSIX spec says this (emphasis mine):

When the shell is using standard input and it invokes a command that also uses standard input, the shell shall ensure that the standard input file pointer points directly after the command it has read when the command begins execution. It shall not read ahead in such a manner that any characters intended to be read by the invoked command are consumed by the shell (whether interpreted by the shell or not) or that characters that are not read by the invoked command are not seen by the shell.

That is: (for stdin script) the shell shall read one-character-at-a-time.

In C locale, one char is one byte.

It seems that posh, mksh, lksh, attsh, yash, ksh, zsh and bash conform to this requirement.

However ash (busybox sh) and dash do not.

edited Jan 18 at 16:31

answered Jan 9 at 2:04

Isaac

11.6k11652

Ok, I contacted the bash developer, he had this to say:

POSIX requires it for scripts that are read from stdin. When reading from a script given as an argument, bash reads blocks.

And, indeed, the POSIX spec says this (emphasis mine):

When the shell is using standard input and it invokes a command that also uses standard input, the shell shall ensure that the standard input file pointer points directly after the command it has read when the command begins execution. It shall not read ahead in such a manner that any characters intended to be read by the invoked command are consumed by the shell (whether interpreted by the shell or not) or that characters that are not read by the invoked command are not seen by the shell.

That is: (for stdin script) the shell shall read one-character-at-a-time.

In C locale, one char is one byte.

It seems that posh, mksh, lksh, attsh, yash, ksh, zsh and bash conform to this requirement.

However ash (busybox sh) and dash do not.

edited Jan 18 at 16:31

answered Jan 9 at 2:04

Isaac

11.6k11652

edited Jan 18 at 16:31

answered Jan 9 at 2:04

Isaac

11.6k11652

answered Jan 9 at 2:04

Isaac

11.6k11652

answered Jan 9 at 2:04

Isaac

11.6k11652

@ilkkachu (1) The length of a character is generally related to the encoding. A UTF-16 encoded file has 2-byte characters in odd-even offsets (except the surrogate range needed for everything outside the BMP, which anyway use two odd-even pairs). Assuming only BMP characters, a UTF-16 encoded file could be read by getting pairs of bytes. A UTF-32 file use four byte characters. A UTF-32 (correctly) encoded file could be read by getting four bytes each time. So, no, "in general" not all files need to be read one-byte-at-a-time (if the encoding is correct and known). (Cont...)

– Isaac
Jan 18 at 17:17

@ilkkachu (2) However, a UTF-16 (or the simpler UCS-2) will generate an invalid Unix text file. A Unix text file could not contain NUL bytes, which, for UTF-16, is a common first byte value. It is the existence of this encodings (of fixed size) that breaks the assumption of the need to read one byte first. (Cont.)

– Isaac
Jan 18 at 17:20

@ilkkachu (3) For UTF-8, only the first byte carry the information of how many bytes follow. The first bits will be 1 followed by a 0. The number of 1's will indicate the number of bytes following. But all the other bytes will start as 10. Therefore, the number of bytes to read to synchronize to the encoding is variable, for current day UTF-8 rules, no more than three (3) bytes. Older rules allowed up to 5 following bytes.

– Isaac
Jan 18 at 17:27

@ilkkachu That is assuming you know which is the first encoded byte. The point is that: reading at a random file offset might require reading several bytes (one-byte-at-a-time) to get in sync again and find the "first (encoded) byte".

– Isaac
Jan 18 at 17:39

@ilkkachu If any read sequence is found to be invalid: it could be rejected (right there and then), but generally it isn't. Valid C char values (any byte value) could generate any sequence of bytes. A shell must be able to deal with such sequences, even if invalid in some particular locale encoding. Try, for example: printf 'echo onexeaxd5two' | sh -s.

– Isaac
Jan 18 at 17:48

|
show 3 more comments

@ilkkachu (1) The length of a character is generally related to the encoding. A UTF-16 encoded file has 2-byte characters in odd-even offsets (except the surrogate range needed for everything outside the BMP, which anyway use two odd-even pairs). Assuming only BMP characters, a UTF-16 encoded file could be read by getting pairs of bytes. A UTF-32 file use four byte characters. A UTF-32 (correctly) encoded file could be read by getting four bytes each time. So, no, "in general" not all files need to be read one-byte-at-a-time (if the encoding is correct and known). (Cont...)

– Isaac
Jan 18 at 17:17

@ilkkachu (2) However, a UTF-16 (or the simpler UCS-2) will generate an invalid Unix text file. A Unix text file could not contain NUL bytes, which, for UTF-16, is a common first byte value. It is the existence of this encodings (of fixed size) that breaks the assumption of the need to read one byte first. (Cont.)

– Isaac
Jan 18 at 17:20

@ilkkachu (3) For UTF-8, only the first byte carry the information of how many bytes follow. The first bits will be 1 followed by a 0. The number of 1's will indicate the number of bytes following. But all the other bytes will start as 10. Therefore, the number of bytes to read to synchronize to the encoding is variable, for current day UTF-8 rules, no more than three (3) bytes. Older rules allowed up to 5 following bytes.

– Isaac
Jan 18 at 17:27

@ilkkachu That is assuming you know which is the first encoded byte. The point is that: reading at a random file offset might require reading several bytes (one-byte-at-a-time) to get in sync again and find the "first (encoded) byte".

– Isaac
Jan 18 at 17:39

@ilkkachu If any read sequence is found to be invalid: it could be rejected (right there and then), but generally it isn't. Valid C char values (any byte value) could generate any sequence of bytes. A shell must be able to deal with such sequences, even if invalid in some particular locale encoding. Try, for example: printf 'echo onexeaxd5two' | sh -s.

– Isaac
Jan 18 at 17:48

@ilkkachu (1) The length of a character is generally related to the encoding. A UTF-16 encoded file has 2-byte characters in odd-even offsets (except the surrogate range needed for everything outside the BMP, which anyway use two odd-even pairs). Assuming only BMP characters, a UTF-16 encoded file could be read by getting pairs of bytes. A UTF-32 file use four byte characters. A UTF-32 (correctly) encoded file could be read by getting four bytes each time. So, no, "in general" not all files need to be read one-byte-at-a-time (if the encoding is correct and known). (Cont...)

– Isaac
Jan 18 at 17:17

@ilkkachu (2) However, a UTF-16 (or the simpler UCS-2) will generate an invalid Unix text file. A Unix text file could not contain NUL bytes, which, for UTF-16, is a common first byte value. It is the existence of this encodings (of fixed size) that breaks the assumption of the need to read one byte first. (Cont.)

– Isaac
Jan 18 at 17:20

@ilkkachu (3) For UTF-8, only the first byte carry the information of how many bytes follow. The first bits will be 1 followed by a 0. The number of 1's will indicate the number of bytes following. But all the other bytes will start as 10. Therefore, the number of bytes to read to synchronize to the encoding is variable, for current day UTF-8 rules, no more than three (3) bytes. Older rules allowed up to 5 following bytes.

– Isaac
Jan 18 at 17:27

@ilkkachu That is assuming you know which is the first encoded byte. The point is that: reading at a random file offset might require reading several bytes (one-byte-at-a-time) to get in sync again and find the "first (encoded) byte".

– Isaac
Jan 18 at 17:39

@ilkkachu If any read sequence is found to be invalid: it could be rejected (right there and then), but generally it isn't. Valid C char values (any byte value) could generate any sequence of bytes. A shell must be able to deal with such sequences, even if invalid in some particular locale encoding. Try, for example: printf 'echo onexeaxd5two' | sh -s.

– Isaac
Jan 18 at 17:48

|
show 3 more comments

draft saved

draft discarded

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

搜尋此網誌

mjhjmtu