Split out individual characters using the null string
Clash Royale CLAN TAG#URR8PPP
up vote
3
down vote
favorite
I read this in the Gawk manual:
GNU EXTENSIONS
[...]
The ability to split out individual characters using the null string as the
value of FS, and as the third argument to split().
However this seems to not be the case. This works as expected:
$ gawk 'BEGIN print split("quebec", z, "")'
6
and I can disable other extensions:
$ export POSIXLY_CORRECT
$ gawk 'BEGIN typeof(1)'
gawk: cmd. line:1: fatal: function `typeof' not defined
but I cannot disable the split behavior:
$ export POSIXLY_CORRECT
$ gawk 'BEGIN print split("quebec", z, "")'
6
$ gawk --posix 'BEGIN print split("quebec", z, "")'
6
I also looked a the Mawk manual:
If FS = "", then mawk breaks the record into individual characters, and,
similarly, split(s,A,"") places the individual characters of s into A.
[...]
Posix explicitly leaves the behavior of FS = "" undefined, and mentions
splitting the record into characters as a possible interpretation, but
currently this use is not portable across implementations.
So, with what implementations can you not get single characters with FS
andsplit
?
awk posix gawk mawk
add a comment |Â
up vote
3
down vote
favorite
I read this in the Gawk manual:
GNU EXTENSIONS
[...]
The ability to split out individual characters using the null string as the
value of FS, and as the third argument to split().
However this seems to not be the case. This works as expected:
$ gawk 'BEGIN print split("quebec", z, "")'
6
and I can disable other extensions:
$ export POSIXLY_CORRECT
$ gawk 'BEGIN typeof(1)'
gawk: cmd. line:1: fatal: function `typeof' not defined
but I cannot disable the split behavior:
$ export POSIXLY_CORRECT
$ gawk 'BEGIN print split("quebec", z, "")'
6
$ gawk --posix 'BEGIN print split("quebec", z, "")'
6
I also looked a the Mawk manual:
If FS = "", then mawk breaks the record into individual characters, and,
similarly, split(s,A,"") places the individual characters of s into A.
[...]
Posix explicitly leaves the behavior of FS = "" undefined, and mentions
splitting the record into characters as a possible interpretation, but
currently this use is not portable across implementations.
So, with what implementations can you not get single characters with FS
andsplit
?
awk posix gawk mawk
1
I suppose you're asking which implementations do something else, or if there's some other reason it's left unspecified in POSIX? I'm a bit confused as to why it should be possible to change/disable that behaviour. If you want to make a script that only relies on POSIX features, then you just don't callsplit
with an empty separator so it doesn't matter what your implementation actually does.
â ilkkachu
Jan 14 at 20:35
1
@isaac,typeof()
is a recent addition to gawk.
â Stéphane Chazelas
Jan 15 at 17:28
add a comment |Â
up vote
3
down vote
favorite
up vote
3
down vote
favorite
I read this in the Gawk manual:
GNU EXTENSIONS
[...]
The ability to split out individual characters using the null string as the
value of FS, and as the third argument to split().
However this seems to not be the case. This works as expected:
$ gawk 'BEGIN print split("quebec", z, "")'
6
and I can disable other extensions:
$ export POSIXLY_CORRECT
$ gawk 'BEGIN typeof(1)'
gawk: cmd. line:1: fatal: function `typeof' not defined
but I cannot disable the split behavior:
$ export POSIXLY_CORRECT
$ gawk 'BEGIN print split("quebec", z, "")'
6
$ gawk --posix 'BEGIN print split("quebec", z, "")'
6
I also looked a the Mawk manual:
If FS = "", then mawk breaks the record into individual characters, and,
similarly, split(s,A,"") places the individual characters of s into A.
[...]
Posix explicitly leaves the behavior of FS = "" undefined, and mentions
splitting the record into characters as a possible interpretation, but
currently this use is not portable across implementations.
So, with what implementations can you not get single characters with FS
andsplit
?
awk posix gawk mawk
I read this in the Gawk manual:
GNU EXTENSIONS
[...]
The ability to split out individual characters using the null string as the
value of FS, and as the third argument to split().
However this seems to not be the case. This works as expected:
$ gawk 'BEGIN print split("quebec", z, "")'
6
and I can disable other extensions:
$ export POSIXLY_CORRECT
$ gawk 'BEGIN typeof(1)'
gawk: cmd. line:1: fatal: function `typeof' not defined
but I cannot disable the split behavior:
$ export POSIXLY_CORRECT
$ gawk 'BEGIN print split("quebec", z, "")'
6
$ gawk --posix 'BEGIN print split("quebec", z, "")'
6
I also looked a the Mawk manual:
If FS = "", then mawk breaks the record into individual characters, and,
similarly, split(s,A,"") places the individual characters of s into A.
[...]
Posix explicitly leaves the behavior of FS = "" undefined, and mentions
splitting the record into characters as a possible interpretation, but
currently this use is not portable across implementations.
So, with what implementations can you not get single characters with FS
andsplit
?
awk posix gawk mawk
edited Jan 14 at 21:38
asked Jan 14 at 20:05
Steven Penny
2,29821635
2,29821635
1
I suppose you're asking which implementations do something else, or if there's some other reason it's left unspecified in POSIX? I'm a bit confused as to why it should be possible to change/disable that behaviour. If you want to make a script that only relies on POSIX features, then you just don't callsplit
with an empty separator so it doesn't matter what your implementation actually does.
â ilkkachu
Jan 14 at 20:35
1
@isaac,typeof()
is a recent addition to gawk.
â Stéphane Chazelas
Jan 15 at 17:28
add a comment |Â
1
I suppose you're asking which implementations do something else, or if there's some other reason it's left unspecified in POSIX? I'm a bit confused as to why it should be possible to change/disable that behaviour. If you want to make a script that only relies on POSIX features, then you just don't callsplit
with an empty separator so it doesn't matter what your implementation actually does.
â ilkkachu
Jan 14 at 20:35
1
@isaac,typeof()
is a recent addition to gawk.
â Stéphane Chazelas
Jan 15 at 17:28
1
1
I suppose you're asking which implementations do something else, or if there's some other reason it's left unspecified in POSIX? I'm a bit confused as to why it should be possible to change/disable that behaviour. If you want to make a script that only relies on POSIX features, then you just don't call
split
with an empty separator so it doesn't matter what your implementation actually does.â ilkkachu
Jan 14 at 20:35
I suppose you're asking which implementations do something else, or if there's some other reason it's left unspecified in POSIX? I'm a bit confused as to why it should be possible to change/disable that behaviour. If you want to make a script that only relies on POSIX features, then you just don't call
split
with an empty separator so it doesn't matter what your implementation actually does.â ilkkachu
Jan 14 at 20:35
1
1
@isaac,
typeof()
is a recent addition to gawk.â Stéphane Chazelas
Jan 15 at 17:28
@isaac,
typeof()
is a recent addition to gawk.â Stéphane Chazelas
Jan 15 at 17:28
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
4
down vote
accepted
That's not POSIX in that you can't use it in POSIX scripts because POSIX leaves the behaviour unspecified. That means that while an application (a script) can't use it if it wants to be portable, an implementation (an awk
implementation) can do whatever it wants if you do and still be POSIX. POSIX does not require awk
to split into characters or bytes, or report an error, or reboot the computer, it leaves it unspecified.
So gawk
has no reason to change its behaviour in that regard when $POSIXLY_CORRECT
is in the environmentù, there is no behaviour that is more POSIXly correct than the other in that instance.
As you found out, that extension is found in gawk (since 3.0, January 1996) and mawk (since version 1.2, January 1996). It's also in busybox awk
(from the start (2002)), and since May 1996 also in the one maintained by Brian Kernighan (the k
in awk
) (the FIXES
file refers to gawk
, etc. as inspiration). It looks like it was added to all 3 within a few months suggesting maybe it was discussed among their maintainers. I'm not so sure now who got the idea first.
With Brian Kernighan's awk
or the ones based on it like on FreeBSD or OpenBSD, note that while an empty FS
or an empty third argument passed to split()
causes the string to be split into its individual characters (well, bytes, see below), awk -F ''
returns an error (awk -v FS=
is OK though).
On Solaris, with both nawk
and /usr/xpg4/bin/awk
(and also the old /bin/awk
from the 70s), an empty FS
seems to disable splitting altogether. nawk -F ''
returns an error. I'd expect it would be the same on other commercial Unices based on AT&T code like AIX or HP/UX, though I cannot test it there.
Also note that mawk
, bwk's awk
(that's different for some based on it) and busybox awk don't support multibyte characters. So for instance, in UTF-8:
echo Stéphane | awk -v FS= 'print $4'
would print the second half of the third character in my first name. So with those, it's more correct to say that an empty FS splits into individual bytes, not characters.
ù I realise now that with POSIXLY_CORRECT, or --posix
, gawk
disables some extensions that otherwise don't conflict with POSIX (typeof
does make gawk
non-compliant though), so you could say it's an omission. Now it would not be the first. For instance, it does not disable nextfile
even though it does conflict with POSIX (awk 'nextfile = 1'
is meant to assign 1 to the nextfile
variable but reports an error in gawk
even under POSIXLY_CORRECT).
are you sure about BSD? I just checked FreeBSD and DragonFly BSD, which both use BWK awk. runningawk 'BEGINprint split("hello",x,"")'
gives me5
with both
â Steven Penny
Mar 21 at 2:47
@StevenPenny, looking at my shell history there, it looks like I testedawk -F ''
(which doesn't work) but notsplit(x, y, "")
(which seems to work indeed). I'll update the answer.
â Stéphane Chazelas
Mar 21 at 8:31
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
4
down vote
accepted
That's not POSIX in that you can't use it in POSIX scripts because POSIX leaves the behaviour unspecified. That means that while an application (a script) can't use it if it wants to be portable, an implementation (an awk
implementation) can do whatever it wants if you do and still be POSIX. POSIX does not require awk
to split into characters or bytes, or report an error, or reboot the computer, it leaves it unspecified.
So gawk
has no reason to change its behaviour in that regard when $POSIXLY_CORRECT
is in the environmentù, there is no behaviour that is more POSIXly correct than the other in that instance.
As you found out, that extension is found in gawk (since 3.0, January 1996) and mawk (since version 1.2, January 1996). It's also in busybox awk
(from the start (2002)), and since May 1996 also in the one maintained by Brian Kernighan (the k
in awk
) (the FIXES
file refers to gawk
, etc. as inspiration). It looks like it was added to all 3 within a few months suggesting maybe it was discussed among their maintainers. I'm not so sure now who got the idea first.
With Brian Kernighan's awk
or the ones based on it like on FreeBSD or OpenBSD, note that while an empty FS
or an empty third argument passed to split()
causes the string to be split into its individual characters (well, bytes, see below), awk -F ''
returns an error (awk -v FS=
is OK though).
On Solaris, with both nawk
and /usr/xpg4/bin/awk
(and also the old /bin/awk
from the 70s), an empty FS
seems to disable splitting altogether. nawk -F ''
returns an error. I'd expect it would be the same on other commercial Unices based on AT&T code like AIX or HP/UX, though I cannot test it there.
Also note that mawk
, bwk's awk
(that's different for some based on it) and busybox awk don't support multibyte characters. So for instance, in UTF-8:
echo Stéphane | awk -v FS= 'print $4'
would print the second half of the third character in my first name. So with those, it's more correct to say that an empty FS splits into individual bytes, not characters.
ù I realise now that with POSIXLY_CORRECT, or --posix
, gawk
disables some extensions that otherwise don't conflict with POSIX (typeof
does make gawk
non-compliant though), so you could say it's an omission. Now it would not be the first. For instance, it does not disable nextfile
even though it does conflict with POSIX (awk 'nextfile = 1'
is meant to assign 1 to the nextfile
variable but reports an error in gawk
even under POSIXLY_CORRECT).
are you sure about BSD? I just checked FreeBSD and DragonFly BSD, which both use BWK awk. runningawk 'BEGINprint split("hello",x,"")'
gives me5
with both
â Steven Penny
Mar 21 at 2:47
@StevenPenny, looking at my shell history there, it looks like I testedawk -F ''
(which doesn't work) but notsplit(x, y, "")
(which seems to work indeed). I'll update the answer.
â Stéphane Chazelas
Mar 21 at 8:31
add a comment |Â
up vote
4
down vote
accepted
That's not POSIX in that you can't use it in POSIX scripts because POSIX leaves the behaviour unspecified. That means that while an application (a script) can't use it if it wants to be portable, an implementation (an awk
implementation) can do whatever it wants if you do and still be POSIX. POSIX does not require awk
to split into characters or bytes, or report an error, or reboot the computer, it leaves it unspecified.
So gawk
has no reason to change its behaviour in that regard when $POSIXLY_CORRECT
is in the environmentù, there is no behaviour that is more POSIXly correct than the other in that instance.
As you found out, that extension is found in gawk (since 3.0, January 1996) and mawk (since version 1.2, January 1996). It's also in busybox awk
(from the start (2002)), and since May 1996 also in the one maintained by Brian Kernighan (the k
in awk
) (the FIXES
file refers to gawk
, etc. as inspiration). It looks like it was added to all 3 within a few months suggesting maybe it was discussed among their maintainers. I'm not so sure now who got the idea first.
With Brian Kernighan's awk
or the ones based on it like on FreeBSD or OpenBSD, note that while an empty FS
or an empty third argument passed to split()
causes the string to be split into its individual characters (well, bytes, see below), awk -F ''
returns an error (awk -v FS=
is OK though).
On Solaris, with both nawk
and /usr/xpg4/bin/awk
(and also the old /bin/awk
from the 70s), an empty FS
seems to disable splitting altogether. nawk -F ''
returns an error. I'd expect it would be the same on other commercial Unices based on AT&T code like AIX or HP/UX, though I cannot test it there.
Also note that mawk
, bwk's awk
(that's different for some based on it) and busybox awk don't support multibyte characters. So for instance, in UTF-8:
echo Stéphane | awk -v FS= 'print $4'
would print the second half of the third character in my first name. So with those, it's more correct to say that an empty FS splits into individual bytes, not characters.
ù I realise now that with POSIXLY_CORRECT, or --posix
, gawk
disables some extensions that otherwise don't conflict with POSIX (typeof
does make gawk
non-compliant though), so you could say it's an omission. Now it would not be the first. For instance, it does not disable nextfile
even though it does conflict with POSIX (awk 'nextfile = 1'
is meant to assign 1 to the nextfile
variable but reports an error in gawk
even under POSIXLY_CORRECT).
are you sure about BSD? I just checked FreeBSD and DragonFly BSD, which both use BWK awk. runningawk 'BEGINprint split("hello",x,"")'
gives me5
with both
â Steven Penny
Mar 21 at 2:47
@StevenPenny, looking at my shell history there, it looks like I testedawk -F ''
(which doesn't work) but notsplit(x, y, "")
(which seems to work indeed). I'll update the answer.
â Stéphane Chazelas
Mar 21 at 8:31
add a comment |Â
up vote
4
down vote
accepted
up vote
4
down vote
accepted
That's not POSIX in that you can't use it in POSIX scripts because POSIX leaves the behaviour unspecified. That means that while an application (a script) can't use it if it wants to be portable, an implementation (an awk
implementation) can do whatever it wants if you do and still be POSIX. POSIX does not require awk
to split into characters or bytes, or report an error, or reboot the computer, it leaves it unspecified.
So gawk
has no reason to change its behaviour in that regard when $POSIXLY_CORRECT
is in the environmentù, there is no behaviour that is more POSIXly correct than the other in that instance.
As you found out, that extension is found in gawk (since 3.0, January 1996) and mawk (since version 1.2, January 1996). It's also in busybox awk
(from the start (2002)), and since May 1996 also in the one maintained by Brian Kernighan (the k
in awk
) (the FIXES
file refers to gawk
, etc. as inspiration). It looks like it was added to all 3 within a few months suggesting maybe it was discussed among their maintainers. I'm not so sure now who got the idea first.
With Brian Kernighan's awk
or the ones based on it like on FreeBSD or OpenBSD, note that while an empty FS
or an empty third argument passed to split()
causes the string to be split into its individual characters (well, bytes, see below), awk -F ''
returns an error (awk -v FS=
is OK though).
On Solaris, with both nawk
and /usr/xpg4/bin/awk
(and also the old /bin/awk
from the 70s), an empty FS
seems to disable splitting altogether. nawk -F ''
returns an error. I'd expect it would be the same on other commercial Unices based on AT&T code like AIX or HP/UX, though I cannot test it there.
Also note that mawk
, bwk's awk
(that's different for some based on it) and busybox awk don't support multibyte characters. So for instance, in UTF-8:
echo Stéphane | awk -v FS= 'print $4'
would print the second half of the third character in my first name. So with those, it's more correct to say that an empty FS splits into individual bytes, not characters.
ù I realise now that with POSIXLY_CORRECT, or --posix
, gawk
disables some extensions that otherwise don't conflict with POSIX (typeof
does make gawk
non-compliant though), so you could say it's an omission. Now it would not be the first. For instance, it does not disable nextfile
even though it does conflict with POSIX (awk 'nextfile = 1'
is meant to assign 1 to the nextfile
variable but reports an error in gawk
even under POSIXLY_CORRECT).
That's not POSIX in that you can't use it in POSIX scripts because POSIX leaves the behaviour unspecified. That means that while an application (a script) can't use it if it wants to be portable, an implementation (an awk
implementation) can do whatever it wants if you do and still be POSIX. POSIX does not require awk
to split into characters or bytes, or report an error, or reboot the computer, it leaves it unspecified.
So gawk
has no reason to change its behaviour in that regard when $POSIXLY_CORRECT
is in the environmentù, there is no behaviour that is more POSIXly correct than the other in that instance.
As you found out, that extension is found in gawk (since 3.0, January 1996) and mawk (since version 1.2, January 1996). It's also in busybox awk
(from the start (2002)), and since May 1996 also in the one maintained by Brian Kernighan (the k
in awk
) (the FIXES
file refers to gawk
, etc. as inspiration). It looks like it was added to all 3 within a few months suggesting maybe it was discussed among their maintainers. I'm not so sure now who got the idea first.
With Brian Kernighan's awk
or the ones based on it like on FreeBSD or OpenBSD, note that while an empty FS
or an empty third argument passed to split()
causes the string to be split into its individual characters (well, bytes, see below), awk -F ''
returns an error (awk -v FS=
is OK though).
On Solaris, with both nawk
and /usr/xpg4/bin/awk
(and also the old /bin/awk
from the 70s), an empty FS
seems to disable splitting altogether. nawk -F ''
returns an error. I'd expect it would be the same on other commercial Unices based on AT&T code like AIX or HP/UX, though I cannot test it there.
Also note that mawk
, bwk's awk
(that's different for some based on it) and busybox awk don't support multibyte characters. So for instance, in UTF-8:
echo Stéphane | awk -v FS= 'print $4'
would print the second half of the third character in my first name. So with those, it's more correct to say that an empty FS splits into individual bytes, not characters.
ù I realise now that with POSIXLY_CORRECT, or --posix
, gawk
disables some extensions that otherwise don't conflict with POSIX (typeof
does make gawk
non-compliant though), so you could say it's an omission. Now it would not be the first. For instance, it does not disable nextfile
even though it does conflict with POSIX (awk 'nextfile = 1'
is meant to assign 1 to the nextfile
variable but reports an error in gawk
even under POSIXLY_CORRECT).
edited Mar 22 at 10:58
answered Jan 14 at 22:00
Stéphane Chazelas
281k53518849
281k53518849
are you sure about BSD? I just checked FreeBSD and DragonFly BSD, which both use BWK awk. runningawk 'BEGINprint split("hello",x,"")'
gives me5
with both
â Steven Penny
Mar 21 at 2:47
@StevenPenny, looking at my shell history there, it looks like I testedawk -F ''
(which doesn't work) but notsplit(x, y, "")
(which seems to work indeed). I'll update the answer.
â Stéphane Chazelas
Mar 21 at 8:31
add a comment |Â
are you sure about BSD? I just checked FreeBSD and DragonFly BSD, which both use BWK awk. runningawk 'BEGINprint split("hello",x,"")'
gives me5
with both
â Steven Penny
Mar 21 at 2:47
@StevenPenny, looking at my shell history there, it looks like I testedawk -F ''
(which doesn't work) but notsplit(x, y, "")
(which seems to work indeed). I'll update the answer.
â Stéphane Chazelas
Mar 21 at 8:31
are you sure about BSD? I just checked FreeBSD and DragonFly BSD, which both use BWK awk. running
awk 'BEGINprint split("hello",x,"")'
gives me 5
with bothâ Steven Penny
Mar 21 at 2:47
are you sure about BSD? I just checked FreeBSD and DragonFly BSD, which both use BWK awk. running
awk 'BEGINprint split("hello",x,"")'
gives me 5
with bothâ Steven Penny
Mar 21 at 2:47
@StevenPenny, looking at my shell history there, it looks like I tested
awk -F ''
(which doesn't work) but not split(x, y, "")
(which seems to work indeed). I'll update the answer.â Stéphane Chazelas
Mar 21 at 8:31
@StevenPenny, looking at my shell history there, it looks like I tested
awk -F ''
(which doesn't work) but not split(x, y, "")
(which seems to work indeed). I'll update the answer.â Stéphane Chazelas
Mar 21 at 8:31
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f417105%2fsplit-out-individual-characters-using-the-null-string%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
1
I suppose you're asking which implementations do something else, or if there's some other reason it's left unspecified in POSIX? I'm a bit confused as to why it should be possible to change/disable that behaviour. If you want to make a script that only relies on POSIX features, then you just don't call
split
with an empty separator so it doesn't matter what your implementation actually does.â ilkkachu
Jan 14 at 20:35
1
@isaac,
typeof()
is a recent addition to gawk.â Stéphane Chazelas
Jan 15 at 17:28