Split out individual characters using the null string

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
3
down vote

favorite












I read this in the Gawk manual:




GNU EXTENSIONS



[...]



The ability to split out individual characters using the null string as the
value of FS, and as the third argument to split().




However this seems to not be the case. This works as expected:



$ gawk 'BEGIN print split("quebec", z, "")'
6


and I can disable other extensions:



$ export POSIXLY_CORRECT
$ gawk 'BEGIN typeof(1)'
gawk: cmd. line:1: fatal: function `typeof' not defined


but I cannot disable the split behavior:



$ export POSIXLY_CORRECT
$ gawk 'BEGIN print split("quebec", z, "")'
6

$ gawk --posix 'BEGIN print split("quebec", z, "")'
6


I also looked a the Mawk manual:




If FS = "", then mawk breaks the record into individual characters, and,
similarly, split(s,A,"") places the individual characters of s into A.



[...]



Posix explicitly leaves the behavior of FS = "" undefined, and mentions
splitting the record into characters as a possible interpretation, but
currently this use is not portable across implementations.




So, with what implementations can you not get single characters with FS and
split?







share|improve this question


















  • 1




    I suppose you're asking which implementations do something else, or if there's some other reason it's left unspecified in POSIX? I'm a bit confused as to why it should be possible to change/disable that behaviour. If you want to make a script that only relies on POSIX features, then you just don't call split with an empty separator so it doesn't matter what your implementation actually does.
    – ilkkachu
    Jan 14 at 20:35






  • 1




    @isaac, typeof() is a recent addition to gawk.
    – Stéphane Chazelas
    Jan 15 at 17:28














up vote
3
down vote

favorite












I read this in the Gawk manual:




GNU EXTENSIONS



[...]



The ability to split out individual characters using the null string as the
value of FS, and as the third argument to split().




However this seems to not be the case. This works as expected:



$ gawk 'BEGIN print split("quebec", z, "")'
6


and I can disable other extensions:



$ export POSIXLY_CORRECT
$ gawk 'BEGIN typeof(1)'
gawk: cmd. line:1: fatal: function `typeof' not defined


but I cannot disable the split behavior:



$ export POSIXLY_CORRECT
$ gawk 'BEGIN print split("quebec", z, "")'
6

$ gawk --posix 'BEGIN print split("quebec", z, "")'
6


I also looked a the Mawk manual:




If FS = "", then mawk breaks the record into individual characters, and,
similarly, split(s,A,"") places the individual characters of s into A.



[...]



Posix explicitly leaves the behavior of FS = "" undefined, and mentions
splitting the record into characters as a possible interpretation, but
currently this use is not portable across implementations.




So, with what implementations can you not get single characters with FS and
split?







share|improve this question


















  • 1




    I suppose you're asking which implementations do something else, or if there's some other reason it's left unspecified in POSIX? I'm a bit confused as to why it should be possible to change/disable that behaviour. If you want to make a script that only relies on POSIX features, then you just don't call split with an empty separator so it doesn't matter what your implementation actually does.
    – ilkkachu
    Jan 14 at 20:35






  • 1




    @isaac, typeof() is a recent addition to gawk.
    – Stéphane Chazelas
    Jan 15 at 17:28












up vote
3
down vote

favorite









up vote
3
down vote

favorite











I read this in the Gawk manual:




GNU EXTENSIONS



[...]



The ability to split out individual characters using the null string as the
value of FS, and as the third argument to split().




However this seems to not be the case. This works as expected:



$ gawk 'BEGIN print split("quebec", z, "")'
6


and I can disable other extensions:



$ export POSIXLY_CORRECT
$ gawk 'BEGIN typeof(1)'
gawk: cmd. line:1: fatal: function `typeof' not defined


but I cannot disable the split behavior:



$ export POSIXLY_CORRECT
$ gawk 'BEGIN print split("quebec", z, "")'
6

$ gawk --posix 'BEGIN print split("quebec", z, "")'
6


I also looked a the Mawk manual:




If FS = "", then mawk breaks the record into individual characters, and,
similarly, split(s,A,"") places the individual characters of s into A.



[...]



Posix explicitly leaves the behavior of FS = "" undefined, and mentions
splitting the record into characters as a possible interpretation, but
currently this use is not portable across implementations.




So, with what implementations can you not get single characters with FS and
split?







share|improve this question














I read this in the Gawk manual:




GNU EXTENSIONS



[...]



The ability to split out individual characters using the null string as the
value of FS, and as the third argument to split().




However this seems to not be the case. This works as expected:



$ gawk 'BEGIN print split("quebec", z, "")'
6


and I can disable other extensions:



$ export POSIXLY_CORRECT
$ gawk 'BEGIN typeof(1)'
gawk: cmd. line:1: fatal: function `typeof' not defined


but I cannot disable the split behavior:



$ export POSIXLY_CORRECT
$ gawk 'BEGIN print split("quebec", z, "")'
6

$ gawk --posix 'BEGIN print split("quebec", z, "")'
6


I also looked a the Mawk manual:




If FS = "", then mawk breaks the record into individual characters, and,
similarly, split(s,A,"") places the individual characters of s into A.



[...]



Posix explicitly leaves the behavior of FS = "" undefined, and mentions
splitting the record into characters as a possible interpretation, but
currently this use is not portable across implementations.




So, with what implementations can you not get single characters with FS and
split?









share|improve this question













share|improve this question




share|improve this question








edited Jan 14 at 21:38

























asked Jan 14 at 20:05









Steven Penny

2,29821635




2,29821635







  • 1




    I suppose you're asking which implementations do something else, or if there's some other reason it's left unspecified in POSIX? I'm a bit confused as to why it should be possible to change/disable that behaviour. If you want to make a script that only relies on POSIX features, then you just don't call split with an empty separator so it doesn't matter what your implementation actually does.
    – ilkkachu
    Jan 14 at 20:35






  • 1




    @isaac, typeof() is a recent addition to gawk.
    – Stéphane Chazelas
    Jan 15 at 17:28












  • 1




    I suppose you're asking which implementations do something else, or if there's some other reason it's left unspecified in POSIX? I'm a bit confused as to why it should be possible to change/disable that behaviour. If you want to make a script that only relies on POSIX features, then you just don't call split with an empty separator so it doesn't matter what your implementation actually does.
    – ilkkachu
    Jan 14 at 20:35






  • 1




    @isaac, typeof() is a recent addition to gawk.
    – Stéphane Chazelas
    Jan 15 at 17:28







1




1




I suppose you're asking which implementations do something else, or if there's some other reason it's left unspecified in POSIX? I'm a bit confused as to why it should be possible to change/disable that behaviour. If you want to make a script that only relies on POSIX features, then you just don't call split with an empty separator so it doesn't matter what your implementation actually does.
– ilkkachu
Jan 14 at 20:35




I suppose you're asking which implementations do something else, or if there's some other reason it's left unspecified in POSIX? I'm a bit confused as to why it should be possible to change/disable that behaviour. If you want to make a script that only relies on POSIX features, then you just don't call split with an empty separator so it doesn't matter what your implementation actually does.
– ilkkachu
Jan 14 at 20:35




1




1




@isaac, typeof() is a recent addition to gawk.
– Stéphane Chazelas
Jan 15 at 17:28




@isaac, typeof() is a recent addition to gawk.
– Stéphane Chazelas
Jan 15 at 17:28










1 Answer
1






active

oldest

votes

















up vote
4
down vote



accepted










That's not POSIX in that you can't use it in POSIX scripts because POSIX leaves the behaviour unspecified. That means that while an application (a script) can't use it if it wants to be portable, an implementation (an awk implementation) can do whatever it wants if you do and still be POSIX. POSIX does not require awk to split into characters or bytes, or report an error, or reboot the computer, it leaves it unspecified.



So gawk has no reason to change its behaviour in that regard when $POSIXLY_CORRECT is in the environment¹, there is no behaviour that is more POSIXly correct than the other in that instance.



As you found out, that extension is found in gawk (since 3.0, January 1996) and mawk (since version 1.2, January 1996). It's also in busybox awk (from the start (2002)), and since May 1996 also in the one maintained by Brian Kernighan (the k in awk) (the FIXES file refers to gawk, etc. as inspiration). It looks like it was added to all 3 within a few months suggesting maybe it was discussed among their maintainers. I'm not so sure now who got the idea first.



With Brian Kernighan's awk or the ones based on it like on FreeBSD or OpenBSD, note that while an empty FS or an empty third argument passed to split() causes the string to be split into its individual characters (well, bytes, see below), awk -F '' returns an error (awk -v FS= is OK though).



On Solaris, with both nawk and /usr/xpg4/bin/awk (and also the old /bin/awk from the 70s), an empty FS seems to disable splitting altogether. nawk -F '' returns an error. I'd expect it would be the same on other commercial Unices based on AT&T code like AIX or HP/UX, though I cannot test it there.



Also note that mawk, bwk's awk (that's different for some based on it) and busybox awk don't support multibyte characters. So for instance, in UTF-8:



echo Stéphane | awk -v FS= 'print $4'


would print the second half of the third character in my first name. So with those, it's more correct to say that an empty FS splits into individual bytes, not characters.




¹ I realise now that with POSIXLY_CORRECT, or --posix, gawk disables some extensions that otherwise don't conflict with POSIX (typeof does make gawk non-compliant though), so you could say it's an omission. Now it would not be the first. For instance, it does not disable nextfile even though it does conflict with POSIX (awk 'nextfile = 1' is meant to assign 1 to the nextfile variable but reports an error in gawk even under POSIXLY_CORRECT).






share|improve this answer






















  • are you sure about BSD? I just checked FreeBSD and DragonFly BSD, which both use BWK awk. running awk 'BEGINprint split("hello",x,"")' gives me 5 with both
    – Steven Penny
    Mar 21 at 2:47










  • @StevenPenny, looking at my shell history there, it looks like I tested awk -F '' (which doesn't work) but not split(x, y, "") (which seems to work indeed). I'll update the answer.
    – Stéphane Chazelas
    Mar 21 at 8:31











Your Answer







StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);








 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f417105%2fsplit-out-individual-characters-using-the-null-string%23new-answer', 'question_page');

);

Post as a guest






























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
4
down vote



accepted










That's not POSIX in that you can't use it in POSIX scripts because POSIX leaves the behaviour unspecified. That means that while an application (a script) can't use it if it wants to be portable, an implementation (an awk implementation) can do whatever it wants if you do and still be POSIX. POSIX does not require awk to split into characters or bytes, or report an error, or reboot the computer, it leaves it unspecified.



So gawk has no reason to change its behaviour in that regard when $POSIXLY_CORRECT is in the environment¹, there is no behaviour that is more POSIXly correct than the other in that instance.



As you found out, that extension is found in gawk (since 3.0, January 1996) and mawk (since version 1.2, January 1996). It's also in busybox awk (from the start (2002)), and since May 1996 also in the one maintained by Brian Kernighan (the k in awk) (the FIXES file refers to gawk, etc. as inspiration). It looks like it was added to all 3 within a few months suggesting maybe it was discussed among their maintainers. I'm not so sure now who got the idea first.



With Brian Kernighan's awk or the ones based on it like on FreeBSD or OpenBSD, note that while an empty FS or an empty third argument passed to split() causes the string to be split into its individual characters (well, bytes, see below), awk -F '' returns an error (awk -v FS= is OK though).



On Solaris, with both nawk and /usr/xpg4/bin/awk (and also the old /bin/awk from the 70s), an empty FS seems to disable splitting altogether. nawk -F '' returns an error. I'd expect it would be the same on other commercial Unices based on AT&T code like AIX or HP/UX, though I cannot test it there.



Also note that mawk, bwk's awk (that's different for some based on it) and busybox awk don't support multibyte characters. So for instance, in UTF-8:



echo Stéphane | awk -v FS= 'print $4'


would print the second half of the third character in my first name. So with those, it's more correct to say that an empty FS splits into individual bytes, not characters.




¹ I realise now that with POSIXLY_CORRECT, or --posix, gawk disables some extensions that otherwise don't conflict with POSIX (typeof does make gawk non-compliant though), so you could say it's an omission. Now it would not be the first. For instance, it does not disable nextfile even though it does conflict with POSIX (awk 'nextfile = 1' is meant to assign 1 to the nextfile variable but reports an error in gawk even under POSIXLY_CORRECT).






share|improve this answer






















  • are you sure about BSD? I just checked FreeBSD and DragonFly BSD, which both use BWK awk. running awk 'BEGINprint split("hello",x,"")' gives me 5 with both
    – Steven Penny
    Mar 21 at 2:47










  • @StevenPenny, looking at my shell history there, it looks like I tested awk -F '' (which doesn't work) but not split(x, y, "") (which seems to work indeed). I'll update the answer.
    – Stéphane Chazelas
    Mar 21 at 8:31















up vote
4
down vote



accepted










That's not POSIX in that you can't use it in POSIX scripts because POSIX leaves the behaviour unspecified. That means that while an application (a script) can't use it if it wants to be portable, an implementation (an awk implementation) can do whatever it wants if you do and still be POSIX. POSIX does not require awk to split into characters or bytes, or report an error, or reboot the computer, it leaves it unspecified.



So gawk has no reason to change its behaviour in that regard when $POSIXLY_CORRECT is in the environment¹, there is no behaviour that is more POSIXly correct than the other in that instance.



As you found out, that extension is found in gawk (since 3.0, January 1996) and mawk (since version 1.2, January 1996). It's also in busybox awk (from the start (2002)), and since May 1996 also in the one maintained by Brian Kernighan (the k in awk) (the FIXES file refers to gawk, etc. as inspiration). It looks like it was added to all 3 within a few months suggesting maybe it was discussed among their maintainers. I'm not so sure now who got the idea first.



With Brian Kernighan's awk or the ones based on it like on FreeBSD or OpenBSD, note that while an empty FS or an empty third argument passed to split() causes the string to be split into its individual characters (well, bytes, see below), awk -F '' returns an error (awk -v FS= is OK though).



On Solaris, with both nawk and /usr/xpg4/bin/awk (and also the old /bin/awk from the 70s), an empty FS seems to disable splitting altogether. nawk -F '' returns an error. I'd expect it would be the same on other commercial Unices based on AT&T code like AIX or HP/UX, though I cannot test it there.



Also note that mawk, bwk's awk (that's different for some based on it) and busybox awk don't support multibyte characters. So for instance, in UTF-8:



echo Stéphane | awk -v FS= 'print $4'


would print the second half of the third character in my first name. So with those, it's more correct to say that an empty FS splits into individual bytes, not characters.




¹ I realise now that with POSIXLY_CORRECT, or --posix, gawk disables some extensions that otherwise don't conflict with POSIX (typeof does make gawk non-compliant though), so you could say it's an omission. Now it would not be the first. For instance, it does not disable nextfile even though it does conflict with POSIX (awk 'nextfile = 1' is meant to assign 1 to the nextfile variable but reports an error in gawk even under POSIXLY_CORRECT).






share|improve this answer






















  • are you sure about BSD? I just checked FreeBSD and DragonFly BSD, which both use BWK awk. running awk 'BEGINprint split("hello",x,"")' gives me 5 with both
    – Steven Penny
    Mar 21 at 2:47










  • @StevenPenny, looking at my shell history there, it looks like I tested awk -F '' (which doesn't work) but not split(x, y, "") (which seems to work indeed). I'll update the answer.
    – Stéphane Chazelas
    Mar 21 at 8:31













up vote
4
down vote



accepted







up vote
4
down vote



accepted






That's not POSIX in that you can't use it in POSIX scripts because POSIX leaves the behaviour unspecified. That means that while an application (a script) can't use it if it wants to be portable, an implementation (an awk implementation) can do whatever it wants if you do and still be POSIX. POSIX does not require awk to split into characters or bytes, or report an error, or reboot the computer, it leaves it unspecified.



So gawk has no reason to change its behaviour in that regard when $POSIXLY_CORRECT is in the environment¹, there is no behaviour that is more POSIXly correct than the other in that instance.



As you found out, that extension is found in gawk (since 3.0, January 1996) and mawk (since version 1.2, January 1996). It's also in busybox awk (from the start (2002)), and since May 1996 also in the one maintained by Brian Kernighan (the k in awk) (the FIXES file refers to gawk, etc. as inspiration). It looks like it was added to all 3 within a few months suggesting maybe it was discussed among their maintainers. I'm not so sure now who got the idea first.



With Brian Kernighan's awk or the ones based on it like on FreeBSD or OpenBSD, note that while an empty FS or an empty third argument passed to split() causes the string to be split into its individual characters (well, bytes, see below), awk -F '' returns an error (awk -v FS= is OK though).



On Solaris, with both nawk and /usr/xpg4/bin/awk (and also the old /bin/awk from the 70s), an empty FS seems to disable splitting altogether. nawk -F '' returns an error. I'd expect it would be the same on other commercial Unices based on AT&T code like AIX or HP/UX, though I cannot test it there.



Also note that mawk, bwk's awk (that's different for some based on it) and busybox awk don't support multibyte characters. So for instance, in UTF-8:



echo Stéphane | awk -v FS= 'print $4'


would print the second half of the third character in my first name. So with those, it's more correct to say that an empty FS splits into individual bytes, not characters.




¹ I realise now that with POSIXLY_CORRECT, or --posix, gawk disables some extensions that otherwise don't conflict with POSIX (typeof does make gawk non-compliant though), so you could say it's an omission. Now it would not be the first. For instance, it does not disable nextfile even though it does conflict with POSIX (awk 'nextfile = 1' is meant to assign 1 to the nextfile variable but reports an error in gawk even under POSIXLY_CORRECT).






share|improve this answer














That's not POSIX in that you can't use it in POSIX scripts because POSIX leaves the behaviour unspecified. That means that while an application (a script) can't use it if it wants to be portable, an implementation (an awk implementation) can do whatever it wants if you do and still be POSIX. POSIX does not require awk to split into characters or bytes, or report an error, or reboot the computer, it leaves it unspecified.



So gawk has no reason to change its behaviour in that regard when $POSIXLY_CORRECT is in the environment¹, there is no behaviour that is more POSIXly correct than the other in that instance.



As you found out, that extension is found in gawk (since 3.0, January 1996) and mawk (since version 1.2, January 1996). It's also in busybox awk (from the start (2002)), and since May 1996 also in the one maintained by Brian Kernighan (the k in awk) (the FIXES file refers to gawk, etc. as inspiration). It looks like it was added to all 3 within a few months suggesting maybe it was discussed among their maintainers. I'm not so sure now who got the idea first.



With Brian Kernighan's awk or the ones based on it like on FreeBSD or OpenBSD, note that while an empty FS or an empty third argument passed to split() causes the string to be split into its individual characters (well, bytes, see below), awk -F '' returns an error (awk -v FS= is OK though).



On Solaris, with both nawk and /usr/xpg4/bin/awk (and also the old /bin/awk from the 70s), an empty FS seems to disable splitting altogether. nawk -F '' returns an error. I'd expect it would be the same on other commercial Unices based on AT&T code like AIX or HP/UX, though I cannot test it there.



Also note that mawk, bwk's awk (that's different for some based on it) and busybox awk don't support multibyte characters. So for instance, in UTF-8:



echo Stéphane | awk -v FS= 'print $4'


would print the second half of the third character in my first name. So with those, it's more correct to say that an empty FS splits into individual bytes, not characters.




¹ I realise now that with POSIXLY_CORRECT, or --posix, gawk disables some extensions that otherwise don't conflict with POSIX (typeof does make gawk non-compliant though), so you could say it's an omission. Now it would not be the first. For instance, it does not disable nextfile even though it does conflict with POSIX (awk 'nextfile = 1' is meant to assign 1 to the nextfile variable but reports an error in gawk even under POSIXLY_CORRECT).







share|improve this answer














share|improve this answer



share|improve this answer








edited Mar 22 at 10:58

























answered Jan 14 at 22:00









Stéphane Chazelas

281k53518849




281k53518849











  • are you sure about BSD? I just checked FreeBSD and DragonFly BSD, which both use BWK awk. running awk 'BEGINprint split("hello",x,"")' gives me 5 with both
    – Steven Penny
    Mar 21 at 2:47










  • @StevenPenny, looking at my shell history there, it looks like I tested awk -F '' (which doesn't work) but not split(x, y, "") (which seems to work indeed). I'll update the answer.
    – Stéphane Chazelas
    Mar 21 at 8:31

















  • are you sure about BSD? I just checked FreeBSD and DragonFly BSD, which both use BWK awk. running awk 'BEGINprint split("hello",x,"")' gives me 5 with both
    – Steven Penny
    Mar 21 at 2:47










  • @StevenPenny, looking at my shell history there, it looks like I tested awk -F '' (which doesn't work) but not split(x, y, "") (which seems to work indeed). I'll update the answer.
    – Stéphane Chazelas
    Mar 21 at 8:31
















are you sure about BSD? I just checked FreeBSD and DragonFly BSD, which both use BWK awk. running awk 'BEGINprint split("hello",x,"")' gives me 5 with both
– Steven Penny
Mar 21 at 2:47




are you sure about BSD? I just checked FreeBSD and DragonFly BSD, which both use BWK awk. running awk 'BEGINprint split("hello",x,"")' gives me 5 with both
– Steven Penny
Mar 21 at 2:47












@StevenPenny, looking at my shell history there, it looks like I tested awk -F '' (which doesn't work) but not split(x, y, "") (which seems to work indeed). I'll update the answer.
– Stéphane Chazelas
Mar 21 at 8:31





@StevenPenny, looking at my shell history there, it looks like I tested awk -F '' (which doesn't work) but not split(x, y, "") (which seems to work indeed). I'll update the answer.
– Stéphane Chazelas
Mar 21 at 8:31













 

draft saved


draft discarded


























 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f417105%2fsplit-out-individual-characters-using-the-null-string%23new-answer', 'question_page');

);

Post as a guest













































































Popular posts from this blog

How to check contact read email or not when send email to Individual?

Bahrain

Postfix configuration issue with fips on centos 7; mailgun relay