How do I create a text file (1 gigabyte) containing random characters with UTF-8 character encoding?

Clash Royale CLAN TAG#URR8PPP

up vote
3
down vote

favorite

The following command does not use UTF-8: head -c 1M </dev/urandom >myfile.txt

edited Aug 26 at 14:21

Jeff Schaller

32.7k849110

asked Nov 26 '15 at 9:52

Message Passing

5115

1

As long as each char is <=7F, then you have a UTF-8 character
â€“Â Alastair McCormack
Nov 26 '15 at 10:16

After executing head -c 1M </dev/urandom >myfile.txt, if I open mytext.txt with gedit it says that there are problems with UTF-8 character encoding
â€“Â Message Passing
Nov 26 '15 at 10:29

Sorry, I meant that to get valid UTF-8, you must only have single bytes of <= x7F or build valid multi-byte UTF-8 sequences. The former is obviously easier from a random perspective.
â€“Â Alastair McCormack
Nov 26 '15 at 10:39

1

Do you want it covering u0 to U7fffffff or just u0..ud7ff+ ue000..u10ffff or only those that are currently specified in the latest Unicode spec?
â€“Â StÃ©phane Chazelas
Nov 26 '15 at 11:44

add a commentÂ |Â

up vote
3
down vote

favorite

The following command does not use UTF-8: head -c 1M </dev/urandom >myfile.txt

edited Aug 26 at 14:21

Jeff Schaller

32.7k849110

asked Nov 26 '15 at 9:52

Message Passing

5115

1

As long as each char is <=7F, then you have a UTF-8 character
â€“Â Alastair McCormack
Nov 26 '15 at 10:16

After executing head -c 1M </dev/urandom >myfile.txt, if I open mytext.txt with gedit it says that there are problems with UTF-8 character encoding
â€“Â Message Passing
Nov 26 '15 at 10:29

Sorry, I meant that to get valid UTF-8, you must only have single bytes of <= x7F or build valid multi-byte UTF-8 sequences. The former is obviously easier from a random perspective.
â€“Â Alastair McCormack
Nov 26 '15 at 10:39

1

Do you want it covering u0 to U7fffffff or just u0..ud7ff+ ue000..u10ffff or only those that are currently specified in the latest Unicode spec?
â€“Â StÃ©phane Chazelas
Nov 26 '15 at 11:44

add a commentÂ |Â

up vote
3
down vote

favorite

The following command does not use UTF-8: head -c 1M </dev/urandom >myfile.txt

edited Aug 26 at 14:21

Jeff Schaller

32.7k849110

asked Nov 26 '15 at 9:52

Message Passing

5115

The following command does not use UTF-8: head -c 1M </dev/urandom >myfile.txt

files unicode text random

edited Aug 26 at 14:21

Jeff Schaller

32.7k849110

asked Nov 26 '15 at 9:52

Message Passing

5115

edited Aug 26 at 14:21

Jeff Schaller

32.7k849110

asked Nov 26 '15 at 9:52

Message Passing

5115

edited Aug 26 at 14:21

Jeff Schaller

32.7k849110

edited Aug 26 at 14:21

Jeff Schaller

32.7k849110

edited Aug 26 at 14:21

Jeff Schaller

32.7k849110

asked Nov 26 '15 at 9:52

Message Passing

5115

asked Nov 26 '15 at 9:52

Message Passing

5115

asked Nov 26 '15 at 9:52

Message Passing

5115

1

As long as each char is <=7F, then you have a UTF-8 character
â€“Â Alastair McCormack
Nov 26 '15 at 10:16

After executing head -c 1M </dev/urandom >myfile.txt, if I open mytext.txt with gedit it says that there are problems with UTF-8 character encoding
â€“Â Message Passing
Nov 26 '15 at 10:29

Sorry, I meant that to get valid UTF-8, you must only have single bytes of <= x7F or build valid multi-byte UTF-8 sequences. The former is obviously easier from a random perspective.
â€“Â Alastair McCormack
Nov 26 '15 at 10:39

1

Do you want it covering u0 to U7fffffff or just u0..ud7ff+ ue000..u10ffff or only those that are currently specified in the latest Unicode spec?
â€“Â StÃ©phane Chazelas
Nov 26 '15 at 11:44

add a commentÂ |Â

1

As long as each char is <=7F, then you have a UTF-8 character
â€“Â Alastair McCormack
Nov 26 '15 at 10:16

After executing head -c 1M </dev/urandom >myfile.txt, if I open mytext.txt with gedit it says that there are problems with UTF-8 character encoding
â€“Â Message Passing
Nov 26 '15 at 10:29

Sorry, I meant that to get valid UTF-8, you must only have single bytes of <= x7F or build valid multi-byte UTF-8 sequences. The former is obviously easier from a random perspective.
â€“Â Alastair McCormack
Nov 26 '15 at 10:39

1

Do you want it covering u0 to U7fffffff or just u0..ud7ff+ ue000..u10ffff or only those that are currently specified in the latest Unicode spec?
â€“Â StÃ©phane Chazelas
Nov 26 '15 at 11:44

As long as each char is <=7F, then you have a UTF-8 character
â€“Â Alastair McCormack
Nov 26 '15 at 10:16

After executing head -c 1M </dev/urandom >myfile.txt, if I open mytext.txt with gedit it says that there are problems with UTF-8 character encoding
â€“Â Message Passing
Nov 26 '15 at 10:29

Sorry, I meant that to get valid UTF-8, you must only have single bytes of <= x7F or build valid multi-byte UTF-8 sequences. The former is obviously easier from a random perspective.
â€“Â Alastair McCormack
Nov 26 '15 at 10:39

Do you want it covering u0 to U7fffffff or just u0..ud7ff+ ue000..u10ffff or only those that are currently specified in the latest Unicode spec?
â€“Â StÃ©phane Chazelas
Nov 26 '15 at 11:44

add a commentÂ |Â

4 Answers
4

active

oldest

votes

up vote
4
down vote

If you want UTF-8 encodings of code points 0 to 0x7FFFFFFF (which the UTF-8 encoding algorithm was originally designed to work on):

< /dev/urandom perl -CO -ne '
 BEGIN$/=4
 no warnings "utf8";
 print chr(unpack("L>",$_) & 0x7fffffff)'

Nowadays, Unicode is restricted to 0..D7FF, E000..10FFFF (though some of those characters are not assigned, some of which will never be (are defined as non-characters)).

< /dev/urandom perl -CO -ne '
 BEGIN$/=3
 no warnings "utf8";
 $c = unpack("L>","$_") * 0x10f800 >> 24;
 $c += 0x800 if $c >= 0xd800;
 print chr($c)'

If you only want assigned characters, you can pipe that to:

uconv -x '[:unassigned:]>;'

Or change that to:

< /dev/urandom perl -CO -ne '
 BEGIN$/=3
 no warnings "utf8";
 $c = unpack("L>","$_") * 0x10f800 >> 24;
 $c += 0x800 if $c >= 0xd800;
 $c = chr $c;
 print $c if $c =~ /Punassigned/'

You may prefer:

 if $c =~ /[pSpacepGraph]/ && $c !~ /pCo/

To only get graphical and spacing ones (exclude those from the private-use sections).

Now, to get 1GiB of that, you can pipe it to head -c1G (assuming GNU head), but beware the last character may be cut in the middle.

edited Nov 26 '15 at 15:25

answered Nov 26 '15 at 14:44

StÃ©phane Chazelas

285k53525864

add a commentÂ |Â

up vote
2
down vote

The most efficient way to create a text file with size 10 MB and UTF-8 character encoding is base64 /dev/urandom | head -c 10000000 | egrep -ao "w" | tr -d 'n' > file10MB.txt

answered Nov 26 '15 at 13:41

Message Passing

5115

Is the egrep and tr necessary?
â€“Â Alastair McCormack
Nov 26 '15 at 13:50

That doesn't produce a 10MB file. Also, you either need to remove base64 or add a base64 -d somewhere; what you have now only outputs characters used by the Base64 encoding.
â€“Â Gilles
Nov 26 '15 at 23:23

add a commentÂ |Â

up vote
0
down vote

Grep for ASCII (sub-set of UTF-8) chars, on Linux/GNU:

dd if=/dev/random bs=1 count=1G | egrep -ao "w" | tr -d 'n'

answered Nov 26 '15 at 10:34

Alastair McCormack

1293

1

base64 might be considerably more efficient.
â€“Â muru
Nov 26 '15 at 11:08

haha, doh! That's what happens if you over think the question :) You should offer that as an answer :)
â€“Â Alastair McCormack
Nov 26 '15 at 11:10

I'm looking for some way to as many UTF8 characters as I can. If I do, I'll post that as answer. :)
â€“Â muru
Nov 26 '15 at 11:11

dd if=/dev/random bs=1 count=10M | egrep -ao "w" | tr -d 'n' > file10MB.txt gets stuck. It is too slow.
â€“Â Message Passing
Nov 26 '15 at 13:27

1

@muru you really need to use urandom here, random will get stuck really fast because it needs entropy. This is not fast, but it will get the job done. dd if=/dev/urandom bs=1 count=1G | egrep -ao "w" | tr -d 'n'
â€“Â theduke
Jan 8 '17 at 11:23

add a commentÂ |Â

up vote
0
down vote

If you want non-ASCII characters, then you'll need a way to build valid UTF-8 sequences. The chance that two consecutive bytes yielding a valid UTF-8 is very low.

Instead, this Python script creates random 8 bit values that can be converted in Unicode chars, then written out as UTF-8:

import random
import io

char_count = 0

with io.open("random-utf8.txt", "w", encoding="utf-8") as my_file:

 while char_count <= 1000000 * 1024:
 rand_long = random.getrandbits(8)

 # Ignore control characters
 if rand_long <= 32 or (rand_long <= 0x9F and rand_long > 0x7F):
 continue

 unicode_char = unichr(rand_long)
 my_file.write(unicode_char)
 char_count += 1

You could also change it to use a random 16 bit number which would yield non-latin values.

It's not fast but fairly accurate.

answered Nov 26 '15 at 13:47

Alastair McCormack

1293

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f245623%2fhow-do-i-create-a-text-file-1-gigabyte-containing-random-characters-with-utf-8%23new-answer', 'question_page');

);

Post as a guest

Name

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

up vote
4
down vote

If you want UTF-8 encodings of code points 0 to 0x7FFFFFFF (which the UTF-8 encoding algorithm was originally designed to work on):

< /dev/urandom perl -CO -ne '
 BEGIN$/=4
 no warnings "utf8";
 print chr(unpack("L>",$_) & 0x7fffffff)'

Nowadays, Unicode is restricted to 0..D7FF, E000..10FFFF (though some of those characters are not assigned, some of which will never be (are defined as non-characters)).

< /dev/urandom perl -CO -ne '
 BEGIN$/=3
 no warnings "utf8";
 $c = unpack("L>","$_") * 0x10f800 >> 24;
 $c += 0x800 if $c >= 0xd800;
 print chr($c)'

If you only want assigned characters, you can pipe that to:

uconv -x '[:unassigned:]>;'

Or change that to:

< /dev/urandom perl -CO -ne '
 BEGIN$/=3
 no warnings "utf8";
 $c = unpack("L>","$_") * 0x10f800 >> 24;
 $c += 0x800 if $c >= 0xd800;
 $c = chr $c;
 print $c if $c =~ /Punassigned/'

You may prefer:

 if $c =~ /[pSpacepGraph]/ && $c !~ /pCo/

To only get graphical and spacing ones (exclude those from the private-use sections).

Now, to get 1GiB of that, you can pipe it to head -c1G (assuming GNU head), but beware the last character may be cut in the middle.

edited Nov 26 '15 at 15:25

answered Nov 26 '15 at 14:44

StÃ©phane Chazelas

285k53525864

add a commentÂ |Â

up vote
4
down vote

If you want UTF-8 encodings of code points 0 to 0x7FFFFFFF (which the UTF-8 encoding algorithm was originally designed to work on):

< /dev/urandom perl -CO -ne '
 BEGIN$/=4
 no warnings "utf8";
 print chr(unpack("L>",$_) & 0x7fffffff)'

Nowadays, Unicode is restricted to 0..D7FF, E000..10FFFF (though some of those characters are not assigned, some of which will never be (are defined as non-characters)).

< /dev/urandom perl -CO -ne '
 BEGIN$/=3
 no warnings "utf8";
 $c = unpack("L>","$_") * 0x10f800 >> 24;
 $c += 0x800 if $c >= 0xd800;
 print chr($c)'

If you only want assigned characters, you can pipe that to:

uconv -x '[:unassigned:]>;'

Or change that to:

< /dev/urandom perl -CO -ne '
 BEGIN$/=3
 no warnings "utf8";
 $c = unpack("L>","$_") * 0x10f800 >> 24;
 $c += 0x800 if $c >= 0xd800;
 $c = chr $c;
 print $c if $c =~ /Punassigned/'

You may prefer:

 if $c =~ /[pSpacepGraph]/ && $c !~ /pCo/

To only get graphical and spacing ones (exclude those from the private-use sections).

Now, to get 1GiB of that, you can pipe it to head -c1G (assuming GNU head), but beware the last character may be cut in the middle.

edited Nov 26 '15 at 15:25

answered Nov 26 '15 at 14:44

StÃ©phane Chazelas

285k53525864

add a commentÂ |Â

up vote
4
down vote

If you want UTF-8 encodings of code points 0 to 0x7FFFFFFF (which the UTF-8 encoding algorithm was originally designed to work on):

< /dev/urandom perl -CO -ne '
 BEGIN$/=4
 no warnings "utf8";
 print chr(unpack("L>",$_) & 0x7fffffff)'

Nowadays, Unicode is restricted to 0..D7FF, E000..10FFFF (though some of those characters are not assigned, some of which will never be (are defined as non-characters)).

< /dev/urandom perl -CO -ne '
 BEGIN$/=3
 no warnings "utf8";
 $c = unpack("L>","$_") * 0x10f800 >> 24;
 $c += 0x800 if $c >= 0xd800;
 print chr($c)'

If you only want assigned characters, you can pipe that to:

uconv -x '[:unassigned:]>;'

Or change that to:

< /dev/urandom perl -CO -ne '
 BEGIN$/=3
 no warnings "utf8";
 $c = unpack("L>","$_") * 0x10f800 >> 24;
 $c += 0x800 if $c >= 0xd800;
 $c = chr $c;
 print $c if $c =~ /Punassigned/'

You may prefer:

 if $c =~ /[pSpacepGraph]/ && $c !~ /pCo/

To only get graphical and spacing ones (exclude those from the private-use sections).

Now, to get 1GiB of that, you can pipe it to head -c1G (assuming GNU head), but beware the last character may be cut in the middle.

edited Nov 26 '15 at 15:25

answered Nov 26 '15 at 14:44

StÃ©phane Chazelas

285k53525864

If you want UTF-8 encodings of code points 0 to 0x7FFFFFFF (which the UTF-8 encoding algorithm was originally designed to work on):

< /dev/urandom perl -CO -ne '
 BEGIN$/=4
 no warnings "utf8";
 print chr(unpack("L>",$_) & 0x7fffffff)'

Nowadays, Unicode is restricted to 0..D7FF, E000..10FFFF (though some of those characters are not assigned, some of which will never be (are defined as non-characters)).

< /dev/urandom perl -CO -ne '
 BEGIN$/=3
 no warnings "utf8";
 $c = unpack("L>","$_") * 0x10f800 >> 24;
 $c += 0x800 if $c >= 0xd800;
 print chr($c)'

If you only want assigned characters, you can pipe that to:

uconv -x '[:unassigned:]>;'

Or change that to:

< /dev/urandom perl -CO -ne '
 BEGIN$/=3
 no warnings "utf8";
 $c = unpack("L>","$_") * 0x10f800 >> 24;
 $c += 0x800 if $c >= 0xd800;
 $c = chr $c;
 print $c if $c =~ /Punassigned/'

You may prefer:

 if $c =~ /[pSpacepGraph]/ && $c !~ /pCo/

To only get graphical and spacing ones (exclude those from the private-use sections).

Now, to get 1GiB of that, you can pipe it to head -c1G (assuming GNU head), but beware the last character may be cut in the middle.

edited Nov 26 '15 at 15:25

answered Nov 26 '15 at 14:44

StÃ©phane Chazelas

285k53525864

edited Nov 26 '15 at 15:25

answered Nov 26 '15 at 14:44

StÃ©phane Chazelas

285k53525864

answered Nov 26 '15 at 14:44

StÃ©phane Chazelas

285k53525864

answered Nov 26 '15 at 14:44

StÃ©phane Chazelas

285k53525864

add a commentÂ |Â

up vote
2
down vote

The most efficient way to create a text file with size 10 MB and UTF-8 character encoding is base64 /dev/urandom | head -c 10000000 | egrep -ao "w" | tr -d 'n' > file10MB.txt

answered Nov 26 '15 at 13:41

Message Passing

5115

Is the egrep and tr necessary?
â€“Â Alastair McCormack
Nov 26 '15 at 13:50

That doesn't produce a 10MB file. Also, you either need to remove base64 or add a base64 -d somewhere; what you have now only outputs characters used by the Base64 encoding.
â€“Â Gilles
Nov 26 '15 at 23:23

add a commentÂ |Â

up vote
2
down vote

The most efficient way to create a text file with size 10 MB and UTF-8 character encoding is base64 /dev/urandom | head -c 10000000 | egrep -ao "w" | tr -d 'n' > file10MB.txt

answered Nov 26 '15 at 13:41

Message Passing

5115

Is the egrep and tr necessary?
â€“Â Alastair McCormack
Nov 26 '15 at 13:50

That doesn't produce a 10MB file. Also, you either need to remove base64 or add a base64 -d somewhere; what you have now only outputs characters used by the Base64 encoding.
â€“Â Gilles
Nov 26 '15 at 23:23

add a commentÂ |Â

up vote
2
down vote

The most efficient way to create a text file with size 10 MB and UTF-8 character encoding is base64 /dev/urandom | head -c 10000000 | egrep -ao "w" | tr -d 'n' > file10MB.txt

answered Nov 26 '15 at 13:41

Message Passing

5115

The most efficient way to create a text file with size 10 MB and UTF-8 character encoding is base64 /dev/urandom | head -c 10000000 | egrep -ao "w" | tr -d 'n' > file10MB.txt

answered Nov 26 '15 at 13:41

Message Passing

5115

answered Nov 26 '15 at 13:41

Message Passing

5115

answered Nov 26 '15 at 13:41

Message Passing

5115

answered Nov 26 '15 at 13:41

Message Passing

5115

Is the egrep and tr necessary?
â€“Â Alastair McCormack
Nov 26 '15 at 13:50

That doesn't produce a 10MB file. Also, you either need to remove base64 or add a base64 -d somewhere; what you have now only outputs characters used by the Base64 encoding.
â€“Â Gilles
Nov 26 '15 at 23:23

add a commentÂ |Â

Is the egrep and tr necessary?
â€“Â Alastair McCormack
Nov 26 '15 at 13:50

That doesn't produce a 10MB file. Also, you either need to remove base64 or add a base64 -d somewhere; what you have now only outputs characters used by the Base64 encoding.
â€“Â Gilles
Nov 26 '15 at 23:23

Is the egrep and tr necessary?
â€“Â Alastair McCormack
Nov 26 '15 at 13:50

That doesn't produce a 10MB file. Also, you either need to remove base64 or add a base64 -d somewhere; what you have now only outputs characters used by the Base64 encoding.
â€“Â Gilles
Nov 26 '15 at 23:23

add a commentÂ |Â

up vote
0
down vote

Grep for ASCII (sub-set of UTF-8) chars, on Linux/GNU:

dd if=/dev/random bs=1 count=1G | egrep -ao "w" | tr -d 'n'

answered Nov 26 '15 at 10:34

Alastair McCormack

1293

1

base64 might be considerably more efficient.
â€“Â muru
Nov 26 '15 at 11:08

haha, doh! That's what happens if you over think the question :) You should offer that as an answer :)
â€“Â Alastair McCormack
Nov 26 '15 at 11:10

I'm looking for some way to as many UTF8 characters as I can. If I do, I'll post that as answer. :)
â€“Â muru
Nov 26 '15 at 11:11

dd if=/dev/random bs=1 count=10M | egrep -ao "w" | tr -d 'n' > file10MB.txt gets stuck. It is too slow.
â€“Â Message Passing
Nov 26 '15 at 13:27

1

@muru you really need to use urandom here, random will get stuck really fast because it needs entropy. This is not fast, but it will get the job done. dd if=/dev/urandom bs=1 count=1G | egrep -ao "w" | tr -d 'n'
â€“Â theduke
Jan 8 '17 at 11:23

add a commentÂ |Â

up vote
0
down vote

Grep for ASCII (sub-set of UTF-8) chars, on Linux/GNU:

dd if=/dev/random bs=1 count=1G | egrep -ao "w" | tr -d 'n'

answered Nov 26 '15 at 10:34

Alastair McCormack

1293

1

base64 might be considerably more efficient.
â€“Â muru
Nov 26 '15 at 11:08

haha, doh! That's what happens if you over think the question :) You should offer that as an answer :)
â€“Â Alastair McCormack
Nov 26 '15 at 11:10

I'm looking for some way to as many UTF8 characters as I can. If I do, I'll post that as answer. :)
â€“Â muru
Nov 26 '15 at 11:11

dd if=/dev/random bs=1 count=10M | egrep -ao "w" | tr -d 'n' > file10MB.txt gets stuck. It is too slow.
â€“Â Message Passing
Nov 26 '15 at 13:27

1

@muru you really need to use urandom here, random will get stuck really fast because it needs entropy. This is not fast, but it will get the job done. dd if=/dev/urandom bs=1 count=1G | egrep -ao "w" | tr -d 'n'
â€“Â theduke
Jan 8 '17 at 11:23

add a commentÂ |Â

up vote
0
down vote

Grep for ASCII (sub-set of UTF-8) chars, on Linux/GNU:

dd if=/dev/random bs=1 count=1G | egrep -ao "w" | tr -d 'n'

answered Nov 26 '15 at 10:34

Alastair McCormack

1293

Grep for ASCII (sub-set of UTF-8) chars, on Linux/GNU:

dd if=/dev/random bs=1 count=1G | egrep -ao "w" | tr -d 'n'

answered Nov 26 '15 at 10:34

Alastair McCormack

1293

answered Nov 26 '15 at 10:34

Alastair McCormack

1293

answered Nov 26 '15 at 10:34

Alastair McCormack

1293

answered Nov 26 '15 at 10:34

Alastair McCormack

1293

1

base64 might be considerably more efficient.
â€“Â muru
Nov 26 '15 at 11:08

haha, doh! That's what happens if you over think the question :) You should offer that as an answer :)
â€“Â Alastair McCormack
Nov 26 '15 at 11:10

I'm looking for some way to as many UTF8 characters as I can. If I do, I'll post that as answer. :)
â€“Â muru
Nov 26 '15 at 11:11

dd if=/dev/random bs=1 count=10M | egrep -ao "w" | tr -d 'n' > file10MB.txt gets stuck. It is too slow.
â€“Â Message Passing
Nov 26 '15 at 13:27

1

@muru you really need to use urandom here, random will get stuck really fast because it needs entropy. This is not fast, but it will get the job done. dd if=/dev/urandom bs=1 count=1G | egrep -ao "w" | tr -d 'n'
â€“Â theduke
Jan 8 '17 at 11:23

add a commentÂ |Â

1

base64 might be considerably more efficient.
â€“Â muru
Nov 26 '15 at 11:08

haha, doh! That's what happens if you over think the question :) You should offer that as an answer :)
â€“Â Alastair McCormack
Nov 26 '15 at 11:10

I'm looking for some way to as many UTF8 characters as I can. If I do, I'll post that as answer. :)
â€“Â muru
Nov 26 '15 at 11:11

dd if=/dev/random bs=1 count=10M | egrep -ao "w" | tr -d 'n' > file10MB.txt gets stuck. It is too slow.
â€“Â Message Passing
Nov 26 '15 at 13:27

1

@muru you really need to use urandom here, random will get stuck really fast because it needs entropy. This is not fast, but it will get the job done. dd if=/dev/urandom bs=1 count=1G | egrep -ao "w" | tr -d 'n'
â€“Â theduke
Jan 8 '17 at 11:23

base64 might be considerably more efficient.
â€“Â muru
Nov 26 '15 at 11:08

haha, doh! That's what happens if you over think the question :) You should offer that as an answer :)
â€“Â Alastair McCormack
Nov 26 '15 at 11:10

I'm looking for some way to as many UTF8 characters as I can. If I do, I'll post that as answer. :)
â€“Â muru
Nov 26 '15 at 11:11

dd if=/dev/random bs=1 count=10M | egrep -ao "w" | tr -d 'n' > file10MB.txt gets stuck. It is too slow.
â€“Â Message Passing
Nov 26 '15 at 13:27

@muru you really need to use urandom here, random will get stuck really fast because it needs entropy. This is not fast, but it will get the job done. dd if=/dev/urandom bs=1 count=1G | egrep -ao "w" | tr -d 'n'
â€“Â theduke
Jan 8 '17 at 11:23

add a commentÂ |Â

up vote
0
down vote

If you want non-ASCII characters, then you'll need a way to build valid UTF-8 sequences. The chance that two consecutive bytes yielding a valid UTF-8 is very low.

Instead, this Python script creates random 8 bit values that can be converted in Unicode chars, then written out as UTF-8:

import random
import io

char_count = 0

with io.open("random-utf8.txt", "w", encoding="utf-8") as my_file:

 while char_count <= 1000000 * 1024:
 rand_long = random.getrandbits(8)

 # Ignore control characters
 if rand_long <= 32 or (rand_long <= 0x9F and rand_long > 0x7F):
 continue

 unicode_char = unichr(rand_long)
 my_file.write(unicode_char)
 char_count += 1

You could also change it to use a random 16 bit number which would yield non-latin values.

It's not fast but fairly accurate.

answered Nov 26 '15 at 13:47

Alastair McCormack

1293

add a commentÂ |Â

up vote
0
down vote

If you want non-ASCII characters, then you'll need a way to build valid UTF-8 sequences. The chance that two consecutive bytes yielding a valid UTF-8 is very low.

Instead, this Python script creates random 8 bit values that can be converted in Unicode chars, then written out as UTF-8:

import random
import io

char_count = 0

with io.open("random-utf8.txt", "w", encoding="utf-8") as my_file:

 while char_count <= 1000000 * 1024:
 rand_long = random.getrandbits(8)

 # Ignore control characters
 if rand_long <= 32 or (rand_long <= 0x9F and rand_long > 0x7F):
 continue

 unicode_char = unichr(rand_long)
 my_file.write(unicode_char)
 char_count += 1

You could also change it to use a random 16 bit number which would yield non-latin values.

It's not fast but fairly accurate.

answered Nov 26 '15 at 13:47

Alastair McCormack

1293

add a commentÂ |Â

up vote
0
down vote

If you want non-ASCII characters, then you'll need a way to build valid UTF-8 sequences. The chance that two consecutive bytes yielding a valid UTF-8 is very low.

Instead, this Python script creates random 8 bit values that can be converted in Unicode chars, then written out as UTF-8:

import random
import io

char_count = 0

with io.open("random-utf8.txt", "w", encoding="utf-8") as my_file:

 while char_count <= 1000000 * 1024:
 rand_long = random.getrandbits(8)

 # Ignore control characters
 if rand_long <= 32 or (rand_long <= 0x9F and rand_long > 0x7F):
 continue

 unicode_char = unichr(rand_long)
 my_file.write(unicode_char)
 char_count += 1

You could also change it to use a random 16 bit number which would yield non-latin values.

It's not fast but fairly accurate.

answered Nov 26 '15 at 13:47

Alastair McCormack

1293

If you want non-ASCII characters, then you'll need a way to build valid UTF-8 sequences. The chance that two consecutive bytes yielding a valid UTF-8 is very low.

Instead, this Python script creates random 8 bit values that can be converted in Unicode chars, then written out as UTF-8:

import random
import io

char_count = 0

with io.open("random-utf8.txt", "w", encoding="utf-8") as my_file:

 while char_count <= 1000000 * 1024:
 rand_long = random.getrandbits(8)

 # Ignore control characters
 if rand_long <= 32 or (rand_long <= 0x9F and rand_long > 0x7F):
 continue

 unicode_char = unichr(rand_long)
 my_file.write(unicode_char)
 char_count += 1

You could also change it to use a random 16 bit number which would yield non-latin values.

It's not fast but fairly accurate.

answered Nov 26 '15 at 13:47

Alastair McCormack

1293

answered Nov 26 '15 at 13:47

Alastair McCormack

1293

answered Nov 26 '15 at 13:47

Alastair McCormack

1293

answered Nov 26 '15 at 13:47

Alastair McCormack

1293

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu