How do I create a text file (1 gigabyte) containing random characters with UTF-8 character encoding?
Clash Royale CLAN TAG#URR8PPP
up vote
3
down vote
favorite
The following command does not use UTF-8: head -c 1M </dev/urandom >myfile.txt
files unicode text random
add a comment |Â
up vote
3
down vote
favorite
The following command does not use UTF-8: head -c 1M </dev/urandom >myfile.txt
files unicode text random
1
As long as each char is <=7F, then you have a UTF-8 character
â Alastair McCormack
Nov 26 '15 at 10:16
After executinghead -c 1M </dev/urandom >myfile.txt
, if I open mytext.txt with gedit it says that there are problems with UTF-8 character encoding
â Message Passing
Nov 26 '15 at 10:29
Sorry, I meant that to get valid UTF-8, you must only have single bytes of <= x7F or build valid multi-byte UTF-8 sequences. The former is obviously easier from a random perspective.
â Alastair McCormack
Nov 26 '15 at 10:39
1
Do you want it coveringu0
toU7fffffff
or justu0
..ud7ff
+ue000
..u10ffff
or only those that are currently specified in the latest Unicode spec?
â Stéphane Chazelas
Nov 26 '15 at 11:44
add a comment |Â
up vote
3
down vote
favorite
up vote
3
down vote
favorite
The following command does not use UTF-8: head -c 1M </dev/urandom >myfile.txt
files unicode text random
The following command does not use UTF-8: head -c 1M </dev/urandom >myfile.txt
files unicode text random
files unicode text random
edited Aug 26 at 14:21
Jeff Schaller
32.7k849110
32.7k849110
asked Nov 26 '15 at 9:52
Message Passing
5115
5115
1
As long as each char is <=7F, then you have a UTF-8 character
â Alastair McCormack
Nov 26 '15 at 10:16
After executinghead -c 1M </dev/urandom >myfile.txt
, if I open mytext.txt with gedit it says that there are problems with UTF-8 character encoding
â Message Passing
Nov 26 '15 at 10:29
Sorry, I meant that to get valid UTF-8, you must only have single bytes of <= x7F or build valid multi-byte UTF-8 sequences. The former is obviously easier from a random perspective.
â Alastair McCormack
Nov 26 '15 at 10:39
1
Do you want it coveringu0
toU7fffffff
or justu0
..ud7ff
+ue000
..u10ffff
or only those that are currently specified in the latest Unicode spec?
â Stéphane Chazelas
Nov 26 '15 at 11:44
add a comment |Â
1
As long as each char is <=7F, then you have a UTF-8 character
â Alastair McCormack
Nov 26 '15 at 10:16
After executinghead -c 1M </dev/urandom >myfile.txt
, if I open mytext.txt with gedit it says that there are problems with UTF-8 character encoding
â Message Passing
Nov 26 '15 at 10:29
Sorry, I meant that to get valid UTF-8, you must only have single bytes of <= x7F or build valid multi-byte UTF-8 sequences. The former is obviously easier from a random perspective.
â Alastair McCormack
Nov 26 '15 at 10:39
1
Do you want it coveringu0
toU7fffffff
or justu0
..ud7ff
+ue000
..u10ffff
or only those that are currently specified in the latest Unicode spec?
â Stéphane Chazelas
Nov 26 '15 at 11:44
1
1
As long as each char is <=7F, then you have a UTF-8 character
â Alastair McCormack
Nov 26 '15 at 10:16
As long as each char is <=7F, then you have a UTF-8 character
â Alastair McCormack
Nov 26 '15 at 10:16
After executing
head -c 1M </dev/urandom >myfile.txt
, if I open mytext.txt with gedit it says that there are problems with UTF-8 character encodingâ Message Passing
Nov 26 '15 at 10:29
After executing
head -c 1M </dev/urandom >myfile.txt
, if I open mytext.txt with gedit it says that there are problems with UTF-8 character encodingâ Message Passing
Nov 26 '15 at 10:29
Sorry, I meant that to get valid UTF-8, you must only have single bytes of <= x7F or build valid multi-byte UTF-8 sequences. The former is obviously easier from a random perspective.
â Alastair McCormack
Nov 26 '15 at 10:39
Sorry, I meant that to get valid UTF-8, you must only have single bytes of <= x7F or build valid multi-byte UTF-8 sequences. The former is obviously easier from a random perspective.
â Alastair McCormack
Nov 26 '15 at 10:39
1
1
Do you want it covering
u0
to U7fffffff
or just u0
..ud7ff
+ ue000
..u10ffff
or only those that are currently specified in the latest Unicode spec?â Stéphane Chazelas
Nov 26 '15 at 11:44
Do you want it covering
u0
to U7fffffff
or just u0
..ud7ff
+ ue000
..u10ffff
or only those that are currently specified in the latest Unicode spec?â Stéphane Chazelas
Nov 26 '15 at 11:44
add a comment |Â
4 Answers
4
active
oldest
votes
up vote
4
down vote
If you want UTF-8 encodings of code points 0 to 0x7FFFFFFF (which the UTF-8 encoding algorithm was originally designed to work on):
< /dev/urandom perl -CO -ne '
BEGIN$/=4
no warnings "utf8";
print chr(unpack("L>",$_) & 0x7fffffff)'
Nowadays, Unicode is restricted to 0..D7FF, E000..10FFFF (though some of those characters are not assigned, some of which will never be (are defined as non-characters)).
< /dev/urandom perl -CO -ne '
BEGIN$/=3
no warnings "utf8";
$c = unpack("L>","$_") * 0x10f800 >> 24;
$c += 0x800 if $c >= 0xd800;
print chr($c)'
If you only want assigned characters, you can pipe that to:
uconv -x '[:unassigned:]>;'
Or change that to:
< /dev/urandom perl -CO -ne '
BEGIN$/=3
no warnings "utf8";
$c = unpack("L>","$_") * 0x10f800 >> 24;
$c += 0x800 if $c >= 0xd800;
$c = chr $c;
print $c if $c =~ /Punassigned/'
You may prefer:
if $c =~ /[pSpacepGraph]/ && $c !~ /pCo/
To only get graphical and spacing ones (exclude those from the private-use sections).
Now, to get 1GiB of that, you can pipe it to head -c1G
(assuming GNU head
), but beware the last character may be cut in the middle.
add a comment |Â
up vote
2
down vote
The most efficient way to create a text file with size 10 MB and UTF-8 character encoding is base64 /dev/urandom | head -c 10000000 | egrep -ao "w" | tr -d 'n' > file10MB.txt
Is theegrep
andtr
necessary?
â Alastair McCormack
Nov 26 '15 at 13:50
That doesn't produce a 10MB file. Also, you either need to removebase64
or add abase64 -d
somewhere; what you have now only outputs characters used by the Base64 encoding.
â Gilles
Nov 26 '15 at 23:23
add a comment |Â
up vote
0
down vote
Grep for ASCII (sub-set of UTF-8) chars, on Linux/GNU:
dd if=/dev/random bs=1 count=1G | egrep -ao "w" | tr -d 'n'
1
base64
might be considerably more efficient.
â muru
Nov 26 '15 at 11:08
haha, doh! That's what happens if you over think the question :) You should offer that as an answer :)
â Alastair McCormack
Nov 26 '15 at 11:10
I'm looking for some way to as many UTF8 characters as I can. If I do, I'll post that as answer. :)
â muru
Nov 26 '15 at 11:11
dd if=/dev/random bs=1 count=10M | egrep -ao "w" | tr -d 'n' > file10MB.txt
gets stuck. It is too slow.
â Message Passing
Nov 26 '15 at 13:27
1
@muru you really need to use urandom here, random will get stuck really fast because it needs entropy. This is not fast, but it will get the job done.dd if=/dev/urandom bs=1 count=1G | egrep -ao "w" | tr -d 'n'
â theduke
Jan 8 '17 at 11:23
add a comment |Â
up vote
0
down vote
If you want non-ASCII characters, then you'll need a way to build valid UTF-8 sequences. The chance that two consecutive bytes yielding a valid UTF-8 is very low.
Instead, this Python script creates random 8 bit values that can be converted in Unicode chars, then written out as UTF-8:
import random
import io
char_count = 0
with io.open("random-utf8.txt", "w", encoding="utf-8") as my_file:
while char_count <= 1000000 * 1024:
rand_long = random.getrandbits(8)
# Ignore control characters
if rand_long <= 32 or (rand_long <= 0x9F and rand_long > 0x7F):
continue
unicode_char = unichr(rand_long)
my_file.write(unicode_char)
char_count += 1
You could also change it to use a random 16 bit number which would yield non-latin values.
It's not fast but fairly accurate.
add a comment |Â
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
4
down vote
If you want UTF-8 encodings of code points 0 to 0x7FFFFFFF (which the UTF-8 encoding algorithm was originally designed to work on):
< /dev/urandom perl -CO -ne '
BEGIN$/=4
no warnings "utf8";
print chr(unpack("L>",$_) & 0x7fffffff)'
Nowadays, Unicode is restricted to 0..D7FF, E000..10FFFF (though some of those characters are not assigned, some of which will never be (are defined as non-characters)).
< /dev/urandom perl -CO -ne '
BEGIN$/=3
no warnings "utf8";
$c = unpack("L>","$_") * 0x10f800 >> 24;
$c += 0x800 if $c >= 0xd800;
print chr($c)'
If you only want assigned characters, you can pipe that to:
uconv -x '[:unassigned:]>;'
Or change that to:
< /dev/urandom perl -CO -ne '
BEGIN$/=3
no warnings "utf8";
$c = unpack("L>","$_") * 0x10f800 >> 24;
$c += 0x800 if $c >= 0xd800;
$c = chr $c;
print $c if $c =~ /Punassigned/'
You may prefer:
if $c =~ /[pSpacepGraph]/ && $c !~ /pCo/
To only get graphical and spacing ones (exclude those from the private-use sections).
Now, to get 1GiB of that, you can pipe it to head -c1G
(assuming GNU head
), but beware the last character may be cut in the middle.
add a comment |Â
up vote
4
down vote
If you want UTF-8 encodings of code points 0 to 0x7FFFFFFF (which the UTF-8 encoding algorithm was originally designed to work on):
< /dev/urandom perl -CO -ne '
BEGIN$/=4
no warnings "utf8";
print chr(unpack("L>",$_) & 0x7fffffff)'
Nowadays, Unicode is restricted to 0..D7FF, E000..10FFFF (though some of those characters are not assigned, some of which will never be (are defined as non-characters)).
< /dev/urandom perl -CO -ne '
BEGIN$/=3
no warnings "utf8";
$c = unpack("L>","$_") * 0x10f800 >> 24;
$c += 0x800 if $c >= 0xd800;
print chr($c)'
If you only want assigned characters, you can pipe that to:
uconv -x '[:unassigned:]>;'
Or change that to:
< /dev/urandom perl -CO -ne '
BEGIN$/=3
no warnings "utf8";
$c = unpack("L>","$_") * 0x10f800 >> 24;
$c += 0x800 if $c >= 0xd800;
$c = chr $c;
print $c if $c =~ /Punassigned/'
You may prefer:
if $c =~ /[pSpacepGraph]/ && $c !~ /pCo/
To only get graphical and spacing ones (exclude those from the private-use sections).
Now, to get 1GiB of that, you can pipe it to head -c1G
(assuming GNU head
), but beware the last character may be cut in the middle.
add a comment |Â
up vote
4
down vote
up vote
4
down vote
If you want UTF-8 encodings of code points 0 to 0x7FFFFFFF (which the UTF-8 encoding algorithm was originally designed to work on):
< /dev/urandom perl -CO -ne '
BEGIN$/=4
no warnings "utf8";
print chr(unpack("L>",$_) & 0x7fffffff)'
Nowadays, Unicode is restricted to 0..D7FF, E000..10FFFF (though some of those characters are not assigned, some of which will never be (are defined as non-characters)).
< /dev/urandom perl -CO -ne '
BEGIN$/=3
no warnings "utf8";
$c = unpack("L>","$_") * 0x10f800 >> 24;
$c += 0x800 if $c >= 0xd800;
print chr($c)'
If you only want assigned characters, you can pipe that to:
uconv -x '[:unassigned:]>;'
Or change that to:
< /dev/urandom perl -CO -ne '
BEGIN$/=3
no warnings "utf8";
$c = unpack("L>","$_") * 0x10f800 >> 24;
$c += 0x800 if $c >= 0xd800;
$c = chr $c;
print $c if $c =~ /Punassigned/'
You may prefer:
if $c =~ /[pSpacepGraph]/ && $c !~ /pCo/
To only get graphical and spacing ones (exclude those from the private-use sections).
Now, to get 1GiB of that, you can pipe it to head -c1G
(assuming GNU head
), but beware the last character may be cut in the middle.
If you want UTF-8 encodings of code points 0 to 0x7FFFFFFF (which the UTF-8 encoding algorithm was originally designed to work on):
< /dev/urandom perl -CO -ne '
BEGIN$/=4
no warnings "utf8";
print chr(unpack("L>",$_) & 0x7fffffff)'
Nowadays, Unicode is restricted to 0..D7FF, E000..10FFFF (though some of those characters are not assigned, some of which will never be (are defined as non-characters)).
< /dev/urandom perl -CO -ne '
BEGIN$/=3
no warnings "utf8";
$c = unpack("L>","$_") * 0x10f800 >> 24;
$c += 0x800 if $c >= 0xd800;
print chr($c)'
If you only want assigned characters, you can pipe that to:
uconv -x '[:unassigned:]>;'
Or change that to:
< /dev/urandom perl -CO -ne '
BEGIN$/=3
no warnings "utf8";
$c = unpack("L>","$_") * 0x10f800 >> 24;
$c += 0x800 if $c >= 0xd800;
$c = chr $c;
print $c if $c =~ /Punassigned/'
You may prefer:
if $c =~ /[pSpacepGraph]/ && $c !~ /pCo/
To only get graphical and spacing ones (exclude those from the private-use sections).
Now, to get 1GiB of that, you can pipe it to head -c1G
(assuming GNU head
), but beware the last character may be cut in the middle.
edited Nov 26 '15 at 15:25
answered Nov 26 '15 at 14:44
Stéphane Chazelas
285k53525864
285k53525864
add a comment |Â
add a comment |Â
up vote
2
down vote
The most efficient way to create a text file with size 10 MB and UTF-8 character encoding is base64 /dev/urandom | head -c 10000000 | egrep -ao "w" | tr -d 'n' > file10MB.txt
Is theegrep
andtr
necessary?
â Alastair McCormack
Nov 26 '15 at 13:50
That doesn't produce a 10MB file. Also, you either need to removebase64
or add abase64 -d
somewhere; what you have now only outputs characters used by the Base64 encoding.
â Gilles
Nov 26 '15 at 23:23
add a comment |Â
up vote
2
down vote
The most efficient way to create a text file with size 10 MB and UTF-8 character encoding is base64 /dev/urandom | head -c 10000000 | egrep -ao "w" | tr -d 'n' > file10MB.txt
Is theegrep
andtr
necessary?
â Alastair McCormack
Nov 26 '15 at 13:50
That doesn't produce a 10MB file. Also, you either need to removebase64
or add abase64 -d
somewhere; what you have now only outputs characters used by the Base64 encoding.
â Gilles
Nov 26 '15 at 23:23
add a comment |Â
up vote
2
down vote
up vote
2
down vote
The most efficient way to create a text file with size 10 MB and UTF-8 character encoding is base64 /dev/urandom | head -c 10000000 | egrep -ao "w" | tr -d 'n' > file10MB.txt
The most efficient way to create a text file with size 10 MB and UTF-8 character encoding is base64 /dev/urandom | head -c 10000000 | egrep -ao "w" | tr -d 'n' > file10MB.txt
answered Nov 26 '15 at 13:41
Message Passing
5115
5115
Is theegrep
andtr
necessary?
â Alastair McCormack
Nov 26 '15 at 13:50
That doesn't produce a 10MB file. Also, you either need to removebase64
or add abase64 -d
somewhere; what you have now only outputs characters used by the Base64 encoding.
â Gilles
Nov 26 '15 at 23:23
add a comment |Â
Is theegrep
andtr
necessary?
â Alastair McCormack
Nov 26 '15 at 13:50
That doesn't produce a 10MB file. Also, you either need to removebase64
or add abase64 -d
somewhere; what you have now only outputs characters used by the Base64 encoding.
â Gilles
Nov 26 '15 at 23:23
Is the
egrep
and tr
necessary?â Alastair McCormack
Nov 26 '15 at 13:50
Is the
egrep
and tr
necessary?â Alastair McCormack
Nov 26 '15 at 13:50
That doesn't produce a 10MB file. Also, you either need to remove
base64
or add a base64 -d
somewhere; what you have now only outputs characters used by the Base64 encoding.â Gilles
Nov 26 '15 at 23:23
That doesn't produce a 10MB file. Also, you either need to remove
base64
or add a base64 -d
somewhere; what you have now only outputs characters used by the Base64 encoding.â Gilles
Nov 26 '15 at 23:23
add a comment |Â
up vote
0
down vote
Grep for ASCII (sub-set of UTF-8) chars, on Linux/GNU:
dd if=/dev/random bs=1 count=1G | egrep -ao "w" | tr -d 'n'
1
base64
might be considerably more efficient.
â muru
Nov 26 '15 at 11:08
haha, doh! That's what happens if you over think the question :) You should offer that as an answer :)
â Alastair McCormack
Nov 26 '15 at 11:10
I'm looking for some way to as many UTF8 characters as I can. If I do, I'll post that as answer. :)
â muru
Nov 26 '15 at 11:11
dd if=/dev/random bs=1 count=10M | egrep -ao "w" | tr -d 'n' > file10MB.txt
gets stuck. It is too slow.
â Message Passing
Nov 26 '15 at 13:27
1
@muru you really need to use urandom here, random will get stuck really fast because it needs entropy. This is not fast, but it will get the job done.dd if=/dev/urandom bs=1 count=1G | egrep -ao "w" | tr -d 'n'
â theduke
Jan 8 '17 at 11:23
add a comment |Â
up vote
0
down vote
Grep for ASCII (sub-set of UTF-8) chars, on Linux/GNU:
dd if=/dev/random bs=1 count=1G | egrep -ao "w" | tr -d 'n'
1
base64
might be considerably more efficient.
â muru
Nov 26 '15 at 11:08
haha, doh! That's what happens if you over think the question :) You should offer that as an answer :)
â Alastair McCormack
Nov 26 '15 at 11:10
I'm looking for some way to as many UTF8 characters as I can. If I do, I'll post that as answer. :)
â muru
Nov 26 '15 at 11:11
dd if=/dev/random bs=1 count=10M | egrep -ao "w" | tr -d 'n' > file10MB.txt
gets stuck. It is too slow.
â Message Passing
Nov 26 '15 at 13:27
1
@muru you really need to use urandom here, random will get stuck really fast because it needs entropy. This is not fast, but it will get the job done.dd if=/dev/urandom bs=1 count=1G | egrep -ao "w" | tr -d 'n'
â theduke
Jan 8 '17 at 11:23
add a comment |Â
up vote
0
down vote
up vote
0
down vote
Grep for ASCII (sub-set of UTF-8) chars, on Linux/GNU:
dd if=/dev/random bs=1 count=1G | egrep -ao "w" | tr -d 'n'
Grep for ASCII (sub-set of UTF-8) chars, on Linux/GNU:
dd if=/dev/random bs=1 count=1G | egrep -ao "w" | tr -d 'n'
answered Nov 26 '15 at 10:34
Alastair McCormack
1293
1293
1
base64
might be considerably more efficient.
â muru
Nov 26 '15 at 11:08
haha, doh! That's what happens if you over think the question :) You should offer that as an answer :)
â Alastair McCormack
Nov 26 '15 at 11:10
I'm looking for some way to as many UTF8 characters as I can. If I do, I'll post that as answer. :)
â muru
Nov 26 '15 at 11:11
dd if=/dev/random bs=1 count=10M | egrep -ao "w" | tr -d 'n' > file10MB.txt
gets stuck. It is too slow.
â Message Passing
Nov 26 '15 at 13:27
1
@muru you really need to use urandom here, random will get stuck really fast because it needs entropy. This is not fast, but it will get the job done.dd if=/dev/urandom bs=1 count=1G | egrep -ao "w" | tr -d 'n'
â theduke
Jan 8 '17 at 11:23
add a comment |Â
1
base64
might be considerably more efficient.
â muru
Nov 26 '15 at 11:08
haha, doh! That's what happens if you over think the question :) You should offer that as an answer :)
â Alastair McCormack
Nov 26 '15 at 11:10
I'm looking for some way to as many UTF8 characters as I can. If I do, I'll post that as answer. :)
â muru
Nov 26 '15 at 11:11
dd if=/dev/random bs=1 count=10M | egrep -ao "w" | tr -d 'n' > file10MB.txt
gets stuck. It is too slow.
â Message Passing
Nov 26 '15 at 13:27
1
@muru you really need to use urandom here, random will get stuck really fast because it needs entropy. This is not fast, but it will get the job done.dd if=/dev/urandom bs=1 count=1G | egrep -ao "w" | tr -d 'n'
â theduke
Jan 8 '17 at 11:23
1
1
base64
might be considerably more efficient.â muru
Nov 26 '15 at 11:08
base64
might be considerably more efficient.â muru
Nov 26 '15 at 11:08
haha, doh! That's what happens if you over think the question :) You should offer that as an answer :)
â Alastair McCormack
Nov 26 '15 at 11:10
haha, doh! That's what happens if you over think the question :) You should offer that as an answer :)
â Alastair McCormack
Nov 26 '15 at 11:10
I'm looking for some way to as many UTF8 characters as I can. If I do, I'll post that as answer. :)
â muru
Nov 26 '15 at 11:11
I'm looking for some way to as many UTF8 characters as I can. If I do, I'll post that as answer. :)
â muru
Nov 26 '15 at 11:11
dd if=/dev/random bs=1 count=10M | egrep -ao "w" | tr -d 'n' > file10MB.txt
gets stuck. It is too slow.â Message Passing
Nov 26 '15 at 13:27
dd if=/dev/random bs=1 count=10M | egrep -ao "w" | tr -d 'n' > file10MB.txt
gets stuck. It is too slow.â Message Passing
Nov 26 '15 at 13:27
1
1
@muru you really need to use urandom here, random will get stuck really fast because it needs entropy. This is not fast, but it will get the job done.
dd if=/dev/urandom bs=1 count=1G | egrep -ao "w" | tr -d 'n'
â theduke
Jan 8 '17 at 11:23
@muru you really need to use urandom here, random will get stuck really fast because it needs entropy. This is not fast, but it will get the job done.
dd if=/dev/urandom bs=1 count=1G | egrep -ao "w" | tr -d 'n'
â theduke
Jan 8 '17 at 11:23
add a comment |Â
up vote
0
down vote
If you want non-ASCII characters, then you'll need a way to build valid UTF-8 sequences. The chance that two consecutive bytes yielding a valid UTF-8 is very low.
Instead, this Python script creates random 8 bit values that can be converted in Unicode chars, then written out as UTF-8:
import random
import io
char_count = 0
with io.open("random-utf8.txt", "w", encoding="utf-8") as my_file:
while char_count <= 1000000 * 1024:
rand_long = random.getrandbits(8)
# Ignore control characters
if rand_long <= 32 or (rand_long <= 0x9F and rand_long > 0x7F):
continue
unicode_char = unichr(rand_long)
my_file.write(unicode_char)
char_count += 1
You could also change it to use a random 16 bit number which would yield non-latin values.
It's not fast but fairly accurate.
add a comment |Â
up vote
0
down vote
If you want non-ASCII characters, then you'll need a way to build valid UTF-8 sequences. The chance that two consecutive bytes yielding a valid UTF-8 is very low.
Instead, this Python script creates random 8 bit values that can be converted in Unicode chars, then written out as UTF-8:
import random
import io
char_count = 0
with io.open("random-utf8.txt", "w", encoding="utf-8") as my_file:
while char_count <= 1000000 * 1024:
rand_long = random.getrandbits(8)
# Ignore control characters
if rand_long <= 32 or (rand_long <= 0x9F and rand_long > 0x7F):
continue
unicode_char = unichr(rand_long)
my_file.write(unicode_char)
char_count += 1
You could also change it to use a random 16 bit number which would yield non-latin values.
It's not fast but fairly accurate.
add a comment |Â
up vote
0
down vote
up vote
0
down vote
If you want non-ASCII characters, then you'll need a way to build valid UTF-8 sequences. The chance that two consecutive bytes yielding a valid UTF-8 is very low.
Instead, this Python script creates random 8 bit values that can be converted in Unicode chars, then written out as UTF-8:
import random
import io
char_count = 0
with io.open("random-utf8.txt", "w", encoding="utf-8") as my_file:
while char_count <= 1000000 * 1024:
rand_long = random.getrandbits(8)
# Ignore control characters
if rand_long <= 32 or (rand_long <= 0x9F and rand_long > 0x7F):
continue
unicode_char = unichr(rand_long)
my_file.write(unicode_char)
char_count += 1
You could also change it to use a random 16 bit number which would yield non-latin values.
It's not fast but fairly accurate.
If you want non-ASCII characters, then you'll need a way to build valid UTF-8 sequences. The chance that two consecutive bytes yielding a valid UTF-8 is very low.
Instead, this Python script creates random 8 bit values that can be converted in Unicode chars, then written out as UTF-8:
import random
import io
char_count = 0
with io.open("random-utf8.txt", "w", encoding="utf-8") as my_file:
while char_count <= 1000000 * 1024:
rand_long = random.getrandbits(8)
# Ignore control characters
if rand_long <= 32 or (rand_long <= 0x9F and rand_long > 0x7F):
continue
unicode_char = unichr(rand_long)
my_file.write(unicode_char)
char_count += 1
You could also change it to use a random 16 bit number which would yield non-latin values.
It's not fast but fairly accurate.
answered Nov 26 '15 at 13:47
Alastair McCormack
1293
1293
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f245623%2fhow-do-i-create-a-text-file-1-gigabyte-containing-random-characters-with-utf-8%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
1
As long as each char is <=7F, then you have a UTF-8 character
â Alastair McCormack
Nov 26 '15 at 10:16
After executing
head -c 1M </dev/urandom >myfile.txt
, if I open mytext.txt with gedit it says that there are problems with UTF-8 character encodingâ Message Passing
Nov 26 '15 at 10:29
Sorry, I meant that to get valid UTF-8, you must only have single bytes of <= x7F or build valid multi-byte UTF-8 sequences. The former is obviously easier from a random perspective.
â Alastair McCormack
Nov 26 '15 at 10:39
1
Do you want it covering
u0
toU7fffffff
or justu0
..ud7ff
+ue000
..u10ffff
or only those that are currently specified in the latest Unicode spec?â Stéphane Chazelas
Nov 26 '15 at 11:44