How do I create a text file (1 gigabyte) containing random characters with UTF-8 character encoding?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
3
down vote

favorite
1












The following command does not use UTF-8: head -c 1M </dev/urandom >myfile.txt










share|improve this question



















  • 1




    As long as each char is <=7F, then you have a UTF-8 character
    – Alastair McCormack
    Nov 26 '15 at 10:16










  • After executing head -c 1M </dev/urandom >myfile.txt, if I open mytext.txt with gedit it says that there are problems with UTF-8 character encoding
    – Message Passing
    Nov 26 '15 at 10:29










  • Sorry, I meant that to get valid UTF-8, you must only have single bytes of <= x7F or build valid multi-byte UTF-8 sequences. The former is obviously easier from a random perspective.
    – Alastair McCormack
    Nov 26 '15 at 10:39







  • 1




    Do you want it covering u0 to U7fffffff or just u0..ud7ff+ ue000..u10ffff or only those that are currently specified in the latest Unicode spec?
    – Stéphane Chazelas
    Nov 26 '15 at 11:44















up vote
3
down vote

favorite
1












The following command does not use UTF-8: head -c 1M </dev/urandom >myfile.txt










share|improve this question



















  • 1




    As long as each char is <=7F, then you have a UTF-8 character
    – Alastair McCormack
    Nov 26 '15 at 10:16










  • After executing head -c 1M </dev/urandom >myfile.txt, if I open mytext.txt with gedit it says that there are problems with UTF-8 character encoding
    – Message Passing
    Nov 26 '15 at 10:29










  • Sorry, I meant that to get valid UTF-8, you must only have single bytes of <= x7F or build valid multi-byte UTF-8 sequences. The former is obviously easier from a random perspective.
    – Alastair McCormack
    Nov 26 '15 at 10:39







  • 1




    Do you want it covering u0 to U7fffffff or just u0..ud7ff+ ue000..u10ffff or only those that are currently specified in the latest Unicode spec?
    – Stéphane Chazelas
    Nov 26 '15 at 11:44













up vote
3
down vote

favorite
1









up vote
3
down vote

favorite
1






1





The following command does not use UTF-8: head -c 1M </dev/urandom >myfile.txt










share|improve this question















The following command does not use UTF-8: head -c 1M </dev/urandom >myfile.txt







files unicode text random






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Aug 26 at 14:21









Jeff Schaller

32.7k849110




32.7k849110










asked Nov 26 '15 at 9:52









Message Passing

5115




5115







  • 1




    As long as each char is <=7F, then you have a UTF-8 character
    – Alastair McCormack
    Nov 26 '15 at 10:16










  • After executing head -c 1M </dev/urandom >myfile.txt, if I open mytext.txt with gedit it says that there are problems with UTF-8 character encoding
    – Message Passing
    Nov 26 '15 at 10:29










  • Sorry, I meant that to get valid UTF-8, you must only have single bytes of <= x7F or build valid multi-byte UTF-8 sequences. The former is obviously easier from a random perspective.
    – Alastair McCormack
    Nov 26 '15 at 10:39







  • 1




    Do you want it covering u0 to U7fffffff or just u0..ud7ff+ ue000..u10ffff or only those that are currently specified in the latest Unicode spec?
    – Stéphane Chazelas
    Nov 26 '15 at 11:44













  • 1




    As long as each char is <=7F, then you have a UTF-8 character
    – Alastair McCormack
    Nov 26 '15 at 10:16










  • After executing head -c 1M </dev/urandom >myfile.txt, if I open mytext.txt with gedit it says that there are problems with UTF-8 character encoding
    – Message Passing
    Nov 26 '15 at 10:29










  • Sorry, I meant that to get valid UTF-8, you must only have single bytes of <= x7F or build valid multi-byte UTF-8 sequences. The former is obviously easier from a random perspective.
    – Alastair McCormack
    Nov 26 '15 at 10:39







  • 1




    Do you want it covering u0 to U7fffffff or just u0..ud7ff+ ue000..u10ffff or only those that are currently specified in the latest Unicode spec?
    – Stéphane Chazelas
    Nov 26 '15 at 11:44








1




1




As long as each char is <=7F, then you have a UTF-8 character
– Alastair McCormack
Nov 26 '15 at 10:16




As long as each char is <=7F, then you have a UTF-8 character
– Alastair McCormack
Nov 26 '15 at 10:16












After executing head -c 1M </dev/urandom >myfile.txt, if I open mytext.txt with gedit it says that there are problems with UTF-8 character encoding
– Message Passing
Nov 26 '15 at 10:29




After executing head -c 1M </dev/urandom >myfile.txt, if I open mytext.txt with gedit it says that there are problems with UTF-8 character encoding
– Message Passing
Nov 26 '15 at 10:29












Sorry, I meant that to get valid UTF-8, you must only have single bytes of <= x7F or build valid multi-byte UTF-8 sequences. The former is obviously easier from a random perspective.
– Alastair McCormack
Nov 26 '15 at 10:39





Sorry, I meant that to get valid UTF-8, you must only have single bytes of <= x7F or build valid multi-byte UTF-8 sequences. The former is obviously easier from a random perspective.
– Alastair McCormack
Nov 26 '15 at 10:39





1




1




Do you want it covering u0 to U7fffffff or just u0..ud7ff+ ue000..u10ffff or only those that are currently specified in the latest Unicode spec?
– Stéphane Chazelas
Nov 26 '15 at 11:44





Do you want it covering u0 to U7fffffff or just u0..ud7ff+ ue000..u10ffff or only those that are currently specified in the latest Unicode spec?
– Stéphane Chazelas
Nov 26 '15 at 11:44











4 Answers
4






active

oldest

votes

















up vote
4
down vote













If you want UTF-8 encodings of code points 0 to 0x7FFFFFFF (which the UTF-8 encoding algorithm was originally designed to work on):



< /dev/urandom perl -CO -ne '
BEGIN$/=4
no warnings "utf8";
print chr(unpack("L>",$_) & 0x7fffffff)'


Nowadays, Unicode is restricted to 0..D7FF, E000..10FFFF (though some of those characters are not assigned, some of which will never be (are defined as non-characters)).



< /dev/urandom perl -CO -ne '
BEGIN$/=3
no warnings "utf8";
$c = unpack("L>","$_") * 0x10f800 >> 24;
$c += 0x800 if $c >= 0xd800;
print chr($c)'


If you only want assigned characters, you can pipe that to:



uconv -x '[:unassigned:]>;'


Or change that to:



< /dev/urandom perl -CO -ne '
BEGIN$/=3
no warnings "utf8";
$c = unpack("L>","$_") * 0x10f800 >> 24;
$c += 0x800 if $c >= 0xd800;
$c = chr $c;
print $c if $c =~ /Punassigned/'


You may prefer:



 if $c =~ /[pSpacepGraph]/ && $c !~ /pCo/


To only get graphical and spacing ones (exclude those from the private-use sections).



Now, to get 1GiB of that, you can pipe it to head -c1G (assuming GNU head), but beware the last character may be cut in the middle.






share|improve this answer





























    up vote
    2
    down vote













    The most efficient way to create a text file with size 10 MB and UTF-8 character encoding is base64 /dev/urandom | head -c 10000000 | egrep -ao "w" | tr -d 'n' > file10MB.txt






    share|improve this answer




















    • Is the egrep and tr necessary?
      – Alastair McCormack
      Nov 26 '15 at 13:50










    • That doesn't produce a 10MB file. Also, you either need to remove base64 or add a base64 -d somewhere; what you have now only outputs characters used by the Base64 encoding.
      – Gilles
      Nov 26 '15 at 23:23

















    up vote
    0
    down vote













    Grep for ASCII (sub-set of UTF-8) chars, on Linux/GNU:



    dd if=/dev/random bs=1 count=1G | egrep -ao "w" | tr -d 'n'





    share|improve this answer
















    • 1




      base64 might be considerably more efficient.
      – muru
      Nov 26 '15 at 11:08










    • haha, doh! That's what happens if you over think the question :) You should offer that as an answer :)
      – Alastair McCormack
      Nov 26 '15 at 11:10










    • I'm looking for some way to as many UTF8 characters as I can. If I do, I'll post that as answer. :)
      – muru
      Nov 26 '15 at 11:11










    • dd if=/dev/random bs=1 count=10M | egrep -ao "w" | tr -d 'n' > file10MB.txt gets stuck. It is too slow.
      – Message Passing
      Nov 26 '15 at 13:27






    • 1




      @muru you really need to use urandom here, random will get stuck really fast because it needs entropy. This is not fast, but it will get the job done. dd if=/dev/urandom bs=1 count=1G | egrep -ao "w" | tr -d 'n'
      – theduke
      Jan 8 '17 at 11:23


















    up vote
    0
    down vote













    If you want non-ASCII characters, then you'll need a way to build valid UTF-8 sequences. The chance that two consecutive bytes yielding a valid UTF-8 is very low.



    Instead, this Python script creates random 8 bit values that can be converted in Unicode chars, then written out as UTF-8:



    import random
    import io

    char_count = 0

    with io.open("random-utf8.txt", "w", encoding="utf-8") as my_file:

    while char_count <= 1000000 * 1024:
    rand_long = random.getrandbits(8)

    # Ignore control characters
    if rand_long <= 32 or (rand_long <= 0x9F and rand_long > 0x7F):
    continue

    unicode_char = unichr(rand_long)
    my_file.write(unicode_char)
    char_count += 1


    You could also change it to use a random 16 bit number which would yield non-latin values.



    It's not fast but fairly accurate.






    share|improve this answer




















      Your Answer







      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "106"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      convertImagesToLinks: false,
      noModals: false,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );













       

      draft saved


      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f245623%2fhow-do-i-create-a-text-file-1-gigabyte-containing-random-characters-with-utf-8%23new-answer', 'question_page');

      );

      Post as a guest






























      4 Answers
      4






      active

      oldest

      votes








      4 Answers
      4






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes








      up vote
      4
      down vote













      If you want UTF-8 encodings of code points 0 to 0x7FFFFFFF (which the UTF-8 encoding algorithm was originally designed to work on):



      < /dev/urandom perl -CO -ne '
      BEGIN$/=4
      no warnings "utf8";
      print chr(unpack("L>",$_) & 0x7fffffff)'


      Nowadays, Unicode is restricted to 0..D7FF, E000..10FFFF (though some of those characters are not assigned, some of which will never be (are defined as non-characters)).



      < /dev/urandom perl -CO -ne '
      BEGIN$/=3
      no warnings "utf8";
      $c = unpack("L>","$_") * 0x10f800 >> 24;
      $c += 0x800 if $c >= 0xd800;
      print chr($c)'


      If you only want assigned characters, you can pipe that to:



      uconv -x '[:unassigned:]>;'


      Or change that to:



      < /dev/urandom perl -CO -ne '
      BEGIN$/=3
      no warnings "utf8";
      $c = unpack("L>","$_") * 0x10f800 >> 24;
      $c += 0x800 if $c >= 0xd800;
      $c = chr $c;
      print $c if $c =~ /Punassigned/'


      You may prefer:



       if $c =~ /[pSpacepGraph]/ && $c !~ /pCo/


      To only get graphical and spacing ones (exclude those from the private-use sections).



      Now, to get 1GiB of that, you can pipe it to head -c1G (assuming GNU head), but beware the last character may be cut in the middle.






      share|improve this answer


























        up vote
        4
        down vote













        If you want UTF-8 encodings of code points 0 to 0x7FFFFFFF (which the UTF-8 encoding algorithm was originally designed to work on):



        < /dev/urandom perl -CO -ne '
        BEGIN$/=4
        no warnings "utf8";
        print chr(unpack("L>",$_) & 0x7fffffff)'


        Nowadays, Unicode is restricted to 0..D7FF, E000..10FFFF (though some of those characters are not assigned, some of which will never be (are defined as non-characters)).



        < /dev/urandom perl -CO -ne '
        BEGIN$/=3
        no warnings "utf8";
        $c = unpack("L>","$_") * 0x10f800 >> 24;
        $c += 0x800 if $c >= 0xd800;
        print chr($c)'


        If you only want assigned characters, you can pipe that to:



        uconv -x '[:unassigned:]>;'


        Or change that to:



        < /dev/urandom perl -CO -ne '
        BEGIN$/=3
        no warnings "utf8";
        $c = unpack("L>","$_") * 0x10f800 >> 24;
        $c += 0x800 if $c >= 0xd800;
        $c = chr $c;
        print $c if $c =~ /Punassigned/'


        You may prefer:



         if $c =~ /[pSpacepGraph]/ && $c !~ /pCo/


        To only get graphical and spacing ones (exclude those from the private-use sections).



        Now, to get 1GiB of that, you can pipe it to head -c1G (assuming GNU head), but beware the last character may be cut in the middle.






        share|improve this answer
























          up vote
          4
          down vote










          up vote
          4
          down vote









          If you want UTF-8 encodings of code points 0 to 0x7FFFFFFF (which the UTF-8 encoding algorithm was originally designed to work on):



          < /dev/urandom perl -CO -ne '
          BEGIN$/=4
          no warnings "utf8";
          print chr(unpack("L>",$_) & 0x7fffffff)'


          Nowadays, Unicode is restricted to 0..D7FF, E000..10FFFF (though some of those characters are not assigned, some of which will never be (are defined as non-characters)).



          < /dev/urandom perl -CO -ne '
          BEGIN$/=3
          no warnings "utf8";
          $c = unpack("L>","$_") * 0x10f800 >> 24;
          $c += 0x800 if $c >= 0xd800;
          print chr($c)'


          If you only want assigned characters, you can pipe that to:



          uconv -x '[:unassigned:]>;'


          Or change that to:



          < /dev/urandom perl -CO -ne '
          BEGIN$/=3
          no warnings "utf8";
          $c = unpack("L>","$_") * 0x10f800 >> 24;
          $c += 0x800 if $c >= 0xd800;
          $c = chr $c;
          print $c if $c =~ /Punassigned/'


          You may prefer:



           if $c =~ /[pSpacepGraph]/ && $c !~ /pCo/


          To only get graphical and spacing ones (exclude those from the private-use sections).



          Now, to get 1GiB of that, you can pipe it to head -c1G (assuming GNU head), but beware the last character may be cut in the middle.






          share|improve this answer














          If you want UTF-8 encodings of code points 0 to 0x7FFFFFFF (which the UTF-8 encoding algorithm was originally designed to work on):



          < /dev/urandom perl -CO -ne '
          BEGIN$/=4
          no warnings "utf8";
          print chr(unpack("L>",$_) & 0x7fffffff)'


          Nowadays, Unicode is restricted to 0..D7FF, E000..10FFFF (though some of those characters are not assigned, some of which will never be (are defined as non-characters)).



          < /dev/urandom perl -CO -ne '
          BEGIN$/=3
          no warnings "utf8";
          $c = unpack("L>","$_") * 0x10f800 >> 24;
          $c += 0x800 if $c >= 0xd800;
          print chr($c)'


          If you only want assigned characters, you can pipe that to:



          uconv -x '[:unassigned:]>;'


          Or change that to:



          < /dev/urandom perl -CO -ne '
          BEGIN$/=3
          no warnings "utf8";
          $c = unpack("L>","$_") * 0x10f800 >> 24;
          $c += 0x800 if $c >= 0xd800;
          $c = chr $c;
          print $c if $c =~ /Punassigned/'


          You may prefer:



           if $c =~ /[pSpacepGraph]/ && $c !~ /pCo/


          To only get graphical and spacing ones (exclude those from the private-use sections).



          Now, to get 1GiB of that, you can pipe it to head -c1G (assuming GNU head), but beware the last character may be cut in the middle.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 26 '15 at 15:25

























          answered Nov 26 '15 at 14:44









          Stéphane Chazelas

          285k53525864




          285k53525864






















              up vote
              2
              down vote













              The most efficient way to create a text file with size 10 MB and UTF-8 character encoding is base64 /dev/urandom | head -c 10000000 | egrep -ao "w" | tr -d 'n' > file10MB.txt






              share|improve this answer




















              • Is the egrep and tr necessary?
                – Alastair McCormack
                Nov 26 '15 at 13:50










              • That doesn't produce a 10MB file. Also, you either need to remove base64 or add a base64 -d somewhere; what you have now only outputs characters used by the Base64 encoding.
                – Gilles
                Nov 26 '15 at 23:23














              up vote
              2
              down vote













              The most efficient way to create a text file with size 10 MB and UTF-8 character encoding is base64 /dev/urandom | head -c 10000000 | egrep -ao "w" | tr -d 'n' > file10MB.txt






              share|improve this answer




















              • Is the egrep and tr necessary?
                – Alastair McCormack
                Nov 26 '15 at 13:50










              • That doesn't produce a 10MB file. Also, you either need to remove base64 or add a base64 -d somewhere; what you have now only outputs characters used by the Base64 encoding.
                – Gilles
                Nov 26 '15 at 23:23












              up vote
              2
              down vote










              up vote
              2
              down vote









              The most efficient way to create a text file with size 10 MB and UTF-8 character encoding is base64 /dev/urandom | head -c 10000000 | egrep -ao "w" | tr -d 'n' > file10MB.txt






              share|improve this answer












              The most efficient way to create a text file with size 10 MB and UTF-8 character encoding is base64 /dev/urandom | head -c 10000000 | egrep -ao "w" | tr -d 'n' > file10MB.txt







              share|improve this answer












              share|improve this answer



              share|improve this answer










              answered Nov 26 '15 at 13:41









              Message Passing

              5115




              5115











              • Is the egrep and tr necessary?
                – Alastair McCormack
                Nov 26 '15 at 13:50










              • That doesn't produce a 10MB file. Also, you either need to remove base64 or add a base64 -d somewhere; what you have now only outputs characters used by the Base64 encoding.
                – Gilles
                Nov 26 '15 at 23:23
















              • Is the egrep and tr necessary?
                – Alastair McCormack
                Nov 26 '15 at 13:50










              • That doesn't produce a 10MB file. Also, you either need to remove base64 or add a base64 -d somewhere; what you have now only outputs characters used by the Base64 encoding.
                – Gilles
                Nov 26 '15 at 23:23















              Is the egrep and tr necessary?
              – Alastair McCormack
              Nov 26 '15 at 13:50




              Is the egrep and tr necessary?
              – Alastair McCormack
              Nov 26 '15 at 13:50












              That doesn't produce a 10MB file. Also, you either need to remove base64 or add a base64 -d somewhere; what you have now only outputs characters used by the Base64 encoding.
              – Gilles
              Nov 26 '15 at 23:23




              That doesn't produce a 10MB file. Also, you either need to remove base64 or add a base64 -d somewhere; what you have now only outputs characters used by the Base64 encoding.
              – Gilles
              Nov 26 '15 at 23:23










              up vote
              0
              down vote













              Grep for ASCII (sub-set of UTF-8) chars, on Linux/GNU:



              dd if=/dev/random bs=1 count=1G | egrep -ao "w" | tr -d 'n'





              share|improve this answer
















              • 1




                base64 might be considerably more efficient.
                – muru
                Nov 26 '15 at 11:08










              • haha, doh! That's what happens if you over think the question :) You should offer that as an answer :)
                – Alastair McCormack
                Nov 26 '15 at 11:10










              • I'm looking for some way to as many UTF8 characters as I can. If I do, I'll post that as answer. :)
                – muru
                Nov 26 '15 at 11:11










              • dd if=/dev/random bs=1 count=10M | egrep -ao "w" | tr -d 'n' > file10MB.txt gets stuck. It is too slow.
                – Message Passing
                Nov 26 '15 at 13:27






              • 1




                @muru you really need to use urandom here, random will get stuck really fast because it needs entropy. This is not fast, but it will get the job done. dd if=/dev/urandom bs=1 count=1G | egrep -ao "w" | tr -d 'n'
                – theduke
                Jan 8 '17 at 11:23















              up vote
              0
              down vote













              Grep for ASCII (sub-set of UTF-8) chars, on Linux/GNU:



              dd if=/dev/random bs=1 count=1G | egrep -ao "w" | tr -d 'n'





              share|improve this answer
















              • 1




                base64 might be considerably more efficient.
                – muru
                Nov 26 '15 at 11:08










              • haha, doh! That's what happens if you over think the question :) You should offer that as an answer :)
                – Alastair McCormack
                Nov 26 '15 at 11:10










              • I'm looking for some way to as many UTF8 characters as I can. If I do, I'll post that as answer. :)
                – muru
                Nov 26 '15 at 11:11










              • dd if=/dev/random bs=1 count=10M | egrep -ao "w" | tr -d 'n' > file10MB.txt gets stuck. It is too slow.
                – Message Passing
                Nov 26 '15 at 13:27






              • 1




                @muru you really need to use urandom here, random will get stuck really fast because it needs entropy. This is not fast, but it will get the job done. dd if=/dev/urandom bs=1 count=1G | egrep -ao "w" | tr -d 'n'
                – theduke
                Jan 8 '17 at 11:23













              up vote
              0
              down vote










              up vote
              0
              down vote









              Grep for ASCII (sub-set of UTF-8) chars, on Linux/GNU:



              dd if=/dev/random bs=1 count=1G | egrep -ao "w" | tr -d 'n'





              share|improve this answer












              Grep for ASCII (sub-set of UTF-8) chars, on Linux/GNU:



              dd if=/dev/random bs=1 count=1G | egrep -ao "w" | tr -d 'n'






              share|improve this answer












              share|improve this answer



              share|improve this answer










              answered Nov 26 '15 at 10:34









              Alastair McCormack

              1293




              1293







              • 1




                base64 might be considerably more efficient.
                – muru
                Nov 26 '15 at 11:08










              • haha, doh! That's what happens if you over think the question :) You should offer that as an answer :)
                – Alastair McCormack
                Nov 26 '15 at 11:10










              • I'm looking for some way to as many UTF8 characters as I can. If I do, I'll post that as answer. :)
                – muru
                Nov 26 '15 at 11:11










              • dd if=/dev/random bs=1 count=10M | egrep -ao "w" | tr -d 'n' > file10MB.txt gets stuck. It is too slow.
                – Message Passing
                Nov 26 '15 at 13:27






              • 1




                @muru you really need to use urandom here, random will get stuck really fast because it needs entropy. This is not fast, but it will get the job done. dd if=/dev/urandom bs=1 count=1G | egrep -ao "w" | tr -d 'n'
                – theduke
                Jan 8 '17 at 11:23













              • 1




                base64 might be considerably more efficient.
                – muru
                Nov 26 '15 at 11:08










              • haha, doh! That's what happens if you over think the question :) You should offer that as an answer :)
                – Alastair McCormack
                Nov 26 '15 at 11:10










              • I'm looking for some way to as many UTF8 characters as I can. If I do, I'll post that as answer. :)
                – muru
                Nov 26 '15 at 11:11










              • dd if=/dev/random bs=1 count=10M | egrep -ao "w" | tr -d 'n' > file10MB.txt gets stuck. It is too slow.
                – Message Passing
                Nov 26 '15 at 13:27






              • 1




                @muru you really need to use urandom here, random will get stuck really fast because it needs entropy. This is not fast, but it will get the job done. dd if=/dev/urandom bs=1 count=1G | egrep -ao "w" | tr -d 'n'
                – theduke
                Jan 8 '17 at 11:23








              1




              1




              base64 might be considerably more efficient.
              – muru
              Nov 26 '15 at 11:08




              base64 might be considerably more efficient.
              – muru
              Nov 26 '15 at 11:08












              haha, doh! That's what happens if you over think the question :) You should offer that as an answer :)
              – Alastair McCormack
              Nov 26 '15 at 11:10




              haha, doh! That's what happens if you over think the question :) You should offer that as an answer :)
              – Alastair McCormack
              Nov 26 '15 at 11:10












              I'm looking for some way to as many UTF8 characters as I can. If I do, I'll post that as answer. :)
              – muru
              Nov 26 '15 at 11:11




              I'm looking for some way to as many UTF8 characters as I can. If I do, I'll post that as answer. :)
              – muru
              Nov 26 '15 at 11:11












              dd if=/dev/random bs=1 count=10M | egrep -ao "w" | tr -d 'n' > file10MB.txt gets stuck. It is too slow.
              – Message Passing
              Nov 26 '15 at 13:27




              dd if=/dev/random bs=1 count=10M | egrep -ao "w" | tr -d 'n' > file10MB.txt gets stuck. It is too slow.
              – Message Passing
              Nov 26 '15 at 13:27




              1




              1




              @muru you really need to use urandom here, random will get stuck really fast because it needs entropy. This is not fast, but it will get the job done. dd if=/dev/urandom bs=1 count=1G | egrep -ao "w" | tr -d 'n'
              – theduke
              Jan 8 '17 at 11:23





              @muru you really need to use urandom here, random will get stuck really fast because it needs entropy. This is not fast, but it will get the job done. dd if=/dev/urandom bs=1 count=1G | egrep -ao "w" | tr -d 'n'
              – theduke
              Jan 8 '17 at 11:23











              up vote
              0
              down vote













              If you want non-ASCII characters, then you'll need a way to build valid UTF-8 sequences. The chance that two consecutive bytes yielding a valid UTF-8 is very low.



              Instead, this Python script creates random 8 bit values that can be converted in Unicode chars, then written out as UTF-8:



              import random
              import io

              char_count = 0

              with io.open("random-utf8.txt", "w", encoding="utf-8") as my_file:

              while char_count <= 1000000 * 1024:
              rand_long = random.getrandbits(8)

              # Ignore control characters
              if rand_long <= 32 or (rand_long <= 0x9F and rand_long > 0x7F):
              continue

              unicode_char = unichr(rand_long)
              my_file.write(unicode_char)
              char_count += 1


              You could also change it to use a random 16 bit number which would yield non-latin values.



              It's not fast but fairly accurate.






              share|improve this answer
























                up vote
                0
                down vote













                If you want non-ASCII characters, then you'll need a way to build valid UTF-8 sequences. The chance that two consecutive bytes yielding a valid UTF-8 is very low.



                Instead, this Python script creates random 8 bit values that can be converted in Unicode chars, then written out as UTF-8:



                import random
                import io

                char_count = 0

                with io.open("random-utf8.txt", "w", encoding="utf-8") as my_file:

                while char_count <= 1000000 * 1024:
                rand_long = random.getrandbits(8)

                # Ignore control characters
                if rand_long <= 32 or (rand_long <= 0x9F and rand_long > 0x7F):
                continue

                unicode_char = unichr(rand_long)
                my_file.write(unicode_char)
                char_count += 1


                You could also change it to use a random 16 bit number which would yield non-latin values.



                It's not fast but fairly accurate.






                share|improve this answer






















                  up vote
                  0
                  down vote










                  up vote
                  0
                  down vote









                  If you want non-ASCII characters, then you'll need a way to build valid UTF-8 sequences. The chance that two consecutive bytes yielding a valid UTF-8 is very low.



                  Instead, this Python script creates random 8 bit values that can be converted in Unicode chars, then written out as UTF-8:



                  import random
                  import io

                  char_count = 0

                  with io.open("random-utf8.txt", "w", encoding="utf-8") as my_file:

                  while char_count <= 1000000 * 1024:
                  rand_long = random.getrandbits(8)

                  # Ignore control characters
                  if rand_long <= 32 or (rand_long <= 0x9F and rand_long > 0x7F):
                  continue

                  unicode_char = unichr(rand_long)
                  my_file.write(unicode_char)
                  char_count += 1


                  You could also change it to use a random 16 bit number which would yield non-latin values.



                  It's not fast but fairly accurate.






                  share|improve this answer












                  If you want non-ASCII characters, then you'll need a way to build valid UTF-8 sequences. The chance that two consecutive bytes yielding a valid UTF-8 is very low.



                  Instead, this Python script creates random 8 bit values that can be converted in Unicode chars, then written out as UTF-8:



                  import random
                  import io

                  char_count = 0

                  with io.open("random-utf8.txt", "w", encoding="utf-8") as my_file:

                  while char_count <= 1000000 * 1024:
                  rand_long = random.getrandbits(8)

                  # Ignore control characters
                  if rand_long <= 32 or (rand_long <= 0x9F and rand_long > 0x7F):
                  continue

                  unicode_char = unichr(rand_long)
                  my_file.write(unicode_char)
                  char_count += 1


                  You could also change it to use a random 16 bit number which would yield non-latin values.



                  It's not fast but fairly accurate.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Nov 26 '15 at 13:47









                  Alastair McCormack

                  1293




                  1293



























                       

                      draft saved


                      draft discarded















































                       


                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f245623%2fhow-do-i-create-a-text-file-1-gigabyte-containing-random-characters-with-utf-8%23new-answer', 'question_page');

                      );

                      Post as a guest













































































                      Popular posts from this blog

                      How to check contact read email or not when send email to Individual?

                      Displaying single band from multi-band raster using QGIS

                      How many registers does an x86_64 CPU actually have?