Need to split large csv file

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
2
down vote

favorite












I've got this csv file that needs to be broken up in smaller files.
no problem with split -l 20000 test.csv my problem is that the file contains different headers. want to split say every +-1000 lines but it needs to split after pay header and new file needs to start with cust header.




cust header,xxx,xxxxxx,xxxxxx
txn header,xxxx,xxx,,xxxx,xxxxx,,xxx
detail header,xxxx,xxxx,xxxxxx,xxxx,xxxx
detail header,xxxxxxxx,xxxxxxxxxx,xxx,,
pay header,,,,,xxxx,xxxxx
cust header,xxx,xxxxxx,xxxxxx
txn header,xxxx,xxx,,xxxx,xxxxx,,xxx
detail header,xxxx,xxxx,xxxxxx,xxxx,xxxx
pay header,,,,,xxxx,xxxxx
cust header,xxx,xxxxxx,xxxxxx
txn header,xxxx,xxx,,xxxx,xxxxx,,xxx
detail header,xxxx,xxxx,xxxxxx,xxxx,xxxx
pay header,,,,,xxxx,xxxxx






share|improve this question

















  • 1




    what should the output files look like? I assume it's not just one set of headers per file.
    – glenn jackman
    Jun 15 at 13:42










  • Will the cust header sections need to be broken into multiple files with cust header at top of each file, e.g. cust header1 - lines 1-1000 and cust header1 -line 1001-1999; cust header2 - lines 1-200; cust header3 - lines 1-1000?
    – Melioratus
    Jun 15 at 17:18














up vote
2
down vote

favorite












I've got this csv file that needs to be broken up in smaller files.
no problem with split -l 20000 test.csv my problem is that the file contains different headers. want to split say every +-1000 lines but it needs to split after pay header and new file needs to start with cust header.




cust header,xxx,xxxxxx,xxxxxx
txn header,xxxx,xxx,,xxxx,xxxxx,,xxx
detail header,xxxx,xxxx,xxxxxx,xxxx,xxxx
detail header,xxxxxxxx,xxxxxxxxxx,xxx,,
pay header,,,,,xxxx,xxxxx
cust header,xxx,xxxxxx,xxxxxx
txn header,xxxx,xxx,,xxxx,xxxxx,,xxx
detail header,xxxx,xxxx,xxxxxx,xxxx,xxxx
pay header,,,,,xxxx,xxxxx
cust header,xxx,xxxxxx,xxxxxx
txn header,xxxx,xxx,,xxxx,xxxxx,,xxx
detail header,xxxx,xxxx,xxxxxx,xxxx,xxxx
pay header,,,,,xxxx,xxxxx






share|improve this question

















  • 1




    what should the output files look like? I assume it's not just one set of headers per file.
    – glenn jackman
    Jun 15 at 13:42










  • Will the cust header sections need to be broken into multiple files with cust header at top of each file, e.g. cust header1 - lines 1-1000 and cust header1 -line 1001-1999; cust header2 - lines 1-200; cust header3 - lines 1-1000?
    – Melioratus
    Jun 15 at 17:18












up vote
2
down vote

favorite









up vote
2
down vote

favorite











I've got this csv file that needs to be broken up in smaller files.
no problem with split -l 20000 test.csv my problem is that the file contains different headers. want to split say every +-1000 lines but it needs to split after pay header and new file needs to start with cust header.




cust header,xxx,xxxxxx,xxxxxx
txn header,xxxx,xxx,,xxxx,xxxxx,,xxx
detail header,xxxx,xxxx,xxxxxx,xxxx,xxxx
detail header,xxxxxxxx,xxxxxxxxxx,xxx,,
pay header,,,,,xxxx,xxxxx
cust header,xxx,xxxxxx,xxxxxx
txn header,xxxx,xxx,,xxxx,xxxxx,,xxx
detail header,xxxx,xxxx,xxxxxx,xxxx,xxxx
pay header,,,,,xxxx,xxxxx
cust header,xxx,xxxxxx,xxxxxx
txn header,xxxx,xxx,,xxxx,xxxxx,,xxx
detail header,xxxx,xxxx,xxxxxx,xxxx,xxxx
pay header,,,,,xxxx,xxxxx






share|improve this question













I've got this csv file that needs to be broken up in smaller files.
no problem with split -l 20000 test.csv my problem is that the file contains different headers. want to split say every +-1000 lines but it needs to split after pay header and new file needs to start with cust header.




cust header,xxx,xxxxxx,xxxxxx
txn header,xxxx,xxx,,xxxx,xxxxx,,xxx
detail header,xxxx,xxxx,xxxxxx,xxxx,xxxx
detail header,xxxxxxxx,xxxxxxxxxx,xxx,,
pay header,,,,,xxxx,xxxxx
cust header,xxx,xxxxxx,xxxxxx
txn header,xxxx,xxx,,xxxx,xxxxx,,xxx
detail header,xxxx,xxxx,xxxxxx,xxxx,xxxx
pay header,,,,,xxxx,xxxxx
cust header,xxx,xxxxxx,xxxxxx
txn header,xxxx,xxx,,xxxx,xxxxx,,xxx
detail header,xxxx,xxxx,xxxxxx,xxxx,xxxx
pay header,,,,,xxxx,xxxxx








share|improve this question












share|improve this question




share|improve this question








edited Jun 15 at 17:49









Vadim Kotov

1137




1137









asked Jun 15 at 12:32









Daniel

132




132







  • 1




    what should the output files look like? I assume it's not just one set of headers per file.
    – glenn jackman
    Jun 15 at 13:42










  • Will the cust header sections need to be broken into multiple files with cust header at top of each file, e.g. cust header1 - lines 1-1000 and cust header1 -line 1001-1999; cust header2 - lines 1-200; cust header3 - lines 1-1000?
    – Melioratus
    Jun 15 at 17:18












  • 1




    what should the output files look like? I assume it's not just one set of headers per file.
    – glenn jackman
    Jun 15 at 13:42










  • Will the cust header sections need to be broken into multiple files with cust header at top of each file, e.g. cust header1 - lines 1-1000 and cust header1 -line 1001-1999; cust header2 - lines 1-200; cust header3 - lines 1-1000?
    – Melioratus
    Jun 15 at 17:18







1




1




what should the output files look like? I assume it's not just one set of headers per file.
– glenn jackman
Jun 15 at 13:42




what should the output files look like? I assume it's not just one set of headers per file.
– glenn jackman
Jun 15 at 13:42












Will the cust header sections need to be broken into multiple files with cust header at top of each file, e.g. cust header1 - lines 1-1000 and cust header1 -line 1001-1999; cust header2 - lines 1-200; cust header3 - lines 1-1000?
– Melioratus
Jun 15 at 17:18




Will the cust header sections need to be broken into multiple files with cust header at top of each file, e.g. cust header1 - lines 1-1000 and cust header1 -line 1001-1999; cust header2 - lines 1-200; cust header3 - lines 1-1000?
– Melioratus
Jun 15 at 17:18










2 Answers
2






active

oldest

votes

















up vote
1
down vote



accepted










ilkkachu's solution is slick, uses a single executable, and is probably the correct answer. However, I've just never been able to wrap my head around advanced uses of awk. Had ilkkachu not answered first, I might have opted for csplit. csplit will split a text file based on context lines (okay, regular expressions). You could then take that output and further split the files with the split utility you already know:



$ csplit --prefix="MySplit." test.csv '/^cust header,/' '*'
0
174
134
134


Those are the byte counts of each chunk (which we ignore in this case). Now, iterate over each MySplit chunk and further split to your 20k requirement:



$ for i in MySplit.0*; do
split --additional-suffix=".$i" -l 20000 "$i"
done


For example, using -l 2 instead of 20k, the final output given your sample would be (sorted by extension):



$ ls -lhXB
total 44K
-rw-r--r-- 1 hunteke hunteke 0 Jun 15 13:31 MySplit.00
-rw-r--r-- 1 hunteke hunteke 174 Jun 15 13:31 MySplit.01
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.01
-rw-r--r-- 1 hunteke hunteke 81 Jun 15 13:27 xab.MySplit.01
-rw-r--r-- 1 hunteke hunteke 26 Jun 15 13:27 xac.MySplit.01
-rw-r--r-- 1 hunteke hunteke 134 Jun 15 13:31 MySplit.02
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.02
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xab.MySplit.02
-rw-r--r-- 1 hunteke hunteke 134 Jun 15 13:31 MySplit.03
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.03
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xab.MySplit.03
-rw-r--r-- 2 hunteke hunteke 442 Jun 15 13:06 test.csv





share|improve this answer




























    up vote
    6
    down vote













    You could do something like this with awk:



    awk -vfilename=output -vcut=1000 '
    BEGIN nl=0; nf=1; f=filename "." nf;
    ++nl >= cut && /^cust header,/
    close(f); nl=0; f=filename "." ++nf
    print > f' < file


    It keeps a count of the lines it's seen, and reopens a new output file if the count is larger than cut (1000 here), and the current line starts with cust header,. The output files are named output.1, output.2, ... (the filename variable)






    share|improve this answer





















      Your Answer







      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "106"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      convertImagesToLinks: false,
      noModals: false,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );








       

      draft saved


      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f449990%2fneed-to-split-large-csv-file%23new-answer', 'question_page');

      );

      Post as a guest






























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes








      up vote
      1
      down vote



      accepted










      ilkkachu's solution is slick, uses a single executable, and is probably the correct answer. However, I've just never been able to wrap my head around advanced uses of awk. Had ilkkachu not answered first, I might have opted for csplit. csplit will split a text file based on context lines (okay, regular expressions). You could then take that output and further split the files with the split utility you already know:



      $ csplit --prefix="MySplit." test.csv '/^cust header,/' '*'
      0
      174
      134
      134


      Those are the byte counts of each chunk (which we ignore in this case). Now, iterate over each MySplit chunk and further split to your 20k requirement:



      $ for i in MySplit.0*; do
      split --additional-suffix=".$i" -l 20000 "$i"
      done


      For example, using -l 2 instead of 20k, the final output given your sample would be (sorted by extension):



      $ ls -lhXB
      total 44K
      -rw-r--r-- 1 hunteke hunteke 0 Jun 15 13:31 MySplit.00
      -rw-r--r-- 1 hunteke hunteke 174 Jun 15 13:31 MySplit.01
      -rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.01
      -rw-r--r-- 1 hunteke hunteke 81 Jun 15 13:27 xab.MySplit.01
      -rw-r--r-- 1 hunteke hunteke 26 Jun 15 13:27 xac.MySplit.01
      -rw-r--r-- 1 hunteke hunteke 134 Jun 15 13:31 MySplit.02
      -rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.02
      -rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xab.MySplit.02
      -rw-r--r-- 1 hunteke hunteke 134 Jun 15 13:31 MySplit.03
      -rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.03
      -rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xab.MySplit.03
      -rw-r--r-- 2 hunteke hunteke 442 Jun 15 13:06 test.csv





      share|improve this answer

























        up vote
        1
        down vote



        accepted










        ilkkachu's solution is slick, uses a single executable, and is probably the correct answer. However, I've just never been able to wrap my head around advanced uses of awk. Had ilkkachu not answered first, I might have opted for csplit. csplit will split a text file based on context lines (okay, regular expressions). You could then take that output and further split the files with the split utility you already know:



        $ csplit --prefix="MySplit." test.csv '/^cust header,/' '*'
        0
        174
        134
        134


        Those are the byte counts of each chunk (which we ignore in this case). Now, iterate over each MySplit chunk and further split to your 20k requirement:



        $ for i in MySplit.0*; do
        split --additional-suffix=".$i" -l 20000 "$i"
        done


        For example, using -l 2 instead of 20k, the final output given your sample would be (sorted by extension):



        $ ls -lhXB
        total 44K
        -rw-r--r-- 1 hunteke hunteke 0 Jun 15 13:31 MySplit.00
        -rw-r--r-- 1 hunteke hunteke 174 Jun 15 13:31 MySplit.01
        -rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.01
        -rw-r--r-- 1 hunteke hunteke 81 Jun 15 13:27 xab.MySplit.01
        -rw-r--r-- 1 hunteke hunteke 26 Jun 15 13:27 xac.MySplit.01
        -rw-r--r-- 1 hunteke hunteke 134 Jun 15 13:31 MySplit.02
        -rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.02
        -rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xab.MySplit.02
        -rw-r--r-- 1 hunteke hunteke 134 Jun 15 13:31 MySplit.03
        -rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.03
        -rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xab.MySplit.03
        -rw-r--r-- 2 hunteke hunteke 442 Jun 15 13:06 test.csv





        share|improve this answer























          up vote
          1
          down vote



          accepted







          up vote
          1
          down vote



          accepted






          ilkkachu's solution is slick, uses a single executable, and is probably the correct answer. However, I've just never been able to wrap my head around advanced uses of awk. Had ilkkachu not answered first, I might have opted for csplit. csplit will split a text file based on context lines (okay, regular expressions). You could then take that output and further split the files with the split utility you already know:



          $ csplit --prefix="MySplit." test.csv '/^cust header,/' '*'
          0
          174
          134
          134


          Those are the byte counts of each chunk (which we ignore in this case). Now, iterate over each MySplit chunk and further split to your 20k requirement:



          $ for i in MySplit.0*; do
          split --additional-suffix=".$i" -l 20000 "$i"
          done


          For example, using -l 2 instead of 20k, the final output given your sample would be (sorted by extension):



          $ ls -lhXB
          total 44K
          -rw-r--r-- 1 hunteke hunteke 0 Jun 15 13:31 MySplit.00
          -rw-r--r-- 1 hunteke hunteke 174 Jun 15 13:31 MySplit.01
          -rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.01
          -rw-r--r-- 1 hunteke hunteke 81 Jun 15 13:27 xab.MySplit.01
          -rw-r--r-- 1 hunteke hunteke 26 Jun 15 13:27 xac.MySplit.01
          -rw-r--r-- 1 hunteke hunteke 134 Jun 15 13:31 MySplit.02
          -rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.02
          -rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xab.MySplit.02
          -rw-r--r-- 1 hunteke hunteke 134 Jun 15 13:31 MySplit.03
          -rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.03
          -rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xab.MySplit.03
          -rw-r--r-- 2 hunteke hunteke 442 Jun 15 13:06 test.csv





          share|improve this answer













          ilkkachu's solution is slick, uses a single executable, and is probably the correct answer. However, I've just never been able to wrap my head around advanced uses of awk. Had ilkkachu not answered first, I might have opted for csplit. csplit will split a text file based on context lines (okay, regular expressions). You could then take that output and further split the files with the split utility you already know:



          $ csplit --prefix="MySplit." test.csv '/^cust header,/' '*'
          0
          174
          134
          134


          Those are the byte counts of each chunk (which we ignore in this case). Now, iterate over each MySplit chunk and further split to your 20k requirement:



          $ for i in MySplit.0*; do
          split --additional-suffix=".$i" -l 20000 "$i"
          done


          For example, using -l 2 instead of 20k, the final output given your sample would be (sorted by extension):



          $ ls -lhXB
          total 44K
          -rw-r--r-- 1 hunteke hunteke 0 Jun 15 13:31 MySplit.00
          -rw-r--r-- 1 hunteke hunteke 174 Jun 15 13:31 MySplit.01
          -rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.01
          -rw-r--r-- 1 hunteke hunteke 81 Jun 15 13:27 xab.MySplit.01
          -rw-r--r-- 1 hunteke hunteke 26 Jun 15 13:27 xac.MySplit.01
          -rw-r--r-- 1 hunteke hunteke 134 Jun 15 13:31 MySplit.02
          -rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.02
          -rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xab.MySplit.02
          -rw-r--r-- 1 hunteke hunteke 134 Jun 15 13:31 MySplit.03
          -rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.03
          -rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xab.MySplit.03
          -rw-r--r-- 2 hunteke hunteke 442 Jun 15 13:06 test.csv






          share|improve this answer













          share|improve this answer



          share|improve this answer











          answered Jun 15 at 17:42









          hunteke

          1713




          1713






















              up vote
              6
              down vote













              You could do something like this with awk:



              awk -vfilename=output -vcut=1000 '
              BEGIN nl=0; nf=1; f=filename "." nf;
              ++nl >= cut && /^cust header,/
              close(f); nl=0; f=filename "." ++nf
              print > f' < file


              It keeps a count of the lines it's seen, and reopens a new output file if the count is larger than cut (1000 here), and the current line starts with cust header,. The output files are named output.1, output.2, ... (the filename variable)






              share|improve this answer

























                up vote
                6
                down vote













                You could do something like this with awk:



                awk -vfilename=output -vcut=1000 '
                BEGIN nl=0; nf=1; f=filename "." nf;
                ++nl >= cut && /^cust header,/
                close(f); nl=0; f=filename "." ++nf
                print > f' < file


                It keeps a count of the lines it's seen, and reopens a new output file if the count is larger than cut (1000 here), and the current line starts with cust header,. The output files are named output.1, output.2, ... (the filename variable)






                share|improve this answer























                  up vote
                  6
                  down vote










                  up vote
                  6
                  down vote









                  You could do something like this with awk:



                  awk -vfilename=output -vcut=1000 '
                  BEGIN nl=0; nf=1; f=filename "." nf;
                  ++nl >= cut && /^cust header,/
                  close(f); nl=0; f=filename "." ++nf
                  print > f' < file


                  It keeps a count of the lines it's seen, and reopens a new output file if the count is larger than cut (1000 here), and the current line starts with cust header,. The output files are named output.1, output.2, ... (the filename variable)






                  share|improve this answer













                  You could do something like this with awk:



                  awk -vfilename=output -vcut=1000 '
                  BEGIN nl=0; nf=1; f=filename "." nf;
                  ++nl >= cut && /^cust header,/
                  close(f); nl=0; f=filename "." ++nf
                  print > f' < file


                  It keeps a count of the lines it's seen, and reopens a new output file if the count is larger than cut (1000 here), and the current line starts with cust header,. The output files are named output.1, output.2, ... (the filename variable)







                  share|improve this answer













                  share|improve this answer



                  share|improve this answer











                  answered Jun 15 at 12:49









                  ilkkachu

                  47.5k668130




                  47.5k668130






















                       

                      draft saved


                      draft discarded


























                       


                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f449990%2fneed-to-split-large-csv-file%23new-answer', 'question_page');

                      );

                      Post as a guest













































































                      Popular posts from this blog

                      Peggy Mitchell

                      Palaiologos

                      The Forum (Inglewood, California)