Need to split large csv file

up vote
2
down vote

favorite

I've got this csv file that needs to be broken up in smaller files.
no problem with split -l 20000 test.csv my problem is that the file contains different headers. want to split say every +-1000 lines but it needs to split after pay header and new file needs to start with cust header.


cust header,xxx,xxxxxx,xxxxxx
txn header,xxxx,xxx,,xxxx,xxxxx,,xxx
detail header,xxxx,xxxx,xxxxxx,xxxx,xxxx
detail header,xxxxxxxx,xxxxxxxxxx,xxx,,
pay header,,,,,xxxx,xxxxx
cust header,xxx,xxxxxx,xxxxxx
txn header,xxxx,xxx,,xxxx,xxxxx,,xxx
detail header,xxxx,xxxx,xxxxxx,xxxx,xxxx
pay header,,,,,xxxx,xxxxx
cust header,xxx,xxxxxx,xxxxxx
txn header,xxxx,xxx,,xxxx,xxxxx,,xxx
detail header,xxxx,xxxx,xxxxxx,xxxx,xxxx
pay header,,,,,xxxx,xxxxx

edited Jun 15 at 17:49

Vadim Kotov

1137

asked Jun 15 at 12:32

Daniel

132

1

what should the output files look like? I assume it's not just one set of headers per file.
â€“Â glenn jackman
Jun 15 at 13:42

Will the cust header sections need to be broken into multiple files with cust header at top of each file, e.g. cust header1 - lines 1-1000 and cust header1 -line 1001-1999; cust header2 - lines 1-200; cust header3 - lines 1-1000?
â€“Â Melioratus
Jun 15 at 17:18

add a commentÂ |Â

up vote
2
down vote

favorite


cust header,xxx,xxxxxx,xxxxxx
txn header,xxxx,xxx,,xxxx,xxxxx,,xxx
detail header,xxxx,xxxx,xxxxxx,xxxx,xxxx
detail header,xxxxxxxx,xxxxxxxxxx,xxx,,
pay header,,,,,xxxx,xxxxx
cust header,xxx,xxxxxx,xxxxxx
txn header,xxxx,xxx,,xxxx,xxxxx,,xxx
detail header,xxxx,xxxx,xxxxxx,xxxx,xxxx
pay header,,,,,xxxx,xxxxx
cust header,xxx,xxxxxx,xxxxxx
txn header,xxxx,xxx,,xxxx,xxxxx,,xxx
detail header,xxxx,xxxx,xxxxxx,xxxx,xxxx
pay header,,,,,xxxx,xxxxx

edited Jun 15 at 17:49

Vadim Kotov

1137

asked Jun 15 at 12:32

Daniel

132

1

what should the output files look like? I assume it's not just one set of headers per file.
â€“Â glenn jackman
Jun 15 at 13:42

Will the cust header sections need to be broken into multiple files with cust header at top of each file, e.g. cust header1 - lines 1-1000 and cust header1 -line 1001-1999; cust header2 - lines 1-200; cust header3 - lines 1-1000?
â€“Â Melioratus
Jun 15 at 17:18

add a commentÂ |Â

up vote
2
down vote

favorite


cust header,xxx,xxxxxx,xxxxxx
txn header,xxxx,xxx,,xxxx,xxxxx,,xxx
detail header,xxxx,xxxx,xxxxxx,xxxx,xxxx
detail header,xxxxxxxx,xxxxxxxxxx,xxx,,
pay header,,,,,xxxx,xxxxx
cust header,xxx,xxxxxx,xxxxxx
txn header,xxxx,xxx,,xxxx,xxxxx,,xxx
detail header,xxxx,xxxx,xxxxxx,xxxx,xxxx
pay header,,,,,xxxx,xxxxx
cust header,xxx,xxxxxx,xxxxxx
txn header,xxxx,xxx,,xxxx,xxxxx,,xxx
detail header,xxxx,xxxx,xxxxxx,xxxx,xxxx
pay header,,,,,xxxx,xxxxx

edited Jun 15 at 17:49

Vadim Kotov

1137

asked Jun 15 at 12:32

Daniel

132


cust header,xxx,xxxxxx,xxxxxx
txn header,xxxx,xxx,,xxxx,xxxxx,,xxx
detail header,xxxx,xxxx,xxxxxx,xxxx,xxxx
detail header,xxxxxxxx,xxxxxxxxxx,xxx,,
pay header,,,,,xxxx,xxxxx
cust header,xxx,xxxxxx,xxxxxx
txn header,xxxx,xxx,,xxxx,xxxxx,,xxx
detail header,xxxx,xxxx,xxxxxx,xxxx,xxxx
pay header,,,,,xxxx,xxxxx
cust header,xxx,xxxxxx,xxxxxx
txn header,xxxx,xxx,,xxxx,xxxxx,,xxx
detail header,xxxx,xxxx,xxxxxx,xxxx,xxxx
pay header,,,,,xxxx,xxxxx

edited Jun 15 at 17:49

Vadim Kotov

1137

asked Jun 15 at 12:32

Daniel

132

edited Jun 15 at 17:49

Vadim Kotov

1137

edited Jun 15 at 17:49

Vadim Kotov

1137

edited Jun 15 at 17:49

Vadim Kotov

1137

asked Jun 15 at 12:32

Daniel

132

asked Jun 15 at 12:32

Daniel

132

asked Jun 15 at 12:32

Daniel

132

1

what should the output files look like? I assume it's not just one set of headers per file.
â€“Â glenn jackman
Jun 15 at 13:42

Will the cust header sections need to be broken into multiple files with cust header at top of each file, e.g. cust header1 - lines 1-1000 and cust header1 -line 1001-1999; cust header2 - lines 1-200; cust header3 - lines 1-1000?
â€“Â Melioratus
Jun 15 at 17:18

add a commentÂ |Â

1

what should the output files look like? I assume it's not just one set of headers per file.
â€“Â glenn jackman
Jun 15 at 13:42

Will the cust header sections need to be broken into multiple files with cust header at top of each file, e.g. cust header1 - lines 1-1000 and cust header1 -line 1001-1999; cust header2 - lines 1-200; cust header3 - lines 1-1000?
â€“Â Melioratus
Jun 15 at 17:18

what should the output files look like? I assume it's not just one set of headers per file.
â€“Â glenn jackman
Jun 15 at 13:42

Will the cust header sections need to be broken into multiple files with cust header at top of each file, e.g. cust header1 - lines 1-1000 and cust header1 -line 1001-1999; cust header2 - lines 1-200; cust header3 - lines 1-1000?
â€“Â Melioratus
Jun 15 at 17:18

add a commentÂ |Â

2 Answers
2

active

oldest

votes

up vote
1
down vote

accepted

ilkkachu's solution is slick, uses a single executable, and is ~~probably~~ the correct answer. However, I've just never been able to wrap my head around advanced uses of awk. Had ilkkachu not answered first, I might have opted for csplit. csplit will split a text file based on context lines (okay, regular expressions). You could then take that output and further split the files with the split utility you already know:

$ csplit --prefix="MySplit." test.csv '/^cust header,/' '*'
0
174
134
134

Those are the byte counts of each chunk (which we ignore in this case). Now, iterate over each MySplit chunk and further split to your 20k requirement:

$ for i in MySplit.0*; do
 split --additional-suffix=".$i" -l 20000 "$i"
 done

For example, using -l 2 instead of 20k, the final output given your sample would be (sorted by extension):

$ ls -lhXB
total 44K
-rw-r--r-- 1 hunteke hunteke 0 Jun 15 13:31 MySplit.00
-rw-r--r-- 1 hunteke hunteke 174 Jun 15 13:31 MySplit.01
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.01
-rw-r--r-- 1 hunteke hunteke 81 Jun 15 13:27 xab.MySplit.01
-rw-r--r-- 1 hunteke hunteke 26 Jun 15 13:27 xac.MySplit.01
-rw-r--r-- 1 hunteke hunteke 134 Jun 15 13:31 MySplit.02
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.02
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xab.MySplit.02
-rw-r--r-- 1 hunteke hunteke 134 Jun 15 13:31 MySplit.03
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.03
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xab.MySplit.03
-rw-r--r-- 2 hunteke hunteke 442 Jun 15 13:06 test.csv

answered Jun 15 at 17:42

hunteke

1713

add a commentÂ |Â

up vote
6
down vote

You could do something like this with awk:

awk -vfilename=output -vcut=1000 '
 BEGIN nl=0; nf=1; f=filename "." nf; 
 ++nl >= cut && /^cust header,/ 
 close(f); nl=0; f=filename "." ++nf
 print > f' < file

It keeps a count of the lines it's seen, and reopens a new output file if the count is larger than cut (1000 here), and the current line starts with cust header,. The output files are named output.1, output.2, ... (the filename variable)

answered Jun 15 at 12:49

ilkkachu

47.5k668130

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f449990%2fneed-to-split-large-csv-file%23new-answer', 'question_page');

);

Post as a guest

Name

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
1
down vote

accepted

$ csplit --prefix="MySplit." test.csv '/^cust header,/' '*'
0
174
134
134

Those are the byte counts of each chunk (which we ignore in this case). Now, iterate over each MySplit chunk and further split to your 20k requirement:

$ for i in MySplit.0*; do
 split --additional-suffix=".$i" -l 20000 "$i"
 done

For example, using -l 2 instead of 20k, the final output given your sample would be (sorted by extension):

$ ls -lhXB
total 44K
-rw-r--r-- 1 hunteke hunteke 0 Jun 15 13:31 MySplit.00
-rw-r--r-- 1 hunteke hunteke 174 Jun 15 13:31 MySplit.01
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.01
-rw-r--r-- 1 hunteke hunteke 81 Jun 15 13:27 xab.MySplit.01
-rw-r--r-- 1 hunteke hunteke 26 Jun 15 13:27 xac.MySplit.01
-rw-r--r-- 1 hunteke hunteke 134 Jun 15 13:31 MySplit.02
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.02
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xab.MySplit.02
-rw-r--r-- 1 hunteke hunteke 134 Jun 15 13:31 MySplit.03
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.03
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xab.MySplit.03
-rw-r--r-- 2 hunteke hunteke 442 Jun 15 13:06 test.csv

answered Jun 15 at 17:42

hunteke

1713

add a commentÂ |Â

up vote
1
down vote

accepted

$ csplit --prefix="MySplit." test.csv '/^cust header,/' '*'
0
174
134
134

Those are the byte counts of each chunk (which we ignore in this case). Now, iterate over each MySplit chunk and further split to your 20k requirement:

$ for i in MySplit.0*; do
 split --additional-suffix=".$i" -l 20000 "$i"
 done

For example, using -l 2 instead of 20k, the final output given your sample would be (sorted by extension):

$ ls -lhXB
total 44K
-rw-r--r-- 1 hunteke hunteke 0 Jun 15 13:31 MySplit.00
-rw-r--r-- 1 hunteke hunteke 174 Jun 15 13:31 MySplit.01
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.01
-rw-r--r-- 1 hunteke hunteke 81 Jun 15 13:27 xab.MySplit.01
-rw-r--r-- 1 hunteke hunteke 26 Jun 15 13:27 xac.MySplit.01
-rw-r--r-- 1 hunteke hunteke 134 Jun 15 13:31 MySplit.02
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.02
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xab.MySplit.02
-rw-r--r-- 1 hunteke hunteke 134 Jun 15 13:31 MySplit.03
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.03
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xab.MySplit.03
-rw-r--r-- 2 hunteke hunteke 442 Jun 15 13:06 test.csv

answered Jun 15 at 17:42

hunteke

1713

add a commentÂ |Â

up vote
1
down vote

accepted

$ csplit --prefix="MySplit." test.csv '/^cust header,/' '*'
0
174
134
134

Those are the byte counts of each chunk (which we ignore in this case). Now, iterate over each MySplit chunk and further split to your 20k requirement:

$ for i in MySplit.0*; do
 split --additional-suffix=".$i" -l 20000 "$i"
 done

For example, using -l 2 instead of 20k, the final output given your sample would be (sorted by extension):

$ ls -lhXB
total 44K
-rw-r--r-- 1 hunteke hunteke 0 Jun 15 13:31 MySplit.00
-rw-r--r-- 1 hunteke hunteke 174 Jun 15 13:31 MySplit.01
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.01
-rw-r--r-- 1 hunteke hunteke 81 Jun 15 13:27 xab.MySplit.01
-rw-r--r-- 1 hunteke hunteke 26 Jun 15 13:27 xac.MySplit.01
-rw-r--r-- 1 hunteke hunteke 134 Jun 15 13:31 MySplit.02
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.02
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xab.MySplit.02
-rw-r--r-- 1 hunteke hunteke 134 Jun 15 13:31 MySplit.03
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.03
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xab.MySplit.03
-rw-r--r-- 2 hunteke hunteke 442 Jun 15 13:06 test.csv

answered Jun 15 at 17:42

hunteke

1713

$ csplit --prefix="MySplit." test.csv '/^cust header,/' '*'
0
174
134
134

Those are the byte counts of each chunk (which we ignore in this case). Now, iterate over each MySplit chunk and further split to your 20k requirement:

$ for i in MySplit.0*; do
 split --additional-suffix=".$i" -l 20000 "$i"
 done

For example, using -l 2 instead of 20k, the final output given your sample would be (sorted by extension):

$ ls -lhXB
total 44K
-rw-r--r-- 1 hunteke hunteke 0 Jun 15 13:31 MySplit.00
-rw-r--r-- 1 hunteke hunteke 174 Jun 15 13:31 MySplit.01
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.01
-rw-r--r-- 1 hunteke hunteke 81 Jun 15 13:27 xab.MySplit.01
-rw-r--r-- 1 hunteke hunteke 26 Jun 15 13:27 xac.MySplit.01
-rw-r--r-- 1 hunteke hunteke 134 Jun 15 13:31 MySplit.02
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.02
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xab.MySplit.02
-rw-r--r-- 1 hunteke hunteke 134 Jun 15 13:31 MySplit.03
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.03
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xab.MySplit.03
-rw-r--r-- 2 hunteke hunteke 442 Jun 15 13:06 test.csv

answered Jun 15 at 17:42

hunteke

1713

answered Jun 15 at 17:42

hunteke

1713

answered Jun 15 at 17:42

hunteke

1713

answered Jun 15 at 17:42

hunteke

1713

add a commentÂ |Â

up vote
6
down vote

You could do something like this with awk:

awk -vfilename=output -vcut=1000 '
 BEGIN nl=0; nf=1; f=filename "." nf; 
 ++nl >= cut && /^cust header,/ 
 close(f); nl=0; f=filename "." ++nf
 print > f' < file

answered Jun 15 at 12:49

ilkkachu

47.5k668130

add a commentÂ |Â

up vote
6
down vote

You could do something like this with awk:

awk -vfilename=output -vcut=1000 '
 BEGIN nl=0; nf=1; f=filename "." nf; 
 ++nl >= cut && /^cust header,/ 
 close(f); nl=0; f=filename "." ++nf
 print > f' < file

answered Jun 15 at 12:49

ilkkachu

47.5k668130

add a commentÂ |Â

up vote
6
down vote

You could do something like this with awk:

awk -vfilename=output -vcut=1000 '
 BEGIN nl=0; nf=1; f=filename "." nf; 
 ++nl >= cut && /^cust header,/ 
 close(f); nl=0; f=filename "." ++nf
 print > f' < file

answered Jun 15 at 12:49

ilkkachu

47.5k668130

You could do something like this with awk:

awk -vfilename=output -vcut=1000 '
 BEGIN nl=0; nf=1; f=filename "." nf; 
 ++nl >= cut && /^cust header,/ 
 close(f); nl=0; f=filename "." ++nf
 print > f' < file

answered Jun 15 at 12:49

ilkkachu

47.5k668130

answered Jun 15 at 12:49

ilkkachu

47.5k668130

answered Jun 15 at 12:49

ilkkachu

47.5k668130

answered Jun 15 at 12:49

ilkkachu

47.5k668130

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu