Need to split large csv file

Clash Royale CLAN TAG#URR8PPP
up vote
2
down vote
favorite
I've got this csv file that needs to be broken up in smaller files.
no problem with split -l 20000 test.csv my problem is that the file contains different headers. want to split say every +-1000 lines but it needs to split after pay header and new file needs to start with cust header.
cust header,xxx,xxxxxx,xxxxxx
txn header,xxxx,xxx,,xxxx,xxxxx,,xxx
detail header,xxxx,xxxx,xxxxxx,xxxx,xxxx
detail header,xxxxxxxx,xxxxxxxxxx,xxx,,
pay header,,,,,xxxx,xxxxx
cust header,xxx,xxxxxx,xxxxxx
txn header,xxxx,xxx,,xxxx,xxxxx,,xxx
detail header,xxxx,xxxx,xxxxxx,xxxx,xxxx
pay header,,,,,xxxx,xxxxx
cust header,xxx,xxxxxx,xxxxxx
txn header,xxxx,xxx,,xxxx,xxxxx,,xxx
detail header,xxxx,xxxx,xxxxxx,xxxx,xxxx
pay header,,,,,xxxx,xxxxx
text-processing
add a comment |Â
up vote
2
down vote
favorite
I've got this csv file that needs to be broken up in smaller files.
no problem with split -l 20000 test.csv my problem is that the file contains different headers. want to split say every +-1000 lines but it needs to split after pay header and new file needs to start with cust header.
cust header,xxx,xxxxxx,xxxxxx
txn header,xxxx,xxx,,xxxx,xxxxx,,xxx
detail header,xxxx,xxxx,xxxxxx,xxxx,xxxx
detail header,xxxxxxxx,xxxxxxxxxx,xxx,,
pay header,,,,,xxxx,xxxxx
cust header,xxx,xxxxxx,xxxxxx
txn header,xxxx,xxx,,xxxx,xxxxx,,xxx
detail header,xxxx,xxxx,xxxxxx,xxxx,xxxx
pay header,,,,,xxxx,xxxxx
cust header,xxx,xxxxxx,xxxxxx
txn header,xxxx,xxx,,xxxx,xxxxx,,xxx
detail header,xxxx,xxxx,xxxxxx,xxxx,xxxx
pay header,,,,,xxxx,xxxxx
text-processing
1
what should the output files look like? I assume it's not just one set of headers per file.
â glenn jackman
Jun 15 at 13:42
Will the cust header sections need to be broken into multiple files with cust header at top of each file, e.g. cust header1 - lines 1-1000 and cust header1 -line 1001-1999; cust header2 - lines 1-200; cust header3 - lines 1-1000?
â Melioratus
Jun 15 at 17:18
add a comment |Â
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I've got this csv file that needs to be broken up in smaller files.
no problem with split -l 20000 test.csv my problem is that the file contains different headers. want to split say every +-1000 lines but it needs to split after pay header and new file needs to start with cust header.
cust header,xxx,xxxxxx,xxxxxx
txn header,xxxx,xxx,,xxxx,xxxxx,,xxx
detail header,xxxx,xxxx,xxxxxx,xxxx,xxxx
detail header,xxxxxxxx,xxxxxxxxxx,xxx,,
pay header,,,,,xxxx,xxxxx
cust header,xxx,xxxxxx,xxxxxx
txn header,xxxx,xxx,,xxxx,xxxxx,,xxx
detail header,xxxx,xxxx,xxxxxx,xxxx,xxxx
pay header,,,,,xxxx,xxxxx
cust header,xxx,xxxxxx,xxxxxx
txn header,xxxx,xxx,,xxxx,xxxxx,,xxx
detail header,xxxx,xxxx,xxxxxx,xxxx,xxxx
pay header,,,,,xxxx,xxxxx
text-processing
I've got this csv file that needs to be broken up in smaller files.
no problem with split -l 20000 test.csv my problem is that the file contains different headers. want to split say every +-1000 lines but it needs to split after pay header and new file needs to start with cust header.
cust header,xxx,xxxxxx,xxxxxx
txn header,xxxx,xxx,,xxxx,xxxxx,,xxx
detail header,xxxx,xxxx,xxxxxx,xxxx,xxxx
detail header,xxxxxxxx,xxxxxxxxxx,xxx,,
pay header,,,,,xxxx,xxxxx
cust header,xxx,xxxxxx,xxxxxx
txn header,xxxx,xxx,,xxxx,xxxxx,,xxx
detail header,xxxx,xxxx,xxxxxx,xxxx,xxxx
pay header,,,,,xxxx,xxxxx
cust header,xxx,xxxxxx,xxxxxx
txn header,xxxx,xxx,,xxxx,xxxxx,,xxx
detail header,xxxx,xxxx,xxxxxx,xxxx,xxxx
pay header,,,,,xxxx,xxxxx
text-processing
edited Jun 15 at 17:49
Vadim Kotov
1137
1137
asked Jun 15 at 12:32
Daniel
132
132
1
what should the output files look like? I assume it's not just one set of headers per file.
â glenn jackman
Jun 15 at 13:42
Will the cust header sections need to be broken into multiple files with cust header at top of each file, e.g. cust header1 - lines 1-1000 and cust header1 -line 1001-1999; cust header2 - lines 1-200; cust header3 - lines 1-1000?
â Melioratus
Jun 15 at 17:18
add a comment |Â
1
what should the output files look like? I assume it's not just one set of headers per file.
â glenn jackman
Jun 15 at 13:42
Will the cust header sections need to be broken into multiple files with cust header at top of each file, e.g. cust header1 - lines 1-1000 and cust header1 -line 1001-1999; cust header2 - lines 1-200; cust header3 - lines 1-1000?
â Melioratus
Jun 15 at 17:18
1
1
what should the output files look like? I assume it's not just one set of headers per file.
â glenn jackman
Jun 15 at 13:42
what should the output files look like? I assume it's not just one set of headers per file.
â glenn jackman
Jun 15 at 13:42
Will the cust header sections need to be broken into multiple files with cust header at top of each file, e.g. cust header1 - lines 1-1000 and cust header1 -line 1001-1999; cust header2 - lines 1-200; cust header3 - lines 1-1000?
â Melioratus
Jun 15 at 17:18
Will the cust header sections need to be broken into multiple files with cust header at top of each file, e.g. cust header1 - lines 1-1000 and cust header1 -line 1001-1999; cust header2 - lines 1-200; cust header3 - lines 1-1000?
â Melioratus
Jun 15 at 17:18
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
1
down vote
accepted
ilkkachu's solution is slick, uses a single executable, and is probably the correct answer. However, I've just never been able to wrap my head around advanced uses of awk. Had ilkkachu not answered first, I might have opted for csplit. csplit will split a text file based on context lines (okay, regular expressions). You could then take that output and further split the files with the split utility you already know:
$ csplit --prefix="MySplit." test.csv '/^cust header,/' '*'
0
174
134
134
Those are the byte counts of each chunk (which we ignore in this case). Now, iterate over each MySplit chunk and further split to your 20k requirement:
$ for i in MySplit.0*; do
split --additional-suffix=".$i" -l 20000 "$i"
done
For example, using -l 2 instead of 20k, the final output given your sample would be (sorted by extension):
$ ls -lhXB
total 44K
-rw-r--r-- 1 hunteke hunteke 0 Jun 15 13:31 MySplit.00
-rw-r--r-- 1 hunteke hunteke 174 Jun 15 13:31 MySplit.01
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.01
-rw-r--r-- 1 hunteke hunteke 81 Jun 15 13:27 xab.MySplit.01
-rw-r--r-- 1 hunteke hunteke 26 Jun 15 13:27 xac.MySplit.01
-rw-r--r-- 1 hunteke hunteke 134 Jun 15 13:31 MySplit.02
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.02
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xab.MySplit.02
-rw-r--r-- 1 hunteke hunteke 134 Jun 15 13:31 MySplit.03
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.03
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xab.MySplit.03
-rw-r--r-- 2 hunteke hunteke 442 Jun 15 13:06 test.csv
add a comment |Â
up vote
6
down vote
You could do something like this with awk:
awk -vfilename=output -vcut=1000 '
BEGIN nl=0; nf=1; f=filename "." nf;
++nl >= cut && /^cust header,/
close(f); nl=0; f=filename "." ++nf
print > f' < file
It keeps a count of the lines it's seen, and reopens a new output file if the count is larger than cut (1000 here), and the current line starts with cust header,. The output files are named output.1, output.2, ... (the filename variable)
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
ilkkachu's solution is slick, uses a single executable, and is probably the correct answer. However, I've just never been able to wrap my head around advanced uses of awk. Had ilkkachu not answered first, I might have opted for csplit. csplit will split a text file based on context lines (okay, regular expressions). You could then take that output and further split the files with the split utility you already know:
$ csplit --prefix="MySplit." test.csv '/^cust header,/' '*'
0
174
134
134
Those are the byte counts of each chunk (which we ignore in this case). Now, iterate over each MySplit chunk and further split to your 20k requirement:
$ for i in MySplit.0*; do
split --additional-suffix=".$i" -l 20000 "$i"
done
For example, using -l 2 instead of 20k, the final output given your sample would be (sorted by extension):
$ ls -lhXB
total 44K
-rw-r--r-- 1 hunteke hunteke 0 Jun 15 13:31 MySplit.00
-rw-r--r-- 1 hunteke hunteke 174 Jun 15 13:31 MySplit.01
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.01
-rw-r--r-- 1 hunteke hunteke 81 Jun 15 13:27 xab.MySplit.01
-rw-r--r-- 1 hunteke hunteke 26 Jun 15 13:27 xac.MySplit.01
-rw-r--r-- 1 hunteke hunteke 134 Jun 15 13:31 MySplit.02
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.02
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xab.MySplit.02
-rw-r--r-- 1 hunteke hunteke 134 Jun 15 13:31 MySplit.03
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.03
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xab.MySplit.03
-rw-r--r-- 2 hunteke hunteke 442 Jun 15 13:06 test.csv
add a comment |Â
up vote
1
down vote
accepted
ilkkachu's solution is slick, uses a single executable, and is probably the correct answer. However, I've just never been able to wrap my head around advanced uses of awk. Had ilkkachu not answered first, I might have opted for csplit. csplit will split a text file based on context lines (okay, regular expressions). You could then take that output and further split the files with the split utility you already know:
$ csplit --prefix="MySplit." test.csv '/^cust header,/' '*'
0
174
134
134
Those are the byte counts of each chunk (which we ignore in this case). Now, iterate over each MySplit chunk and further split to your 20k requirement:
$ for i in MySplit.0*; do
split --additional-suffix=".$i" -l 20000 "$i"
done
For example, using -l 2 instead of 20k, the final output given your sample would be (sorted by extension):
$ ls -lhXB
total 44K
-rw-r--r-- 1 hunteke hunteke 0 Jun 15 13:31 MySplit.00
-rw-r--r-- 1 hunteke hunteke 174 Jun 15 13:31 MySplit.01
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.01
-rw-r--r-- 1 hunteke hunteke 81 Jun 15 13:27 xab.MySplit.01
-rw-r--r-- 1 hunteke hunteke 26 Jun 15 13:27 xac.MySplit.01
-rw-r--r-- 1 hunteke hunteke 134 Jun 15 13:31 MySplit.02
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.02
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xab.MySplit.02
-rw-r--r-- 1 hunteke hunteke 134 Jun 15 13:31 MySplit.03
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.03
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xab.MySplit.03
-rw-r--r-- 2 hunteke hunteke 442 Jun 15 13:06 test.csv
add a comment |Â
up vote
1
down vote
accepted
up vote
1
down vote
accepted
ilkkachu's solution is slick, uses a single executable, and is probably the correct answer. However, I've just never been able to wrap my head around advanced uses of awk. Had ilkkachu not answered first, I might have opted for csplit. csplit will split a text file based on context lines (okay, regular expressions). You could then take that output and further split the files with the split utility you already know:
$ csplit --prefix="MySplit." test.csv '/^cust header,/' '*'
0
174
134
134
Those are the byte counts of each chunk (which we ignore in this case). Now, iterate over each MySplit chunk and further split to your 20k requirement:
$ for i in MySplit.0*; do
split --additional-suffix=".$i" -l 20000 "$i"
done
For example, using -l 2 instead of 20k, the final output given your sample would be (sorted by extension):
$ ls -lhXB
total 44K
-rw-r--r-- 1 hunteke hunteke 0 Jun 15 13:31 MySplit.00
-rw-r--r-- 1 hunteke hunteke 174 Jun 15 13:31 MySplit.01
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.01
-rw-r--r-- 1 hunteke hunteke 81 Jun 15 13:27 xab.MySplit.01
-rw-r--r-- 1 hunteke hunteke 26 Jun 15 13:27 xac.MySplit.01
-rw-r--r-- 1 hunteke hunteke 134 Jun 15 13:31 MySplit.02
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.02
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xab.MySplit.02
-rw-r--r-- 1 hunteke hunteke 134 Jun 15 13:31 MySplit.03
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.03
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xab.MySplit.03
-rw-r--r-- 2 hunteke hunteke 442 Jun 15 13:06 test.csv
ilkkachu's solution is slick, uses a single executable, and is probably the correct answer. However, I've just never been able to wrap my head around advanced uses of awk. Had ilkkachu not answered first, I might have opted for csplit. csplit will split a text file based on context lines (okay, regular expressions). You could then take that output and further split the files with the split utility you already know:
$ csplit --prefix="MySplit." test.csv '/^cust header,/' '*'
0
174
134
134
Those are the byte counts of each chunk (which we ignore in this case). Now, iterate over each MySplit chunk and further split to your 20k requirement:
$ for i in MySplit.0*; do
split --additional-suffix=".$i" -l 20000 "$i"
done
For example, using -l 2 instead of 20k, the final output given your sample would be (sorted by extension):
$ ls -lhXB
total 44K
-rw-r--r-- 1 hunteke hunteke 0 Jun 15 13:31 MySplit.00
-rw-r--r-- 1 hunteke hunteke 174 Jun 15 13:31 MySplit.01
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.01
-rw-r--r-- 1 hunteke hunteke 81 Jun 15 13:27 xab.MySplit.01
-rw-r--r-- 1 hunteke hunteke 26 Jun 15 13:27 xac.MySplit.01
-rw-r--r-- 1 hunteke hunteke 134 Jun 15 13:31 MySplit.02
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.02
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xab.MySplit.02
-rw-r--r-- 1 hunteke hunteke 134 Jun 15 13:31 MySplit.03
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xaa.MySplit.03
-rw-r--r-- 1 hunteke hunteke 67 Jun 15 13:27 xab.MySplit.03
-rw-r--r-- 2 hunteke hunteke 442 Jun 15 13:06 test.csv
answered Jun 15 at 17:42
hunteke
1713
1713
add a comment |Â
add a comment |Â
up vote
6
down vote
You could do something like this with awk:
awk -vfilename=output -vcut=1000 '
BEGIN nl=0; nf=1; f=filename "." nf;
++nl >= cut && /^cust header,/
close(f); nl=0; f=filename "." ++nf
print > f' < file
It keeps a count of the lines it's seen, and reopens a new output file if the count is larger than cut (1000 here), and the current line starts with cust header,. The output files are named output.1, output.2, ... (the filename variable)
add a comment |Â
up vote
6
down vote
You could do something like this with awk:
awk -vfilename=output -vcut=1000 '
BEGIN nl=0; nf=1; f=filename "." nf;
++nl >= cut && /^cust header,/
close(f); nl=0; f=filename "." ++nf
print > f' < file
It keeps a count of the lines it's seen, and reopens a new output file if the count is larger than cut (1000 here), and the current line starts with cust header,. The output files are named output.1, output.2, ... (the filename variable)
add a comment |Â
up vote
6
down vote
up vote
6
down vote
You could do something like this with awk:
awk -vfilename=output -vcut=1000 '
BEGIN nl=0; nf=1; f=filename "." nf;
++nl >= cut && /^cust header,/
close(f); nl=0; f=filename "." ++nf
print > f' < file
It keeps a count of the lines it's seen, and reopens a new output file if the count is larger than cut (1000 here), and the current line starts with cust header,. The output files are named output.1, output.2, ... (the filename variable)
You could do something like this with awk:
awk -vfilename=output -vcut=1000 '
BEGIN nl=0; nf=1; f=filename "." nf;
++nl >= cut && /^cust header,/
close(f); nl=0; f=filename "." ++nf
print > f' < file
It keeps a count of the lines it's seen, and reopens a new output file if the count is larger than cut (1000 here), and the current line starts with cust header,. The output files are named output.1, output.2, ... (the filename variable)
answered Jun 15 at 12:49
ilkkachu
47.5k668130
47.5k668130
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f449990%2fneed-to-split-large-csv-file%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
1
what should the output files look like? I assume it's not just one set of headers per file.
â glenn jackman
Jun 15 at 13:42
Will the cust header sections need to be broken into multiple files with cust header at top of each file, e.g. cust header1 - lines 1-1000 and cust header1 -line 1001-1999; cust header2 - lines 1-200; cust header3 - lines 1-1000?
â Melioratus
Jun 15 at 17:18