Cut the SUBSTRINGS to a specific length in a CSV file

up vote
1
down vote

favorite

I have a file like below,
cat Test.csv

"pav",12345,"ABCD,EF;xyz23;15rtg",,
"xyz",,"C4DEF;x23yu;rtg",,

After modification :

cat Test.csv

"pav",12345,"AB;xy;15",,
"xyz",,"C4;x2;rt",,

The 3rd field containing substrings delimited with ";" has to be replaced with their substrings

edited Oct 22 '17 at 10:37

RomanPerekhrest

22.5k12145

asked Oct 22 '17 at 6:56

Pavan

Only 5 columns ?and only third column should be replaced?or it's can appear in different columns too?
â€“Â ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·
Oct 22 '17 at 7:51

@ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ· The data has 5 columns, with the 4th and 5th being empty, it seems.
â€“Â Kusalananda
Oct 22 '17 at 7:52

ah, yes, edited my comment
â€“Â ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·
Oct 22 '17 at 7:53

add a commentÂ |Â

up vote
1
down vote

favorite

I have a file like below,
cat Test.csv

"pav",12345,"ABCD,EF;xyz23;15rtg",,
"xyz",,"C4DEF;x23yu;rtg",,

After modification :

cat Test.csv

"pav",12345,"AB;xy;15",,
"xyz",,"C4;x2;rt",,

The 3rd field containing substrings delimited with ";" has to be replaced with their substrings

edited Oct 22 '17 at 10:37

RomanPerekhrest

22.5k12145

asked Oct 22 '17 at 6:56

Pavan

Only 5 columns ?and only third column should be replaced?or it's can appear in different columns too?
â€“Â ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·
Oct 22 '17 at 7:51

@ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ· The data has 5 columns, with the 4th and 5th being empty, it seems.
â€“Â Kusalananda
Oct 22 '17 at 7:52

ah, yes, edited my comment
â€“Â ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·
Oct 22 '17 at 7:53

add a commentÂ |Â

up vote
1
down vote

favorite

I have a file like below,
cat Test.csv

"pav",12345,"ABCD,EF;xyz23;15rtg",,
"xyz",,"C4DEF;x23yu;rtg",,

After modification :

cat Test.csv

"pav",12345,"AB;xy;15",,
"xyz",,"C4;x2;rt",,

The 3rd field containing substrings delimited with ";" has to be replaced with their substrings

edited Oct 22 '17 at 10:37

RomanPerekhrest

22.5k12145

asked Oct 22 '17 at 6:56

Pavan

I have a file like below,
cat Test.csv

"pav",12345,"ABCD,EF;xyz23;15rtg",,
"xyz",,"C4DEF;x23yu;rtg",,

After modification :

cat Test.csv

"pav",12345,"AB;xy;15",,
"xyz",,"C4;x2;rt",,

The 3rd field containing substrings delimited with ";" has to be replaced with their substrings

edited Oct 22 '17 at 10:37

RomanPerekhrest

22.5k12145

asked Oct 22 '17 at 6:56

Pavan

edited Oct 22 '17 at 10:37

RomanPerekhrest

22.5k12145

edited Oct 22 '17 at 10:37

RomanPerekhrest

22.5k12145

edited Oct 22 '17 at 10:37

RomanPerekhrest

22.5k12145

asked Oct 22 '17 at 6:56

Pavan

asked Oct 22 '17 at 6:56

Pavan

asked Oct 22 '17 at 6:56

Pavan

Only 5 columns ?and only third column should be replaced?or it's can appear in different columns too?
â€“Â ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·
Oct 22 '17 at 7:51

@ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ· The data has 5 columns, with the 4th and 5th being empty, it seems.
â€“Â Kusalananda
Oct 22 '17 at 7:52

ah, yes, edited my comment
â€“Â ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·
Oct 22 '17 at 7:53

add a commentÂ |Â

Only 5 columns ?and only third column should be replaced?or it's can appear in different columns too?
â€“Â ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·
Oct 22 '17 at 7:51

@ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ· The data has 5 columns, with the 4th and 5th being empty, it seems.
â€“Â Kusalananda
Oct 22 '17 at 7:52

ah, yes, edited my comment
â€“Â ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·
Oct 22 '17 at 7:53

Only 5 columns ?and only third column should be replaced?or it's can appear in different columns too?
â€“Â ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·
Oct 22 '17 at 7:51

@ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ· The data has 5 columns, with the 4th and 5th being empty, it seems.
â€“Â Kusalananda
Oct 22 '17 at 7:52

ah, yes, edited my comment
â€“Â ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·
Oct 22 '17 at 7:53

add a commentÂ |Â

4 Answers
4

active

oldest

votes

up vote
0
down vote

The following is using csvkit, because parsing CSV data that contains commas in quoted fields with awk directly is error prone.

This will get column three on the correct format:

csvcut -c 3 file.csv |
sed -r 's/^"|"$//g' |
awk -F';' -vOFS=';' ' for (i=1; i<=NF; ++i) $i = substr($i, 0, 2) printf(""%s"n", $0) ' >tmp-3rd

For the given input, this produces

"AB;xy;15"
"C4;x2;rt"

csvcut will cut out the third column.

sed will remove any double quotes from the data, if they appear first or last on the line.

The awk program will go through the ;-delimited fields and cut them down to a length of two characters per field. It prints out the data with double quotes around it.

The output is written to the file tmp-3rd.

Then it's just a matter of reassembling this with the original data (this is assuming bash or any other shell that can do process substitutions with <(...)):

paste -d, <( csvcut -c 1,2 file.csv ) tmp-3rd <( csvcut -c 4,5 file.csv ) | csvformat

paste will put the columns together with commas in-between.

The first process substitution produces the first two columns from the original file, and the second produces the last two columns. In the middle, we provide the modified third column.

As an optional step, we pass the data through csvformat which will quote or unquote fields as needed.

The output will be

pav,12345,AB;xy;15,,
xyz,,C4;x2;rt,,

Bypassing the need for the temporary file:

paste -d, 
 <( csvcut -c 1,2 file.csv ) 
 <( csvcut -c 3 file.csv | sed -r 's/^"|"$//g' |
 awk -F';' -vOFS=';' ' for (i=1; i<=NF; ++i) $i = substr($i, 0, 2) printf(""%s"n", $0) ' ) 
 <( csvcut -c 4,5 file.csv ) | csvformat

answered Oct 22 '17 at 7:51

Kusalananda

105k14209326

I doesn't seem to have a csvkit , will try to use thae AWK as the rest had concquered some how, Thanks :)
â€“Â Pavan
Oct 22 '17 at 8:18

@Pavan pip install --user csvkit will install the commands in $HOME/.local/bin.
â€“Â Kusalananda
Oct 22 '17 at 8:29

I tried with this which works fine , but what i actually wasnt to is, the third field in above example has to be replaced with the modified one in he file could you help me : awk -F',"' 'print $5 "t" ' test.csv|awk -F';' -vOFS=';' ' for (i=1; i<=NF; ++i) $i = substr($i, 0, 25) printf(""%s"n", $0) '
â€“Â Pavan
Oct 23 '17 at 15:10

@Pavan I don't understand. Please update the question with the appropriate information.
â€“Â Kusalananda
Oct 23 '17 at 15:14

I mean now i am able to see the output on the screen as expected, But my actual need is to replace that in the actual file. awk -F',"' 'print $3 "t" ' test.csv --> helping me to print the 3rd field eg. "ABCD,EF;xyz23;15rtg" and awk -F';' -vOFS=';' ' for (i=1; i<=NF; ++i) $i = substr($i, 0, 2) printf(""%s"n", $0) ' --> helping me to limit the substrings to 2 char. the desired output is seen on the screen, but i want that to be updated in the file and the file wold have like 50k lines and the action should happen to all lines
â€“Â Pavan
Oct 23 '17 at 15:22

Â |Â
show 1 more comment

up vote
0
down vote

With perl

Assuming ; is only in third field

$ perl -pe 's/"K[^;"]*;[^"]*(?=")/$&=~s|([^;]2)[^;]+|$1|gr/e' ip.txt
"pav",12345,"AB;xy;15",,
"xyz",,"C4;x2;rt",,

"K to match " before string of interest and (?=") to match " after string of interest. But the " themselves not part of captured string as these are lookarounds

[^;"]*;[^"]* match any non ; or " characters followed by ; followed by non " characters

$&=~s|([^;]2)[^;]+|$1|gr to perform another substitution on the matched string

e modifier allows to use Perl code in substitution section

To restrict only for 3rd field

$ cat ip.txt 
"pav",12345,"ABCD,EF;xyz23;15rtg",,
"xyz",,"C4DEF;x23yu;rtg",,
"foo;12,23;good",124,253
12,5232,"xyz","ijk;5545;62"

$ perl -pe 's/^("[^"]*",|[^,]*,)2"K[^;"]*;[^"]*(?=")/$&=~s|([^;]2)[^;]+|$1|gr/e' ip.txt
"pav",12345,"AB;xy;15",,
"xyz",,"C4;x2;rt",,
"foo;12,23;good",124,253
12,5232,"xyz","ijk;5545;62"

answered Oct 22 '17 at 8:46

Sundeep

6,9611826

add a commentÂ |Â

up vote
0
down vote

Accurate and robust Python 3.x solution (based on csv.reader object):

parse_csv.py script:

import csv, sys
with open(sys.argv[1]) as f:
 reader = csv.reader(f)
 for l in reader:
 l = [s if ';' not in s else ';'.join(_[:2] for _ in s.split(';')) for s in l]
 print(','.join(i if not i or i.isnumeric() else '""'.format(i) for i in l))

Usage:

python3 parse_csv.py Test.csv

The output:

"pav",12345,"AB;xy;15",,
"xyz",,"C4;x2;rt",,

Python's csv module provides robust and flexible support for csv data.

edited Oct 22 '17 at 10:32

answered Oct 22 '17 at 10:13

RomanPerekhrest

22.5k12145

add a commentÂ |Â

up vote
0
down vote

Complex GNU AWK solution (parsing csv data):

awk -v FPAT='"[^"]+"|[^",]+|,,' ' 
 for (i=1;i<=NF;i++) 
 if ($i~/^".*;./) 
 len=split($i,a,";"); v=substr(a[1],1,3); 
 for (j=2;j<=len;j++) v= v";"substr(a[j],1,2);
 v=v"42"
 printf "%s%s",(v? v: ($i~/^,,/? (i==NF? ",":""):$i )),
 (i==NF? ORS:OFS); v="" 
 
 ' OFS=',' Test.csv

FPAT='"[^"]+"|[^",]+|,,' - complex regex pattern defining field value

if ($i~/^".*;./) ... - if the current field $i contains ; character(s)

len=split($i,a,";") - split the field value $i into array a by separator ;. len is assigned with number of elements/chunks created

v=substr(a[1],1,3); - capturing the first chunk of the needed length including leading " char, for ex. "AB will be extracted from "ABCD,EF

for (j=2;j<=len;j++) ... - iterating through remaining chunks/items

v=v"42" - add trailing double quote " to the processed sequence v. 43 is ASCII octal code representing the double quote char ".

($i~/^,,/? (i==NF? ",":""):$i ) - each empty field ,, is recreated with single comma , and common delimiter (also ,). This is to avoid redundant comma cluttering like "pav",,,

(i==NF? ORS:OFS) - on encountering the last field i==NF - print output record separator ORS, otherwise - print output filed separator OFS

The output:

"pav",12345,"AB;xy;15",,
"xyz",,"C4;x2;rt",,

edited Oct 22 '17 at 10:38

answered Oct 22 '17 at 7:45

RomanPerekhrest

22.5k12145

Thanks Roman, Works perfect. Could you help me in understanding bit better, like the 042 and arrays look bit complex.
â€“Â Pavan
Oct 22 '17 at 8:25

@Pavan, welcome, see my explanation
â€“Â RomanPerekhrest
Oct 22 '17 at 10:30

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f399655%2fcut-the-substrings-to-a-specific-length-in-a-csv-file%23new-answer', 'question_page');

);

Post as a guest

Name

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

up vote
0
down vote

The following is using csvkit, because parsing CSV data that contains commas in quoted fields with awk directly is error prone.

This will get column three on the correct format:

csvcut -c 3 file.csv |
sed -r 's/^"|"$//g' |
awk -F';' -vOFS=';' ' for (i=1; i<=NF; ++i) $i = substr($i, 0, 2) printf(""%s"n", $0) ' >tmp-3rd

For the given input, this produces

"AB;xy;15"
"C4;x2;rt"

csvcut will cut out the third column.

sed will remove any double quotes from the data, if they appear first or last on the line.

The awk program will go through the ;-delimited fields and cut them down to a length of two characters per field. It prints out the data with double quotes around it.

The output is written to the file tmp-3rd.

Then it's just a matter of reassembling this with the original data (this is assuming bash or any other shell that can do process substitutions with <(...)):

paste -d, <( csvcut -c 1,2 file.csv ) tmp-3rd <( csvcut -c 4,5 file.csv ) | csvformat

paste will put the columns together with commas in-between.

The first process substitution produces the first two columns from the original file, and the second produces the last two columns. In the middle, we provide the modified third column.

As an optional step, we pass the data through csvformat which will quote or unquote fields as needed.

The output will be

pav,12345,AB;xy;15,,
xyz,,C4;x2;rt,,

Bypassing the need for the temporary file:

paste -d, 
 <( csvcut -c 1,2 file.csv ) 
 <( csvcut -c 3 file.csv | sed -r 's/^"|"$//g' |
 awk -F';' -vOFS=';' ' for (i=1; i<=NF; ++i) $i = substr($i, 0, 2) printf(""%s"n", $0) ' ) 
 <( csvcut -c 4,5 file.csv ) | csvformat

answered Oct 22 '17 at 7:51

Kusalananda

105k14209326

I doesn't seem to have a csvkit , will try to use thae AWK as the rest had concquered some how, Thanks :)
â€“Â Pavan
Oct 22 '17 at 8:18

@Pavan pip install --user csvkit will install the commands in $HOME/.local/bin.
â€“Â Kusalananda
Oct 22 '17 at 8:29

I tried with this which works fine , but what i actually wasnt to is, the third field in above example has to be replaced with the modified one in he file could you help me : awk -F',"' 'print $5 "t" ' test.csv|awk -F';' -vOFS=';' ' for (i=1; i<=NF; ++i) $i = substr($i, 0, 25) printf(""%s"n", $0) '
â€“Â Pavan
Oct 23 '17 at 15:10

@Pavan I don't understand. Please update the question with the appropriate information.
â€“Â Kusalananda
Oct 23 '17 at 15:14

I mean now i am able to see the output on the screen as expected, But my actual need is to replace that in the actual file. awk -F',"' 'print $3 "t" ' test.csv --> helping me to print the 3rd field eg. "ABCD,EF;xyz23;15rtg" and awk -F';' -vOFS=';' ' for (i=1; i<=NF; ++i) $i = substr($i, 0, 2) printf(""%s"n", $0) ' --> helping me to limit the substrings to 2 char. the desired output is seen on the screen, but i want that to be updated in the file and the file wold have like 50k lines and the action should happen to all lines
â€“Â Pavan
Oct 23 '17 at 15:22

Â |Â
show 1 more comment

up vote
0
down vote

The following is using csvkit, because parsing CSV data that contains commas in quoted fields with awk directly is error prone.

This will get column three on the correct format:

csvcut -c 3 file.csv |
sed -r 's/^"|"$//g' |
awk -F';' -vOFS=';' ' for (i=1; i<=NF; ++i) $i = substr($i, 0, 2) printf(""%s"n", $0) ' >tmp-3rd

For the given input, this produces

"AB;xy;15"
"C4;x2;rt"

csvcut will cut out the third column.

sed will remove any double quotes from the data, if they appear first or last on the line.

The awk program will go through the ;-delimited fields and cut them down to a length of two characters per field. It prints out the data with double quotes around it.

The output is written to the file tmp-3rd.

Then it's just a matter of reassembling this with the original data (this is assuming bash or any other shell that can do process substitutions with <(...)):

paste -d, <( csvcut -c 1,2 file.csv ) tmp-3rd <( csvcut -c 4,5 file.csv ) | csvformat

paste will put the columns together with commas in-between.

The first process substitution produces the first two columns from the original file, and the second produces the last two columns. In the middle, we provide the modified third column.

As an optional step, we pass the data through csvformat which will quote or unquote fields as needed.

The output will be

pav,12345,AB;xy;15,,
xyz,,C4;x2;rt,,

Bypassing the need for the temporary file:

paste -d, 
 <( csvcut -c 1,2 file.csv ) 
 <( csvcut -c 3 file.csv | sed -r 's/^"|"$//g' |
 awk -F';' -vOFS=';' ' for (i=1; i<=NF; ++i) $i = substr($i, 0, 2) printf(""%s"n", $0) ' ) 
 <( csvcut -c 4,5 file.csv ) | csvformat

answered Oct 22 '17 at 7:51

Kusalananda

105k14209326

I doesn't seem to have a csvkit , will try to use thae AWK as the rest had concquered some how, Thanks :)
â€“Â Pavan
Oct 22 '17 at 8:18

@Pavan pip install --user csvkit will install the commands in $HOME/.local/bin.
â€“Â Kusalananda
Oct 22 '17 at 8:29

I tried with this which works fine , but what i actually wasnt to is, the third field in above example has to be replaced with the modified one in he file could you help me : awk -F',"' 'print $5 "t" ' test.csv|awk -F';' -vOFS=';' ' for (i=1; i<=NF; ++i) $i = substr($i, 0, 25) printf(""%s"n", $0) '
â€“Â Pavan
Oct 23 '17 at 15:10

@Pavan I don't understand. Please update the question with the appropriate information.
â€“Â Kusalananda
Oct 23 '17 at 15:14

I mean now i am able to see the output on the screen as expected, But my actual need is to replace that in the actual file. awk -F',"' 'print $3 "t" ' test.csv --> helping me to print the 3rd field eg. "ABCD,EF;xyz23;15rtg" and awk -F';' -vOFS=';' ' for (i=1; i<=NF; ++i) $i = substr($i, 0, 2) printf(""%s"n", $0) ' --> helping me to limit the substrings to 2 char. the desired output is seen on the screen, but i want that to be updated in the file and the file wold have like 50k lines and the action should happen to all lines
â€“Â Pavan
Oct 23 '17 at 15:22

Â |Â
show 1 more comment

up vote
0
down vote

The following is using csvkit, because parsing CSV data that contains commas in quoted fields with awk directly is error prone.

This will get column three on the correct format:

csvcut -c 3 file.csv |
sed -r 's/^"|"$//g' |
awk -F';' -vOFS=';' ' for (i=1; i<=NF; ++i) $i = substr($i, 0, 2) printf(""%s"n", $0) ' >tmp-3rd

For the given input, this produces

"AB;xy;15"
"C4;x2;rt"

csvcut will cut out the third column.

sed will remove any double quotes from the data, if they appear first or last on the line.

The awk program will go through the ;-delimited fields and cut them down to a length of two characters per field. It prints out the data with double quotes around it.

The output is written to the file tmp-3rd.

Then it's just a matter of reassembling this with the original data (this is assuming bash or any other shell that can do process substitutions with <(...)):

paste -d, <( csvcut -c 1,2 file.csv ) tmp-3rd <( csvcut -c 4,5 file.csv ) | csvformat

paste will put the columns together with commas in-between.

The first process substitution produces the first two columns from the original file, and the second produces the last two columns. In the middle, we provide the modified third column.

As an optional step, we pass the data through csvformat which will quote or unquote fields as needed.

The output will be

pav,12345,AB;xy;15,,
xyz,,C4;x2;rt,,

Bypassing the need for the temporary file:

paste -d, 
 <( csvcut -c 1,2 file.csv ) 
 <( csvcut -c 3 file.csv | sed -r 's/^"|"$//g' |
 awk -F';' -vOFS=';' ' for (i=1; i<=NF; ++i) $i = substr($i, 0, 2) printf(""%s"n", $0) ' ) 
 <( csvcut -c 4,5 file.csv ) | csvformat

answered Oct 22 '17 at 7:51

Kusalananda

105k14209326

The following is using csvkit, because parsing CSV data that contains commas in quoted fields with awk directly is error prone.

This will get column three on the correct format:

csvcut -c 3 file.csv |
sed -r 's/^"|"$//g' |
awk -F';' -vOFS=';' ' for (i=1; i<=NF; ++i) $i = substr($i, 0, 2) printf(""%s"n", $0) ' >tmp-3rd

For the given input, this produces

"AB;xy;15"
"C4;x2;rt"

csvcut will cut out the third column.

sed will remove any double quotes from the data, if they appear first or last on the line.

The awk program will go through the ;-delimited fields and cut them down to a length of two characters per field. It prints out the data with double quotes around it.

The output is written to the file tmp-3rd.

Then it's just a matter of reassembling this with the original data (this is assuming bash or any other shell that can do process substitutions with <(...)):

paste -d, <( csvcut -c 1,2 file.csv ) tmp-3rd <( csvcut -c 4,5 file.csv ) | csvformat

paste will put the columns together with commas in-between.

The first process substitution produces the first two columns from the original file, and the second produces the last two columns. In the middle, we provide the modified third column.

As an optional step, we pass the data through csvformat which will quote or unquote fields as needed.

The output will be

pav,12345,AB;xy;15,,
xyz,,C4;x2;rt,,

Bypassing the need for the temporary file:

paste -d, 
 <( csvcut -c 1,2 file.csv ) 
 <( csvcut -c 3 file.csv | sed -r 's/^"|"$//g' |
 awk -F';' -vOFS=';' ' for (i=1; i<=NF; ++i) $i = substr($i, 0, 2) printf(""%s"n", $0) ' ) 
 <( csvcut -c 4,5 file.csv ) | csvformat

answered Oct 22 '17 at 7:51

Kusalananda

105k14209326

answered Oct 22 '17 at 7:51

Kusalananda

105k14209326

answered Oct 22 '17 at 7:51

Kusalananda

105k14209326

answered Oct 22 '17 at 7:51

Kusalananda

105k14209326

I doesn't seem to have a csvkit , will try to use thae AWK as the rest had concquered some how, Thanks :)
â€“Â Pavan
Oct 22 '17 at 8:18

@Pavan pip install --user csvkit will install the commands in $HOME/.local/bin.
â€“Â Kusalananda
Oct 22 '17 at 8:29

I tried with this which works fine , but what i actually wasnt to is, the third field in above example has to be replaced with the modified one in he file could you help me : awk -F',"' 'print $5 "t" ' test.csv|awk -F';' -vOFS=';' ' for (i=1; i<=NF; ++i) $i = substr($i, 0, 25) printf(""%s"n", $0) '
â€“Â Pavan
Oct 23 '17 at 15:10

@Pavan I don't understand. Please update the question with the appropriate information.
â€“Â Kusalananda
Oct 23 '17 at 15:14

I mean now i am able to see the output on the screen as expected, But my actual need is to replace that in the actual file. awk -F',"' 'print $3 "t" ' test.csv --> helping me to print the 3rd field eg. "ABCD,EF;xyz23;15rtg" and awk -F';' -vOFS=';' ' for (i=1; i<=NF; ++i) $i = substr($i, 0, 2) printf(""%s"n", $0) ' --> helping me to limit the substrings to 2 char. the desired output is seen on the screen, but i want that to be updated in the file and the file wold have like 50k lines and the action should happen to all lines
â€“Â Pavan
Oct 23 '17 at 15:22

Â |Â
show 1 more comment

I doesn't seem to have a csvkit , will try to use thae AWK as the rest had concquered some how, Thanks :)
â€“Â Pavan
Oct 22 '17 at 8:18

@Pavan pip install --user csvkit will install the commands in $HOME/.local/bin.
â€“Â Kusalananda
Oct 22 '17 at 8:29

I tried with this which works fine , but what i actually wasnt to is, the third field in above example has to be replaced with the modified one in he file could you help me : awk -F',"' 'print $5 "t" ' test.csv|awk -F';' -vOFS=';' ' for (i=1; i<=NF; ++i) $i = substr($i, 0, 25) printf(""%s"n", $0) '
â€“Â Pavan
Oct 23 '17 at 15:10

@Pavan I don't understand. Please update the question with the appropriate information.
â€“Â Kusalananda
Oct 23 '17 at 15:14

I mean now i am able to see the output on the screen as expected, But my actual need is to replace that in the actual file. awk -F',"' 'print $3 "t" ' test.csv --> helping me to print the 3rd field eg. "ABCD,EF;xyz23;15rtg" and awk -F';' -vOFS=';' ' for (i=1; i<=NF; ++i) $i = substr($i, 0, 2) printf(""%s"n", $0) ' --> helping me to limit the substrings to 2 char. the desired output is seen on the screen, but i want that to be updated in the file and the file wold have like 50k lines and the action should happen to all lines
â€“Â Pavan
Oct 23 '17 at 15:22

I doesn't seem to have a csvkit , will try to use thae AWK as the rest had concquered some how, Thanks :)
â€“Â Pavan
Oct 22 '17 at 8:18

@Pavan pip install --user csvkit will install the commands in $HOME/.local/bin.
â€“Â Kusalananda
Oct 22 '17 at 8:29

I tried with this which works fine , but what i actually wasnt to is, the third field in above example has to be replaced with the modified one in he file could you help me : awk -F',"' 'print $5 "t" ' test.csv|awk -F';' -vOFS=';' ' for (i=1; i<=NF; ++i) $i = substr($i, 0, 25) printf(""%s"n", $0) '
â€“Â Pavan
Oct 23 '17 at 15:10

@Pavan I don't understand. Please update the question with the appropriate information.
â€“Â Kusalananda
Oct 23 '17 at 15:14

I mean now i am able to see the output on the screen as expected, But my actual need is to replace that in the actual file. awk -F',"' 'print $3 "t" ' test.csv --> helping me to print the 3rd field eg. "ABCD,EF;xyz23;15rtg" and awk -F';' -vOFS=';' ' for (i=1; i<=NF; ++i) $i = substr($i, 0, 2) printf(""%s"n", $0) ' --> helping me to limit the substrings to 2 char. the desired output is seen on the screen, but i want that to be updated in the file and the file wold have like 50k lines and the action should happen to all lines
â€“Â Pavan
Oct 23 '17 at 15:22

Â |Â
show 1 more comment

up vote
0
down vote

With perl

Assuming ; is only in third field

$ perl -pe 's/"K[^;"]*;[^"]*(?=")/$&=~s|([^;]2)[^;]+|$1|gr/e' ip.txt
"pav",12345,"AB;xy;15",,
"xyz",,"C4;x2;rt",,

"K to match " before string of interest and (?=") to match " after string of interest. But the " themselves not part of captured string as these are lookarounds

[^;"]*;[^"]* match any non ; or " characters followed by ; followed by non " characters

$&=~s|([^;]2)[^;]+|$1|gr to perform another substitution on the matched string

e modifier allows to use Perl code in substitution section

To restrict only for 3rd field

$ cat ip.txt 
"pav",12345,"ABCD,EF;xyz23;15rtg",,
"xyz",,"C4DEF;x23yu;rtg",,
"foo;12,23;good",124,253
12,5232,"xyz","ijk;5545;62"

$ perl -pe 's/^("[^"]*",|[^,]*,)2"K[^;"]*;[^"]*(?=")/$&=~s|([^;]2)[^;]+|$1|gr/e' ip.txt
"pav",12345,"AB;xy;15",,
"xyz",,"C4;x2;rt",,
"foo;12,23;good",124,253
12,5232,"xyz","ijk;5545;62"

answered Oct 22 '17 at 8:46

Sundeep

6,9611826

add a commentÂ |Â

up vote
0
down vote

With perl

Assuming ; is only in third field

$ perl -pe 's/"K[^;"]*;[^"]*(?=")/$&=~s|([^;]2)[^;]+|$1|gr/e' ip.txt
"pav",12345,"AB;xy;15",,
"xyz",,"C4;x2;rt",,

"K to match " before string of interest and (?=") to match " after string of interest. But the " themselves not part of captured string as these are lookarounds

[^;"]*;[^"]* match any non ; or " characters followed by ; followed by non " characters

$&=~s|([^;]2)[^;]+|$1|gr to perform another substitution on the matched string

e modifier allows to use Perl code in substitution section

To restrict only for 3rd field

$ cat ip.txt 
"pav",12345,"ABCD,EF;xyz23;15rtg",,
"xyz",,"C4DEF;x23yu;rtg",,
"foo;12,23;good",124,253
12,5232,"xyz","ijk;5545;62"

$ perl -pe 's/^("[^"]*",|[^,]*,)2"K[^;"]*;[^"]*(?=")/$&=~s|([^;]2)[^;]+|$1|gr/e' ip.txt
"pav",12345,"AB;xy;15",,
"xyz",,"C4;x2;rt",,
"foo;12,23;good",124,253
12,5232,"xyz","ijk;5545;62"

answered Oct 22 '17 at 8:46

Sundeep

6,9611826

add a commentÂ |Â

up vote
0
down vote

With perl

Assuming ; is only in third field

$ perl -pe 's/"K[^;"]*;[^"]*(?=")/$&=~s|([^;]2)[^;]+|$1|gr/e' ip.txt
"pav",12345,"AB;xy;15",,
"xyz",,"C4;x2;rt",,

"K to match " before string of interest and (?=") to match " after string of interest. But the " themselves not part of captured string as these are lookarounds

[^;"]*;[^"]* match any non ; or " characters followed by ; followed by non " characters

$&=~s|([^;]2)[^;]+|$1|gr to perform another substitution on the matched string

e modifier allows to use Perl code in substitution section

To restrict only for 3rd field

$ cat ip.txt 
"pav",12345,"ABCD,EF;xyz23;15rtg",,
"xyz",,"C4DEF;x23yu;rtg",,
"foo;12,23;good",124,253
12,5232,"xyz","ijk;5545;62"

$ perl -pe 's/^("[^"]*",|[^,]*,)2"K[^;"]*;[^"]*(?=")/$&=~s|([^;]2)[^;]+|$1|gr/e' ip.txt
"pav",12345,"AB;xy;15",,
"xyz",,"C4;x2;rt",,
"foo;12,23;good",124,253
12,5232,"xyz","ijk;5545;62"

answered Oct 22 '17 at 8:46

Sundeep

6,9611826

With perl

Assuming ; is only in third field

$ perl -pe 's/"K[^;"]*;[^"]*(?=")/$&=~s|([^;]2)[^;]+|$1|gr/e' ip.txt
"pav",12345,"AB;xy;15",,
"xyz",,"C4;x2;rt",,

"K to match " before string of interest and (?=") to match " after string of interest. But the " themselves not part of captured string as these are lookarounds

[^;"]*;[^"]* match any non ; or " characters followed by ; followed by non " characters

$&=~s|([^;]2)[^;]+|$1|gr to perform another substitution on the matched string

e modifier allows to use Perl code in substitution section

To restrict only for 3rd field

$ cat ip.txt 
"pav",12345,"ABCD,EF;xyz23;15rtg",,
"xyz",,"C4DEF;x23yu;rtg",,
"foo;12,23;good",124,253
12,5232,"xyz","ijk;5545;62"

$ perl -pe 's/^("[^"]*",|[^,]*,)2"K[^;"]*;[^"]*(?=")/$&=~s|([^;]2)[^;]+|$1|gr/e' ip.txt
"pav",12345,"AB;xy;15",,
"xyz",,"C4;x2;rt",,
"foo;12,23;good",124,253
12,5232,"xyz","ijk;5545;62"

answered Oct 22 '17 at 8:46

Sundeep

6,9611826

answered Oct 22 '17 at 8:46

Sundeep

6,9611826

answered Oct 22 '17 at 8:46

Sundeep

6,9611826

answered Oct 22 '17 at 8:46

Sundeep

6,9611826

add a commentÂ |Â

up vote
0
down vote

Accurate and robust Python 3.x solution (based on csv.reader object):

parse_csv.py script:

import csv, sys
with open(sys.argv[1]) as f:
 reader = csv.reader(f)
 for l in reader:
 l = [s if ';' not in s else ';'.join(_[:2] for _ in s.split(';')) for s in l]
 print(','.join(i if not i or i.isnumeric() else '""'.format(i) for i in l))

Usage:

python3 parse_csv.py Test.csv

The output:

"pav",12345,"AB;xy;15",,
"xyz",,"C4;x2;rt",,

Python's csv module provides robust and flexible support for csv data.

edited Oct 22 '17 at 10:32

answered Oct 22 '17 at 10:13

RomanPerekhrest

22.5k12145

add a commentÂ |Â

up vote
0
down vote

Accurate and robust Python 3.x solution (based on csv.reader object):

parse_csv.py script:

import csv, sys
with open(sys.argv[1]) as f:
 reader = csv.reader(f)
 for l in reader:
 l = [s if ';' not in s else ';'.join(_[:2] for _ in s.split(';')) for s in l]
 print(','.join(i if not i or i.isnumeric() else '""'.format(i) for i in l))

Usage:

python3 parse_csv.py Test.csv

The output:

"pav",12345,"AB;xy;15",,
"xyz",,"C4;x2;rt",,

Python's csv module provides robust and flexible support for csv data.

edited Oct 22 '17 at 10:32

answered Oct 22 '17 at 10:13

RomanPerekhrest

22.5k12145

add a commentÂ |Â

up vote
0
down vote

Accurate and robust Python 3.x solution (based on csv.reader object):

parse_csv.py script:

import csv, sys
with open(sys.argv[1]) as f:
 reader = csv.reader(f)
 for l in reader:
 l = [s if ';' not in s else ';'.join(_[:2] for _ in s.split(';')) for s in l]
 print(','.join(i if not i or i.isnumeric() else '""'.format(i) for i in l))

Usage:

python3 parse_csv.py Test.csv

The output:

"pav",12345,"AB;xy;15",,
"xyz",,"C4;x2;rt",,

Python's csv module provides robust and flexible support for csv data.

edited Oct 22 '17 at 10:32

answered Oct 22 '17 at 10:13

RomanPerekhrest

22.5k12145

Accurate and robust Python 3.x solution (based on csv.reader object):

parse_csv.py script:

import csv, sys
with open(sys.argv[1]) as f:
 reader = csv.reader(f)
 for l in reader:
 l = [s if ';' not in s else ';'.join(_[:2] for _ in s.split(';')) for s in l]
 print(','.join(i if not i or i.isnumeric() else '""'.format(i) for i in l))

Usage:

python3 parse_csv.py Test.csv

The output:

"pav",12345,"AB;xy;15",,
"xyz",,"C4;x2;rt",,

Python's csv module provides robust and flexible support for csv data.

edited Oct 22 '17 at 10:32

answered Oct 22 '17 at 10:13

RomanPerekhrest

22.5k12145

edited Oct 22 '17 at 10:32

answered Oct 22 '17 at 10:13

RomanPerekhrest

22.5k12145

answered Oct 22 '17 at 10:13

RomanPerekhrest

22.5k12145

answered Oct 22 '17 at 10:13

RomanPerekhrest

22.5k12145

add a commentÂ |Â

up vote
0
down vote

Complex GNU AWK solution (parsing csv data):

awk -v FPAT='"[^"]+"|[^",]+|,,' ' 
 for (i=1;i<=NF;i++) 
 if ($i~/^".*;./) 
 len=split($i,a,";"); v=substr(a[1],1,3); 
 for (j=2;j<=len;j++) v= v";"substr(a[j],1,2);
 v=v"42"
 printf "%s%s",(v? v: ($i~/^,,/? (i==NF? ",":""):$i )),
 (i==NF? ORS:OFS); v="" 
 
 ' OFS=',' Test.csv

FPAT='"[^"]+"|[^",]+|,,' - complex regex pattern defining field value

if ($i~/^".*;./) ... - if the current field $i contains ; character(s)

len=split($i,a,";") - split the field value $i into array a by separator ;. len is assigned with number of elements/chunks created

v=substr(a[1],1,3); - capturing the first chunk of the needed length including leading " char, for ex. "AB will be extracted from "ABCD,EF

for (j=2;j<=len;j++) ... - iterating through remaining chunks/items

v=v"42" - add trailing double quote " to the processed sequence v. 43 is ASCII octal code representing the double quote char ".

($i~/^,,/? (i==NF? ",":""):$i ) - each empty field ,, is recreated with single comma , and common delimiter (also ,). This is to avoid redundant comma cluttering like "pav",,,

(i==NF? ORS:OFS) - on encountering the last field i==NF - print output record separator ORS, otherwise - print output filed separator OFS

The output:

"pav",12345,"AB;xy;15",,
"xyz",,"C4;x2;rt",,

edited Oct 22 '17 at 10:38

answered Oct 22 '17 at 7:45

RomanPerekhrest

22.5k12145

Thanks Roman, Works perfect. Could you help me in understanding bit better, like the 042 and arrays look bit complex.
â€“Â Pavan
Oct 22 '17 at 8:25

@Pavan, welcome, see my explanation
â€“Â RomanPerekhrest
Oct 22 '17 at 10:30

add a commentÂ |Â

up vote
0
down vote

Complex GNU AWK solution (parsing csv data):

awk -v FPAT='"[^"]+"|[^",]+|,,' ' 
 for (i=1;i<=NF;i++) 
 if ($i~/^".*;./) 
 len=split($i,a,";"); v=substr(a[1],1,3); 
 for (j=2;j<=len;j++) v= v";"substr(a[j],1,2);
 v=v"42"
 printf "%s%s",(v? v: ($i~/^,,/? (i==NF? ",":""):$i )),
 (i==NF? ORS:OFS); v="" 
 
 ' OFS=',' Test.csv

FPAT='"[^"]+"|[^",]+|,,' - complex regex pattern defining field value

if ($i~/^".*;./) ... - if the current field $i contains ; character(s)

len=split($i,a,";") - split the field value $i into array a by separator ;. len is assigned with number of elements/chunks created

v=substr(a[1],1,3); - capturing the first chunk of the needed length including leading " char, for ex. "AB will be extracted from "ABCD,EF

for (j=2;j<=len;j++) ... - iterating through remaining chunks/items

v=v"42" - add trailing double quote " to the processed sequence v. 43 is ASCII octal code representing the double quote char ".

($i~/^,,/? (i==NF? ",":""):$i ) - each empty field ,, is recreated with single comma , and common delimiter (also ,). This is to avoid redundant comma cluttering like "pav",,,

(i==NF? ORS:OFS) - on encountering the last field i==NF - print output record separator ORS, otherwise - print output filed separator OFS

The output:

"pav",12345,"AB;xy;15",,
"xyz",,"C4;x2;rt",,

edited Oct 22 '17 at 10:38

answered Oct 22 '17 at 7:45

RomanPerekhrest

22.5k12145

Thanks Roman, Works perfect. Could you help me in understanding bit better, like the 042 and arrays look bit complex.
â€“Â Pavan
Oct 22 '17 at 8:25

@Pavan, welcome, see my explanation
â€“Â RomanPerekhrest
Oct 22 '17 at 10:30

add a commentÂ |Â

up vote
0
down vote

Complex GNU AWK solution (parsing csv data):

awk -v FPAT='"[^"]+"|[^",]+|,,' ' 
 for (i=1;i<=NF;i++) 
 if ($i~/^".*;./) 
 len=split($i,a,";"); v=substr(a[1],1,3); 
 for (j=2;j<=len;j++) v= v";"substr(a[j],1,2);
 v=v"42"
 printf "%s%s",(v? v: ($i~/^,,/? (i==NF? ",":""):$i )),
 (i==NF? ORS:OFS); v="" 
 
 ' OFS=',' Test.csv

FPAT='"[^"]+"|[^",]+|,,' - complex regex pattern defining field value

if ($i~/^".*;./) ... - if the current field $i contains ; character(s)

len=split($i,a,";") - split the field value $i into array a by separator ;. len is assigned with number of elements/chunks created

v=substr(a[1],1,3); - capturing the first chunk of the needed length including leading " char, for ex. "AB will be extracted from "ABCD,EF

for (j=2;j<=len;j++) ... - iterating through remaining chunks/items

v=v"42" - add trailing double quote " to the processed sequence v. 43 is ASCII octal code representing the double quote char ".

($i~/^,,/? (i==NF? ",":""):$i ) - each empty field ,, is recreated with single comma , and common delimiter (also ,). This is to avoid redundant comma cluttering like "pav",,,

(i==NF? ORS:OFS) - on encountering the last field i==NF - print output record separator ORS, otherwise - print output filed separator OFS

The output:

"pav",12345,"AB;xy;15",,
"xyz",,"C4;x2;rt",,

edited Oct 22 '17 at 10:38

answered Oct 22 '17 at 7:45

RomanPerekhrest

22.5k12145

Complex GNU AWK solution (parsing csv data):

awk -v FPAT='"[^"]+"|[^",]+|,,' ' 
 for (i=1;i<=NF;i++) 
 if ($i~/^".*;./) 
 len=split($i,a,";"); v=substr(a[1],1,3); 
 for (j=2;j<=len;j++) v= v";"substr(a[j],1,2);
 v=v"42"
 printf "%s%s",(v? v: ($i~/^,,/? (i==NF? ",":""):$i )),
 (i==NF? ORS:OFS); v="" 
 
 ' OFS=',' Test.csv

FPAT='"[^"]+"|[^",]+|,,' - complex regex pattern defining field value

if ($i~/^".*;./) ... - if the current field $i contains ; character(s)

len=split($i,a,";") - split the field value $i into array a by separator ;. len is assigned with number of elements/chunks created

v=substr(a[1],1,3); - capturing the first chunk of the needed length including leading " char, for ex. "AB will be extracted from "ABCD,EF

for (j=2;j<=len;j++) ... - iterating through remaining chunks/items

v=v"42" - add trailing double quote " to the processed sequence v. 43 is ASCII octal code representing the double quote char ".

($i~/^,,/? (i==NF? ",":""):$i ) - each empty field ,, is recreated with single comma , and common delimiter (also ,). This is to avoid redundant comma cluttering like "pav",,,

(i==NF? ORS:OFS) - on encountering the last field i==NF - print output record separator ORS, otherwise - print output filed separator OFS

The output:

"pav",12345,"AB;xy;15",,
"xyz",,"C4;x2;rt",,

edited Oct 22 '17 at 10:38

answered Oct 22 '17 at 7:45

RomanPerekhrest

22.5k12145

edited Oct 22 '17 at 10:38

answered Oct 22 '17 at 7:45

RomanPerekhrest

22.5k12145

answered Oct 22 '17 at 7:45

RomanPerekhrest

22.5k12145

answered Oct 22 '17 at 7:45

RomanPerekhrest

22.5k12145

Thanks Roman, Works perfect. Could you help me in understanding bit better, like the 042 and arrays look bit complex.
â€“Â Pavan
Oct 22 '17 at 8:25

@Pavan, welcome, see my explanation
â€“Â RomanPerekhrest
Oct 22 '17 at 10:30

add a commentÂ |Â

Thanks Roman, Works perfect. Could you help me in understanding bit better, like the 042 and arrays look bit complex.
â€“Â Pavan
Oct 22 '17 at 8:25

@Pavan, welcome, see my explanation
â€“Â RomanPerekhrest
Oct 22 '17 at 10:30

Thanks Roman, Works perfect. Could you help me in understanding bit better, like the 042 and arrays look bit complex.
â€“Â Pavan
Oct 22 '17 at 8:25

@Pavan, welcome, see my explanation
â€“Â RomanPerekhrest
Oct 22 '17 at 10:30

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu