Field separator part of a column - incorrect parsing unix

up vote
-2
down vote

favorite

I want to do a check the number of columns in a CSV file before processing it. The problem is that the delimiter (comma) also occurs in the text of some fields and because of that I cannot parse it correctly and I receive too many columns.

Eg:

~new file: 12345~,~125.5~,,,~ example (45), case (20)~,,

7 columns

~new file: 12345~

~125.5~

empty

empty

~ example (45), case (20)~

empty

empty

The problem is the comma inside ~example (45), case (20)~ in 5^th column.

I tried to replace delimiter , with ; using sed but I had to do more than one iteration.

I would like a general rule that will match multiple cases with a more optimal approach.

edited 9 mins ago

Kusalananda

109k14211334

asked 1 hour ago

Mathew Linton

New contributor

How do you know which commas are part of the data and which are field separators? Proper CSV uses double quotes to surround such fields.
â€“Â roaima
1 hour ago

This is a txt file that is extracted from an application and the separteur when the extract was done was set to be comma.
â€“Â Mathew Linton
57 mins ago

@MathewLinton. Thank you Mat, Are columns 3 4 6 7 empty in all the rows in your data? if not and I assume NOT, then the command column is more than enough. If YES then you don't have seven columns and the command column is more than enough ;-)
â€“Â Goro
24 mins ago

It looks like the ~ character is the quoting character (so ~hello, world~ is one field). Is that correct?
â€“Â roaima
4 mins ago

@Goro.In your example you remove all ','. In column 5th I want ',' since is the column value ~ example (45), case (20)~ and I don't want to alter the data.
â€“Â Mathew Linton
1 min ago

Â |Â
show 1 more comment

up vote
-2
down vote

favorite

Eg:

~new file: 12345~,~125.5~,,,~ example (45), case (20)~,,

7 columns

~new file: 12345~

~125.5~

empty

empty

~ example (45), case (20)~

empty

empty

The problem is the comma inside ~example (45), case (20)~ in 5^th column.

I tried to replace delimiter , with ; using sed but I had to do more than one iteration.

I would like a general rule that will match multiple cases with a more optimal approach.

edited 9 mins ago

Kusalananda

109k14211334

asked 1 hour ago

Mathew Linton

New contributor

How do you know which commas are part of the data and which are field separators? Proper CSV uses double quotes to surround such fields.
â€“Â roaima
1 hour ago

This is a txt file that is extracted from an application and the separteur when the extract was done was set to be comma.
â€“Â Mathew Linton
57 mins ago

@MathewLinton. Thank you Mat, Are columns 3 4 6 7 empty in all the rows in your data? if not and I assume NOT, then the command column is more than enough. If YES then you don't have seven columns and the command column is more than enough ;-)
â€“Â Goro
24 mins ago

It looks like the ~ character is the quoting character (so ~hello, world~ is one field). Is that correct?
â€“Â roaima
4 mins ago

@Goro.In your example you remove all ','. In column 5th I want ',' since is the column value ~ example (45), case (20)~ and I don't want to alter the data.
â€“Â Mathew Linton
1 min ago

Â |Â
show 1 more comment

up vote
-2
down vote

favorite

Eg:

~new file: 12345~,~125.5~,,,~ example (45), case (20)~,,

7 columns

~new file: 12345~

~125.5~

empty

empty

~ example (45), case (20)~

empty

empty

The problem is the comma inside ~example (45), case (20)~ in 5^th column.

I tried to replace delimiter , with ; using sed but I had to do more than one iteration.

I would like a general rule that will match multiple cases with a more optimal approach.

edited 9 mins ago

Kusalananda

109k14211334

asked 1 hour ago

Mathew Linton

New contributor

Eg:

~new file: 12345~,~125.5~,,,~ example (45), case (20)~,,

7 columns

~new file: 12345~

~125.5~

empty

empty

~ example (45), case (20)~

empty

empty

The problem is the comma inside ~example (45), case (20)~ in 5^th column.

I tried to replace delimiter , with ; using sed but I had to do more than one iteration.

I would like a general rule that will match multiple cases with a more optimal approach.

text-processing awk sed csv

edited 9 mins ago

Kusalananda

109k14211334

asked 1 hour ago

Mathew Linton

New contributor

edited 9 mins ago

Kusalananda

109k14211334

asked 1 hour ago

Mathew Linton

New contributor

edited 9 mins ago

Kusalananda

109k14211334

edited 9 mins ago

Kusalananda

109k14211334

edited 9 mins ago

Kusalananda

109k14211334

asked 1 hour ago

Mathew Linton

New contributor

asked 1 hour ago

Mathew Linton

asked 1 hour ago

Mathew Linton

New contributor

Mathew Linton is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

How do you know which commas are part of the data and which are field separators? Proper CSV uses double quotes to surround such fields.
â€“Â roaima
1 hour ago

This is a txt file that is extracted from an application and the separteur when the extract was done was set to be comma.
â€“Â Mathew Linton
57 mins ago

@MathewLinton. Thank you Mat, Are columns 3 4 6 7 empty in all the rows in your data? if not and I assume NOT, then the command column is more than enough. If YES then you don't have seven columns and the command column is more than enough ;-)
â€“Â Goro
24 mins ago

It looks like the ~ character is the quoting character (so ~hello, world~ is one field). Is that correct?
â€“Â roaima
4 mins ago

@Goro.In your example you remove all ','. In column 5th I want ',' since is the column value ~ example (45), case (20)~ and I don't want to alter the data.
â€“Â Mathew Linton
1 min ago

Â |Â
show 1 more comment

How do you know which commas are part of the data and which are field separators? Proper CSV uses double quotes to surround such fields.
â€“Â roaima
1 hour ago

This is a txt file that is extracted from an application and the separteur when the extract was done was set to be comma.
â€“Â Mathew Linton
57 mins ago

@MathewLinton. Thank you Mat, Are columns 3 4 6 7 empty in all the rows in your data? if not and I assume NOT, then the command column is more than enough. If YES then you don't have seven columns and the command column is more than enough ;-)
â€“Â Goro
24 mins ago

It looks like the ~ character is the quoting character (so ~hello, world~ is one field). Is that correct?
â€“Â roaima
4 mins ago

@Goro.In your example you remove all ','. In column 5th I want ',' since is the column value ~ example (45), case (20)~ and I don't want to alter the data.
â€“Â Mathew Linton
1 min ago

How do you know which commas are part of the data and which are field separators? Proper CSV uses double quotes to surround such fields.
â€“Â roaima
1 hour ago

This is a txt file that is extracted from an application and the separteur when the extract was done was set to be comma.
â€“Â Mathew Linton
57 mins ago

@MathewLinton. Thank you Mat, Are columns 3 4 6 7 empty in all the rows in your data? if not and I assume NOT, then the command column is more than enough. If YES then you don't have seven columns and the command column is more than enough ;-)
â€“Â Goro
24 mins ago

It looks like the ~ character is the quoting character (so ~hello, world~ is one field). Is that correct?
â€“Â roaima
4 mins ago

@Goro.In your example you remove all ','. In column 5th I want ',' since is the column value ~ example (45), case (20)~ and I don't want to alter the data.
â€“Â Mathew Linton
1 min ago

Â |Â
show 1 more comment

3 Answers
3

active

oldest

votes

up vote
1
down vote

I assume , is the columns delimiter, I would just run the command column:

echo "~new file: 12345~,~125.5~,,,~ example (45), case (20)~,," | column -t -s','

column -t -s',' file

output:

~new file: 12345~ ~125.5~ ~ example (45) case (20)~

answered 1 hour ago

Goro

8,24354182

I think you misunderstood the question. Here ~ example (45), case (20)~ is a single column but column is splitting in into two columns. I see the format of values in each column is ,~...~, where comma also can be within a filed like in ,~something with comma, and rest~, for non empty fields. it's the ~ inplace of quote "
â€“Â sddgob
1 hour ago

No, we have 7 columns. This is an example of the output that can be used: ~new file: 12345~;~125.5~;;;~ example (45), case(20)~;;
â€“Â Mathew Linton
56 mins ago

I amended the question with each column value
â€“Â Mathew Linton
42 mins ago

add a commentÂ |Â

up vote
1
down vote

Using awk you would do:

awk -F, ' gsub(/~[^~]*~/,""); print NF ' infile

for an input like:

~new file: 12345~,~125.5~,,,~ example (45), case (20)~,,
,~125.5~,,,~ example (45), case (20)~

It will return:

7
5

In gsub(/~[^~]*~/,""), we are replacing every pattern started from a ~ till the next ~ seen (like ~...~) with empty string; see below:

awk -F, ' gsub(/~[^~]*~/,""); print $0 ' infile
,,,,,,
,,,,

This assume that there is no inner ~ like ,~some~thing~, in your input.

then print NF will print the number of fields according to the specified filed separator -F .

edited 33 mins ago

community wiki

3 revs
ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·

add a commentÂ |Â

up vote
1
down vote

This looks like a CSV file that is using comma as field delimiters and tilde as quoting character.

Using a proper CSV parser, like the one provided by the Text::CSV Perl module:

perl -MText::CSV -e 'print scalar(@Text::CSV->new(quote_char=>"~")->getline(*STDIN))' <file.csv

This would read the first line of the CSV file file.csv and print the number of columns in it. We instantiate a parser that understands that the quote character is a tilde before reading the first line with this parser. The getline() method on this parser would read a line from the given filehandle and return a reference to an array of data, one item per parsed column. The print scalar(...) is a fairly common way to print the length of an array in Perl.

Another way, using the CSVKit command line CSV parser toolkit:

csvstat -n -q '~' <file.csv | wc -l

or equivalently, using long options,

csvstat --names --quotechar '~' <file.csv | wc -l

This would likewise read the first line of the input file and return a listing of the headers (the first line of a CSV file usually contains column headers), one per line. The wc -l counts the number of lines returned.

When you later parse the CSV file, I suggest that you use one of these approaches, or look for a proper parser in the programming language that you are most used to. awk and sed may be used on simple CSV data, but in this case your data is using some of the CSV format features that these tools would have difficulties to cope with without taking great care.

edited 5 mins ago

answered 19 mins ago

Kusalananda

109k14211334

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

Mathew Linton is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f474609%2ffield-separator-part-of-a-column-incorrect-parsing-unix%23new-answer', 'question_page');

);

Post as a guest

Name

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

up vote
1
down vote

I assume , is the columns delimiter, I would just run the command column:

echo "~new file: 12345~,~125.5~,,,~ example (45), case (20)~,," | column -t -s','

column -t -s',' file

output:

~new file: 12345~ ~125.5~ ~ example (45) case (20)~

answered 1 hour ago

Goro

8,24354182

I think you misunderstood the question. Here ~ example (45), case (20)~ is a single column but column is splitting in into two columns. I see the format of values in each column is ,~...~, where comma also can be within a filed like in ,~something with comma, and rest~, for non empty fields. it's the ~ inplace of quote "
â€“Â sddgob
1 hour ago

No, we have 7 columns. This is an example of the output that can be used: ~new file: 12345~;~125.5~;;;~ example (45), case(20)~;;
â€“Â Mathew Linton
56 mins ago

I amended the question with each column value
â€“Â Mathew Linton
42 mins ago

add a commentÂ |Â

up vote
1
down vote

I assume , is the columns delimiter, I would just run the command column:

echo "~new file: 12345~,~125.5~,,,~ example (45), case (20)~,," | column -t -s','

column -t -s',' file

output:

~new file: 12345~ ~125.5~ ~ example (45) case (20)~

answered 1 hour ago

Goro

8,24354182

I think you misunderstood the question. Here ~ example (45), case (20)~ is a single column but column is splitting in into two columns. I see the format of values in each column is ,~...~, where comma also can be within a filed like in ,~something with comma, and rest~, for non empty fields. it's the ~ inplace of quote "
â€“Â sddgob
1 hour ago

No, we have 7 columns. This is an example of the output that can be used: ~new file: 12345~;~125.5~;;;~ example (45), case(20)~;;
â€“Â Mathew Linton
56 mins ago

I amended the question with each column value
â€“Â Mathew Linton
42 mins ago

add a commentÂ |Â

up vote
1
down vote

I assume , is the columns delimiter, I would just run the command column:

echo "~new file: 12345~,~125.5~,,,~ example (45), case (20)~,," | column -t -s','

column -t -s',' file

output:

~new file: 12345~ ~125.5~ ~ example (45) case (20)~

answered 1 hour ago

Goro

8,24354182

I assume , is the columns delimiter, I would just run the command column:

echo "~new file: 12345~,~125.5~,,,~ example (45), case (20)~,," | column -t -s','

column -t -s',' file

output:

~new file: 12345~ ~125.5~ ~ example (45) case (20)~

answered 1 hour ago

Goro

8,24354182

answered 1 hour ago

Goro

8,24354182

answered 1 hour ago

Goro

8,24354182

answered 1 hour ago

Goro

8,24354182

I think you misunderstood the question. Here ~ example (45), case (20)~ is a single column but column is splitting in into two columns. I see the format of values in each column is ,~...~, where comma also can be within a filed like in ,~something with comma, and rest~, for non empty fields. it's the ~ inplace of quote "
â€“Â sddgob
1 hour ago

No, we have 7 columns. This is an example of the output that can be used: ~new file: 12345~;~125.5~;;;~ example (45), case(20)~;;
â€“Â Mathew Linton
56 mins ago

I amended the question with each column value
â€“Â Mathew Linton
42 mins ago

add a commentÂ |Â

I think you misunderstood the question. Here ~ example (45), case (20)~ is a single column but column is splitting in into two columns. I see the format of values in each column is ,~...~, where comma also can be within a filed like in ,~something with comma, and rest~, for non empty fields. it's the ~ inplace of quote "
â€“Â sddgob
1 hour ago

No, we have 7 columns. This is an example of the output that can be used: ~new file: 12345~;~125.5~;;;~ example (45), case(20)~;;
â€“Â Mathew Linton
56 mins ago

I amended the question with each column value
â€“Â Mathew Linton
42 mins ago

I think you misunderstood the question. Here ~ example (45), case (20)~ is a single column but column is splitting in into two columns. I see the format of values in each column is ,~...~, where comma also can be within a filed like in ,~something with comma, and rest~, for non empty fields. it's the ~ inplace of quote "
â€“Â sddgob
1 hour ago

No, we have 7 columns. This is an example of the output that can be used: ~new file: 12345~;~125.5~;;;~ example (45), case(20)~;;
â€“Â Mathew Linton
56 mins ago

I amended the question with each column value
â€“Â Mathew Linton
42 mins ago

add a commentÂ |Â

up vote
1
down vote

Using awk you would do:

awk -F, ' gsub(/~[^~]*~/,""); print NF ' infile

for an input like:

~new file: 12345~,~125.5~,,,~ example (45), case (20)~,,
,~125.5~,,,~ example (45), case (20)~

It will return:

7
5

In gsub(/~[^~]*~/,""), we are replacing every pattern started from a ~ till the next ~ seen (like ~...~) with empty string; see below:

awk -F, ' gsub(/~[^~]*~/,""); print $0 ' infile
,,,,,,
,,,,

This assume that there is no inner ~ like ,~some~thing~, in your input.

then print NF will print the number of fields according to the specified filed separator -F .

edited 33 mins ago

community wiki

3 revs
ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·

add a commentÂ |Â

up vote
1
down vote

Using awk you would do:

awk -F, ' gsub(/~[^~]*~/,""); print NF ' infile

for an input like:

~new file: 12345~,~125.5~,,,~ example (45), case (20)~,,
,~125.5~,,,~ example (45), case (20)~

It will return:

7
5

In gsub(/~[^~]*~/,""), we are replacing every pattern started from a ~ till the next ~ seen (like ~...~) with empty string; see below:

awk -F, ' gsub(/~[^~]*~/,""); print $0 ' infile
,,,,,,
,,,,

This assume that there is no inner ~ like ,~some~thing~, in your input.

then print NF will print the number of fields according to the specified filed separator -F .

edited 33 mins ago

community wiki

3 revs
ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·

add a commentÂ |Â

up vote
1
down vote

Using awk you would do:

awk -F, ' gsub(/~[^~]*~/,""); print NF ' infile

for an input like:

~new file: 12345~,~125.5~,,,~ example (45), case (20)~,,
,~125.5~,,,~ example (45), case (20)~

It will return:

7
5

In gsub(/~[^~]*~/,""), we are replacing every pattern started from a ~ till the next ~ seen (like ~...~) with empty string; see below:

awk -F, ' gsub(/~[^~]*~/,""); print $0 ' infile
,,,,,,
,,,,

This assume that there is no inner ~ like ,~some~thing~, in your input.

then print NF will print the number of fields according to the specified filed separator -F .

edited 33 mins ago

community wiki

3 revs
ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·

Using awk you would do:

awk -F, ' gsub(/~[^~]*~/,""); print NF ' infile

for an input like:

~new file: 12345~,~125.5~,,,~ example (45), case (20)~,,
,~125.5~,,,~ example (45), case (20)~

It will return:

7
5

In gsub(/~[^~]*~/,""), we are replacing every pattern started from a ~ till the next ~ seen (like ~...~) with empty string; see below:

awk -F, ' gsub(/~[^~]*~/,""); print $0 ' infile
,,,,,,
,,,,

This assume that there is no inner ~ like ,~some~thing~, in your input.

then print NF will print the number of fields according to the specified filed separator -F .

edited 33 mins ago

community wiki

3 revs
ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·

edited 33 mins ago

community wiki

3 revs
ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·

community wiki

3 revs
ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·

community wiki

3 revs
ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·

add a commentÂ |Â

up vote
1
down vote

This looks like a CSV file that is using comma as field delimiters and tilde as quoting character.

Using a proper CSV parser, like the one provided by the Text::CSV Perl module:

perl -MText::CSV -e 'print scalar(@Text::CSV->new(quote_char=>"~")->getline(*STDIN))' <file.csv

Another way, using the CSVKit command line CSV parser toolkit:

csvstat -n -q '~' <file.csv | wc -l

or equivalently, using long options,

csvstat --names --quotechar '~' <file.csv | wc -l

edited 5 mins ago

answered 19 mins ago

Kusalananda

109k14211334

add a commentÂ |Â

up vote
1
down vote

This looks like a CSV file that is using comma as field delimiters and tilde as quoting character.

Using a proper CSV parser, like the one provided by the Text::CSV Perl module:

perl -MText::CSV -e 'print scalar(@Text::CSV->new(quote_char=>"~")->getline(*STDIN))' <file.csv

Another way, using the CSVKit command line CSV parser toolkit:

csvstat -n -q '~' <file.csv | wc -l

or equivalently, using long options,

csvstat --names --quotechar '~' <file.csv | wc -l

edited 5 mins ago

answered 19 mins ago

Kusalananda

109k14211334

add a commentÂ |Â

up vote
1
down vote

This looks like a CSV file that is using comma as field delimiters and tilde as quoting character.

Using a proper CSV parser, like the one provided by the Text::CSV Perl module:

perl -MText::CSV -e 'print scalar(@Text::CSV->new(quote_char=>"~")->getline(*STDIN))' <file.csv

Another way, using the CSVKit command line CSV parser toolkit:

csvstat -n -q '~' <file.csv | wc -l

or equivalently, using long options,

csvstat --names --quotechar '~' <file.csv | wc -l

edited 5 mins ago

answered 19 mins ago

Kusalananda

109k14211334

This looks like a CSV file that is using comma as field delimiters and tilde as quoting character.

Using a proper CSV parser, like the one provided by the Text::CSV Perl module:

perl -MText::CSV -e 'print scalar(@Text::CSV->new(quote_char=>"~")->getline(*STDIN))' <file.csv

Another way, using the CSVKit command line CSV parser toolkit:

csvstat -n -q '~' <file.csv | wc -l

or equivalently, using long options,

csvstat --names --quotechar '~' <file.csv | wc -l

edited 5 mins ago

answered 19 mins ago

Kusalananda

109k14211334

edited 5 mins ago

answered 19 mins ago

Kusalananda

109k14211334

answered 19 mins ago

Kusalananda

109k14211334

answered 19 mins ago

Kusalananda

109k14211334

add a commentÂ |Â

Mathew Linton is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Mathew Linton is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu