Having trouble parsing a csv file from the commandline

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP












2















I have a CSV file that I've been working on for most of the day and I'm not having any luck getting it parsed properly using a Regex with awk.



awk is not processing the Regex as expected.



Here are the inputs:



  • GNU Awk 4.1.4, API: 1.1 (GNU MPFR 3.1.5-p2, GNU MP 6.1.2)

  • Regex: /([(.*?)])|[^,]+/g

  • Sample text hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,[role1, role2, role3],[recipe1, recipe2, recipe3],2019-01-10 06:06:31

  • Raw text (before stripping out double-quotes, which I do at a step not listed explicitly in this question):
    hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,"[""role1"", ""role2"", ""role3""]","[""recipe1"", ""recipe2"", ""recipe3""]",2019-01-10 06:06:31

When I run this through Regexr.com, it shows the proper matches:
Screencap from Regexr



I pipe from cat -> sed -> awk (the sample text above is what comes out of sed) and run the following command (I only want the first 9 fields, which includes the entirety of the first field enclosed in , but not anything after that):



awk '/([(.*?)])|[^,]+/gprint $1,$2,$3,$4,$5,$6,$7,$8,$9'


What I'm expecting to see as an output:
hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,[role1, role2, role3]



NOTE: The important part about this is treating the field with the roles (between the brackets) as a single field, or at least including all of the roles in the output, but none of the recipes)



What I'm actually getting out is the full line that was fed in.



From playing around with the variables, I found the following field assigments coming out of awk:



  • $1 = hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,[role1,

  • $2 = role2,

  • $3 = role3],[recipe1,

  • $4 = recipe2,

  • $5 = recipe3],2019-01-10

  • $6 = 06:06:31

I've tried using the accepted answer from This Stack Overflow question outright, and I've tried tweaking it to use as the delimiters instead of ", and that gets me closer, but it's still not treating the role field as a single field.










share|improve this question
























  • What do you intend for the g after the regular expression to mean? This would be a syntax error in an extended regular expression (which awk uses).

    – Kusalananda
    Jan 11 at 20:31











  • Actually, what are you trying to do with the regex generally? Which lines do you want to print?

    – terdon
    Jan 11 at 20:32











  • I only included the g because Regexr did, in the current state, it doesn't change the output at all. @terdon - I want it to print all of hte fields up to (and including) the field that contains the 3 roles, but nothing after that

    – Taegost
    Jan 11 at 20:34











  • @Taegost OK, but what is the regex doing? Your print statement already does that.

    – terdon
    Jan 11 at 20:38















2















I have a CSV file that I've been working on for most of the day and I'm not having any luck getting it parsed properly using a Regex with awk.



awk is not processing the Regex as expected.



Here are the inputs:



  • GNU Awk 4.1.4, API: 1.1 (GNU MPFR 3.1.5-p2, GNU MP 6.1.2)

  • Regex: /([(.*?)])|[^,]+/g

  • Sample text hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,[role1, role2, role3],[recipe1, recipe2, recipe3],2019-01-10 06:06:31

  • Raw text (before stripping out double-quotes, which I do at a step not listed explicitly in this question):
    hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,"[""role1"", ""role2"", ""role3""]","[""recipe1"", ""recipe2"", ""recipe3""]",2019-01-10 06:06:31

When I run this through Regexr.com, it shows the proper matches:
Screencap from Regexr



I pipe from cat -> sed -> awk (the sample text above is what comes out of sed) and run the following command (I only want the first 9 fields, which includes the entirety of the first field enclosed in , but not anything after that):



awk '/([(.*?)])|[^,]+/gprint $1,$2,$3,$4,$5,$6,$7,$8,$9'


What I'm expecting to see as an output:
hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,[role1, role2, role3]



NOTE: The important part about this is treating the field with the roles (between the brackets) as a single field, or at least including all of the roles in the output, but none of the recipes)



What I'm actually getting out is the full line that was fed in.



From playing around with the variables, I found the following field assigments coming out of awk:



  • $1 = hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,[role1,

  • $2 = role2,

  • $3 = role3],[recipe1,

  • $4 = recipe2,

  • $5 = recipe3],2019-01-10

  • $6 = 06:06:31

I've tried using the accepted answer from This Stack Overflow question outright, and I've tried tweaking it to use as the delimiters instead of ", and that gets me closer, but it's still not treating the role field as a single field.










share|improve this question
























  • What do you intend for the g after the regular expression to mean? This would be a syntax error in an extended regular expression (which awk uses).

    – Kusalananda
    Jan 11 at 20:31











  • Actually, what are you trying to do with the regex generally? Which lines do you want to print?

    – terdon
    Jan 11 at 20:32











  • I only included the g because Regexr did, in the current state, it doesn't change the output at all. @terdon - I want it to print all of hte fields up to (and including) the field that contains the 3 roles, but nothing after that

    – Taegost
    Jan 11 at 20:34











  • @Taegost OK, but what is the regex doing? Your print statement already does that.

    – terdon
    Jan 11 at 20:38













2












2








2








I have a CSV file that I've been working on for most of the day and I'm not having any luck getting it parsed properly using a Regex with awk.



awk is not processing the Regex as expected.



Here are the inputs:



  • GNU Awk 4.1.4, API: 1.1 (GNU MPFR 3.1.5-p2, GNU MP 6.1.2)

  • Regex: /([(.*?)])|[^,]+/g

  • Sample text hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,[role1, role2, role3],[recipe1, recipe2, recipe3],2019-01-10 06:06:31

  • Raw text (before stripping out double-quotes, which I do at a step not listed explicitly in this question):
    hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,"[""role1"", ""role2"", ""role3""]","[""recipe1"", ""recipe2"", ""recipe3""]",2019-01-10 06:06:31

When I run this through Regexr.com, it shows the proper matches:
Screencap from Regexr



I pipe from cat -> sed -> awk (the sample text above is what comes out of sed) and run the following command (I only want the first 9 fields, which includes the entirety of the first field enclosed in , but not anything after that):



awk '/([(.*?)])|[^,]+/gprint $1,$2,$3,$4,$5,$6,$7,$8,$9'


What I'm expecting to see as an output:
hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,[role1, role2, role3]



NOTE: The important part about this is treating the field with the roles (between the brackets) as a single field, or at least including all of the roles in the output, but none of the recipes)



What I'm actually getting out is the full line that was fed in.



From playing around with the variables, I found the following field assigments coming out of awk:



  • $1 = hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,[role1,

  • $2 = role2,

  • $3 = role3],[recipe1,

  • $4 = recipe2,

  • $5 = recipe3],2019-01-10

  • $6 = 06:06:31

I've tried using the accepted answer from This Stack Overflow question outright, and I've tried tweaking it to use as the delimiters instead of ", and that gets me closer, but it's still not treating the role field as a single field.










share|improve this question
















I have a CSV file that I've been working on for most of the day and I'm not having any luck getting it parsed properly using a Regex with awk.



awk is not processing the Regex as expected.



Here are the inputs:



  • GNU Awk 4.1.4, API: 1.1 (GNU MPFR 3.1.5-p2, GNU MP 6.1.2)

  • Regex: /([(.*?)])|[^,]+/g

  • Sample text hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,[role1, role2, role3],[recipe1, recipe2, recipe3],2019-01-10 06:06:31

  • Raw text (before stripping out double-quotes, which I do at a step not listed explicitly in this question):
    hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,"[""role1"", ""role2"", ""role3""]","[""recipe1"", ""recipe2"", ""recipe3""]",2019-01-10 06:06:31

When I run this through Regexr.com, it shows the proper matches:
Screencap from Regexr



I pipe from cat -> sed -> awk (the sample text above is what comes out of sed) and run the following command (I only want the first 9 fields, which includes the entirety of the first field enclosed in , but not anything after that):



awk '/([(.*?)])|[^,]+/gprint $1,$2,$3,$4,$5,$6,$7,$8,$9'


What I'm expecting to see as an output:
hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,[role1, role2, role3]



NOTE: The important part about this is treating the field with the roles (between the brackets) as a single field, or at least including all of the roles in the output, but none of the recipes)



What I'm actually getting out is the full line that was fed in.



From playing around with the variables, I found the following field assigments coming out of awk:



  • $1 = hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,[role1,

  • $2 = role2,

  • $3 = role3],[recipe1,

  • $4 = recipe2,

  • $5 = recipe3],2019-01-10

  • $6 = 06:06:31

I've tried using the accepted answer from This Stack Overflow question outright, and I've tried tweaking it to use as the delimiters instead of ", and that gets me closer, but it's still not treating the role field as a single field.







awk regular-expression csv






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 11 at 22:13









Rui F Ribeiro

39.6k1479132




39.6k1479132










asked Jan 11 at 20:22









TaegostTaegost

1155




1155












  • What do you intend for the g after the regular expression to mean? This would be a syntax error in an extended regular expression (which awk uses).

    – Kusalananda
    Jan 11 at 20:31











  • Actually, what are you trying to do with the regex generally? Which lines do you want to print?

    – terdon
    Jan 11 at 20:32











  • I only included the g because Regexr did, in the current state, it doesn't change the output at all. @terdon - I want it to print all of hte fields up to (and including) the field that contains the 3 roles, but nothing after that

    – Taegost
    Jan 11 at 20:34











  • @Taegost OK, but what is the regex doing? Your print statement already does that.

    – terdon
    Jan 11 at 20:38

















  • What do you intend for the g after the regular expression to mean? This would be a syntax error in an extended regular expression (which awk uses).

    – Kusalananda
    Jan 11 at 20:31











  • Actually, what are you trying to do with the regex generally? Which lines do you want to print?

    – terdon
    Jan 11 at 20:32











  • I only included the g because Regexr did, in the current state, it doesn't change the output at all. @terdon - I want it to print all of hte fields up to (and including) the field that contains the 3 roles, but nothing after that

    – Taegost
    Jan 11 at 20:34











  • @Taegost OK, but what is the regex doing? Your print statement already does that.

    – terdon
    Jan 11 at 20:38
















What do you intend for the g after the regular expression to mean? This would be a syntax error in an extended regular expression (which awk uses).

– Kusalananda
Jan 11 at 20:31





What do you intend for the g after the regular expression to mean? This would be a syntax error in an extended regular expression (which awk uses).

– Kusalananda
Jan 11 at 20:31













Actually, what are you trying to do with the regex generally? Which lines do you want to print?

– terdon
Jan 11 at 20:32





Actually, what are you trying to do with the regex generally? Which lines do you want to print?

– terdon
Jan 11 at 20:32













I only included the g because Regexr did, in the current state, it doesn't change the output at all. @terdon - I want it to print all of hte fields up to (and including) the field that contains the 3 roles, but nothing after that

– Taegost
Jan 11 at 20:34





I only included the g because Regexr did, in the current state, it doesn't change the output at all. @terdon - I want it to print all of hte fields up to (and including) the field that contains the 3 roles, but nothing after that

– Taegost
Jan 11 at 20:34













@Taegost OK, but what is the regex doing? Your print statement already does that.

– terdon
Jan 11 at 20:38





@Taegost OK, but what is the regex doing? Your print statement already does that.

– terdon
Jan 11 at 20:38










2 Answers
2






active

oldest

votes


















1














By default, awk will use whitespace to define fields, which explains why you're getting the output you see. Since you want to use a comma to separate the fields, you need to say so with -F:



awk -F, '...' 


To have awk print comma separated output, you need to set the OFS variable:



awk -F, -vOFS=, '...' 


The real difficulty here is that you are then trying to treat [role1, role2, role3] as a single field, but that's 3 fields. There are commas there, so that will be split into [role1, role2 and role3]. If you know there will always be exactly 3 fields there, it's easy:



$ awk -F, -vOFS=, 'print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11' file
hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,[role1, role2, role3]



However, based on the raw data you have now added, while a proper CSV parser will always be the better approach, you can still do it in awk. Just run this on your original input data:



$ awk -F']' -vOFS=, 'gsub(/"/,"");print $1"]"' file
hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,[role1, role2, role3]


The trick is to use ] as the field separator and tell awk to only print the 1st field. This will print everything up to the first ]. We then add back the ] (since that is removed when the fields are built). The gsub removes all the quotes.






share|improve this answer

























  • That field with the roles and the recipes are a single data field, which is the problem. I stripped out the quotation marks that were in it initially because I couldn't get any Regex pattern to treat it as a single field, but still separate the rest on a comma. I'll update the question with the raw data, that may help. It'll take me a few minutes

    – Taegost
    Jan 11 at 20:36











  • @Taegost they might be for csv, but they aren't for awk.

    – terdon
    Jan 11 at 20:39











  • Correct, which is the entire point of the regex: To separate the fields by a comma, except for the ones that include multiple entries (the role and recipe fields), where everything within the brackets needs to be treated as a single field

    – Taegost
    Jan 11 at 20:42











  • If you're using the same character to delimit data and also as part of the data, you're going to have a bad time.

    – DopeGhoti
    Jan 11 at 20:52











  • @Taegost given your input data, you can use the fact that you only want the text until the first ]. But a dedicated CSV parser is a safer approach.

    – terdon
    Jan 11 at 21:24


















4














If you're dealing with a complex CSV file - in particular, one whose fields may contain quoted delimiters (in this case, commas) then a proper CSV parser is going to save a lot of headaches e.g. with csvtool



$ echo 'hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,"[""role1"", ""role2"", ""role3""]","[""recipe1"", ""recipe2"", ""recipe3""]",2019-01-10 06:06:31' | 
csvtool col 1-9 -
hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,"[""role1"", ""role2"", ""role3""]"


or (to remove the quotes)



$ echo 'hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,"[""role1"", ""role2"", ""role3""]","[""recipe1"", ""recipe2"", ""recipe3""]",2019-01-10 06:06:31' | 
csvtool col 1-9 - | tr -d '"'
hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,[role1, role2, role3]



If you can't obtain a standalone CSV parser such as csvtool, then both Perl and Python have CSV modules e.g.



perl -MText::CSV -lpe '
BEGIN$p = Text::CSV->new()
$_ = join ",", map $_ = s/"//gr ($p->fields())[0..8] if $p->parse($_)
'





share|improve this answer

























  • This is an excellent answer and is definitely the right way to go. Unfortunately installing 3rd party tools is locked way down and I don't have Perl or Python (But I do have Ruby, didn't think to try that). Thank you for taking the time to write this up, in the future this will definitely be my preferred approach!

    – Taegost
    Jan 11 at 21:41











  • @Taegost if you figure out how to do it in Ruby, please add that as an answer!

    – steeldriver
    Jan 11 at 21:45










Your Answer








StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f494013%2fhaving-trouble-parsing-a-csv-file-from-the-commandline%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














By default, awk will use whitespace to define fields, which explains why you're getting the output you see. Since you want to use a comma to separate the fields, you need to say so with -F:



awk -F, '...' 


To have awk print comma separated output, you need to set the OFS variable:



awk -F, -vOFS=, '...' 


The real difficulty here is that you are then trying to treat [role1, role2, role3] as a single field, but that's 3 fields. There are commas there, so that will be split into [role1, role2 and role3]. If you know there will always be exactly 3 fields there, it's easy:



$ awk -F, -vOFS=, 'print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11' file
hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,[role1, role2, role3]



However, based on the raw data you have now added, while a proper CSV parser will always be the better approach, you can still do it in awk. Just run this on your original input data:



$ awk -F']' -vOFS=, 'gsub(/"/,"");print $1"]"' file
hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,[role1, role2, role3]


The trick is to use ] as the field separator and tell awk to only print the 1st field. This will print everything up to the first ]. We then add back the ] (since that is removed when the fields are built). The gsub removes all the quotes.






share|improve this answer

























  • That field with the roles and the recipes are a single data field, which is the problem. I stripped out the quotation marks that were in it initially because I couldn't get any Regex pattern to treat it as a single field, but still separate the rest on a comma. I'll update the question with the raw data, that may help. It'll take me a few minutes

    – Taegost
    Jan 11 at 20:36











  • @Taegost they might be for csv, but they aren't for awk.

    – terdon
    Jan 11 at 20:39











  • Correct, which is the entire point of the regex: To separate the fields by a comma, except for the ones that include multiple entries (the role and recipe fields), where everything within the brackets needs to be treated as a single field

    – Taegost
    Jan 11 at 20:42











  • If you're using the same character to delimit data and also as part of the data, you're going to have a bad time.

    – DopeGhoti
    Jan 11 at 20:52











  • @Taegost given your input data, you can use the fact that you only want the text until the first ]. But a dedicated CSV parser is a safer approach.

    – terdon
    Jan 11 at 21:24















1














By default, awk will use whitespace to define fields, which explains why you're getting the output you see. Since you want to use a comma to separate the fields, you need to say so with -F:



awk -F, '...' 


To have awk print comma separated output, you need to set the OFS variable:



awk -F, -vOFS=, '...' 


The real difficulty here is that you are then trying to treat [role1, role2, role3] as a single field, but that's 3 fields. There are commas there, so that will be split into [role1, role2 and role3]. If you know there will always be exactly 3 fields there, it's easy:



$ awk -F, -vOFS=, 'print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11' file
hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,[role1, role2, role3]



However, based on the raw data you have now added, while a proper CSV parser will always be the better approach, you can still do it in awk. Just run this on your original input data:



$ awk -F']' -vOFS=, 'gsub(/"/,"");print $1"]"' file
hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,[role1, role2, role3]


The trick is to use ] as the field separator and tell awk to only print the 1st field. This will print everything up to the first ]. We then add back the ] (since that is removed when the fields are built). The gsub removes all the quotes.






share|improve this answer

























  • That field with the roles and the recipes are a single data field, which is the problem. I stripped out the quotation marks that were in it initially because I couldn't get any Regex pattern to treat it as a single field, but still separate the rest on a comma. I'll update the question with the raw data, that may help. It'll take me a few minutes

    – Taegost
    Jan 11 at 20:36











  • @Taegost they might be for csv, but they aren't for awk.

    – terdon
    Jan 11 at 20:39











  • Correct, which is the entire point of the regex: To separate the fields by a comma, except for the ones that include multiple entries (the role and recipe fields), where everything within the brackets needs to be treated as a single field

    – Taegost
    Jan 11 at 20:42











  • If you're using the same character to delimit data and also as part of the data, you're going to have a bad time.

    – DopeGhoti
    Jan 11 at 20:52











  • @Taegost given your input data, you can use the fact that you only want the text until the first ]. But a dedicated CSV parser is a safer approach.

    – terdon
    Jan 11 at 21:24













1












1








1







By default, awk will use whitespace to define fields, which explains why you're getting the output you see. Since you want to use a comma to separate the fields, you need to say so with -F:



awk -F, '...' 


To have awk print comma separated output, you need to set the OFS variable:



awk -F, -vOFS=, '...' 


The real difficulty here is that you are then trying to treat [role1, role2, role3] as a single field, but that's 3 fields. There are commas there, so that will be split into [role1, role2 and role3]. If you know there will always be exactly 3 fields there, it's easy:



$ awk -F, -vOFS=, 'print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11' file
hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,[role1, role2, role3]



However, based on the raw data you have now added, while a proper CSV parser will always be the better approach, you can still do it in awk. Just run this on your original input data:



$ awk -F']' -vOFS=, 'gsub(/"/,"");print $1"]"' file
hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,[role1, role2, role3]


The trick is to use ] as the field separator and tell awk to only print the 1st field. This will print everything up to the first ]. We then add back the ] (since that is removed when the fields are built). The gsub removes all the quotes.






share|improve this answer















By default, awk will use whitespace to define fields, which explains why you're getting the output you see. Since you want to use a comma to separate the fields, you need to say so with -F:



awk -F, '...' 


To have awk print comma separated output, you need to set the OFS variable:



awk -F, -vOFS=, '...' 


The real difficulty here is that you are then trying to treat [role1, role2, role3] as a single field, but that's 3 fields. There are commas there, so that will be split into [role1, role2 and role3]. If you know there will always be exactly 3 fields there, it's easy:



$ awk -F, -vOFS=, 'print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11' file
hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,[role1, role2, role3]



However, based on the raw data you have now added, while a proper CSV parser will always be the better approach, you can still do it in awk. Just run this on your original input data:



$ awk -F']' -vOFS=, 'gsub(/"/,"");print $1"]"' file
hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,[role1, role2, role3]


The trick is to use ] as the field separator and tell awk to only print the 1st field. This will print everything up to the first ]. We then add back the ] (since that is removed when the fields are built). The gsub removes all the quotes.







share|improve this answer














share|improve this answer



share|improve this answer








edited Jan 11 at 21:23

























answered Jan 11 at 20:33









terdonterdon

129k32253430




129k32253430












  • That field with the roles and the recipes are a single data field, which is the problem. I stripped out the quotation marks that were in it initially because I couldn't get any Regex pattern to treat it as a single field, but still separate the rest on a comma. I'll update the question with the raw data, that may help. It'll take me a few minutes

    – Taegost
    Jan 11 at 20:36











  • @Taegost they might be for csv, but they aren't for awk.

    – terdon
    Jan 11 at 20:39











  • Correct, which is the entire point of the regex: To separate the fields by a comma, except for the ones that include multiple entries (the role and recipe fields), where everything within the brackets needs to be treated as a single field

    – Taegost
    Jan 11 at 20:42











  • If you're using the same character to delimit data and also as part of the data, you're going to have a bad time.

    – DopeGhoti
    Jan 11 at 20:52











  • @Taegost given your input data, you can use the fact that you only want the text until the first ]. But a dedicated CSV parser is a safer approach.

    – terdon
    Jan 11 at 21:24

















  • That field with the roles and the recipes are a single data field, which is the problem. I stripped out the quotation marks that were in it initially because I couldn't get any Regex pattern to treat it as a single field, but still separate the rest on a comma. I'll update the question with the raw data, that may help. It'll take me a few minutes

    – Taegost
    Jan 11 at 20:36











  • @Taegost they might be for csv, but they aren't for awk.

    – terdon
    Jan 11 at 20:39











  • Correct, which is the entire point of the regex: To separate the fields by a comma, except for the ones that include multiple entries (the role and recipe fields), where everything within the brackets needs to be treated as a single field

    – Taegost
    Jan 11 at 20:42











  • If you're using the same character to delimit data and also as part of the data, you're going to have a bad time.

    – DopeGhoti
    Jan 11 at 20:52











  • @Taegost given your input data, you can use the fact that you only want the text until the first ]. But a dedicated CSV parser is a safer approach.

    – terdon
    Jan 11 at 21:24
















That field with the roles and the recipes are a single data field, which is the problem. I stripped out the quotation marks that were in it initially because I couldn't get any Regex pattern to treat it as a single field, but still separate the rest on a comma. I'll update the question with the raw data, that may help. It'll take me a few minutes

– Taegost
Jan 11 at 20:36





That field with the roles and the recipes are a single data field, which is the problem. I stripped out the quotation marks that were in it initially because I couldn't get any Regex pattern to treat it as a single field, but still separate the rest on a comma. I'll update the question with the raw data, that may help. It'll take me a few minutes

– Taegost
Jan 11 at 20:36













@Taegost they might be for csv, but they aren't for awk.

– terdon
Jan 11 at 20:39





@Taegost they might be for csv, but they aren't for awk.

– terdon
Jan 11 at 20:39













Correct, which is the entire point of the regex: To separate the fields by a comma, except for the ones that include multiple entries (the role and recipe fields), where everything within the brackets needs to be treated as a single field

– Taegost
Jan 11 at 20:42





Correct, which is the entire point of the regex: To separate the fields by a comma, except for the ones that include multiple entries (the role and recipe fields), where everything within the brackets needs to be treated as a single field

– Taegost
Jan 11 at 20:42













If you're using the same character to delimit data and also as part of the data, you're going to have a bad time.

– DopeGhoti
Jan 11 at 20:52





If you're using the same character to delimit data and also as part of the data, you're going to have a bad time.

– DopeGhoti
Jan 11 at 20:52













@Taegost given your input data, you can use the fact that you only want the text until the first ]. But a dedicated CSV parser is a safer approach.

– terdon
Jan 11 at 21:24





@Taegost given your input data, you can use the fact that you only want the text until the first ]. But a dedicated CSV parser is a safer approach.

– terdon
Jan 11 at 21:24













4














If you're dealing with a complex CSV file - in particular, one whose fields may contain quoted delimiters (in this case, commas) then a proper CSV parser is going to save a lot of headaches e.g. with csvtool



$ echo 'hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,"[""role1"", ""role2"", ""role3""]","[""recipe1"", ""recipe2"", ""recipe3""]",2019-01-10 06:06:31' | 
csvtool col 1-9 -
hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,"[""role1"", ""role2"", ""role3""]"


or (to remove the quotes)



$ echo 'hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,"[""role1"", ""role2"", ""role3""]","[""recipe1"", ""recipe2"", ""recipe3""]",2019-01-10 06:06:31' | 
csvtool col 1-9 - | tr -d '"'
hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,[role1, role2, role3]



If you can't obtain a standalone CSV parser such as csvtool, then both Perl and Python have CSV modules e.g.



perl -MText::CSV -lpe '
BEGIN$p = Text::CSV->new()
$_ = join ",", map $_ = s/"//gr ($p->fields())[0..8] if $p->parse($_)
'





share|improve this answer

























  • This is an excellent answer and is definitely the right way to go. Unfortunately installing 3rd party tools is locked way down and I don't have Perl or Python (But I do have Ruby, didn't think to try that). Thank you for taking the time to write this up, in the future this will definitely be my preferred approach!

    – Taegost
    Jan 11 at 21:41











  • @Taegost if you figure out how to do it in Ruby, please add that as an answer!

    – steeldriver
    Jan 11 at 21:45















4














If you're dealing with a complex CSV file - in particular, one whose fields may contain quoted delimiters (in this case, commas) then a proper CSV parser is going to save a lot of headaches e.g. with csvtool



$ echo 'hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,"[""role1"", ""role2"", ""role3""]","[""recipe1"", ""recipe2"", ""recipe3""]",2019-01-10 06:06:31' | 
csvtool col 1-9 -
hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,"[""role1"", ""role2"", ""role3""]"


or (to remove the quotes)



$ echo 'hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,"[""role1"", ""role2"", ""role3""]","[""recipe1"", ""recipe2"", ""recipe3""]",2019-01-10 06:06:31' | 
csvtool col 1-9 - | tr -d '"'
hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,[role1, role2, role3]



If you can't obtain a standalone CSV parser such as csvtool, then both Perl and Python have CSV modules e.g.



perl -MText::CSV -lpe '
BEGIN$p = Text::CSV->new()
$_ = join ",", map $_ = s/"//gr ($p->fields())[0..8] if $p->parse($_)
'





share|improve this answer

























  • This is an excellent answer and is definitely the right way to go. Unfortunately installing 3rd party tools is locked way down and I don't have Perl or Python (But I do have Ruby, didn't think to try that). Thank you for taking the time to write this up, in the future this will definitely be my preferred approach!

    – Taegost
    Jan 11 at 21:41











  • @Taegost if you figure out how to do it in Ruby, please add that as an answer!

    – steeldriver
    Jan 11 at 21:45













4












4








4







If you're dealing with a complex CSV file - in particular, one whose fields may contain quoted delimiters (in this case, commas) then a proper CSV parser is going to save a lot of headaches e.g. with csvtool



$ echo 'hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,"[""role1"", ""role2"", ""role3""]","[""recipe1"", ""recipe2"", ""recipe3""]",2019-01-10 06:06:31' | 
csvtool col 1-9 -
hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,"[""role1"", ""role2"", ""role3""]"


or (to remove the quotes)



$ echo 'hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,"[""role1"", ""role2"", ""role3""]","[""recipe1"", ""recipe2"", ""recipe3""]",2019-01-10 06:06:31' | 
csvtool col 1-9 - | tr -d '"'
hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,[role1, role2, role3]



If you can't obtain a standalone CSV parser such as csvtool, then both Perl and Python have CSV modules e.g.



perl -MText::CSV -lpe '
BEGIN$p = Text::CSV->new()
$_ = join ",", map $_ = s/"//gr ($p->fields())[0..8] if $p->parse($_)
'





share|improve this answer















If you're dealing with a complex CSV file - in particular, one whose fields may contain quoted delimiters (in this case, commas) then a proper CSV parser is going to save a lot of headaches e.g. with csvtool



$ echo 'hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,"[""role1"", ""role2"", ""role3""]","[""recipe1"", ""recipe2"", ""recipe3""]",2019-01-10 06:06:31' | 
csvtool col 1-9 -
hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,"[""role1"", ""role2"", ""role3""]"


or (to remove the quotes)



$ echo 'hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,"[""role1"", ""role2"", ""role3""]","[""recipe1"", ""recipe2"", ""recipe3""]",2019-01-10 06:06:31' | 
csvtool col 1-9 - | tr -d '"'
hostname,hostname.domain.com,hostname.domain.com,windows,6.2.9200,1.2.3,location,environment,[role1, role2, role3]



If you can't obtain a standalone CSV parser such as csvtool, then both Perl and Python have CSV modules e.g.



perl -MText::CSV -lpe '
BEGIN$p = Text::CSV->new()
$_ = join ",", map $_ = s/"//gr ($p->fields())[0..8] if $p->parse($_)
'






share|improve this answer














share|improve this answer



share|improve this answer








edited Jan 11 at 21:33

























answered Jan 11 at 20:52









steeldriversteeldriver

35.7k35286




35.7k35286












  • This is an excellent answer and is definitely the right way to go. Unfortunately installing 3rd party tools is locked way down and I don't have Perl or Python (But I do have Ruby, didn't think to try that). Thank you for taking the time to write this up, in the future this will definitely be my preferred approach!

    – Taegost
    Jan 11 at 21:41











  • @Taegost if you figure out how to do it in Ruby, please add that as an answer!

    – steeldriver
    Jan 11 at 21:45

















  • This is an excellent answer and is definitely the right way to go. Unfortunately installing 3rd party tools is locked way down and I don't have Perl or Python (But I do have Ruby, didn't think to try that). Thank you for taking the time to write this up, in the future this will definitely be my preferred approach!

    – Taegost
    Jan 11 at 21:41











  • @Taegost if you figure out how to do it in Ruby, please add that as an answer!

    – steeldriver
    Jan 11 at 21:45
















This is an excellent answer and is definitely the right way to go. Unfortunately installing 3rd party tools is locked way down and I don't have Perl or Python (But I do have Ruby, didn't think to try that). Thank you for taking the time to write this up, in the future this will definitely be my preferred approach!

– Taegost
Jan 11 at 21:41





This is an excellent answer and is definitely the right way to go. Unfortunately installing 3rd party tools is locked way down and I don't have Perl or Python (But I do have Ruby, didn't think to try that). Thank you for taking the time to write this up, in the future this will definitely be my preferred approach!

– Taegost
Jan 11 at 21:41













@Taegost if you figure out how to do it in Ruby, please add that as an answer!

– steeldriver
Jan 11 at 21:45





@Taegost if you figure out how to do it in Ruby, please add that as an answer!

– steeldriver
Jan 11 at 21:45

















draft saved

draft discarded
















































Thanks for contributing an answer to Unix & Linux Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f494013%2fhaving-trouble-parsing-a-csv-file-from-the-commandline%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown






Popular posts from this blog

How to check contact read email or not when send email to Individual?

How many registers does an x86_64 CPU actually have?

Nur Jahan