Using sed to get specific text from file

up vote
-1
down vote

favorite

Not sure why I'm not getting this. I've been searching and testing my command for a couple hours and I'm not getting anywhere.

The text is:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><result expand="changes,testResults,metadata,logEntries,plan,vcsRevisions,artifacts,comments,labels,jiraIssues" key="EP-ED-JOB1-174" state="Failed" lifeCycleState="Finished" number="174" ....

And I just want to pull out the ' state="Failed" ' part, it could also be ' state="Successful" '

I've tried a million variations of this:

sed '/state=".*"/p' htmlResponse.txt

But paren's, escape slashes etc seem to match the entire chunk of text. What's wrong with my regex?

asked Oct 16 '17 at 15:31

Justin

1013

you need to use capture groups around what you want and use substitution to print only those portion.. to avoid greedy issue, in this case you can use [^"]* instead of .*... but really, you should use xml parser instead of regex
â€“Â Sundeep
Oct 16 '17 at 15:39

If I do sed -n '/state="[^"]*/p' htmlResponse.html it still gives me back everything.
â€“Â Justin
Oct 16 '17 at 15:42

Use xmllint instead. Use the right tools for the right job.
â€“Â Valentin B
Oct 16 '17 at 15:48

add a commentÂ |Â

up vote
-1
down vote

favorite

Not sure why I'm not getting this. I've been searching and testing my command for a couple hours and I'm not getting anywhere.

The text is:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><result expand="changes,testResults,metadata,logEntries,plan,vcsRevisions,artifacts,comments,labels,jiraIssues" key="EP-ED-JOB1-174" state="Failed" lifeCycleState="Finished" number="174" ....

And I just want to pull out the ' state="Failed" ' part, it could also be ' state="Successful" '

I've tried a million variations of this:

sed '/state=".*"/p' htmlResponse.txt

But paren's, escape slashes etc seem to match the entire chunk of text. What's wrong with my regex?

asked Oct 16 '17 at 15:31

Justin

1013

you need to use capture groups around what you want and use substitution to print only those portion.. to avoid greedy issue, in this case you can use [^"]* instead of .*... but really, you should use xml parser instead of regex
â€“Â Sundeep
Oct 16 '17 at 15:39

If I do sed -n '/state="[^"]*/p' htmlResponse.html it still gives me back everything.
â€“Â Justin
Oct 16 '17 at 15:42

Use xmllint instead. Use the right tools for the right job.
â€“Â Valentin B
Oct 16 '17 at 15:48

add a commentÂ |Â

up vote
-1
down vote

favorite

Not sure why I'm not getting this. I've been searching and testing my command for a couple hours and I'm not getting anywhere.

The text is:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><result expand="changes,testResults,metadata,logEntries,plan,vcsRevisions,artifacts,comments,labels,jiraIssues" key="EP-ED-JOB1-174" state="Failed" lifeCycleState="Finished" number="174" ....

And I just want to pull out the ' state="Failed" ' part, it could also be ' state="Successful" '

I've tried a million variations of this:

sed '/state=".*"/p' htmlResponse.txt

But paren's, escape slashes etc seem to match the entire chunk of text. What's wrong with my regex?

asked Oct 16 '17 at 15:31

Justin

1013

Not sure why I'm not getting this. I've been searching and testing my command for a couple hours and I'm not getting anywhere.

The text is:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><result expand="changes,testResults,metadata,logEntries,plan,vcsRevisions,artifacts,comments,labels,jiraIssues" key="EP-ED-JOB1-174" state="Failed" lifeCycleState="Finished" number="174" ....

And I just want to pull out the ' state="Failed" ' part, it could also be ' state="Successful" '

I've tried a million variations of this:

sed '/state=".*"/p' htmlResponse.txt

But paren's, escape slashes etc seem to match the entire chunk of text. What's wrong with my regex?

asked Oct 16 '17 at 15:31

Justin

1013

asked Oct 16 '17 at 15:31

Justin

1013

asked Oct 16 '17 at 15:31

Justin

1013

asked Oct 16 '17 at 15:31

Justin

1013

you need to use capture groups around what you want and use substitution to print only those portion.. to avoid greedy issue, in this case you can use [^"]* instead of .*... but really, you should use xml parser instead of regex
â€“Â Sundeep
Oct 16 '17 at 15:39

If I do sed -n '/state="[^"]*/p' htmlResponse.html it still gives me back everything.
â€“Â Justin
Oct 16 '17 at 15:42

Use xmllint instead. Use the right tools for the right job.
â€“Â Valentin B
Oct 16 '17 at 15:48

add a commentÂ |Â

you need to use capture groups around what you want and use substitution to print only those portion.. to avoid greedy issue, in this case you can use [^"]* instead of .*... but really, you should use xml parser instead of regex
â€“Â Sundeep
Oct 16 '17 at 15:39

If I do sed -n '/state="[^"]*/p' htmlResponse.html it still gives me back everything.
â€“Â Justin
Oct 16 '17 at 15:42

Use xmllint instead. Use the right tools for the right job.
â€“Â Valentin B
Oct 16 '17 at 15:48

you need to use capture groups around what you want and use substitution to print only those portion.. to avoid greedy issue, in this case you can use [^"]* instead of .*... but really, you should use xml parser instead of regex
â€“Â Sundeep
Oct 16 '17 at 15:39

If I do sed -n '/state="[^"]*/p' htmlResponse.html it still gives me back everything.
â€“Â Justin
Oct 16 '17 at 15:42

Use xmllint instead. Use the right tools for the right job.
â€“Â Valentin B
Oct 16 '17 at 15:48

add a commentÂ |Â

3 Answers
3

active

oldest

votes

up vote
2
down vote

accepted

Putting aside the obligatory "you should really be using a proper XML parser because regexes aren't powerful enough to parse XML" comment, I see two problems in your sed line:

".*" will match from the first " to the last, since . matches "

The sed command /.../p prints the whole line if it matches the regex.

Here's two things I'd suggest for quick-and-dirty HTML-scraping shell scripts:

Use "[^"]*" to match "quote, any number of non-quote characters, end quote"

It's lots easier to use grep -o to pull out bits of a file that match a regex

So that would make your command more like:

grep -o 'state="[^"]*"'

Or, if you really must use sed:

sed -n 's/.*(state="[^"]*").*/1/p'

edited Oct 16 '17 at 15:46

answered Oct 16 '17 at 15:41

wwoods

98679

Thanks! I went with grep as the command just looks easier to type and understand.
â€“Â Justin
Oct 16 '17 at 16:16

add a commentÂ |Â

up vote
1
down vote

The right way is to use XML parsers like xmlstarlet:

printf 'state="%s"n' $(xmlstarlet sel -t -v "//result/@state" -n htmlResponse.txt)

The output:

state="Failed"

answered Oct 16 '17 at 15:59

RomanPerekhrest

22.5k12145

add a commentÂ |Â

up vote
0
down vote

You likely want to match the whole line and print just the matching group:

sed -r 's/.*state="([^"]*)".*/1/' htmlResponse.txt

That actually just pulls out the Failed or Successful (without including the state= part that precedes it), which I suspect is what you want. But if you do need that, you can add it back easily, or use a slightly different regex, as in wwoods's answer.

However, as Sundeep mentions, it is not at all robust to parse HTML (or XML) with a regular expression. It's one thing to use grep or sed to search for things interactively, but if this is part of a script that needs to carry out an important task and actually work, you should parse the XML properly.

answered Oct 16 '17 at 15:42

Eliah Kagan

3,16221530

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f398439%2fusing-sed-to-get-specific-text-from-file%23new-answer', 'question_page');

);

Post as a guest

Name

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

up vote
2
down vote

accepted

Putting aside the obligatory "you should really be using a proper XML parser because regexes aren't powerful enough to parse XML" comment, I see two problems in your sed line:

".*" will match from the first " to the last, since . matches "

The sed command /.../p prints the whole line if it matches the regex.

Here's two things I'd suggest for quick-and-dirty HTML-scraping shell scripts:

Use "[^"]*" to match "quote, any number of non-quote characters, end quote"

It's lots easier to use grep -o to pull out bits of a file that match a regex

So that would make your command more like:

grep -o 'state="[^"]*"'

Or, if you really must use sed:

sed -n 's/.*(state="[^"]*").*/1/p'

edited Oct 16 '17 at 15:46

answered Oct 16 '17 at 15:41

wwoods

98679

Thanks! I went with grep as the command just looks easier to type and understand.
â€“Â Justin
Oct 16 '17 at 16:16

add a commentÂ |Â

up vote
2
down vote

accepted

Putting aside the obligatory "you should really be using a proper XML parser because regexes aren't powerful enough to parse XML" comment, I see two problems in your sed line:

".*" will match from the first " to the last, since . matches "

The sed command /.../p prints the whole line if it matches the regex.

Here's two things I'd suggest for quick-and-dirty HTML-scraping shell scripts:

Use "[^"]*" to match "quote, any number of non-quote characters, end quote"

It's lots easier to use grep -o to pull out bits of a file that match a regex

So that would make your command more like:

grep -o 'state="[^"]*"'

Or, if you really must use sed:

sed -n 's/.*(state="[^"]*").*/1/p'

edited Oct 16 '17 at 15:46

answered Oct 16 '17 at 15:41

wwoods

98679

Thanks! I went with grep as the command just looks easier to type and understand.
â€“Â Justin
Oct 16 '17 at 16:16

add a commentÂ |Â

up vote
2
down vote

accepted

Putting aside the obligatory "you should really be using a proper XML parser because regexes aren't powerful enough to parse XML" comment, I see two problems in your sed line:

".*" will match from the first " to the last, since . matches "

The sed command /.../p prints the whole line if it matches the regex.

Here's two things I'd suggest for quick-and-dirty HTML-scraping shell scripts:

Use "[^"]*" to match "quote, any number of non-quote characters, end quote"

It's lots easier to use grep -o to pull out bits of a file that match a regex

So that would make your command more like:

grep -o 'state="[^"]*"'

Or, if you really must use sed:

sed -n 's/.*(state="[^"]*").*/1/p'

edited Oct 16 '17 at 15:46

answered Oct 16 '17 at 15:41

wwoods

98679

Putting aside the obligatory "you should really be using a proper XML parser because regexes aren't powerful enough to parse XML" comment, I see two problems in your sed line:

".*" will match from the first " to the last, since . matches "

The sed command /.../p prints the whole line if it matches the regex.

Here's two things I'd suggest for quick-and-dirty HTML-scraping shell scripts:

Use "[^"]*" to match "quote, any number of non-quote characters, end quote"

It's lots easier to use grep -o to pull out bits of a file that match a regex

So that would make your command more like:

grep -o 'state="[^"]*"'

Or, if you really must use sed:

sed -n 's/.*(state="[^"]*").*/1/p'

edited Oct 16 '17 at 15:46

answered Oct 16 '17 at 15:41

wwoods

98679

edited Oct 16 '17 at 15:46

answered Oct 16 '17 at 15:41

wwoods

98679

answered Oct 16 '17 at 15:41

wwoods

98679

answered Oct 16 '17 at 15:41

wwoods

98679

Thanks! I went with grep as the command just looks easier to type and understand.
â€“Â Justin
Oct 16 '17 at 16:16

add a commentÂ |Â

Thanks! I went with grep as the command just looks easier to type and understand.
â€“Â Justin
Oct 16 '17 at 16:16

Thanks! I went with grep as the command just looks easier to type and understand.
â€“Â Justin
Oct 16 '17 at 16:16

add a commentÂ |Â

up vote
1
down vote

The right way is to use XML parsers like xmlstarlet:

printf 'state="%s"n' $(xmlstarlet sel -t -v "//result/@state" -n htmlResponse.txt)

The output:

state="Failed"

answered Oct 16 '17 at 15:59

RomanPerekhrest

22.5k12145

add a commentÂ |Â

up vote
1
down vote

The right way is to use XML parsers like xmlstarlet:

printf 'state="%s"n' $(xmlstarlet sel -t -v "//result/@state" -n htmlResponse.txt)

The output:

state="Failed"

answered Oct 16 '17 at 15:59

RomanPerekhrest

22.5k12145

add a commentÂ |Â

up vote
1
down vote

The right way is to use XML parsers like xmlstarlet:

printf 'state="%s"n' $(xmlstarlet sel -t -v "//result/@state" -n htmlResponse.txt)

The output:

state="Failed"

answered Oct 16 '17 at 15:59

RomanPerekhrest

22.5k12145

The right way is to use XML parsers like xmlstarlet:

printf 'state="%s"n' $(xmlstarlet sel -t -v "//result/@state" -n htmlResponse.txt)

The output:

state="Failed"

answered Oct 16 '17 at 15:59

RomanPerekhrest

22.5k12145

answered Oct 16 '17 at 15:59

RomanPerekhrest

22.5k12145

answered Oct 16 '17 at 15:59

RomanPerekhrest

22.5k12145

answered Oct 16 '17 at 15:59

RomanPerekhrest

22.5k12145

add a commentÂ |Â

up vote
0
down vote

You likely want to match the whole line and print just the matching group:

sed -r 's/.*state="([^"]*)".*/1/' htmlResponse.txt

answered Oct 16 '17 at 15:42

Eliah Kagan

3,16221530

add a commentÂ |Â

up vote
0
down vote

You likely want to match the whole line and print just the matching group:

sed -r 's/.*state="([^"]*)".*/1/' htmlResponse.txt

answered Oct 16 '17 at 15:42

Eliah Kagan

3,16221530

add a commentÂ |Â

up vote
0
down vote

You likely want to match the whole line and print just the matching group:

sed -r 's/.*state="([^"]*)".*/1/' htmlResponse.txt

answered Oct 16 '17 at 15:42

Eliah Kagan

3,16221530

You likely want to match the whole line and print just the matching group:

sed -r 's/.*state="([^"]*)".*/1/' htmlResponse.txt

answered Oct 16 '17 at 15:42

Eliah Kagan

3,16221530

answered Oct 16 '17 at 15:42

Eliah Kagan

3,16221530

answered Oct 16 '17 at 15:42

Eliah Kagan

3,16221530

answered Oct 16 '17 at 15:42

Eliah Kagan

3,16221530

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu