Using sed to get specific text from file

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
-1
down vote

favorite












Not sure why I'm not getting this. I've been searching and testing my command for a couple hours and I'm not getting anywhere.



The text is:



<?xml version="1.0" encoding="UTF-8" standalone="yes"?><result expand="changes,testResults,metadata,logEntries,plan,vcsRevisions,artifacts,comments,labels,jiraIssues" key="EP-ED-JOB1-174" state="Failed" lifeCycleState="Finished" number="174" ....


And I just want to pull out the ' state="Failed" ' part, it could also be ' state="Successful" '



I've tried a million variations of this:



sed '/state=".*"/p' htmlResponse.txt


But paren's, escape slashes etc seem to match the entire chunk of text. What's wrong with my regex?







share|improve this question




















  • you need to use capture groups around what you want and use substitution to print only those portion.. to avoid greedy issue, in this case you can use [^"]* instead of .*... but really, you should use xml parser instead of regex
    – Sundeep
    Oct 16 '17 at 15:39











  • If I do sed -n '/state="[^"]*/p' htmlResponse.html it still gives me back everything.
    – Justin
    Oct 16 '17 at 15:42










  • Use xmllint instead. Use the right tools for the right job.
    – Valentin B
    Oct 16 '17 at 15:48














up vote
-1
down vote

favorite












Not sure why I'm not getting this. I've been searching and testing my command for a couple hours and I'm not getting anywhere.



The text is:



<?xml version="1.0" encoding="UTF-8" standalone="yes"?><result expand="changes,testResults,metadata,logEntries,plan,vcsRevisions,artifacts,comments,labels,jiraIssues" key="EP-ED-JOB1-174" state="Failed" lifeCycleState="Finished" number="174" ....


And I just want to pull out the ' state="Failed" ' part, it could also be ' state="Successful" '



I've tried a million variations of this:



sed '/state=".*"/p' htmlResponse.txt


But paren's, escape slashes etc seem to match the entire chunk of text. What's wrong with my regex?







share|improve this question




















  • you need to use capture groups around what you want and use substitution to print only those portion.. to avoid greedy issue, in this case you can use [^"]* instead of .*... but really, you should use xml parser instead of regex
    – Sundeep
    Oct 16 '17 at 15:39











  • If I do sed -n '/state="[^"]*/p' htmlResponse.html it still gives me back everything.
    – Justin
    Oct 16 '17 at 15:42










  • Use xmllint instead. Use the right tools for the right job.
    – Valentin B
    Oct 16 '17 at 15:48












up vote
-1
down vote

favorite









up vote
-1
down vote

favorite











Not sure why I'm not getting this. I've been searching and testing my command for a couple hours and I'm not getting anywhere.



The text is:



<?xml version="1.0" encoding="UTF-8" standalone="yes"?><result expand="changes,testResults,metadata,logEntries,plan,vcsRevisions,artifacts,comments,labels,jiraIssues" key="EP-ED-JOB1-174" state="Failed" lifeCycleState="Finished" number="174" ....


And I just want to pull out the ' state="Failed" ' part, it could also be ' state="Successful" '



I've tried a million variations of this:



sed '/state=".*"/p' htmlResponse.txt


But paren's, escape slashes etc seem to match the entire chunk of text. What's wrong with my regex?







share|improve this question












Not sure why I'm not getting this. I've been searching and testing my command for a couple hours and I'm not getting anywhere.



The text is:



<?xml version="1.0" encoding="UTF-8" standalone="yes"?><result expand="changes,testResults,metadata,logEntries,plan,vcsRevisions,artifacts,comments,labels,jiraIssues" key="EP-ED-JOB1-174" state="Failed" lifeCycleState="Finished" number="174" ....


And I just want to pull out the ' state="Failed" ' part, it could also be ' state="Successful" '



I've tried a million variations of this:



sed '/state=".*"/p' htmlResponse.txt


But paren's, escape slashes etc seem to match the entire chunk of text. What's wrong with my regex?









share|improve this question











share|improve this question




share|improve this question










asked Oct 16 '17 at 15:31









Justin

1013




1013











  • you need to use capture groups around what you want and use substitution to print only those portion.. to avoid greedy issue, in this case you can use [^"]* instead of .*... but really, you should use xml parser instead of regex
    – Sundeep
    Oct 16 '17 at 15:39











  • If I do sed -n '/state="[^"]*/p' htmlResponse.html it still gives me back everything.
    – Justin
    Oct 16 '17 at 15:42










  • Use xmllint instead. Use the right tools for the right job.
    – Valentin B
    Oct 16 '17 at 15:48
















  • you need to use capture groups around what you want and use substitution to print only those portion.. to avoid greedy issue, in this case you can use [^"]* instead of .*... but really, you should use xml parser instead of regex
    – Sundeep
    Oct 16 '17 at 15:39











  • If I do sed -n '/state="[^"]*/p' htmlResponse.html it still gives me back everything.
    – Justin
    Oct 16 '17 at 15:42










  • Use xmllint instead. Use the right tools for the right job.
    – Valentin B
    Oct 16 '17 at 15:48















you need to use capture groups around what you want and use substitution to print only those portion.. to avoid greedy issue, in this case you can use [^"]* instead of .*... but really, you should use xml parser instead of regex
– Sundeep
Oct 16 '17 at 15:39





you need to use capture groups around what you want and use substitution to print only those portion.. to avoid greedy issue, in this case you can use [^"]* instead of .*... but really, you should use xml parser instead of regex
– Sundeep
Oct 16 '17 at 15:39













If I do sed -n '/state="[^"]*/p' htmlResponse.html it still gives me back everything.
– Justin
Oct 16 '17 at 15:42




If I do sed -n '/state="[^"]*/p' htmlResponse.html it still gives me back everything.
– Justin
Oct 16 '17 at 15:42












Use xmllint instead. Use the right tools for the right job.
– Valentin B
Oct 16 '17 at 15:48




Use xmllint instead. Use the right tools for the right job.
– Valentin B
Oct 16 '17 at 15:48










3 Answers
3






active

oldest

votes

















up vote
2
down vote



accepted










Putting aside the obligatory "you should really be using a proper XML parser because regexes aren't powerful enough to parse XML" comment, I see two problems in your sed line:




  1. ".*" will match from the first " to the last, since . matches "

  2. The sed command /.../p prints the whole line if it matches the regex.

Here's two things I'd suggest for quick-and-dirty HTML-scraping shell scripts:



  1. Use "[^"]*" to match "quote, any number of non-quote characters, end quote"

  2. It's lots easier to use grep -o to pull out bits of a file that match a regex

So that would make your command more like:



grep -o 'state="[^"]*"'


Or, if you really must use sed:



sed -n 's/.*(state="[^"]*").*/1/p'





share|improve this answer






















  • Thanks! I went with grep as the command just looks easier to type and understand.
    – Justin
    Oct 16 '17 at 16:16

















up vote
1
down vote













The right way is to use XML parsers like xmlstarlet:



printf 'state="%s"n' $(xmlstarlet sel -t -v "//result/@state" -n htmlResponse.txt)


The output:



state="Failed"





share|improve this answer



























    up vote
    0
    down vote













    You likely want to match the whole line and print just the matching group:



    sed -r 's/.*state="([^"]*)".*/1/' htmlResponse.txt


    That actually just pulls out the Failed or Successful (without including the state= part that precedes it), which I suspect is what you want. But if you do need that, you can add it back easily, or use a slightly different regex, as in wwoods's answer.



    However, as Sundeep mentions, it is not at all robust to parse HTML (or XML) with a regular expression. It's one thing to use grep or sed to search for things interactively, but if this is part of a script that needs to carry out an important task and actually work, you should parse the XML properly.






    share|improve this answer




















      Your Answer







      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "106"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      convertImagesToLinks: false,
      noModals: false,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );













       

      draft saved


      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f398439%2fusing-sed-to-get-specific-text-from-file%23new-answer', 'question_page');

      );

      Post as a guest






























      3 Answers
      3






      active

      oldest

      votes








      3 Answers
      3






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes








      up vote
      2
      down vote



      accepted










      Putting aside the obligatory "you should really be using a proper XML parser because regexes aren't powerful enough to parse XML" comment, I see two problems in your sed line:




      1. ".*" will match from the first " to the last, since . matches "

      2. The sed command /.../p prints the whole line if it matches the regex.

      Here's two things I'd suggest for quick-and-dirty HTML-scraping shell scripts:



      1. Use "[^"]*" to match "quote, any number of non-quote characters, end quote"

      2. It's lots easier to use grep -o to pull out bits of a file that match a regex

      So that would make your command more like:



      grep -o 'state="[^"]*"'


      Or, if you really must use sed:



      sed -n 's/.*(state="[^"]*").*/1/p'





      share|improve this answer






















      • Thanks! I went with grep as the command just looks easier to type and understand.
        – Justin
        Oct 16 '17 at 16:16














      up vote
      2
      down vote



      accepted










      Putting aside the obligatory "you should really be using a proper XML parser because regexes aren't powerful enough to parse XML" comment, I see two problems in your sed line:




      1. ".*" will match from the first " to the last, since . matches "

      2. The sed command /.../p prints the whole line if it matches the regex.

      Here's two things I'd suggest for quick-and-dirty HTML-scraping shell scripts:



      1. Use "[^"]*" to match "quote, any number of non-quote characters, end quote"

      2. It's lots easier to use grep -o to pull out bits of a file that match a regex

      So that would make your command more like:



      grep -o 'state="[^"]*"'


      Or, if you really must use sed:



      sed -n 's/.*(state="[^"]*").*/1/p'





      share|improve this answer






















      • Thanks! I went with grep as the command just looks easier to type and understand.
        – Justin
        Oct 16 '17 at 16:16












      up vote
      2
      down vote



      accepted







      up vote
      2
      down vote



      accepted






      Putting aside the obligatory "you should really be using a proper XML parser because regexes aren't powerful enough to parse XML" comment, I see two problems in your sed line:




      1. ".*" will match from the first " to the last, since . matches "

      2. The sed command /.../p prints the whole line if it matches the regex.

      Here's two things I'd suggest for quick-and-dirty HTML-scraping shell scripts:



      1. Use "[^"]*" to match "quote, any number of non-quote characters, end quote"

      2. It's lots easier to use grep -o to pull out bits of a file that match a regex

      So that would make your command more like:



      grep -o 'state="[^"]*"'


      Or, if you really must use sed:



      sed -n 's/.*(state="[^"]*").*/1/p'





      share|improve this answer














      Putting aside the obligatory "you should really be using a proper XML parser because regexes aren't powerful enough to parse XML" comment, I see two problems in your sed line:




      1. ".*" will match from the first " to the last, since . matches "

      2. The sed command /.../p prints the whole line if it matches the regex.

      Here's two things I'd suggest for quick-and-dirty HTML-scraping shell scripts:



      1. Use "[^"]*" to match "quote, any number of non-quote characters, end quote"

      2. It's lots easier to use grep -o to pull out bits of a file that match a regex

      So that would make your command more like:



      grep -o 'state="[^"]*"'


      Or, if you really must use sed:



      sed -n 's/.*(state="[^"]*").*/1/p'






      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Oct 16 '17 at 15:46

























      answered Oct 16 '17 at 15:41









      wwoods

      98679




      98679











      • Thanks! I went with grep as the command just looks easier to type and understand.
        – Justin
        Oct 16 '17 at 16:16
















      • Thanks! I went with grep as the command just looks easier to type and understand.
        – Justin
        Oct 16 '17 at 16:16















      Thanks! I went with grep as the command just looks easier to type and understand.
      – Justin
      Oct 16 '17 at 16:16




      Thanks! I went with grep as the command just looks easier to type and understand.
      – Justin
      Oct 16 '17 at 16:16












      up vote
      1
      down vote













      The right way is to use XML parsers like xmlstarlet:



      printf 'state="%s"n' $(xmlstarlet sel -t -v "//result/@state" -n htmlResponse.txt)


      The output:



      state="Failed"





      share|improve this answer
























        up vote
        1
        down vote













        The right way is to use XML parsers like xmlstarlet:



        printf 'state="%s"n' $(xmlstarlet sel -t -v "//result/@state" -n htmlResponse.txt)


        The output:



        state="Failed"





        share|improve this answer






















          up vote
          1
          down vote










          up vote
          1
          down vote









          The right way is to use XML parsers like xmlstarlet:



          printf 'state="%s"n' $(xmlstarlet sel -t -v "//result/@state" -n htmlResponse.txt)


          The output:



          state="Failed"





          share|improve this answer












          The right way is to use XML parsers like xmlstarlet:



          printf 'state="%s"n' $(xmlstarlet sel -t -v "//result/@state" -n htmlResponse.txt)


          The output:



          state="Failed"






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Oct 16 '17 at 15:59









          RomanPerekhrest

          22.5k12145




          22.5k12145




















              up vote
              0
              down vote













              You likely want to match the whole line and print just the matching group:



              sed -r 's/.*state="([^"]*)".*/1/' htmlResponse.txt


              That actually just pulls out the Failed or Successful (without including the state= part that precedes it), which I suspect is what you want. But if you do need that, you can add it back easily, or use a slightly different regex, as in wwoods's answer.



              However, as Sundeep mentions, it is not at all robust to parse HTML (or XML) with a regular expression. It's one thing to use grep or sed to search for things interactively, but if this is part of a script that needs to carry out an important task and actually work, you should parse the XML properly.






              share|improve this answer
























                up vote
                0
                down vote













                You likely want to match the whole line and print just the matching group:



                sed -r 's/.*state="([^"]*)".*/1/' htmlResponse.txt


                That actually just pulls out the Failed or Successful (without including the state= part that precedes it), which I suspect is what you want. But if you do need that, you can add it back easily, or use a slightly different regex, as in wwoods's answer.



                However, as Sundeep mentions, it is not at all robust to parse HTML (or XML) with a regular expression. It's one thing to use grep or sed to search for things interactively, but if this is part of a script that needs to carry out an important task and actually work, you should parse the XML properly.






                share|improve this answer






















                  up vote
                  0
                  down vote










                  up vote
                  0
                  down vote









                  You likely want to match the whole line and print just the matching group:



                  sed -r 's/.*state="([^"]*)".*/1/' htmlResponse.txt


                  That actually just pulls out the Failed or Successful (without including the state= part that precedes it), which I suspect is what you want. But if you do need that, you can add it back easily, or use a slightly different regex, as in wwoods's answer.



                  However, as Sundeep mentions, it is not at all robust to parse HTML (or XML) with a regular expression. It's one thing to use grep or sed to search for things interactively, but if this is part of a script that needs to carry out an important task and actually work, you should parse the XML properly.






                  share|improve this answer












                  You likely want to match the whole line and print just the matching group:



                  sed -r 's/.*state="([^"]*)".*/1/' htmlResponse.txt


                  That actually just pulls out the Failed or Successful (without including the state= part that precedes it), which I suspect is what you want. But if you do need that, you can add it back easily, or use a slightly different regex, as in wwoods's answer.



                  However, as Sundeep mentions, it is not at all robust to parse HTML (or XML) with a regular expression. It's one thing to use grep or sed to search for things interactively, but if this is part of a script that needs to carry out an important task and actually work, you should parse the XML properly.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Oct 16 '17 at 15:42









                  Eliah Kagan

                  3,16221530




                  3,16221530



























                       

                      draft saved


                      draft discarded















































                       


                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f398439%2fusing-sed-to-get-specific-text-from-file%23new-answer', 'question_page');

                      );

                      Post as a guest













































































                      Popular posts from this blog

                      Peggy Mitchell

                      Palaiologos

                      The Forum (Inglewood, California)