sed - if condition met, use next pattern

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
1
down vote

favorite












I have a number of plain text files with similar but slightly different structure I need to extract a particular line from.



This line of text doesn't follow any particular pattern (i.e. its content is always different) and is not always in the same place in the file --- though is usually close to the beginning of the file.



These files are press releases (originally in PDF, converted to text on the fly with pdftotext), and the line I need to extract is the subject, that I need to use as filename afterwards.



If I just run sed -n '1p' on these files, extracting the very first line, sometimes I get the result I want, more often not.



A sample of the different results I get:



Title of the press release # correct result
# wrong, here the first line is empty
29.9.2016 # wrong, here the first line contains the date
PRESS RELEASE # also wrong, I would need to scan further down


These are pretty much all of the cases. What gives me hope is that, since these files have very similar structure and contain a title close to the beginning, if I keep scanning down sooner or later I will find what I'm looking for.



Is there any way to tell sed, in the same sed command, to try different patterns until a set of conditions in not met?



In my case I would need to tell sed to:



  • check that the line is not empty

  • check that the line doesn't contain a date

  • check that the line doesn't contain the words "Press Release"

If none of the conditions are met, output the line, if any is met, skip to the next line.



Is this something that sed would be able to do?










share|improve this question



























    up vote
    1
    down vote

    favorite












    I have a number of plain text files with similar but slightly different structure I need to extract a particular line from.



    This line of text doesn't follow any particular pattern (i.e. its content is always different) and is not always in the same place in the file --- though is usually close to the beginning of the file.



    These files are press releases (originally in PDF, converted to text on the fly with pdftotext), and the line I need to extract is the subject, that I need to use as filename afterwards.



    If I just run sed -n '1p' on these files, extracting the very first line, sometimes I get the result I want, more often not.



    A sample of the different results I get:



    Title of the press release # correct result
    # wrong, here the first line is empty
    29.9.2016 # wrong, here the first line contains the date
    PRESS RELEASE # also wrong, I would need to scan further down


    These are pretty much all of the cases. What gives me hope is that, since these files have very similar structure and contain a title close to the beginning, if I keep scanning down sooner or later I will find what I'm looking for.



    Is there any way to tell sed, in the same sed command, to try different patterns until a set of conditions in not met?



    In my case I would need to tell sed to:



    • check that the line is not empty

    • check that the line doesn't contain a date

    • check that the line doesn't contain the words "Press Release"

    If none of the conditions are met, output the line, if any is met, skip to the next line.



    Is this something that sed would be able to do?










    share|improve this question

























      up vote
      1
      down vote

      favorite









      up vote
      1
      down vote

      favorite











      I have a number of plain text files with similar but slightly different structure I need to extract a particular line from.



      This line of text doesn't follow any particular pattern (i.e. its content is always different) and is not always in the same place in the file --- though is usually close to the beginning of the file.



      These files are press releases (originally in PDF, converted to text on the fly with pdftotext), and the line I need to extract is the subject, that I need to use as filename afterwards.



      If I just run sed -n '1p' on these files, extracting the very first line, sometimes I get the result I want, more often not.



      A sample of the different results I get:



      Title of the press release # correct result
      # wrong, here the first line is empty
      29.9.2016 # wrong, here the first line contains the date
      PRESS RELEASE # also wrong, I would need to scan further down


      These are pretty much all of the cases. What gives me hope is that, since these files have very similar structure and contain a title close to the beginning, if I keep scanning down sooner or later I will find what I'm looking for.



      Is there any way to tell sed, in the same sed command, to try different patterns until a set of conditions in not met?



      In my case I would need to tell sed to:



      • check that the line is not empty

      • check that the line doesn't contain a date

      • check that the line doesn't contain the words "Press Release"

      If none of the conditions are met, output the line, if any is met, skip to the next line.



      Is this something that sed would be able to do?










      share|improve this question















      I have a number of plain text files with similar but slightly different structure I need to extract a particular line from.



      This line of text doesn't follow any particular pattern (i.e. its content is always different) and is not always in the same place in the file --- though is usually close to the beginning of the file.



      These files are press releases (originally in PDF, converted to text on the fly with pdftotext), and the line I need to extract is the subject, that I need to use as filename afterwards.



      If I just run sed -n '1p' on these files, extracting the very first line, sometimes I get the result I want, more often not.



      A sample of the different results I get:



      Title of the press release # correct result
      # wrong, here the first line is empty
      29.9.2016 # wrong, here the first line contains the date
      PRESS RELEASE # also wrong, I would need to scan further down


      These are pretty much all of the cases. What gives me hope is that, since these files have very similar structure and contain a title close to the beginning, if I keep scanning down sooner or later I will find what I'm looking for.



      Is there any way to tell sed, in the same sed command, to try different patterns until a set of conditions in not met?



      In my case I would need to tell sed to:



      • check that the line is not empty

      • check that the line doesn't contain a date

      • check that the line doesn't contain the words "Press Release"

      If none of the conditions are met, output the line, if any is met, skip to the next line.



      Is this something that sed would be able to do?







      shell-script shell sed






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Aug 7 at 12:54

























      asked Aug 7 at 12:47









      zool

      1425




      1425




















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          2
          down vote



          accepted










          Finding the first line with any form of text that is not empty (and does not only contain whitespace), does not contain only digits and dots, and does not contain the string PRESS RELEASE (capitalized):



          sed '/^[[:blank:]]*$/d; /^[0-9.]*$/d; /PRESS RELEASE/d; q' file


          If dates can have - and spaces in them, and if PRESS RELEASE could also be written press release, Press Release or Press release (or pRESS Release or some other combination):



          sed -E '/^[[:blank:]]*$/d; /^[0-9. -]*$/d; /[Pp](RESS|ress) [Rr](ELEASE|elease)/d; q' file


          or with GNU sed for case insensitive matching of press release:



          sed '/^[[:blank:]]*$/d; /^[0-9. -]*$/d; /press release/Id; q' file


          Each time a pattern is triggered, the d command deletes that line from the input and a new cycle is started with the next line. If no patterns are triggered, then the q causes the script to exit, but the current line will be printed first.






          share|improve this answer






















          • Thanks, this is very helpful. Is the /[Pp](RESS|ress) [Rr](ELEASE|elease)/d bit really necessary? Isn't there a flag to tell sed to match in a case insensitive manner?
            – zool
            Aug 7 at 13:36






          • 1




            @zool With GNU sed, you could use /press release/Id (that's a capital I, lowercase d). Since I don't know what sed you are using, I kept to standard sed constructs.
            – Kusalananda
            Aug 7 at 13:40











          • I am indeed on macOS where the sed implementation doesn't support the I switch, but installed gnu-sed via homebrew and now I'm good to go. Thanks a lot!
            – zool
            Aug 7 at 14:25










          Your Answer







          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "106"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: false,
          noModals: false,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f461062%2fsed-if-condition-met-use-next-pattern%23new-answer', 'question_page');

          );

          Post as a guest






























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          2
          down vote



          accepted










          Finding the first line with any form of text that is not empty (and does not only contain whitespace), does not contain only digits and dots, and does not contain the string PRESS RELEASE (capitalized):



          sed '/^[[:blank:]]*$/d; /^[0-9.]*$/d; /PRESS RELEASE/d; q' file


          If dates can have - and spaces in them, and if PRESS RELEASE could also be written press release, Press Release or Press release (or pRESS Release or some other combination):



          sed -E '/^[[:blank:]]*$/d; /^[0-9. -]*$/d; /[Pp](RESS|ress) [Rr](ELEASE|elease)/d; q' file


          or with GNU sed for case insensitive matching of press release:



          sed '/^[[:blank:]]*$/d; /^[0-9. -]*$/d; /press release/Id; q' file


          Each time a pattern is triggered, the d command deletes that line from the input and a new cycle is started with the next line. If no patterns are triggered, then the q causes the script to exit, but the current line will be printed first.






          share|improve this answer






















          • Thanks, this is very helpful. Is the /[Pp](RESS|ress) [Rr](ELEASE|elease)/d bit really necessary? Isn't there a flag to tell sed to match in a case insensitive manner?
            – zool
            Aug 7 at 13:36






          • 1




            @zool With GNU sed, you could use /press release/Id (that's a capital I, lowercase d). Since I don't know what sed you are using, I kept to standard sed constructs.
            – Kusalananda
            Aug 7 at 13:40











          • I am indeed on macOS where the sed implementation doesn't support the I switch, but installed gnu-sed via homebrew and now I'm good to go. Thanks a lot!
            – zool
            Aug 7 at 14:25














          up vote
          2
          down vote



          accepted










          Finding the first line with any form of text that is not empty (and does not only contain whitespace), does not contain only digits and dots, and does not contain the string PRESS RELEASE (capitalized):



          sed '/^[[:blank:]]*$/d; /^[0-9.]*$/d; /PRESS RELEASE/d; q' file


          If dates can have - and spaces in them, and if PRESS RELEASE could also be written press release, Press Release or Press release (or pRESS Release or some other combination):



          sed -E '/^[[:blank:]]*$/d; /^[0-9. -]*$/d; /[Pp](RESS|ress) [Rr](ELEASE|elease)/d; q' file


          or with GNU sed for case insensitive matching of press release:



          sed '/^[[:blank:]]*$/d; /^[0-9. -]*$/d; /press release/Id; q' file


          Each time a pattern is triggered, the d command deletes that line from the input and a new cycle is started with the next line. If no patterns are triggered, then the q causes the script to exit, but the current line will be printed first.






          share|improve this answer






















          • Thanks, this is very helpful. Is the /[Pp](RESS|ress) [Rr](ELEASE|elease)/d bit really necessary? Isn't there a flag to tell sed to match in a case insensitive manner?
            – zool
            Aug 7 at 13:36






          • 1




            @zool With GNU sed, you could use /press release/Id (that's a capital I, lowercase d). Since I don't know what sed you are using, I kept to standard sed constructs.
            – Kusalananda
            Aug 7 at 13:40











          • I am indeed on macOS where the sed implementation doesn't support the I switch, but installed gnu-sed via homebrew and now I'm good to go. Thanks a lot!
            – zool
            Aug 7 at 14:25












          up vote
          2
          down vote



          accepted







          up vote
          2
          down vote



          accepted






          Finding the first line with any form of text that is not empty (and does not only contain whitespace), does not contain only digits and dots, and does not contain the string PRESS RELEASE (capitalized):



          sed '/^[[:blank:]]*$/d; /^[0-9.]*$/d; /PRESS RELEASE/d; q' file


          If dates can have - and spaces in them, and if PRESS RELEASE could also be written press release, Press Release or Press release (or pRESS Release or some other combination):



          sed -E '/^[[:blank:]]*$/d; /^[0-9. -]*$/d; /[Pp](RESS|ress) [Rr](ELEASE|elease)/d; q' file


          or with GNU sed for case insensitive matching of press release:



          sed '/^[[:blank:]]*$/d; /^[0-9. -]*$/d; /press release/Id; q' file


          Each time a pattern is triggered, the d command deletes that line from the input and a new cycle is started with the next line. If no patterns are triggered, then the q causes the script to exit, but the current line will be printed first.






          share|improve this answer














          Finding the first line with any form of text that is not empty (and does not only contain whitespace), does not contain only digits and dots, and does not contain the string PRESS RELEASE (capitalized):



          sed '/^[[:blank:]]*$/d; /^[0-9.]*$/d; /PRESS RELEASE/d; q' file


          If dates can have - and spaces in them, and if PRESS RELEASE could also be written press release, Press Release or Press release (or pRESS Release or some other combination):



          sed -E '/^[[:blank:]]*$/d; /^[0-9. -]*$/d; /[Pp](RESS|ress) [Rr](ELEASE|elease)/d; q' file


          or with GNU sed for case insensitive matching of press release:



          sed '/^[[:blank:]]*$/d; /^[0-9. -]*$/d; /press release/Id; q' file


          Each time a pattern is triggered, the d command deletes that line from the input and a new cycle is started with the next line. If no patterns are triggered, then the q causes the script to exit, but the current line will be printed first.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Aug 7 at 13:41

























          answered Aug 7 at 13:00









          Kusalananda

          106k14209327




          106k14209327











          • Thanks, this is very helpful. Is the /[Pp](RESS|ress) [Rr](ELEASE|elease)/d bit really necessary? Isn't there a flag to tell sed to match in a case insensitive manner?
            – zool
            Aug 7 at 13:36






          • 1




            @zool With GNU sed, you could use /press release/Id (that's a capital I, lowercase d). Since I don't know what sed you are using, I kept to standard sed constructs.
            – Kusalananda
            Aug 7 at 13:40











          • I am indeed on macOS where the sed implementation doesn't support the I switch, but installed gnu-sed via homebrew and now I'm good to go. Thanks a lot!
            – zool
            Aug 7 at 14:25
















          • Thanks, this is very helpful. Is the /[Pp](RESS|ress) [Rr](ELEASE|elease)/d bit really necessary? Isn't there a flag to tell sed to match in a case insensitive manner?
            – zool
            Aug 7 at 13:36






          • 1




            @zool With GNU sed, you could use /press release/Id (that's a capital I, lowercase d). Since I don't know what sed you are using, I kept to standard sed constructs.
            – Kusalananda
            Aug 7 at 13:40











          • I am indeed on macOS where the sed implementation doesn't support the I switch, but installed gnu-sed via homebrew and now I'm good to go. Thanks a lot!
            – zool
            Aug 7 at 14:25















          Thanks, this is very helpful. Is the /[Pp](RESS|ress) [Rr](ELEASE|elease)/d bit really necessary? Isn't there a flag to tell sed to match in a case insensitive manner?
          – zool
          Aug 7 at 13:36




          Thanks, this is very helpful. Is the /[Pp](RESS|ress) [Rr](ELEASE|elease)/d bit really necessary? Isn't there a flag to tell sed to match in a case insensitive manner?
          – zool
          Aug 7 at 13:36




          1




          1




          @zool With GNU sed, you could use /press release/Id (that's a capital I, lowercase d). Since I don't know what sed you are using, I kept to standard sed constructs.
          – Kusalananda
          Aug 7 at 13:40





          @zool With GNU sed, you could use /press release/Id (that's a capital I, lowercase d). Since I don't know what sed you are using, I kept to standard sed constructs.
          – Kusalananda
          Aug 7 at 13:40













          I am indeed on macOS where the sed implementation doesn't support the I switch, but installed gnu-sed via homebrew and now I'm good to go. Thanks a lot!
          – zool
          Aug 7 at 14:25




          I am indeed on macOS where the sed implementation doesn't support the I switch, but installed gnu-sed via homebrew and now I'm good to go. Thanks a lot!
          – zool
          Aug 7 at 14:25

















           

          draft saved


          draft discarded















































           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f461062%2fsed-if-condition-met-use-next-pattern%23new-answer', 'question_page');

          );

          Post as a guest













































































          Popular posts from this blog

          How to check contact read email or not when send email to Individual?

          How many registers does an x86_64 CPU actually have?

          Nur Jahan