sed - if condition met, use next pattern

up vote
1
down vote

favorite

I have a number of plain text files with similar but slightly different structure I need to extract a particular line from.

This line of text doesn't follow any particular pattern (i.e. its content is always different) and is not always in the same place in the file --- though is usually close to the beginning of the file.

These files are press releases (originally in PDF, converted to text on the fly with pdftotext), and the line I need to extract is the subject, that I need to use as filename afterwards.

If I just run sed -n '1p' on these files, extracting the very first line, sometimes I get the result I want, more often not.

A sample of the different results I get:

Title of the press release # correct result
 # wrong, here the first line is empty
29.9.2016 # wrong, here the first line contains the date
PRESS RELEASE # also wrong, I would need to scan further down

These are pretty much all of the cases. What gives me hope is that, since these files have very similar structure and contain a title close to the beginning, if I keep scanning down sooner or later I will find what I'm looking for.

Is there any way to tell sed, in the same sed command, to try different patterns until a set of conditions in not met?

In my case I would need to tell sed to:

check that the line is not empty

check that the line doesn't contain a date

check that the line doesn't contain the words "Press Release"

If none of the conditions are met, output the line, if any is met, skip to the next line.

Is this something that sed would be able to do?

edited Aug 7 at 12:54

asked Aug 7 at 12:47

zool

1425

add a commentÂ |Â

up vote
1
down vote

favorite

I have a number of plain text files with similar but slightly different structure I need to extract a particular line from.

These files are press releases (originally in PDF, converted to text on the fly with pdftotext), and the line I need to extract is the subject, that I need to use as filename afterwards.

If I just run sed -n '1p' on these files, extracting the very first line, sometimes I get the result I want, more often not.

A sample of the different results I get:

Title of the press release # correct result
 # wrong, here the first line is empty
29.9.2016 # wrong, here the first line contains the date
PRESS RELEASE # also wrong, I would need to scan further down

Is there any way to tell sed, in the same sed command, to try different patterns until a set of conditions in not met?

In my case I would need to tell sed to:

check that the line is not empty

check that the line doesn't contain a date

check that the line doesn't contain the words "Press Release"

If none of the conditions are met, output the line, if any is met, skip to the next line.

Is this something that sed would be able to do?

edited Aug 7 at 12:54

asked Aug 7 at 12:47

zool

1425

add a commentÂ |Â

up vote
1
down vote

favorite

I have a number of plain text files with similar but slightly different structure I need to extract a particular line from.

These files are press releases (originally in PDF, converted to text on the fly with pdftotext), and the line I need to extract is the subject, that I need to use as filename afterwards.

If I just run sed -n '1p' on these files, extracting the very first line, sometimes I get the result I want, more often not.

A sample of the different results I get:

Title of the press release # correct result
 # wrong, here the first line is empty
29.9.2016 # wrong, here the first line contains the date
PRESS RELEASE # also wrong, I would need to scan further down

Is there any way to tell sed, in the same sed command, to try different patterns until a set of conditions in not met?

In my case I would need to tell sed to:

check that the line is not empty

check that the line doesn't contain a date

check that the line doesn't contain the words "Press Release"

If none of the conditions are met, output the line, if any is met, skip to the next line.

Is this something that sed would be able to do?

edited Aug 7 at 12:54

asked Aug 7 at 12:47

zool

1425

I have a number of plain text files with similar but slightly different structure I need to extract a particular line from.

These files are press releases (originally in PDF, converted to text on the fly with pdftotext), and the line I need to extract is the subject, that I need to use as filename afterwards.

If I just run sed -n '1p' on these files, extracting the very first line, sometimes I get the result I want, more often not.

A sample of the different results I get:

Title of the press release # correct result
 # wrong, here the first line is empty
29.9.2016 # wrong, here the first line contains the date
PRESS RELEASE # also wrong, I would need to scan further down

Is there any way to tell sed, in the same sed command, to try different patterns until a set of conditions in not met?

In my case I would need to tell sed to:

check that the line is not empty

check that the line doesn't contain a date

check that the line doesn't contain the words "Press Release"

If none of the conditions are met, output the line, if any is met, skip to the next line.

Is this something that sed would be able to do?

shell-script shell sed

edited Aug 7 at 12:54

asked Aug 7 at 12:47

zool

1425

edited Aug 7 at 12:54

asked Aug 7 at 12:47

zool

1425

edited Aug 7 at 12:54

asked Aug 7 at 12:47

zool

1425

asked Aug 7 at 12:47

zool

1425

asked Aug 7 at 12:47

zool

1425

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
2
down vote

accepted

Finding the first line with any form of text that is not empty (and does not only contain whitespace), does not contain only digits and dots, and does not contain the string PRESS RELEASE (capitalized):

sed '/^[[:blank:]]*$/d; /^[0-9.]*$/d; /PRESS RELEASE/d; q' file

If dates can have - and spaces in them, and if PRESS RELEASE could also be written press release, Press Release or Press release (or pRESS Release or some other combination):

sed -E '/^[[:blank:]]*$/d; /^[0-9. -]*$/d; /[Pp](RESS|ress) [Rr](ELEASE|elease)/d; q' file

or with GNU sed for case insensitive matching of press release:

sed '/^[[:blank:]]*$/d; /^[0-9. -]*$/d; /press release/Id; q' file

Each time a pattern is triggered, the d command deletes that line from the input and a new cycle is started with the next line. If no patterns are triggered, then the q causes the script to exit, but the current line will be printed first.

edited Aug 7 at 13:41

answered Aug 7 at 13:00

Kusalananda

106k14209327

Thanks, this is very helpful. Is the /[Pp](RESS|ress) [Rr](ELEASE|elease)/d bit really necessary? Isn't there a flag to tell sed to match in a case insensitive manner?
â€“Â zool
Aug 7 at 13:36

1

@zool With GNU sed, you could use /press release/Id (that's a capital I, lowercase d). Since I don't know what sed you are using, I kept to standard sed constructs.
â€“Â Kusalananda
Aug 7 at 13:40

I am indeed on macOS where the sed implementation doesn't support the I switch, but installed gnu-sed via homebrew and now I'm good to go. Thanks a lot!
â€“Â zool
Aug 7 at 14:25

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f461062%2fsed-if-condition-met-use-next-pattern%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
2
down vote

accepted

sed '/^[[:blank:]]*$/d; /^[0-9.]*$/d; /PRESS RELEASE/d; q' file

If dates can have - and spaces in them, and if PRESS RELEASE could also be written press release, Press Release or Press release (or pRESS Release or some other combination):

sed -E '/^[[:blank:]]*$/d; /^[0-9. -]*$/d; /[Pp](RESS|ress) [Rr](ELEASE|elease)/d; q' file

or with GNU sed for case insensitive matching of press release:

sed '/^[[:blank:]]*$/d; /^[0-9. -]*$/d; /press release/Id; q' file

edited Aug 7 at 13:41

answered Aug 7 at 13:00

Kusalananda

106k14209327

Thanks, this is very helpful. Is the /[Pp](RESS|ress) [Rr](ELEASE|elease)/d bit really necessary? Isn't there a flag to tell sed to match in a case insensitive manner?
â€“Â zool
Aug 7 at 13:36

1

@zool With GNU sed, you could use /press release/Id (that's a capital I, lowercase d). Since I don't know what sed you are using, I kept to standard sed constructs.
â€“Â Kusalananda
Aug 7 at 13:40

I am indeed on macOS where the sed implementation doesn't support the I switch, but installed gnu-sed via homebrew and now I'm good to go. Thanks a lot!
â€“Â zool
Aug 7 at 14:25

add a commentÂ |Â

up vote
2
down vote

accepted

sed '/^[[:blank:]]*$/d; /^[0-9.]*$/d; /PRESS RELEASE/d; q' file

If dates can have - and spaces in them, and if PRESS RELEASE could also be written press release, Press Release or Press release (or pRESS Release or some other combination):

sed -E '/^[[:blank:]]*$/d; /^[0-9. -]*$/d; /[Pp](RESS|ress) [Rr](ELEASE|elease)/d; q' file

or with GNU sed for case insensitive matching of press release:

sed '/^[[:blank:]]*$/d; /^[0-9. -]*$/d; /press release/Id; q' file

edited Aug 7 at 13:41

answered Aug 7 at 13:00

Kusalananda

106k14209327

Thanks, this is very helpful. Is the /[Pp](RESS|ress) [Rr](ELEASE|elease)/d bit really necessary? Isn't there a flag to tell sed to match in a case insensitive manner?
â€“Â zool
Aug 7 at 13:36

1

@zool With GNU sed, you could use /press release/Id (that's a capital I, lowercase d). Since I don't know what sed you are using, I kept to standard sed constructs.
â€“Â Kusalananda
Aug 7 at 13:40

I am indeed on macOS where the sed implementation doesn't support the I switch, but installed gnu-sed via homebrew and now I'm good to go. Thanks a lot!
â€“Â zool
Aug 7 at 14:25

add a commentÂ |Â

up vote
2
down vote

accepted

sed '/^[[:blank:]]*$/d; /^[0-9.]*$/d; /PRESS RELEASE/d; q' file

If dates can have - and spaces in them, and if PRESS RELEASE could also be written press release, Press Release or Press release (or pRESS Release or some other combination):

sed -E '/^[[:blank:]]*$/d; /^[0-9. -]*$/d; /[Pp](RESS|ress) [Rr](ELEASE|elease)/d; q' file

or with GNU sed for case insensitive matching of press release:

sed '/^[[:blank:]]*$/d; /^[0-9. -]*$/d; /press release/Id; q' file

edited Aug 7 at 13:41

answered Aug 7 at 13:00

Kusalananda

106k14209327

sed '/^[[:blank:]]*$/d; /^[0-9.]*$/d; /PRESS RELEASE/d; q' file

If dates can have - and spaces in them, and if PRESS RELEASE could also be written press release, Press Release or Press release (or pRESS Release or some other combination):

sed -E '/^[[:blank:]]*$/d; /^[0-9. -]*$/d; /[Pp](RESS|ress) [Rr](ELEASE|elease)/d; q' file

or with GNU sed for case insensitive matching of press release:

sed '/^[[:blank:]]*$/d; /^[0-9. -]*$/d; /press release/Id; q' file

edited Aug 7 at 13:41

answered Aug 7 at 13:00

Kusalananda

106k14209327

edited Aug 7 at 13:41

answered Aug 7 at 13:00

Kusalananda

106k14209327

answered Aug 7 at 13:00

Kusalananda

106k14209327

answered Aug 7 at 13:00

Kusalananda

106k14209327

Thanks, this is very helpful. Is the /[Pp](RESS|ress) [Rr](ELEASE|elease)/d bit really necessary? Isn't there a flag to tell sed to match in a case insensitive manner?
â€“Â zool
Aug 7 at 13:36

1

@zool With GNU sed, you could use /press release/Id (that's a capital I, lowercase d). Since I don't know what sed you are using, I kept to standard sed constructs.
â€“Â Kusalananda
Aug 7 at 13:40

I am indeed on macOS where the sed implementation doesn't support the I switch, but installed gnu-sed via homebrew and now I'm good to go. Thanks a lot!
â€“Â zool
Aug 7 at 14:25

add a commentÂ |Â

Thanks, this is very helpful. Is the /[Pp](RESS|ress) [Rr](ELEASE|elease)/d bit really necessary? Isn't there a flag to tell sed to match in a case insensitive manner?
â€“Â zool
Aug 7 at 13:36

1

@zool With GNU sed, you could use /press release/Id (that's a capital I, lowercase d). Since I don't know what sed you are using, I kept to standard sed constructs.
â€“Â Kusalananda
Aug 7 at 13:40

I am indeed on macOS where the sed implementation doesn't support the I switch, but installed gnu-sed via homebrew and now I'm good to go. Thanks a lot!
â€“Â zool
Aug 7 at 14:25

Thanks, this is very helpful. Is the /[Pp](RESS|ress) [Rr](ELEASE|elease)/d bit really necessary? Isn't there a flag to tell sed to match in a case insensitive manner?
â€“Â zool
Aug 7 at 13:36

@zool With GNU sed, you could use /press release/Id (that's a capital I, lowercase d). Since I don't know what sed you are using, I kept to standard sed constructs.
â€“Â Kusalananda
Aug 7 at 13:40

I am indeed on macOS where the sed implementation doesn't support the I switch, but installed gnu-sed via homebrew and now I'm good to go. Thanks a lot!
â€“Â zool
Aug 7 at 14:25

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu