Find or extract text between two patterns either on the same line or in many lines

up vote
1
down vote

favorite

I need to print the text between two patterns without keeping in mind their place as they are found randomly across the file. either in the same line or not in the same line or a text occur between them

Patterns are : <abc> and </abc>

example :

aslkdjas<abc>aaaa</abc><abc>bbbb</abc>sdkljasdl<abc>
cccc
dddd</abc>ieurwioeru<abc>eeee</abc>asdasd

I need an output like the following or to be comma separated whatever happen in this file to display the values between two patterns:

aaaa
bbbb
cccc
dddd
eeee

edited 2 hours ago

Isaac

9,26411442

asked Oct 27 at 15:17

Gebbo

Is this in fact an XML document?
– Kusalananda
Oct 27 at 15:58

add a comment |

up vote
1
down vote

favorite

Patterns are : <abc> and </abc>

example :

aslkdjas<abc>aaaa</abc><abc>bbbb</abc>sdkljasdl<abc>
cccc
dddd</abc>ieurwioeru<abc>eeee</abc>asdasd

I need an output like the following or to be comma separated whatever happen in this file to display the values between two patterns:

aaaa
bbbb
cccc
dddd
eeee

edited 2 hours ago

Isaac

9,26411442

asked Oct 27 at 15:17

Gebbo

Is this in fact an XML document?
– Kusalananda
Oct 27 at 15:58

add a comment |

up vote
1
down vote

favorite

Patterns are : <abc> and </abc>

example :

aslkdjas<abc>aaaa</abc><abc>bbbb</abc>sdkljasdl<abc>
cccc
dddd</abc>ieurwioeru<abc>eeee</abc>asdasd

I need an output like the following or to be comma separated whatever happen in this file to display the values between two patterns:

aaaa
bbbb
cccc
dddd
eeee

edited 2 hours ago

Isaac

9,26411442

asked Oct 27 at 15:17

Gebbo

Patterns are : <abc> and </abc>

example :

aslkdjas<abc>aaaa</abc><abc>bbbb</abc>sdkljasdl<abc>
cccc
dddd</abc>ieurwioeru<abc>eeee</abc>asdasd

I need an output like the following or to be comma separated whatever happen in this file to display the values between two patterns:

aaaa
bbbb
cccc
dddd
eeee

text-processing sed python perl

edited 2 hours ago

Isaac

9,26411442

asked Oct 27 at 15:17

Gebbo

edited 2 hours ago

Isaac

9,26411442

asked Oct 27 at 15:17

Gebbo

edited 2 hours ago

Isaac

9,26411442

edited 2 hours ago

Isaac

9,26411442

edited 2 hours ago

Isaac

9,26411442

asked Oct 27 at 15:17

Gebbo

asked Oct 27 at 15:17

Gebbo

asked Oct 27 at 15:17

Gebbo

Is this in fact an XML document?
– Kusalananda
Oct 27 at 15:58

add a comment |

Is this in fact an XML document?
– Kusalananda
Oct 27 at 15:58

Is this in fact an XML document?
– Kusalananda
Oct 27 at 15:58

add a comment |

3 Answers
3

active

oldest

votes

up vote
1
down vote

I don't recommend to parse any functional code with text-processing tools. They are simply designed for parsing only human language and sooner or later you will stuck with the problem that cannot be solve. Use dedicated tools instead (html interpreter, c++ compiler, etc.)

With that being said in this case you can try pcregrep:

pcregrep -Mo '<abc>K(.|n)*?(?=</abc>)' file

The result is

aaaa
bbbb

cccc
dddd
eeee

Yes, there is new line between bbbb and cccc because in the original file we have new line. Of course you can pipe the output to remove whitespaces if you want to (with tr, sed or whatever), but as I've said: in the real life examples you may encounter more unexpected results.

answered Oct 27 at 16:03

jimmij

30.2k867102

I forgot to add a link to proper description of "don't parse html with regex": stackoverflow.com/a/1732454/4488514
– jimmij
Oct 27 at 23:46

How could we differentiate between newlines that are "field separators" from those that were included in the original file?. Shouldn't you use a comma (,) as requested?
– Isaac
Nov 3 at 5:44

add a comment |

up vote
0
down vote

For that simple case, try

sed ':L1; N; $bL2; bL1; :L2; s#<abc>#^A#g; s#^[^^A]*^A##; s#</abc>[^^A]*^A#n#g; s#</abc>.*$##; ' file
aaaa
bbbb

cccc
dddd
eeee

Collect all lines into pattern space, replace leading pattern with ^A, remove BOL to first ^A, replace strings between patterns with <new line>, remove pattern until EOL, print.

answered Oct 28 at 13:54

RudiC

2,8081211

The sequence ':L1; N; $bL2; bL1; :L2; could be replaced by ':L1; N; $!bL1;, in fact, this is shorter/simpler: 'H;$!d;g; (capture everything in the hold buffer, on the last line "get it").
– Isaac
Nov 3 at 5:02

How could we differentiate between newlines that are "field separators" from those that were included in the original file?. Shouldn't you use a comma (,) as requested?
– Isaac
Nov 3 at 5:16

add a comment |

up vote
0
down vote

sed

A sed solution is to convert the patterns and to two other characters that are not used inside the file anywhere else. That will convert the problem to the general one of extracting between two single characters.

First, convert each pattern to single characters:

sed 'H;$!d;x; s##^A#g; s##^B#g;' file

That is assuming that you typed Ctrl-V Ctrl-A for each ^A and similarly for ^B.

The initial H;$!d;x; is to capture the whole file in the pattern space. That means:
- Hold every line
- erase the pattern space (and return to the beginning) d
- if it is not the last line $!
- get all the lines stored in the hold space x. (could be g, but x needs less memory as the whole file is not copied from the hold space to the pattern space).

The general process to extract a pattern between two single characters (assume x and y here) is:

sed 's#^[^x]x##; s#y[^y]$##; s#y[^x]*x#,#g;'

That is:
- remove leading characters before the first (^) x.
- remove the trailing characters after the last ($) y.
- Convert characters between y and x to a delimiter (comma (,) in this case).

All together:

$ sed 'H;$!d;x; s#<abc>#^A#g; s#</abc>#^B#g;' s#^[^^A]*^A##; s#^B[^^B]*$##; s#^B[^^A]*^A#,#g;' file
aaaa,bbbb,
cccc
dddd,eeee

grep

It could be done with (GNU) grep but it needs the help of paste to put the commas (only) in the right places:

$ grep -ozP '(?s)<abc>K.*?(?=</abc>)' file | paste -zsd ','; echo
aaaa,bbbb,
cccc
dddd,eeee

answered Nov 3 at 5:43

Isaac

9,26411442

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f478127%2ffind-or-extract-text-between-two-patterns-either-on-the-same-line-or-in-many-lin%23new-answer', 'question_page');

);

Post as a guest

Name

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

up vote
1
down vote

With that being said in this case you can try pcregrep:

pcregrep -Mo '<abc>K(.|n)*?(?=</abc>)' file

The result is

aaaa
bbbb

cccc
dddd
eeee

answered Oct 27 at 16:03

jimmij

30.2k867102

I forgot to add a link to proper description of "don't parse html with regex": stackoverflow.com/a/1732454/4488514
– jimmij
Oct 27 at 23:46

How could we differentiate between newlines that are "field separators" from those that were included in the original file?. Shouldn't you use a comma (,) as requested?
– Isaac
Nov 3 at 5:44

add a comment |

up vote
1
down vote

With that being said in this case you can try pcregrep:

pcregrep -Mo '<abc>K(.|n)*?(?=</abc>)' file

The result is

aaaa
bbbb

cccc
dddd
eeee

answered Oct 27 at 16:03

jimmij

30.2k867102

I forgot to add a link to proper description of "don't parse html with regex": stackoverflow.com/a/1732454/4488514
– jimmij
Oct 27 at 23:46

How could we differentiate between newlines that are "field separators" from those that were included in the original file?. Shouldn't you use a comma (,) as requested?
– Isaac
Nov 3 at 5:44

add a comment |

up vote
1
down vote

With that being said in this case you can try pcregrep:

pcregrep -Mo '<abc>K(.|n)*?(?=</abc>)' file

The result is

aaaa
bbbb

cccc
dddd
eeee

answered Oct 27 at 16:03

jimmij

30.2k867102

With that being said in this case you can try pcregrep:

pcregrep -Mo '<abc>K(.|n)*?(?=</abc>)' file

The result is

aaaa
bbbb

cccc
dddd
eeee

answered Oct 27 at 16:03

jimmij

30.2k867102

answered Oct 27 at 16:03

jimmij

30.2k867102

answered Oct 27 at 16:03

jimmij

30.2k867102

answered Oct 27 at 16:03

jimmij

30.2k867102

I forgot to add a link to proper description of "don't parse html with regex": stackoverflow.com/a/1732454/4488514
– jimmij
Oct 27 at 23:46

How could we differentiate between newlines that are "field separators" from those that were included in the original file?. Shouldn't you use a comma (,) as requested?
– Isaac
Nov 3 at 5:44

add a comment |

I forgot to add a link to proper description of "don't parse html with regex": stackoverflow.com/a/1732454/4488514
– jimmij
Oct 27 at 23:46

How could we differentiate between newlines that are "field separators" from those that were included in the original file?. Shouldn't you use a comma (,) as requested?
– Isaac
Nov 3 at 5:44

I forgot to add a link to proper description of "don't parse html with regex": stackoverflow.com/a/1732454/4488514
– jimmij
Oct 27 at 23:46

How could we differentiate between newlines that are "field separators" from those that were included in the original file?. Shouldn't you use a comma (,) as requested?
– Isaac
Nov 3 at 5:44

add a comment |

up vote
0
down vote

For that simple case, try

sed ':L1; N; $bL2; bL1; :L2; s#<abc>#^A#g; s#^[^^A]*^A##; s#</abc>[^^A]*^A#n#g; s#</abc>.*$##; ' file
aaaa
bbbb

cccc
dddd
eeee

Collect all lines into pattern space, replace leading pattern with ^A, remove BOL to first ^A, replace strings between patterns with <new line>, remove pattern until EOL, print.

answered Oct 28 at 13:54

RudiC

2,8081211

The sequence ':L1; N; $bL2; bL1; :L2; could be replaced by ':L1; N; $!bL1;, in fact, this is shorter/simpler: 'H;$!d;g; (capture everything in the hold buffer, on the last line "get it").
– Isaac
Nov 3 at 5:02

How could we differentiate between newlines that are "field separators" from those that were included in the original file?. Shouldn't you use a comma (,) as requested?
– Isaac
Nov 3 at 5:16

add a comment |

up vote
0
down vote

For that simple case, try

sed ':L1; N; $bL2; bL1; :L2; s#<abc>#^A#g; s#^[^^A]*^A##; s#</abc>[^^A]*^A#n#g; s#</abc>.*$##; ' file
aaaa
bbbb

cccc
dddd
eeee

Collect all lines into pattern space, replace leading pattern with ^A, remove BOL to first ^A, replace strings between patterns with <new line>, remove pattern until EOL, print.

answered Oct 28 at 13:54

RudiC

2,8081211

The sequence ':L1; N; $bL2; bL1; :L2; could be replaced by ':L1; N; $!bL1;, in fact, this is shorter/simpler: 'H;$!d;g; (capture everything in the hold buffer, on the last line "get it").
– Isaac
Nov 3 at 5:02

How could we differentiate between newlines that are "field separators" from those that were included in the original file?. Shouldn't you use a comma (,) as requested?
– Isaac
Nov 3 at 5:16

add a comment |

up vote
0
down vote

For that simple case, try

sed ':L1; N; $bL2; bL1; :L2; s#<abc>#^A#g; s#^[^^A]*^A##; s#</abc>[^^A]*^A#n#g; s#</abc>.*$##; ' file
aaaa
bbbb

cccc
dddd
eeee

Collect all lines into pattern space, replace leading pattern with ^A, remove BOL to first ^A, replace strings between patterns with <new line>, remove pattern until EOL, print.

answered Oct 28 at 13:54

RudiC

2,8081211

For that simple case, try

sed ':L1; N; $bL2; bL1; :L2; s#<abc>#^A#g; s#^[^^A]*^A##; s#</abc>[^^A]*^A#n#g; s#</abc>.*$##; ' file
aaaa
bbbb

cccc
dddd
eeee

Collect all lines into pattern space, replace leading pattern with ^A, remove BOL to first ^A, replace strings between patterns with <new line>, remove pattern until EOL, print.

answered Oct 28 at 13:54

RudiC

2,8081211

answered Oct 28 at 13:54

RudiC

2,8081211

answered Oct 28 at 13:54

RudiC

2,8081211

answered Oct 28 at 13:54

RudiC

2,8081211

The sequence ':L1; N; $bL2; bL1; :L2; could be replaced by ':L1; N; $!bL1;, in fact, this is shorter/simpler: 'H;$!d;g; (capture everything in the hold buffer, on the last line "get it").
– Isaac
Nov 3 at 5:02

How could we differentiate between newlines that are "field separators" from those that were included in the original file?. Shouldn't you use a comma (,) as requested?
– Isaac
Nov 3 at 5:16

add a comment |

The sequence ':L1; N; $bL2; bL1; :L2; could be replaced by ':L1; N; $!bL1;, in fact, this is shorter/simpler: 'H;$!d;g; (capture everything in the hold buffer, on the last line "get it").
– Isaac
Nov 3 at 5:02

How could we differentiate between newlines that are "field separators" from those that were included in the original file?. Shouldn't you use a comma (,) as requested?
– Isaac
Nov 3 at 5:16

The sequence ':L1; N; $bL2; bL1; :L2; could be replaced by ':L1; N; $!bL1;, in fact, this is shorter/simpler: 'H;$!d;g; (capture everything in the hold buffer, on the last line "get it").
– Isaac
Nov 3 at 5:02

How could we differentiate between newlines that are "field separators" from those that were included in the original file?. Shouldn't you use a comma (,) as requested?
– Isaac
Nov 3 at 5:16

add a comment |

up vote
0
down vote

sed

First, convert each pattern to single characters:

sed 'H;$!d;x; s##^A#g; s##^B#g;' file

That is assuming that you typed Ctrl-V Ctrl-A for each ^A and similarly for ^B.

The initial H;$!d;x; is to capture the whole file in the pattern space. That means:
- Hold every line
- erase the pattern space (and return to the beginning) d
- if it is not the last line $!
- get all the lines stored in the hold space x. (could be g, but x needs less memory as the whole file is not copied from the hold space to the pattern space).

The general process to extract a pattern between two single characters (assume x and y here) is:

sed 's#^[^x]x##; s#y[^y]$##; s#y[^x]*x#,#g;'

That is:
- remove leading characters before the first (^) x.
- remove the trailing characters after the last ($) y.
- Convert characters between y and x to a delimiter (comma (,) in this case).

All together:

$ sed 'H;$!d;x; s#<abc>#^A#g; s#</abc>#^B#g;' s#^[^^A]*^A##; s#^B[^^B]*$##; s#^B[^^A]*^A#,#g;' file
aaaa,bbbb,
cccc
dddd,eeee

grep

It could be done with (GNU) grep but it needs the help of paste to put the commas (only) in the right places:

$ grep -ozP '(?s)<abc>K.*?(?=</abc>)' file | paste -zsd ','; echo
aaaa,bbbb,
cccc
dddd,eeee

answered Nov 3 at 5:43

Isaac

9,26411442

add a comment |

up vote
0
down vote

sed

First, convert each pattern to single characters:

sed 'H;$!d;x; s##^A#g; s##^B#g;' file

That is assuming that you typed Ctrl-V Ctrl-A for each ^A and similarly for ^B.

The initial H;$!d;x; is to capture the whole file in the pattern space. That means:
- Hold every line
- erase the pattern space (and return to the beginning) d
- if it is not the last line $!
- get all the lines stored in the hold space x. (could be g, but x needs less memory as the whole file is not copied from the hold space to the pattern space).

The general process to extract a pattern between two single characters (assume x and y here) is:

sed 's#^[^x]x##; s#y[^y]$##; s#y[^x]*x#,#g;'

That is:
- remove leading characters before the first (^) x.
- remove the trailing characters after the last ($) y.
- Convert characters between y and x to a delimiter (comma (,) in this case).

All together:

$ sed 'H;$!d;x; s#<abc>#^A#g; s#</abc>#^B#g;' s#^[^^A]*^A##; s#^B[^^B]*$##; s#^B[^^A]*^A#,#g;' file
aaaa,bbbb,
cccc
dddd,eeee

grep

It could be done with (GNU) grep but it needs the help of paste to put the commas (only) in the right places:

$ grep -ozP '(?s)<abc>K.*?(?=</abc>)' file | paste -zsd ','; echo
aaaa,bbbb,
cccc
dddd,eeee

answered Nov 3 at 5:43

Isaac

9,26411442

add a comment |

up vote
0
down vote

sed

First, convert each pattern to single characters:

sed 'H;$!d;x; s##^A#g; s##^B#g;' file

That is assuming that you typed Ctrl-V Ctrl-A for each ^A and similarly for ^B.

The initial H;$!d;x; is to capture the whole file in the pattern space. That means:
- Hold every line
- erase the pattern space (and return to the beginning) d
- if it is not the last line $!
- get all the lines stored in the hold space x. (could be g, but x needs less memory as the whole file is not copied from the hold space to the pattern space).

The general process to extract a pattern between two single characters (assume x and y here) is:

sed 's#^[^x]x##; s#y[^y]$##; s#y[^x]*x#,#g;'

That is:
- remove leading characters before the first (^) x.
- remove the trailing characters after the last ($) y.
- Convert characters between y and x to a delimiter (comma (,) in this case).

All together:

$ sed 'H;$!d;x; s#<abc>#^A#g; s#</abc>#^B#g;' s#^[^^A]*^A##; s#^B[^^B]*$##; s#^B[^^A]*^A#,#g;' file
aaaa,bbbb,
cccc
dddd,eeee

grep

It could be done with (GNU) grep but it needs the help of paste to put the commas (only) in the right places:

$ grep -ozP '(?s)<abc>K.*?(?=</abc>)' file | paste -zsd ','; echo
aaaa,bbbb,
cccc
dddd,eeee

answered Nov 3 at 5:43

Isaac

9,26411442

sed

First, convert each pattern to single characters:

sed 'H;$!d;x; s##^A#g; s##^B#g;' file

That is assuming that you typed Ctrl-V Ctrl-A for each ^A and similarly for ^B.

The initial H;$!d;x; is to capture the whole file in the pattern space. That means:
- Hold every line
- erase the pattern space (and return to the beginning) d
- if it is not the last line $!
- get all the lines stored in the hold space x. (could be g, but x needs less memory as the whole file is not copied from the hold space to the pattern space).

The general process to extract a pattern between two single characters (assume x and y here) is:

sed 's#^[^x]x##; s#y[^y]$##; s#y[^x]*x#,#g;'

That is:
- remove leading characters before the first (^) x.
- remove the trailing characters after the last ($) y.
- Convert characters between y and x to a delimiter (comma (,) in this case).

All together:

$ sed 'H;$!d;x; s#<abc>#^A#g; s#</abc>#^B#g;' s#^[^^A]*^A##; s#^B[^^B]*$##; s#^B[^^A]*^A#,#g;' file
aaaa,bbbb,
cccc
dddd,eeee

grep

It could be done with (GNU) grep but it needs the help of paste to put the commas (only) in the right places:

$ grep -ozP '(?s)<abc>K.*?(?=</abc>)' file | paste -zsd ','; echo
aaaa,bbbb,
cccc
dddd,eeee

answered Nov 3 at 5:43

Isaac

9,26411442

answered Nov 3 at 5:43

Isaac

9,26411442

answered Nov 3 at 5:43

Isaac

9,26411442

answered Nov 3 at 5:43

Isaac

9,26411442

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu

Find or extract text between two patterns either on the same line or in many lines

3 Answers
3

sed

grep

Your Answer

Post as a guest

3 Answers
3

3 Answers
3

sed

grep

sed

grep

sed

grep

sed

grep

Post as a guest

Popular posts from this blog

Peggy Mitchell

Palaiologos

The Forum (Inglewood, California)

Find or extract text between two patterns either on the same line or in many lines

3 Answers 3

sed

grep

Your Answer

Sign up or log in

Post as a guest

Post as a guest

3 Answers 3

3 Answers 3

sed

grep

sed

grep

sed

grep

sed

grep

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Peggy Mitchell

Palaiologos

The Forum (Inglewood, California)

3 Answers
3

3 Answers
3

3 Answers
3