Find or extract text between two patterns either on the same line or in many lines
Clash Royale CLAN TAG#URR8PPP
up vote
1
down vote
favorite
I need to print the text between two patterns without keeping in mind their place as they are found randomly across the file. either in the same line or not in the same line or a text occur between them
Patterns are : <abc>
and </abc>
example :
aslkdjas<abc>aaaa</abc><abc>bbbb</abc>sdkljasdl<abc>
cccc
dddd</abc>ieurwioeru<abc>eeee</abc>asdasd
I need an output like the following or to be comma separated whatever happen in this file to display the values between two patterns:
aaaa
bbbb
cccc
dddd
eeee
text-processing sed python perl
add a comment |
up vote
1
down vote
favorite
I need to print the text between two patterns without keeping in mind their place as they are found randomly across the file. either in the same line or not in the same line or a text occur between them
Patterns are : <abc>
and </abc>
example :
aslkdjas<abc>aaaa</abc><abc>bbbb</abc>sdkljasdl<abc>
cccc
dddd</abc>ieurwioeru<abc>eeee</abc>asdasd
I need an output like the following or to be comma separated whatever happen in this file to display the values between two patterns:
aaaa
bbbb
cccc
dddd
eeee
text-processing sed python perl
Is this in fact an XML document?
– Kusalananda
Oct 27 at 15:58
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I need to print the text between two patterns without keeping in mind their place as they are found randomly across the file. either in the same line or not in the same line or a text occur between them
Patterns are : <abc>
and </abc>
example :
aslkdjas<abc>aaaa</abc><abc>bbbb</abc>sdkljasdl<abc>
cccc
dddd</abc>ieurwioeru<abc>eeee</abc>asdasd
I need an output like the following or to be comma separated whatever happen in this file to display the values between two patterns:
aaaa
bbbb
cccc
dddd
eeee
text-processing sed python perl
I need to print the text between two patterns without keeping in mind their place as they are found randomly across the file. either in the same line or not in the same line or a text occur between them
Patterns are : <abc>
and </abc>
example :
aslkdjas<abc>aaaa</abc><abc>bbbb</abc>sdkljasdl<abc>
cccc
dddd</abc>ieurwioeru<abc>eeee</abc>asdasd
I need an output like the following or to be comma separated whatever happen in this file to display the values between two patterns:
aaaa
bbbb
cccc
dddd
eeee
text-processing sed python perl
text-processing sed python perl
edited 2 hours ago
Isaac
9,26411442
9,26411442
asked Oct 27 at 15:17
Gebbo
61
61
Is this in fact an XML document?
– Kusalananda
Oct 27 at 15:58
add a comment |
Is this in fact an XML document?
– Kusalananda
Oct 27 at 15:58
Is this in fact an XML document?
– Kusalananda
Oct 27 at 15:58
Is this in fact an XML document?
– Kusalananda
Oct 27 at 15:58
add a comment |
3 Answers
3
active
oldest
votes
up vote
1
down vote
I don't recommend to parse any functional code with text-processing tools. They are simply designed for parsing only human language and sooner or later you will stuck with the problem that cannot be solve. Use dedicated tools instead (html interpreter, c++ compiler, etc.)
With that being said in this case you can try pcregrep
:
pcregrep -Mo '<abc>K(.|n)*?(?=</abc>)' file
The result is
aaaa
bbbb
cccc
dddd
eeee
Yes, there is new line between bbbb
and cccc
because in the original file we have new line. Of course you can pipe the output to remove whitespaces if you want to (with tr
, sed
or whatever), but as I've said: in the real life examples you may encounter more unexpected results.
I forgot to add a link to proper description of "don't parse html with regex": stackoverflow.com/a/1732454/4488514
– jimmij
Oct 27 at 23:46
How could we differentiate between newlines that are "field separators" from those that were included in the original file?. Shouldn't you use a comma (,) as requested?
– Isaac
Nov 3 at 5:44
add a comment |
up vote
0
down vote
For that simple case, try
sed ':L1; N; $bL2; bL1; :L2; s#<abc>#^A#g; s#^[^^A]*^A##; s#</abc>[^^A]*^A#n#g; s#</abc>.*$##; ' file
aaaa
bbbb
cccc
dddd
eeee
Collect all lines into pattern space, replace leading pattern with ^A
, remove BOL to first ^A
, replace strings between patterns with <new line>
, remove pattern until EOL, print.
The sequence':L1; N; $bL2; bL1; :L2;
could be replaced by':L1; N; $!bL1;
, in fact, this is shorter/simpler:'H;$!d;g;
(capture everything in the hold buffer, on the last line "get it").
– Isaac
Nov 3 at 5:02
How could we differentiate between newlines that are "field separators" from those that were included in the original file?. Shouldn't you use a comma (,
) as requested?
– Isaac
Nov 3 at 5:16
add a comment |
up vote
0
down vote
sed
A sed solution is to convert the patterns and to two other characters that are not used inside the file anywhere else. That will convert the problem to the general one of extracting between two single characters.
First, convert each pattern to single characters:
sed 'H;$!d;x; s##^A#g; s##^B#g;' file
That is assuming that you typed Ctrl-V Ctrl-A for each
^A
and similarly for^B
.The initial
H;$!d;x;
is to capture the whole file in the pattern space. That means:- Hold every line
- erase the pattern space (and return to the beginning)
d
if it is not the last line$!
- get all the lines stored in the hold space
x
. (could beg
, but x needs less memory as the whole file is not copied from the hold space to the pattern space).
The general process to extract a pattern between two single characters (assume
x
andy
here) is:sed 's#^[^x]x##; s#y[^y]$##; s#y[^x]*x#,#g;'
That is:
- remove leading characters before the first (
^
)x
. - remove the trailing characters after the last (
$
)y
. - Convert characters between y and x to a delimiter (comma (
,
) in this case).
- remove leading characters before the first (
All together:
$ sed 'H;$!d;x; s#<abc>#^A#g; s#</abc>#^B#g;' s#^[^^A]*^A##; s#^B[^^B]*$##; s#^B[^^A]*^A#,#g;' file
aaaa,bbbb,
cccc
dddd,eeee
grep
It could be done with (GNU) grep but it needs the help of paste to put the commas (only) in the right places:
$ grep -ozP '(?s)<abc>K.*?(?=</abc>)' file | paste -zsd ','; echo
aaaa,bbbb,
cccc
dddd,eeee
add a comment |
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
I don't recommend to parse any functional code with text-processing tools. They are simply designed for parsing only human language and sooner or later you will stuck with the problem that cannot be solve. Use dedicated tools instead (html interpreter, c++ compiler, etc.)
With that being said in this case you can try pcregrep
:
pcregrep -Mo '<abc>K(.|n)*?(?=</abc>)' file
The result is
aaaa
bbbb
cccc
dddd
eeee
Yes, there is new line between bbbb
and cccc
because in the original file we have new line. Of course you can pipe the output to remove whitespaces if you want to (with tr
, sed
or whatever), but as I've said: in the real life examples you may encounter more unexpected results.
I forgot to add a link to proper description of "don't parse html with regex": stackoverflow.com/a/1732454/4488514
– jimmij
Oct 27 at 23:46
How could we differentiate between newlines that are "field separators" from those that were included in the original file?. Shouldn't you use a comma (,) as requested?
– Isaac
Nov 3 at 5:44
add a comment |
up vote
1
down vote
I don't recommend to parse any functional code with text-processing tools. They are simply designed for parsing only human language and sooner or later you will stuck with the problem that cannot be solve. Use dedicated tools instead (html interpreter, c++ compiler, etc.)
With that being said in this case you can try pcregrep
:
pcregrep -Mo '<abc>K(.|n)*?(?=</abc>)' file
The result is
aaaa
bbbb
cccc
dddd
eeee
Yes, there is new line between bbbb
and cccc
because in the original file we have new line. Of course you can pipe the output to remove whitespaces if you want to (with tr
, sed
or whatever), but as I've said: in the real life examples you may encounter more unexpected results.
I forgot to add a link to proper description of "don't parse html with regex": stackoverflow.com/a/1732454/4488514
– jimmij
Oct 27 at 23:46
How could we differentiate between newlines that are "field separators" from those that were included in the original file?. Shouldn't you use a comma (,) as requested?
– Isaac
Nov 3 at 5:44
add a comment |
up vote
1
down vote
up vote
1
down vote
I don't recommend to parse any functional code with text-processing tools. They are simply designed for parsing only human language and sooner or later you will stuck with the problem that cannot be solve. Use dedicated tools instead (html interpreter, c++ compiler, etc.)
With that being said in this case you can try pcregrep
:
pcregrep -Mo '<abc>K(.|n)*?(?=</abc>)' file
The result is
aaaa
bbbb
cccc
dddd
eeee
Yes, there is new line between bbbb
and cccc
because in the original file we have new line. Of course you can pipe the output to remove whitespaces if you want to (with tr
, sed
or whatever), but as I've said: in the real life examples you may encounter more unexpected results.
I don't recommend to parse any functional code with text-processing tools. They are simply designed for parsing only human language and sooner or later you will stuck with the problem that cannot be solve. Use dedicated tools instead (html interpreter, c++ compiler, etc.)
With that being said in this case you can try pcregrep
:
pcregrep -Mo '<abc>K(.|n)*?(?=</abc>)' file
The result is
aaaa
bbbb
cccc
dddd
eeee
Yes, there is new line between bbbb
and cccc
because in the original file we have new line. Of course you can pipe the output to remove whitespaces if you want to (with tr
, sed
or whatever), but as I've said: in the real life examples you may encounter more unexpected results.
answered Oct 27 at 16:03
jimmij
30.2k867102
30.2k867102
I forgot to add a link to proper description of "don't parse html with regex": stackoverflow.com/a/1732454/4488514
– jimmij
Oct 27 at 23:46
How could we differentiate between newlines that are "field separators" from those that were included in the original file?. Shouldn't you use a comma (,) as requested?
– Isaac
Nov 3 at 5:44
add a comment |
I forgot to add a link to proper description of "don't parse html with regex": stackoverflow.com/a/1732454/4488514
– jimmij
Oct 27 at 23:46
How could we differentiate between newlines that are "field separators" from those that were included in the original file?. Shouldn't you use a comma (,) as requested?
– Isaac
Nov 3 at 5:44
I forgot to add a link to proper description of "don't parse html with regex": stackoverflow.com/a/1732454/4488514
– jimmij
Oct 27 at 23:46
I forgot to add a link to proper description of "don't parse html with regex": stackoverflow.com/a/1732454/4488514
– jimmij
Oct 27 at 23:46
How could we differentiate between newlines that are "field separators" from those that were included in the original file?. Shouldn't you use a comma (,) as requested?
– Isaac
Nov 3 at 5:44
How could we differentiate between newlines that are "field separators" from those that were included in the original file?. Shouldn't you use a comma (,) as requested?
– Isaac
Nov 3 at 5:44
add a comment |
up vote
0
down vote
For that simple case, try
sed ':L1; N; $bL2; bL1; :L2; s#<abc>#^A#g; s#^[^^A]*^A##; s#</abc>[^^A]*^A#n#g; s#</abc>.*$##; ' file
aaaa
bbbb
cccc
dddd
eeee
Collect all lines into pattern space, replace leading pattern with ^A
, remove BOL to first ^A
, replace strings between patterns with <new line>
, remove pattern until EOL, print.
The sequence':L1; N; $bL2; bL1; :L2;
could be replaced by':L1; N; $!bL1;
, in fact, this is shorter/simpler:'H;$!d;g;
(capture everything in the hold buffer, on the last line "get it").
– Isaac
Nov 3 at 5:02
How could we differentiate between newlines that are "field separators" from those that were included in the original file?. Shouldn't you use a comma (,
) as requested?
– Isaac
Nov 3 at 5:16
add a comment |
up vote
0
down vote
For that simple case, try
sed ':L1; N; $bL2; bL1; :L2; s#<abc>#^A#g; s#^[^^A]*^A##; s#</abc>[^^A]*^A#n#g; s#</abc>.*$##; ' file
aaaa
bbbb
cccc
dddd
eeee
Collect all lines into pattern space, replace leading pattern with ^A
, remove BOL to first ^A
, replace strings between patterns with <new line>
, remove pattern until EOL, print.
The sequence':L1; N; $bL2; bL1; :L2;
could be replaced by':L1; N; $!bL1;
, in fact, this is shorter/simpler:'H;$!d;g;
(capture everything in the hold buffer, on the last line "get it").
– Isaac
Nov 3 at 5:02
How could we differentiate between newlines that are "field separators" from those that were included in the original file?. Shouldn't you use a comma (,
) as requested?
– Isaac
Nov 3 at 5:16
add a comment |
up vote
0
down vote
up vote
0
down vote
For that simple case, try
sed ':L1; N; $bL2; bL1; :L2; s#<abc>#^A#g; s#^[^^A]*^A##; s#</abc>[^^A]*^A#n#g; s#</abc>.*$##; ' file
aaaa
bbbb
cccc
dddd
eeee
Collect all lines into pattern space, replace leading pattern with ^A
, remove BOL to first ^A
, replace strings between patterns with <new line>
, remove pattern until EOL, print.
For that simple case, try
sed ':L1; N; $bL2; bL1; :L2; s#<abc>#^A#g; s#^[^^A]*^A##; s#</abc>[^^A]*^A#n#g; s#</abc>.*$##; ' file
aaaa
bbbb
cccc
dddd
eeee
Collect all lines into pattern space, replace leading pattern with ^A
, remove BOL to first ^A
, replace strings between patterns with <new line>
, remove pattern until EOL, print.
answered Oct 28 at 13:54
RudiC
2,8081211
2,8081211
The sequence':L1; N; $bL2; bL1; :L2;
could be replaced by':L1; N; $!bL1;
, in fact, this is shorter/simpler:'H;$!d;g;
(capture everything in the hold buffer, on the last line "get it").
– Isaac
Nov 3 at 5:02
How could we differentiate between newlines that are "field separators" from those that were included in the original file?. Shouldn't you use a comma (,
) as requested?
– Isaac
Nov 3 at 5:16
add a comment |
The sequence':L1; N; $bL2; bL1; :L2;
could be replaced by':L1; N; $!bL1;
, in fact, this is shorter/simpler:'H;$!d;g;
(capture everything in the hold buffer, on the last line "get it").
– Isaac
Nov 3 at 5:02
How could we differentiate between newlines that are "field separators" from those that were included in the original file?. Shouldn't you use a comma (,
) as requested?
– Isaac
Nov 3 at 5:16
The sequence
':L1; N; $bL2; bL1; :L2;
could be replaced by ':L1; N; $!bL1;
, in fact, this is shorter/simpler: 'H;$!d;g;
(capture everything in the hold buffer, on the last line "get it").– Isaac
Nov 3 at 5:02
The sequence
':L1; N; $bL2; bL1; :L2;
could be replaced by ':L1; N; $!bL1;
, in fact, this is shorter/simpler: 'H;$!d;g;
(capture everything in the hold buffer, on the last line "get it").– Isaac
Nov 3 at 5:02
How could we differentiate between newlines that are "field separators" from those that were included in the original file?. Shouldn't you use a comma (
,
) as requested?– Isaac
Nov 3 at 5:16
How could we differentiate between newlines that are "field separators" from those that were included in the original file?. Shouldn't you use a comma (
,
) as requested?– Isaac
Nov 3 at 5:16
add a comment |
up vote
0
down vote
sed
A sed solution is to convert the patterns and to two other characters that are not used inside the file anywhere else. That will convert the problem to the general one of extracting between two single characters.
First, convert each pattern to single characters:
sed 'H;$!d;x; s##^A#g; s##^B#g;' file
That is assuming that you typed Ctrl-V Ctrl-A for each
^A
and similarly for^B
.The initial
H;$!d;x;
is to capture the whole file in the pattern space. That means:- Hold every line
- erase the pattern space (and return to the beginning)
d
if it is not the last line$!
- get all the lines stored in the hold space
x
. (could beg
, but x needs less memory as the whole file is not copied from the hold space to the pattern space).
The general process to extract a pattern between two single characters (assume
x
andy
here) is:sed 's#^[^x]x##; s#y[^y]$##; s#y[^x]*x#,#g;'
That is:
- remove leading characters before the first (
^
)x
. - remove the trailing characters after the last (
$
)y
. - Convert characters between y and x to a delimiter (comma (
,
) in this case).
- remove leading characters before the first (
All together:
$ sed 'H;$!d;x; s#<abc>#^A#g; s#</abc>#^B#g;' s#^[^^A]*^A##; s#^B[^^B]*$##; s#^B[^^A]*^A#,#g;' file
aaaa,bbbb,
cccc
dddd,eeee
grep
It could be done with (GNU) grep but it needs the help of paste to put the commas (only) in the right places:
$ grep -ozP '(?s)<abc>K.*?(?=</abc>)' file | paste -zsd ','; echo
aaaa,bbbb,
cccc
dddd,eeee
add a comment |
up vote
0
down vote
sed
A sed solution is to convert the patterns and to two other characters that are not used inside the file anywhere else. That will convert the problem to the general one of extracting between two single characters.
First, convert each pattern to single characters:
sed 'H;$!d;x; s##^A#g; s##^B#g;' file
That is assuming that you typed Ctrl-V Ctrl-A for each
^A
and similarly for^B
.The initial
H;$!d;x;
is to capture the whole file in the pattern space. That means:- Hold every line
- erase the pattern space (and return to the beginning)
d
if it is not the last line$!
- get all the lines stored in the hold space
x
. (could beg
, but x needs less memory as the whole file is not copied from the hold space to the pattern space).
The general process to extract a pattern between two single characters (assume
x
andy
here) is:sed 's#^[^x]x##; s#y[^y]$##; s#y[^x]*x#,#g;'
That is:
- remove leading characters before the first (
^
)x
. - remove the trailing characters after the last (
$
)y
. - Convert characters between y and x to a delimiter (comma (
,
) in this case).
- remove leading characters before the first (
All together:
$ sed 'H;$!d;x; s#<abc>#^A#g; s#</abc>#^B#g;' s#^[^^A]*^A##; s#^B[^^B]*$##; s#^B[^^A]*^A#,#g;' file
aaaa,bbbb,
cccc
dddd,eeee
grep
It could be done with (GNU) grep but it needs the help of paste to put the commas (only) in the right places:
$ grep -ozP '(?s)<abc>K.*?(?=</abc>)' file | paste -zsd ','; echo
aaaa,bbbb,
cccc
dddd,eeee
add a comment |
up vote
0
down vote
up vote
0
down vote
sed
A sed solution is to convert the patterns and to two other characters that are not used inside the file anywhere else. That will convert the problem to the general one of extracting between two single characters.
First, convert each pattern to single characters:
sed 'H;$!d;x; s##^A#g; s##^B#g;' file
That is assuming that you typed Ctrl-V Ctrl-A for each
^A
and similarly for^B
.The initial
H;$!d;x;
is to capture the whole file in the pattern space. That means:- Hold every line
- erase the pattern space (and return to the beginning)
d
if it is not the last line$!
- get all the lines stored in the hold space
x
. (could beg
, but x needs less memory as the whole file is not copied from the hold space to the pattern space).
The general process to extract a pattern between two single characters (assume
x
andy
here) is:sed 's#^[^x]x##; s#y[^y]$##; s#y[^x]*x#,#g;'
That is:
- remove leading characters before the first (
^
)x
. - remove the trailing characters after the last (
$
)y
. - Convert characters between y and x to a delimiter (comma (
,
) in this case).
- remove leading characters before the first (
All together:
$ sed 'H;$!d;x; s#<abc>#^A#g; s#</abc>#^B#g;' s#^[^^A]*^A##; s#^B[^^B]*$##; s#^B[^^A]*^A#,#g;' file
aaaa,bbbb,
cccc
dddd,eeee
grep
It could be done with (GNU) grep but it needs the help of paste to put the commas (only) in the right places:
$ grep -ozP '(?s)<abc>K.*?(?=</abc>)' file | paste -zsd ','; echo
aaaa,bbbb,
cccc
dddd,eeee
sed
A sed solution is to convert the patterns and to two other characters that are not used inside the file anywhere else. That will convert the problem to the general one of extracting between two single characters.
First, convert each pattern to single characters:
sed 'H;$!d;x; s##^A#g; s##^B#g;' file
That is assuming that you typed Ctrl-V Ctrl-A for each
^A
and similarly for^B
.The initial
H;$!d;x;
is to capture the whole file in the pattern space. That means:- Hold every line
- erase the pattern space (and return to the beginning)
d
if it is not the last line$!
- get all the lines stored in the hold space
x
. (could beg
, but x needs less memory as the whole file is not copied from the hold space to the pattern space).
The general process to extract a pattern between two single characters (assume
x
andy
here) is:sed 's#^[^x]x##; s#y[^y]$##; s#y[^x]*x#,#g;'
That is:
- remove leading characters before the first (
^
)x
. - remove the trailing characters after the last (
$
)y
. - Convert characters between y and x to a delimiter (comma (
,
) in this case).
- remove leading characters before the first (
All together:
$ sed 'H;$!d;x; s#<abc>#^A#g; s#</abc>#^B#g;' s#^[^^A]*^A##; s#^B[^^B]*$##; s#^B[^^A]*^A#,#g;' file
aaaa,bbbb,
cccc
dddd,eeee
grep
It could be done with (GNU) grep but it needs the help of paste to put the commas (only) in the right places:
$ grep -ozP '(?s)<abc>K.*?(?=</abc>)' file | paste -zsd ','; echo
aaaa,bbbb,
cccc
dddd,eeee
answered Nov 3 at 5:43
Isaac
9,26411442
9,26411442
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f478127%2ffind-or-extract-text-between-two-patterns-either-on-the-same-line-or-in-many-lin%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Is this in fact an XML document?
– Kusalananda
Oct 27 at 15:58