Extract the Children of a Specific XML Element Type
Clash Royale CLAN TAG#URR8PPP
up vote
2
down vote
favorite
Given a specific XML element (i.e. a specific tag name) and a snippet of XML data, I want to extract the children from each occurrence of that element. More specifically, I have the following snippet of (not quite valid) XML data:
<!-- data.xml -->
<instance ab=1 >
<a1>aa</a1>
<a2>aa</a2>
</instance>
<instance ab=2 >
<b1>bb</b1>
<b2>bb</b2>
</instance>
<instance ab=3 >
<c1>cc</c1>
<c2>cc</c2>
</instance>
I would like a script or command which takes this data as input and produces the following output:
<a1>aa</a1><a2>aa</a2>
<b1>bb</b1><b2>bb</b2>
<c1>cc</c1><c2>cc</c2>
I would like for the solution to use standard text-processing tools such as sed
or awk
.
I tried using the following sed
command, but it did not work:
sed -n '/<Sample/,/</Sample/p' data.xml
text-processing awk sed xml ctags
add a comment |Â
up vote
2
down vote
favorite
Given a specific XML element (i.e. a specific tag name) and a snippet of XML data, I want to extract the children from each occurrence of that element. More specifically, I have the following snippet of (not quite valid) XML data:
<!-- data.xml -->
<instance ab=1 >
<a1>aa</a1>
<a2>aa</a2>
</instance>
<instance ab=2 >
<b1>bb</b1>
<b2>bb</b2>
</instance>
<instance ab=3 >
<c1>cc</c1>
<c2>cc</c2>
</instance>
I would like a script or command which takes this data as input and produces the following output:
<a1>aa</a1><a2>aa</a2>
<b1>bb</b1><b2>bb</b2>
<c1>cc</c1><c2>cc</c2>
I would like for the solution to use standard text-processing tools such as sed
or awk
.
I tried using the following sed
command, but it did not work:
sed -n '/<Sample/,/</Sample/p' data.xml
text-processing awk sed xml ctags
It's totally unclear what you're asking for here. Why would thesed
command you're using have any effect with the data you're using? The search string you're using doesn't appear in the data. And why do you want to usesed
andawk
specifically? Are you sure that's a requirement?
â igal
Nov 11 '17 at 14:26
The input file is not a properly formatted XML file. It lacks a single root element.
â Kusalananda
Nov 11 '17 at 14:47
... and itâÂÂs indented peculiarly.
â G-Man
Nov 11 '17 at 15:16
And the attribute values aren't quoted.
â igal
Nov 11 '17 at 15:40
If your question has been resolved you should accept an answer so that the issue is closed. Otherwise this question will remain open and people may continue to submit solutions.
â igal
Dec 9 '17 at 22:48
add a comment |Â
up vote
2
down vote
favorite
up vote
2
down vote
favorite
Given a specific XML element (i.e. a specific tag name) and a snippet of XML data, I want to extract the children from each occurrence of that element. More specifically, I have the following snippet of (not quite valid) XML data:
<!-- data.xml -->
<instance ab=1 >
<a1>aa</a1>
<a2>aa</a2>
</instance>
<instance ab=2 >
<b1>bb</b1>
<b2>bb</b2>
</instance>
<instance ab=3 >
<c1>cc</c1>
<c2>cc</c2>
</instance>
I would like a script or command which takes this data as input and produces the following output:
<a1>aa</a1><a2>aa</a2>
<b1>bb</b1><b2>bb</b2>
<c1>cc</c1><c2>cc</c2>
I would like for the solution to use standard text-processing tools such as sed
or awk
.
I tried using the following sed
command, but it did not work:
sed -n '/<Sample/,/</Sample/p' data.xml
text-processing awk sed xml ctags
Given a specific XML element (i.e. a specific tag name) and a snippet of XML data, I want to extract the children from each occurrence of that element. More specifically, I have the following snippet of (not quite valid) XML data:
<!-- data.xml -->
<instance ab=1 >
<a1>aa</a1>
<a2>aa</a2>
</instance>
<instance ab=2 >
<b1>bb</b1>
<b2>bb</b2>
</instance>
<instance ab=3 >
<c1>cc</c1>
<c2>cc</c2>
</instance>
I would like a script or command which takes this data as input and produces the following output:
<a1>aa</a1><a2>aa</a2>
<b1>bb</b1><b2>bb</b2>
<c1>cc</c1><c2>cc</c2>
I would like for the solution to use standard text-processing tools such as sed
or awk
.
I tried using the following sed
command, but it did not work:
sed -n '/<Sample/,/</Sample/p' data.xml
text-processing awk sed xml ctags
edited Nov 11 '17 at 17:48
igal
4,830930
4,830930
asked Nov 11 '17 at 14:10
Abhi S
114
114
It's totally unclear what you're asking for here. Why would thesed
command you're using have any effect with the data you're using? The search string you're using doesn't appear in the data. And why do you want to usesed
andawk
specifically? Are you sure that's a requirement?
â igal
Nov 11 '17 at 14:26
The input file is not a properly formatted XML file. It lacks a single root element.
â Kusalananda
Nov 11 '17 at 14:47
... and itâÂÂs indented peculiarly.
â G-Man
Nov 11 '17 at 15:16
And the attribute values aren't quoted.
â igal
Nov 11 '17 at 15:40
If your question has been resolved you should accept an answer so that the issue is closed. Otherwise this question will remain open and people may continue to submit solutions.
â igal
Dec 9 '17 at 22:48
add a comment |Â
It's totally unclear what you're asking for here. Why would thesed
command you're using have any effect with the data you're using? The search string you're using doesn't appear in the data. And why do you want to usesed
andawk
specifically? Are you sure that's a requirement?
â igal
Nov 11 '17 at 14:26
The input file is not a properly formatted XML file. It lacks a single root element.
â Kusalananda
Nov 11 '17 at 14:47
... and itâÂÂs indented peculiarly.
â G-Man
Nov 11 '17 at 15:16
And the attribute values aren't quoted.
â igal
Nov 11 '17 at 15:40
If your question has been resolved you should accept an answer so that the issue is closed. Otherwise this question will remain open and people may continue to submit solutions.
â igal
Dec 9 '17 at 22:48
It's totally unclear what you're asking for here. Why would the
sed
command you're using have any effect with the data you're using? The search string you're using doesn't appear in the data. And why do you want to use sed
and awk
specifically? Are you sure that's a requirement?â igal
Nov 11 '17 at 14:26
It's totally unclear what you're asking for here. Why would the
sed
command you're using have any effect with the data you're using? The search string you're using doesn't appear in the data. And why do you want to use sed
and awk
specifically? Are you sure that's a requirement?â igal
Nov 11 '17 at 14:26
The input file is not a properly formatted XML file. It lacks a single root element.
â Kusalananda
Nov 11 '17 at 14:47
The input file is not a properly formatted XML file. It lacks a single root element.
â Kusalananda
Nov 11 '17 at 14:47
... and itâÂÂs indented peculiarly.
â G-Man
Nov 11 '17 at 15:16
... and itâÂÂs indented peculiarly.
â G-Man
Nov 11 '17 at 15:16
And the attribute values aren't quoted.
â igal
Nov 11 '17 at 15:40
And the attribute values aren't quoted.
â igal
Nov 11 '17 at 15:40
If your question has been resolved you should accept an answer so that the issue is closed. Otherwise this question will remain open and people may continue to submit solutions.
â igal
Dec 9 '17 at 22:48
If your question has been resolved you should accept an answer so that the issue is closed. Otherwise this question will remain open and people may continue to submit solutions.
â igal
Dec 9 '17 at 22:48
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
7
down vote
If you really want sed
- or awk
-like command-line processing for XML files then you should probably consider using an XML-processing command-line tool. Here are some of the tools that I've seen more commonly used:
- xmlstarlet
- xmllint
- BaseX
- XQilla
You should also be aware that there are several XML-specific programming/query languages:
- XPath
- XQuery
- XSLT
Note that (in order to be valid XML) your XML data needs a root node and that your attribute values should be quoted, i.e. your data file should look more like this:
<!-- data.xml -->
<instances>
<instance ab='1'>
<a1>aa</a1>
<a2>aa</a2>
</instance>
<instance ab='2'>
<b1>bb</b1>
<b2>bb</b2>
</instance>
<instance ab='3'>
<c1>cc</c1>
<c2>cc</c2>
</instance>
</instances>
If your data is formatted as valid XML, then you can use XPath with xmlstarlet to get exactly what you want with a very concise command:
xmlstarlet sel -t -m '//instance' -c "./*" -n data.xml
This produces the following output:
<a1>aa</a1><a2>aa</a2>
<b1>bb</b1><b2>bb</b2>
<c1>cc</c1><c2>cc</c2>
Or you could use Python (my personal favorite choice). Here is a Python script that accomplishes the same task:
#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""extract_instance_children.bash"""
import sys
import xml.etree.ElementTree
# Load the data
tree = xml.etree.ElementTree.parse(sys.argv[1])
root = tree.getroot()
# Extract and output the child elements
for instance in root.iter("instance"):
print(''.join([xml.etree.ElementTree.tostring(child).strip() for child in instance]))
And here is how you could run the script:
python extract_instance_children.py data.xml
This uses the xml package from the Python Standard Library which is also a strict XML parser.
If you're not concerned with having properly formatted XML and you just want to parse a text file that looks roughly like the one you've presented, then you can definitely accomplish what you want just using shell-scripting and standard command-line tools. Here is an awk
script (as requested):
#!/usr/bin/env awk
# extract_instance_children.awk
BEGIN
addchild=0;
children="";
# Opening tag for "instance" element - set the "addchild" flag
if($0 ~ "^ *<instance[^<>]+>")
addchild=1;
# Closing tag for "instance" element - reset "children" string and "addchild" flag, print children
else if($0 ~ "^ *</instance>" && addchild == 1)
addchild=0;
printf("%sn", children);
children="";
# Concatenating child elements - strip whitespace
else if (addchild == 1)
gsub(/^[ t]+/,"",$0);
gsub(/[ t]+$/,"",$0);
children=children $0;
To execute the script from a file, you would use a command like this one:
awk -f extract_instance_children.awk data.xml
And here is a Bash script that produces the desired output:
#!/bin/bash
# extract_instance_children.bash
# Keep track of whether or not we're inside of an "instance" element
instance=0
# Loop through the lines of the file
while read line; do
# Set the instance flag to true if we come across an opening tag
if echo "$line" | grep -q '<instance.*>'; then
instance=1
# Set the instance flag to false and print a newline if we come across a closing tag
elif echo "$line" | grep -q '</instance>'; then
instance=0
echo
# If we're inside an instance tag then print the child element
elif [[ $instance == 1 ]]; then
printf "$line"
fi
done < "$1"
You would execute it like this:
bash extract_instance_children.bash data.xml
Or, going back to Python once again, you could use the Beautiful Soup package. Beautiful Soup is much more flexible in its ability to parse invalid XML than the standard Python XML module (and every other XML parser that I've come across). Here is a Python script which uses Beautiful Soup to achieve the desired result:
#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""extract_instance_children.bash"""
import sys
from bs4 import BeautifulSoup as Soup
with open(sys.argv[1], 'r') as xmlfile:
soup = Soup(xmlfile.read(), "html.parser")
for instance in soup.findAll('instance'):
print(''.join([str(child) for child in instance.findChildren()]))
Thanks. bash script is working as expected.. Thanks for your prompt and quick help.. My problem had been resolved.. No need to speed more time on it.
â Abhi S
Nov 11 '17 at 15:44
2
@AbhiS That's great! I'm not just posting for you though. I'm trying to write a clear and complete answer for anyone else who happens to come by this post.
â igal
Nov 11 '17 at 16:21
@AbhiS If this solution worked for you, could you please accept it?
â igal
Feb 1 at 1:47
add a comment |Â
up vote
3
down vote
This may be of help:
#!/bin/bash
awk -vtag=instance -vp=0 '
if($0~("^<"tag))p=1;next
if($0~("^</"tag))p=0;printf("n");next
if(p==1)$1=$1;printf("%s",$0)
' infile
Assuming the Sample
text in your example is a mistake and keeping it simple.
The p variable decides when to print. A $1=$1
removes leading spaces.
This actually didn't work for me; I got no output at all. I changed theif($0~"\<"tag)
condition toif($0~"</"tag)
and got the expected output, but not in the correct format (there was additional whitespace).
â igal
Nov 12 '17 at 0:06
@igal Maybe now, answer edited (made even simpler), spaces removed.
â Arrow
Nov 12 '17 at 2:15
@Arrow Yup! Very nice. Upvoted!
â igal
Nov 12 '17 at 4:27
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
7
down vote
If you really want sed
- or awk
-like command-line processing for XML files then you should probably consider using an XML-processing command-line tool. Here are some of the tools that I've seen more commonly used:
- xmlstarlet
- xmllint
- BaseX
- XQilla
You should also be aware that there are several XML-specific programming/query languages:
- XPath
- XQuery
- XSLT
Note that (in order to be valid XML) your XML data needs a root node and that your attribute values should be quoted, i.e. your data file should look more like this:
<!-- data.xml -->
<instances>
<instance ab='1'>
<a1>aa</a1>
<a2>aa</a2>
</instance>
<instance ab='2'>
<b1>bb</b1>
<b2>bb</b2>
</instance>
<instance ab='3'>
<c1>cc</c1>
<c2>cc</c2>
</instance>
</instances>
If your data is formatted as valid XML, then you can use XPath with xmlstarlet to get exactly what you want with a very concise command:
xmlstarlet sel -t -m '//instance' -c "./*" -n data.xml
This produces the following output:
<a1>aa</a1><a2>aa</a2>
<b1>bb</b1><b2>bb</b2>
<c1>cc</c1><c2>cc</c2>
Or you could use Python (my personal favorite choice). Here is a Python script that accomplishes the same task:
#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""extract_instance_children.bash"""
import sys
import xml.etree.ElementTree
# Load the data
tree = xml.etree.ElementTree.parse(sys.argv[1])
root = tree.getroot()
# Extract and output the child elements
for instance in root.iter("instance"):
print(''.join([xml.etree.ElementTree.tostring(child).strip() for child in instance]))
And here is how you could run the script:
python extract_instance_children.py data.xml
This uses the xml package from the Python Standard Library which is also a strict XML parser.
If you're not concerned with having properly formatted XML and you just want to parse a text file that looks roughly like the one you've presented, then you can definitely accomplish what you want just using shell-scripting and standard command-line tools. Here is an awk
script (as requested):
#!/usr/bin/env awk
# extract_instance_children.awk
BEGIN
addchild=0;
children="";
# Opening tag for "instance" element - set the "addchild" flag
if($0 ~ "^ *<instance[^<>]+>")
addchild=1;
# Closing tag for "instance" element - reset "children" string and "addchild" flag, print children
else if($0 ~ "^ *</instance>" && addchild == 1)
addchild=0;
printf("%sn", children);
children="";
# Concatenating child elements - strip whitespace
else if (addchild == 1)
gsub(/^[ t]+/,"",$0);
gsub(/[ t]+$/,"",$0);
children=children $0;
To execute the script from a file, you would use a command like this one:
awk -f extract_instance_children.awk data.xml
And here is a Bash script that produces the desired output:
#!/bin/bash
# extract_instance_children.bash
# Keep track of whether or not we're inside of an "instance" element
instance=0
# Loop through the lines of the file
while read line; do
# Set the instance flag to true if we come across an opening tag
if echo "$line" | grep -q '<instance.*>'; then
instance=1
# Set the instance flag to false and print a newline if we come across a closing tag
elif echo "$line" | grep -q '</instance>'; then
instance=0
echo
# If we're inside an instance tag then print the child element
elif [[ $instance == 1 ]]; then
printf "$line"
fi
done < "$1"
You would execute it like this:
bash extract_instance_children.bash data.xml
Or, going back to Python once again, you could use the Beautiful Soup package. Beautiful Soup is much more flexible in its ability to parse invalid XML than the standard Python XML module (and every other XML parser that I've come across). Here is a Python script which uses Beautiful Soup to achieve the desired result:
#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""extract_instance_children.bash"""
import sys
from bs4 import BeautifulSoup as Soup
with open(sys.argv[1], 'r') as xmlfile:
soup = Soup(xmlfile.read(), "html.parser")
for instance in soup.findAll('instance'):
print(''.join([str(child) for child in instance.findChildren()]))
Thanks. bash script is working as expected.. Thanks for your prompt and quick help.. My problem had been resolved.. No need to speed more time on it.
â Abhi S
Nov 11 '17 at 15:44
2
@AbhiS That's great! I'm not just posting for you though. I'm trying to write a clear and complete answer for anyone else who happens to come by this post.
â igal
Nov 11 '17 at 16:21
@AbhiS If this solution worked for you, could you please accept it?
â igal
Feb 1 at 1:47
add a comment |Â
up vote
7
down vote
If you really want sed
- or awk
-like command-line processing for XML files then you should probably consider using an XML-processing command-line tool. Here are some of the tools that I've seen more commonly used:
- xmlstarlet
- xmllint
- BaseX
- XQilla
You should also be aware that there are several XML-specific programming/query languages:
- XPath
- XQuery
- XSLT
Note that (in order to be valid XML) your XML data needs a root node and that your attribute values should be quoted, i.e. your data file should look more like this:
<!-- data.xml -->
<instances>
<instance ab='1'>
<a1>aa</a1>
<a2>aa</a2>
</instance>
<instance ab='2'>
<b1>bb</b1>
<b2>bb</b2>
</instance>
<instance ab='3'>
<c1>cc</c1>
<c2>cc</c2>
</instance>
</instances>
If your data is formatted as valid XML, then you can use XPath with xmlstarlet to get exactly what you want with a very concise command:
xmlstarlet sel -t -m '//instance' -c "./*" -n data.xml
This produces the following output:
<a1>aa</a1><a2>aa</a2>
<b1>bb</b1><b2>bb</b2>
<c1>cc</c1><c2>cc</c2>
Or you could use Python (my personal favorite choice). Here is a Python script that accomplishes the same task:
#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""extract_instance_children.bash"""
import sys
import xml.etree.ElementTree
# Load the data
tree = xml.etree.ElementTree.parse(sys.argv[1])
root = tree.getroot()
# Extract and output the child elements
for instance in root.iter("instance"):
print(''.join([xml.etree.ElementTree.tostring(child).strip() for child in instance]))
And here is how you could run the script:
python extract_instance_children.py data.xml
This uses the xml package from the Python Standard Library which is also a strict XML parser.
If you're not concerned with having properly formatted XML and you just want to parse a text file that looks roughly like the one you've presented, then you can definitely accomplish what you want just using shell-scripting and standard command-line tools. Here is an awk
script (as requested):
#!/usr/bin/env awk
# extract_instance_children.awk
BEGIN
addchild=0;
children="";
# Opening tag for "instance" element - set the "addchild" flag
if($0 ~ "^ *<instance[^<>]+>")
addchild=1;
# Closing tag for "instance" element - reset "children" string and "addchild" flag, print children
else if($0 ~ "^ *</instance>" && addchild == 1)
addchild=0;
printf("%sn", children);
children="";
# Concatenating child elements - strip whitespace
else if (addchild == 1)
gsub(/^[ t]+/,"",$0);
gsub(/[ t]+$/,"",$0);
children=children $0;
To execute the script from a file, you would use a command like this one:
awk -f extract_instance_children.awk data.xml
And here is a Bash script that produces the desired output:
#!/bin/bash
# extract_instance_children.bash
# Keep track of whether or not we're inside of an "instance" element
instance=0
# Loop through the lines of the file
while read line; do
# Set the instance flag to true if we come across an opening tag
if echo "$line" | grep -q '<instance.*>'; then
instance=1
# Set the instance flag to false and print a newline if we come across a closing tag
elif echo "$line" | grep -q '</instance>'; then
instance=0
echo
# If we're inside an instance tag then print the child element
elif [[ $instance == 1 ]]; then
printf "$line"
fi
done < "$1"
You would execute it like this:
bash extract_instance_children.bash data.xml
Or, going back to Python once again, you could use the Beautiful Soup package. Beautiful Soup is much more flexible in its ability to parse invalid XML than the standard Python XML module (and every other XML parser that I've come across). Here is a Python script which uses Beautiful Soup to achieve the desired result:
#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""extract_instance_children.bash"""
import sys
from bs4 import BeautifulSoup as Soup
with open(sys.argv[1], 'r') as xmlfile:
soup = Soup(xmlfile.read(), "html.parser")
for instance in soup.findAll('instance'):
print(''.join([str(child) for child in instance.findChildren()]))
Thanks. bash script is working as expected.. Thanks for your prompt and quick help.. My problem had been resolved.. No need to speed more time on it.
â Abhi S
Nov 11 '17 at 15:44
2
@AbhiS That's great! I'm not just posting for you though. I'm trying to write a clear and complete answer for anyone else who happens to come by this post.
â igal
Nov 11 '17 at 16:21
@AbhiS If this solution worked for you, could you please accept it?
â igal
Feb 1 at 1:47
add a comment |Â
up vote
7
down vote
up vote
7
down vote
If you really want sed
- or awk
-like command-line processing for XML files then you should probably consider using an XML-processing command-line tool. Here are some of the tools that I've seen more commonly used:
- xmlstarlet
- xmllint
- BaseX
- XQilla
You should also be aware that there are several XML-specific programming/query languages:
- XPath
- XQuery
- XSLT
Note that (in order to be valid XML) your XML data needs a root node and that your attribute values should be quoted, i.e. your data file should look more like this:
<!-- data.xml -->
<instances>
<instance ab='1'>
<a1>aa</a1>
<a2>aa</a2>
</instance>
<instance ab='2'>
<b1>bb</b1>
<b2>bb</b2>
</instance>
<instance ab='3'>
<c1>cc</c1>
<c2>cc</c2>
</instance>
</instances>
If your data is formatted as valid XML, then you can use XPath with xmlstarlet to get exactly what you want with a very concise command:
xmlstarlet sel -t -m '//instance' -c "./*" -n data.xml
This produces the following output:
<a1>aa</a1><a2>aa</a2>
<b1>bb</b1><b2>bb</b2>
<c1>cc</c1><c2>cc</c2>
Or you could use Python (my personal favorite choice). Here is a Python script that accomplishes the same task:
#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""extract_instance_children.bash"""
import sys
import xml.etree.ElementTree
# Load the data
tree = xml.etree.ElementTree.parse(sys.argv[1])
root = tree.getroot()
# Extract and output the child elements
for instance in root.iter("instance"):
print(''.join([xml.etree.ElementTree.tostring(child).strip() for child in instance]))
And here is how you could run the script:
python extract_instance_children.py data.xml
This uses the xml package from the Python Standard Library which is also a strict XML parser.
If you're not concerned with having properly formatted XML and you just want to parse a text file that looks roughly like the one you've presented, then you can definitely accomplish what you want just using shell-scripting and standard command-line tools. Here is an awk
script (as requested):
#!/usr/bin/env awk
# extract_instance_children.awk
BEGIN
addchild=0;
children="";
# Opening tag for "instance" element - set the "addchild" flag
if($0 ~ "^ *<instance[^<>]+>")
addchild=1;
# Closing tag for "instance" element - reset "children" string and "addchild" flag, print children
else if($0 ~ "^ *</instance>" && addchild == 1)
addchild=0;
printf("%sn", children);
children="";
# Concatenating child elements - strip whitespace
else if (addchild == 1)
gsub(/^[ t]+/,"",$0);
gsub(/[ t]+$/,"",$0);
children=children $0;
To execute the script from a file, you would use a command like this one:
awk -f extract_instance_children.awk data.xml
And here is a Bash script that produces the desired output:
#!/bin/bash
# extract_instance_children.bash
# Keep track of whether or not we're inside of an "instance" element
instance=0
# Loop through the lines of the file
while read line; do
# Set the instance flag to true if we come across an opening tag
if echo "$line" | grep -q '<instance.*>'; then
instance=1
# Set the instance flag to false and print a newline if we come across a closing tag
elif echo "$line" | grep -q '</instance>'; then
instance=0
echo
# If we're inside an instance tag then print the child element
elif [[ $instance == 1 ]]; then
printf "$line"
fi
done < "$1"
You would execute it like this:
bash extract_instance_children.bash data.xml
Or, going back to Python once again, you could use the Beautiful Soup package. Beautiful Soup is much more flexible in its ability to parse invalid XML than the standard Python XML module (and every other XML parser that I've come across). Here is a Python script which uses Beautiful Soup to achieve the desired result:
#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""extract_instance_children.bash"""
import sys
from bs4 import BeautifulSoup as Soup
with open(sys.argv[1], 'r') as xmlfile:
soup = Soup(xmlfile.read(), "html.parser")
for instance in soup.findAll('instance'):
print(''.join([str(child) for child in instance.findChildren()]))
If you really want sed
- or awk
-like command-line processing for XML files then you should probably consider using an XML-processing command-line tool. Here are some of the tools that I've seen more commonly used:
- xmlstarlet
- xmllint
- BaseX
- XQilla
You should also be aware that there are several XML-specific programming/query languages:
- XPath
- XQuery
- XSLT
Note that (in order to be valid XML) your XML data needs a root node and that your attribute values should be quoted, i.e. your data file should look more like this:
<!-- data.xml -->
<instances>
<instance ab='1'>
<a1>aa</a1>
<a2>aa</a2>
</instance>
<instance ab='2'>
<b1>bb</b1>
<b2>bb</b2>
</instance>
<instance ab='3'>
<c1>cc</c1>
<c2>cc</c2>
</instance>
</instances>
If your data is formatted as valid XML, then you can use XPath with xmlstarlet to get exactly what you want with a very concise command:
xmlstarlet sel -t -m '//instance' -c "./*" -n data.xml
This produces the following output:
<a1>aa</a1><a2>aa</a2>
<b1>bb</b1><b2>bb</b2>
<c1>cc</c1><c2>cc</c2>
Or you could use Python (my personal favorite choice). Here is a Python script that accomplishes the same task:
#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""extract_instance_children.bash"""
import sys
import xml.etree.ElementTree
# Load the data
tree = xml.etree.ElementTree.parse(sys.argv[1])
root = tree.getroot()
# Extract and output the child elements
for instance in root.iter("instance"):
print(''.join([xml.etree.ElementTree.tostring(child).strip() for child in instance]))
And here is how you could run the script:
python extract_instance_children.py data.xml
This uses the xml package from the Python Standard Library which is also a strict XML parser.
If you're not concerned with having properly formatted XML and you just want to parse a text file that looks roughly like the one you've presented, then you can definitely accomplish what you want just using shell-scripting and standard command-line tools. Here is an awk
script (as requested):
#!/usr/bin/env awk
# extract_instance_children.awk
BEGIN
addchild=0;
children="";
# Opening tag for "instance" element - set the "addchild" flag
if($0 ~ "^ *<instance[^<>]+>")
addchild=1;
# Closing tag for "instance" element - reset "children" string and "addchild" flag, print children
else if($0 ~ "^ *</instance>" && addchild == 1)
addchild=0;
printf("%sn", children);
children="";
# Concatenating child elements - strip whitespace
else if (addchild == 1)
gsub(/^[ t]+/,"",$0);
gsub(/[ t]+$/,"",$0);
children=children $0;
To execute the script from a file, you would use a command like this one:
awk -f extract_instance_children.awk data.xml
And here is a Bash script that produces the desired output:
#!/bin/bash
# extract_instance_children.bash
# Keep track of whether or not we're inside of an "instance" element
instance=0
# Loop through the lines of the file
while read line; do
# Set the instance flag to true if we come across an opening tag
if echo "$line" | grep -q '<instance.*>'; then
instance=1
# Set the instance flag to false and print a newline if we come across a closing tag
elif echo "$line" | grep -q '</instance>'; then
instance=0
echo
# If we're inside an instance tag then print the child element
elif [[ $instance == 1 ]]; then
printf "$line"
fi
done < "$1"
You would execute it like this:
bash extract_instance_children.bash data.xml
Or, going back to Python once again, you could use the Beautiful Soup package. Beautiful Soup is much more flexible in its ability to parse invalid XML than the standard Python XML module (and every other XML parser that I've come across). Here is a Python script which uses Beautiful Soup to achieve the desired result:
#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""extract_instance_children.bash"""
import sys
from bs4 import BeautifulSoup as Soup
with open(sys.argv[1], 'r') as xmlfile:
soup = Soup(xmlfile.read(), "html.parser")
for instance in soup.findAll('instance'):
print(''.join([str(child) for child in instance.findChildren()]))
edited Mar 30 at 19:19
answered Nov 11 '17 at 15:07
igal
4,830930
4,830930
Thanks. bash script is working as expected.. Thanks for your prompt and quick help.. My problem had been resolved.. No need to speed more time on it.
â Abhi S
Nov 11 '17 at 15:44
2
@AbhiS That's great! I'm not just posting for you though. I'm trying to write a clear and complete answer for anyone else who happens to come by this post.
â igal
Nov 11 '17 at 16:21
@AbhiS If this solution worked for you, could you please accept it?
â igal
Feb 1 at 1:47
add a comment |Â
Thanks. bash script is working as expected.. Thanks for your prompt and quick help.. My problem had been resolved.. No need to speed more time on it.
â Abhi S
Nov 11 '17 at 15:44
2
@AbhiS That's great! I'm not just posting for you though. I'm trying to write a clear and complete answer for anyone else who happens to come by this post.
â igal
Nov 11 '17 at 16:21
@AbhiS If this solution worked for you, could you please accept it?
â igal
Feb 1 at 1:47
Thanks. bash script is working as expected.. Thanks for your prompt and quick help.. My problem had been resolved.. No need to speed more time on it.
â Abhi S
Nov 11 '17 at 15:44
Thanks. bash script is working as expected.. Thanks for your prompt and quick help.. My problem had been resolved.. No need to speed more time on it.
â Abhi S
Nov 11 '17 at 15:44
2
2
@AbhiS That's great! I'm not just posting for you though. I'm trying to write a clear and complete answer for anyone else who happens to come by this post.
â igal
Nov 11 '17 at 16:21
@AbhiS That's great! I'm not just posting for you though. I'm trying to write a clear and complete answer for anyone else who happens to come by this post.
â igal
Nov 11 '17 at 16:21
@AbhiS If this solution worked for you, could you please accept it?
â igal
Feb 1 at 1:47
@AbhiS If this solution worked for you, could you please accept it?
â igal
Feb 1 at 1:47
add a comment |Â
up vote
3
down vote
This may be of help:
#!/bin/bash
awk -vtag=instance -vp=0 '
if($0~("^<"tag))p=1;next
if($0~("^</"tag))p=0;printf("n");next
if(p==1)$1=$1;printf("%s",$0)
' infile
Assuming the Sample
text in your example is a mistake and keeping it simple.
The p variable decides when to print. A $1=$1
removes leading spaces.
This actually didn't work for me; I got no output at all. I changed theif($0~"\<"tag)
condition toif($0~"</"tag)
and got the expected output, but not in the correct format (there was additional whitespace).
â igal
Nov 12 '17 at 0:06
@igal Maybe now, answer edited (made even simpler), spaces removed.
â Arrow
Nov 12 '17 at 2:15
@Arrow Yup! Very nice. Upvoted!
â igal
Nov 12 '17 at 4:27
add a comment |Â
up vote
3
down vote
This may be of help:
#!/bin/bash
awk -vtag=instance -vp=0 '
if($0~("^<"tag))p=1;next
if($0~("^</"tag))p=0;printf("n");next
if(p==1)$1=$1;printf("%s",$0)
' infile
Assuming the Sample
text in your example is a mistake and keeping it simple.
The p variable decides when to print. A $1=$1
removes leading spaces.
This actually didn't work for me; I got no output at all. I changed theif($0~"\<"tag)
condition toif($0~"</"tag)
and got the expected output, but not in the correct format (there was additional whitespace).
â igal
Nov 12 '17 at 0:06
@igal Maybe now, answer edited (made even simpler), spaces removed.
â Arrow
Nov 12 '17 at 2:15
@Arrow Yup! Very nice. Upvoted!
â igal
Nov 12 '17 at 4:27
add a comment |Â
up vote
3
down vote
up vote
3
down vote
This may be of help:
#!/bin/bash
awk -vtag=instance -vp=0 '
if($0~("^<"tag))p=1;next
if($0~("^</"tag))p=0;printf("n");next
if(p==1)$1=$1;printf("%s",$0)
' infile
Assuming the Sample
text in your example is a mistake and keeping it simple.
The p variable decides when to print. A $1=$1
removes leading spaces.
This may be of help:
#!/bin/bash
awk -vtag=instance -vp=0 '
if($0~("^<"tag))p=1;next
if($0~("^</"tag))p=0;printf("n");next
if(p==1)$1=$1;printf("%s",$0)
' infile
Assuming the Sample
text in your example is a mistake and keeping it simple.
The p variable decides when to print. A $1=$1
removes leading spaces.
edited Nov 12 '17 at 14:13
answered Nov 11 '17 at 14:44
Arrow
2,400218
2,400218
This actually didn't work for me; I got no output at all. I changed theif($0~"\<"tag)
condition toif($0~"</"tag)
and got the expected output, but not in the correct format (there was additional whitespace).
â igal
Nov 12 '17 at 0:06
@igal Maybe now, answer edited (made even simpler), spaces removed.
â Arrow
Nov 12 '17 at 2:15
@Arrow Yup! Very nice. Upvoted!
â igal
Nov 12 '17 at 4:27
add a comment |Â
This actually didn't work for me; I got no output at all. I changed theif($0~"\<"tag)
condition toif($0~"</"tag)
and got the expected output, but not in the correct format (there was additional whitespace).
â igal
Nov 12 '17 at 0:06
@igal Maybe now, answer edited (made even simpler), spaces removed.
â Arrow
Nov 12 '17 at 2:15
@Arrow Yup! Very nice. Upvoted!
â igal
Nov 12 '17 at 4:27
This actually didn't work for me; I got no output at all. I changed the
if($0~"\<"tag)
condition to if($0~"</"tag)
and got the expected output, but not in the correct format (there was additional whitespace).â igal
Nov 12 '17 at 0:06
This actually didn't work for me; I got no output at all. I changed the
if($0~"\<"tag)
condition to if($0~"</"tag)
and got the expected output, but not in the correct format (there was additional whitespace).â igal
Nov 12 '17 at 0:06
@igal Maybe now, answer edited (made even simpler), spaces removed.
â Arrow
Nov 12 '17 at 2:15
@igal Maybe now, answer edited (made even simpler), spaces removed.
â Arrow
Nov 12 '17 at 2:15
@Arrow Yup! Very nice. Upvoted!
â igal
Nov 12 '17 at 4:27
@Arrow Yup! Very nice. Upvoted!
â igal
Nov 12 '17 at 4:27
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f403904%2fextract-the-children-of-a-specific-xml-element-type%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
It's totally unclear what you're asking for here. Why would the
sed
command you're using have any effect with the data you're using? The search string you're using doesn't appear in the data. And why do you want to usesed
andawk
specifically? Are you sure that's a requirement?â igal
Nov 11 '17 at 14:26
The input file is not a properly formatted XML file. It lacks a single root element.
â Kusalananda
Nov 11 '17 at 14:47
... and itâÂÂs indented peculiarly.
â G-Man
Nov 11 '17 at 15:16
And the attribute values aren't quoted.
â igal
Nov 11 '17 at 15:40
If your question has been resolved you should accept an answer so that the issue is closed. Otherwise this question will remain open and people may continue to submit solutions.
â igal
Dec 9 '17 at 22:48