Extract the Children of a Specific XML Element Type

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
2
down vote

favorite












Given a specific XML element (i.e. a specific tag name) and a snippet of XML data, I want to extract the children from each occurrence of that element. More specifically, I have the following snippet of (not quite valid) XML data:



<!-- data.xml -->

<instance ab=1 >
<a1>aa</a1>
<a2>aa</a2>
</instance>
<instance ab=2 >
<b1>bb</b1>
<b2>bb</b2>
</instance>
<instance ab=3 >
<c1>cc</c1>
<c2>cc</c2>
</instance>


I would like a script or command which takes this data as input and produces the following output:



<a1>aa</a1><a2>aa</a2>
<b1>bb</b1><b2>bb</b2>
<c1>cc</c1><c2>cc</c2>


I would like for the solution to use standard text-processing tools such as sed or awk.



I tried using the following sed command, but it did not work:



sed -n '/<Sample/,/</Sample/p' data.xml






share|improve this question






















  • It's totally unclear what you're asking for here. Why would the sed command you're using have any effect with the data you're using? The search string you're using doesn't appear in the data. And why do you want to use sed and awk specifically? Are you sure that's a requirement?
    – igal
    Nov 11 '17 at 14:26










  • The input file is not a properly formatted XML file. It lacks a single root element.
    – Kusalananda
    Nov 11 '17 at 14:47










  • ... and it’s indented peculiarly.
    – G-Man
    Nov 11 '17 at 15:16










  • And the attribute values aren't quoted.
    – igal
    Nov 11 '17 at 15:40










  • If your question has been resolved you should accept an answer so that the issue is closed. Otherwise this question will remain open and people may continue to submit solutions.
    – igal
    Dec 9 '17 at 22:48














up vote
2
down vote

favorite












Given a specific XML element (i.e. a specific tag name) and a snippet of XML data, I want to extract the children from each occurrence of that element. More specifically, I have the following snippet of (not quite valid) XML data:



<!-- data.xml -->

<instance ab=1 >
<a1>aa</a1>
<a2>aa</a2>
</instance>
<instance ab=2 >
<b1>bb</b1>
<b2>bb</b2>
</instance>
<instance ab=3 >
<c1>cc</c1>
<c2>cc</c2>
</instance>


I would like a script or command which takes this data as input and produces the following output:



<a1>aa</a1><a2>aa</a2>
<b1>bb</b1><b2>bb</b2>
<c1>cc</c1><c2>cc</c2>


I would like for the solution to use standard text-processing tools such as sed or awk.



I tried using the following sed command, but it did not work:



sed -n '/<Sample/,/</Sample/p' data.xml






share|improve this question






















  • It's totally unclear what you're asking for here. Why would the sed command you're using have any effect with the data you're using? The search string you're using doesn't appear in the data. And why do you want to use sed and awk specifically? Are you sure that's a requirement?
    – igal
    Nov 11 '17 at 14:26










  • The input file is not a properly formatted XML file. It lacks a single root element.
    – Kusalananda
    Nov 11 '17 at 14:47










  • ... and it’s indented peculiarly.
    – G-Man
    Nov 11 '17 at 15:16










  • And the attribute values aren't quoted.
    – igal
    Nov 11 '17 at 15:40










  • If your question has been resolved you should accept an answer so that the issue is closed. Otherwise this question will remain open and people may continue to submit solutions.
    – igal
    Dec 9 '17 at 22:48












up vote
2
down vote

favorite









up vote
2
down vote

favorite











Given a specific XML element (i.e. a specific tag name) and a snippet of XML data, I want to extract the children from each occurrence of that element. More specifically, I have the following snippet of (not quite valid) XML data:



<!-- data.xml -->

<instance ab=1 >
<a1>aa</a1>
<a2>aa</a2>
</instance>
<instance ab=2 >
<b1>bb</b1>
<b2>bb</b2>
</instance>
<instance ab=3 >
<c1>cc</c1>
<c2>cc</c2>
</instance>


I would like a script or command which takes this data as input and produces the following output:



<a1>aa</a1><a2>aa</a2>
<b1>bb</b1><b2>bb</b2>
<c1>cc</c1><c2>cc</c2>


I would like for the solution to use standard text-processing tools such as sed or awk.



I tried using the following sed command, but it did not work:



sed -n '/<Sample/,/</Sample/p' data.xml






share|improve this question














Given a specific XML element (i.e. a specific tag name) and a snippet of XML data, I want to extract the children from each occurrence of that element. More specifically, I have the following snippet of (not quite valid) XML data:



<!-- data.xml -->

<instance ab=1 >
<a1>aa</a1>
<a2>aa</a2>
</instance>
<instance ab=2 >
<b1>bb</b1>
<b2>bb</b2>
</instance>
<instance ab=3 >
<c1>cc</c1>
<c2>cc</c2>
</instance>


I would like a script or command which takes this data as input and produces the following output:



<a1>aa</a1><a2>aa</a2>
<b1>bb</b1><b2>bb</b2>
<c1>cc</c1><c2>cc</c2>


I would like for the solution to use standard text-processing tools such as sed or awk.



I tried using the following sed command, but it did not work:



sed -n '/<Sample/,/</Sample/p' data.xml








share|improve this question













share|improve this question




share|improve this question








edited Nov 11 '17 at 17:48









igal

4,830930




4,830930










asked Nov 11 '17 at 14:10









Abhi S

114




114











  • It's totally unclear what you're asking for here. Why would the sed command you're using have any effect with the data you're using? The search string you're using doesn't appear in the data. And why do you want to use sed and awk specifically? Are you sure that's a requirement?
    – igal
    Nov 11 '17 at 14:26










  • The input file is not a properly formatted XML file. It lacks a single root element.
    – Kusalananda
    Nov 11 '17 at 14:47










  • ... and it’s indented peculiarly.
    – G-Man
    Nov 11 '17 at 15:16










  • And the attribute values aren't quoted.
    – igal
    Nov 11 '17 at 15:40










  • If your question has been resolved you should accept an answer so that the issue is closed. Otherwise this question will remain open and people may continue to submit solutions.
    – igal
    Dec 9 '17 at 22:48
















  • It's totally unclear what you're asking for here. Why would the sed command you're using have any effect with the data you're using? The search string you're using doesn't appear in the data. And why do you want to use sed and awk specifically? Are you sure that's a requirement?
    – igal
    Nov 11 '17 at 14:26










  • The input file is not a properly formatted XML file. It lacks a single root element.
    – Kusalananda
    Nov 11 '17 at 14:47










  • ... and it’s indented peculiarly.
    – G-Man
    Nov 11 '17 at 15:16










  • And the attribute values aren't quoted.
    – igal
    Nov 11 '17 at 15:40










  • If your question has been resolved you should accept an answer so that the issue is closed. Otherwise this question will remain open and people may continue to submit solutions.
    – igal
    Dec 9 '17 at 22:48















It's totally unclear what you're asking for here. Why would the sed command you're using have any effect with the data you're using? The search string you're using doesn't appear in the data. And why do you want to use sed and awk specifically? Are you sure that's a requirement?
– igal
Nov 11 '17 at 14:26




It's totally unclear what you're asking for here. Why would the sed command you're using have any effect with the data you're using? The search string you're using doesn't appear in the data. And why do you want to use sed and awk specifically? Are you sure that's a requirement?
– igal
Nov 11 '17 at 14:26












The input file is not a properly formatted XML file. It lacks a single root element.
– Kusalananda
Nov 11 '17 at 14:47




The input file is not a properly formatted XML file. It lacks a single root element.
– Kusalananda
Nov 11 '17 at 14:47












... and it’s indented peculiarly.
– G-Man
Nov 11 '17 at 15:16




... and it’s indented peculiarly.
– G-Man
Nov 11 '17 at 15:16












And the attribute values aren't quoted.
– igal
Nov 11 '17 at 15:40




And the attribute values aren't quoted.
– igal
Nov 11 '17 at 15:40












If your question has been resolved you should accept an answer so that the issue is closed. Otherwise this question will remain open and people may continue to submit solutions.
– igal
Dec 9 '17 at 22:48




If your question has been resolved you should accept an answer so that the issue is closed. Otherwise this question will remain open and people may continue to submit solutions.
– igal
Dec 9 '17 at 22:48










2 Answers
2






active

oldest

votes

















up vote
7
down vote













If you really want sed- or awk-like command-line processing for XML files then you should probably consider using an XML-processing command-line tool. Here are some of the tools that I've seen more commonly used:



  • xmlstarlet

  • xmllint

  • BaseX

  • XQilla

You should also be aware that there are several XML-specific programming/query languages:



  • XPath

  • XQuery

  • XSLT

Note that (in order to be valid XML) your XML data needs a root node and that your attribute values should be quoted, i.e. your data file should look more like this:



<!-- data.xml -->

<instances>

<instance ab='1'>
<a1>aa</a1>
<a2>aa</a2>
</instance>

<instance ab='2'>
<b1>bb</b1>
<b2>bb</b2>
</instance>

<instance ab='3'>
<c1>cc</c1>
<c2>cc</c2>
</instance>

</instances>


If your data is formatted as valid XML, then you can use XPath with xmlstarlet to get exactly what you want with a very concise command:



xmlstarlet sel -t -m '//instance' -c "./*" -n data.xml


This produces the following output:



<a1>aa</a1><a2>aa</a2>
<b1>bb</b1><b2>bb</b2>
<c1>cc</c1><c2>cc</c2>


Or you could use Python (my personal favorite choice). Here is a Python script that accomplishes the same task:



#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""extract_instance_children.bash"""

import sys
import xml.etree.ElementTree

# Load the data
tree = xml.etree.ElementTree.parse(sys.argv[1])
root = tree.getroot()

# Extract and output the child elements
for instance in root.iter("instance"):
print(''.join([xml.etree.ElementTree.tostring(child).strip() for child in instance]))


And here is how you could run the script:



python extract_instance_children.py data.xml


This uses the xml package from the Python Standard Library which is also a strict XML parser.



If you're not concerned with having properly formatted XML and you just want to parse a text file that looks roughly like the one you've presented, then you can definitely accomplish what you want just using shell-scripting and standard command-line tools. Here is an awk script (as requested):



#!/usr/bin/env awk

# extract_instance_children.awk

BEGIN
addchild=0;
children="";



# Opening tag for "instance" element - set the "addchild" flag
if($0 ~ "^ *<instance[^<>]+>")
addchild=1;


# Closing tag for "instance" element - reset "children" string and "addchild" flag, print children
else if($0 ~ "^ *</instance>" && addchild == 1)
addchild=0;
printf("%sn", children);
children="";


# Concatenating child elements - strip whitespace
else if (addchild == 1)
gsub(/^[ t]+/,"",$0);
gsub(/[ t]+$/,"",$0);
children=children $0;




To execute the script from a file, you would use a command like this one:



awk -f extract_instance_children.awk data.xml


And here is a Bash script that produces the desired output:



#!/bin/bash

# extract_instance_children.bash

# Keep track of whether or not we're inside of an "instance" element
instance=0

# Loop through the lines of the file
while read line; do

# Set the instance flag to true if we come across an opening tag
if echo "$line" | grep -q '<instance.*>'; then
instance=1

# Set the instance flag to false and print a newline if we come across a closing tag
elif echo "$line" | grep -q '</instance>'; then
instance=0
echo

# If we're inside an instance tag then print the child element
elif [[ $instance == 1 ]]; then
printf "$line"
fi

done < "$1"


You would execute it like this:



bash extract_instance_children.bash data.xml


Or, going back to Python once again, you could use the Beautiful Soup package. Beautiful Soup is much more flexible in its ability to parse invalid XML than the standard Python XML module (and every other XML parser that I've come across). Here is a Python script which uses Beautiful Soup to achieve the desired result:



#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""extract_instance_children.bash"""

import sys
from bs4 import BeautifulSoup as Soup

with open(sys.argv[1], 'r') as xmlfile:
soup = Soup(xmlfile.read(), "html.parser")
for instance in soup.findAll('instance'):
print(''.join([str(child) for child in instance.findChildren()]))





share|improve this answer






















  • Thanks. bash script is working as expected.. Thanks for your prompt and quick help.. My problem had been resolved.. No need to speed more time on it.
    – Abhi S
    Nov 11 '17 at 15:44






  • 2




    @AbhiS That's great! I'm not just posting for you though. I'm trying to write a clear and complete answer for anyone else who happens to come by this post.
    – igal
    Nov 11 '17 at 16:21










  • @AbhiS If this solution worked for you, could you please accept it?
    – igal
    Feb 1 at 1:47

















up vote
3
down vote













This may be of help:



#!/bin/bash

awk -vtag=instance -vp=0 '
if($0~("^<"tag))p=1;next
if($0~("^</"tag))p=0;printf("n");next
if(p==1)$1=$1;printf("%s",$0)
' infile


Assuming the Sample text in your example is a mistake and keeping it simple.



The p variable decides when to print. A $1=$1 removes leading spaces.






share|improve this answer






















  • This actually didn't work for me; I got no output at all. I changed the if($0~"\<"tag) condition to if($0~"</"tag) and got the expected output, but not in the correct format (there was additional whitespace).
    – igal
    Nov 12 '17 at 0:06










  • @igal Maybe now, answer edited (made even simpler), spaces removed.
    – Arrow
    Nov 12 '17 at 2:15










  • @Arrow Yup! Very nice. Upvoted!
    – igal
    Nov 12 '17 at 4:27











Your Answer







StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f403904%2fextract-the-children-of-a-specific-xml-element-type%23new-answer', 'question_page');

);

Post as a guest






























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
7
down vote













If you really want sed- or awk-like command-line processing for XML files then you should probably consider using an XML-processing command-line tool. Here are some of the tools that I've seen more commonly used:



  • xmlstarlet

  • xmllint

  • BaseX

  • XQilla

You should also be aware that there are several XML-specific programming/query languages:



  • XPath

  • XQuery

  • XSLT

Note that (in order to be valid XML) your XML data needs a root node and that your attribute values should be quoted, i.e. your data file should look more like this:



<!-- data.xml -->

<instances>

<instance ab='1'>
<a1>aa</a1>
<a2>aa</a2>
</instance>

<instance ab='2'>
<b1>bb</b1>
<b2>bb</b2>
</instance>

<instance ab='3'>
<c1>cc</c1>
<c2>cc</c2>
</instance>

</instances>


If your data is formatted as valid XML, then you can use XPath with xmlstarlet to get exactly what you want with a very concise command:



xmlstarlet sel -t -m '//instance' -c "./*" -n data.xml


This produces the following output:



<a1>aa</a1><a2>aa</a2>
<b1>bb</b1><b2>bb</b2>
<c1>cc</c1><c2>cc</c2>


Or you could use Python (my personal favorite choice). Here is a Python script that accomplishes the same task:



#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""extract_instance_children.bash"""

import sys
import xml.etree.ElementTree

# Load the data
tree = xml.etree.ElementTree.parse(sys.argv[1])
root = tree.getroot()

# Extract and output the child elements
for instance in root.iter("instance"):
print(''.join([xml.etree.ElementTree.tostring(child).strip() for child in instance]))


And here is how you could run the script:



python extract_instance_children.py data.xml


This uses the xml package from the Python Standard Library which is also a strict XML parser.



If you're not concerned with having properly formatted XML and you just want to parse a text file that looks roughly like the one you've presented, then you can definitely accomplish what you want just using shell-scripting and standard command-line tools. Here is an awk script (as requested):



#!/usr/bin/env awk

# extract_instance_children.awk

BEGIN
addchild=0;
children="";



# Opening tag for "instance" element - set the "addchild" flag
if($0 ~ "^ *<instance[^<>]+>")
addchild=1;


# Closing tag for "instance" element - reset "children" string and "addchild" flag, print children
else if($0 ~ "^ *</instance>" && addchild == 1)
addchild=0;
printf("%sn", children);
children="";


# Concatenating child elements - strip whitespace
else if (addchild == 1)
gsub(/^[ t]+/,"",$0);
gsub(/[ t]+$/,"",$0);
children=children $0;




To execute the script from a file, you would use a command like this one:



awk -f extract_instance_children.awk data.xml


And here is a Bash script that produces the desired output:



#!/bin/bash

# extract_instance_children.bash

# Keep track of whether or not we're inside of an "instance" element
instance=0

# Loop through the lines of the file
while read line; do

# Set the instance flag to true if we come across an opening tag
if echo "$line" | grep -q '<instance.*>'; then
instance=1

# Set the instance flag to false and print a newline if we come across a closing tag
elif echo "$line" | grep -q '</instance>'; then
instance=0
echo

# If we're inside an instance tag then print the child element
elif [[ $instance == 1 ]]; then
printf "$line"
fi

done < "$1"


You would execute it like this:



bash extract_instance_children.bash data.xml


Or, going back to Python once again, you could use the Beautiful Soup package. Beautiful Soup is much more flexible in its ability to parse invalid XML than the standard Python XML module (and every other XML parser that I've come across). Here is a Python script which uses Beautiful Soup to achieve the desired result:



#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""extract_instance_children.bash"""

import sys
from bs4 import BeautifulSoup as Soup

with open(sys.argv[1], 'r') as xmlfile:
soup = Soup(xmlfile.read(), "html.parser")
for instance in soup.findAll('instance'):
print(''.join([str(child) for child in instance.findChildren()]))





share|improve this answer






















  • Thanks. bash script is working as expected.. Thanks for your prompt and quick help.. My problem had been resolved.. No need to speed more time on it.
    – Abhi S
    Nov 11 '17 at 15:44






  • 2




    @AbhiS That's great! I'm not just posting for you though. I'm trying to write a clear and complete answer for anyone else who happens to come by this post.
    – igal
    Nov 11 '17 at 16:21










  • @AbhiS If this solution worked for you, could you please accept it?
    – igal
    Feb 1 at 1:47














up vote
7
down vote













If you really want sed- or awk-like command-line processing for XML files then you should probably consider using an XML-processing command-line tool. Here are some of the tools that I've seen more commonly used:



  • xmlstarlet

  • xmllint

  • BaseX

  • XQilla

You should also be aware that there are several XML-specific programming/query languages:



  • XPath

  • XQuery

  • XSLT

Note that (in order to be valid XML) your XML data needs a root node and that your attribute values should be quoted, i.e. your data file should look more like this:



<!-- data.xml -->

<instances>

<instance ab='1'>
<a1>aa</a1>
<a2>aa</a2>
</instance>

<instance ab='2'>
<b1>bb</b1>
<b2>bb</b2>
</instance>

<instance ab='3'>
<c1>cc</c1>
<c2>cc</c2>
</instance>

</instances>


If your data is formatted as valid XML, then you can use XPath with xmlstarlet to get exactly what you want with a very concise command:



xmlstarlet sel -t -m '//instance' -c "./*" -n data.xml


This produces the following output:



<a1>aa</a1><a2>aa</a2>
<b1>bb</b1><b2>bb</b2>
<c1>cc</c1><c2>cc</c2>


Or you could use Python (my personal favorite choice). Here is a Python script that accomplishes the same task:



#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""extract_instance_children.bash"""

import sys
import xml.etree.ElementTree

# Load the data
tree = xml.etree.ElementTree.parse(sys.argv[1])
root = tree.getroot()

# Extract and output the child elements
for instance in root.iter("instance"):
print(''.join([xml.etree.ElementTree.tostring(child).strip() for child in instance]))


And here is how you could run the script:



python extract_instance_children.py data.xml


This uses the xml package from the Python Standard Library which is also a strict XML parser.



If you're not concerned with having properly formatted XML and you just want to parse a text file that looks roughly like the one you've presented, then you can definitely accomplish what you want just using shell-scripting and standard command-line tools. Here is an awk script (as requested):



#!/usr/bin/env awk

# extract_instance_children.awk

BEGIN
addchild=0;
children="";



# Opening tag for "instance" element - set the "addchild" flag
if($0 ~ "^ *<instance[^<>]+>")
addchild=1;


# Closing tag for "instance" element - reset "children" string and "addchild" flag, print children
else if($0 ~ "^ *</instance>" && addchild == 1)
addchild=0;
printf("%sn", children);
children="";


# Concatenating child elements - strip whitespace
else if (addchild == 1)
gsub(/^[ t]+/,"",$0);
gsub(/[ t]+$/,"",$0);
children=children $0;




To execute the script from a file, you would use a command like this one:



awk -f extract_instance_children.awk data.xml


And here is a Bash script that produces the desired output:



#!/bin/bash

# extract_instance_children.bash

# Keep track of whether or not we're inside of an "instance" element
instance=0

# Loop through the lines of the file
while read line; do

# Set the instance flag to true if we come across an opening tag
if echo "$line" | grep -q '<instance.*>'; then
instance=1

# Set the instance flag to false and print a newline if we come across a closing tag
elif echo "$line" | grep -q '</instance>'; then
instance=0
echo

# If we're inside an instance tag then print the child element
elif [[ $instance == 1 ]]; then
printf "$line"
fi

done < "$1"


You would execute it like this:



bash extract_instance_children.bash data.xml


Or, going back to Python once again, you could use the Beautiful Soup package. Beautiful Soup is much more flexible in its ability to parse invalid XML than the standard Python XML module (and every other XML parser that I've come across). Here is a Python script which uses Beautiful Soup to achieve the desired result:



#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""extract_instance_children.bash"""

import sys
from bs4 import BeautifulSoup as Soup

with open(sys.argv[1], 'r') as xmlfile:
soup = Soup(xmlfile.read(), "html.parser")
for instance in soup.findAll('instance'):
print(''.join([str(child) for child in instance.findChildren()]))





share|improve this answer






















  • Thanks. bash script is working as expected.. Thanks for your prompt and quick help.. My problem had been resolved.. No need to speed more time on it.
    – Abhi S
    Nov 11 '17 at 15:44






  • 2




    @AbhiS That's great! I'm not just posting for you though. I'm trying to write a clear and complete answer for anyone else who happens to come by this post.
    – igal
    Nov 11 '17 at 16:21










  • @AbhiS If this solution worked for you, could you please accept it?
    – igal
    Feb 1 at 1:47












up vote
7
down vote










up vote
7
down vote









If you really want sed- or awk-like command-line processing for XML files then you should probably consider using an XML-processing command-line tool. Here are some of the tools that I've seen more commonly used:



  • xmlstarlet

  • xmllint

  • BaseX

  • XQilla

You should also be aware that there are several XML-specific programming/query languages:



  • XPath

  • XQuery

  • XSLT

Note that (in order to be valid XML) your XML data needs a root node and that your attribute values should be quoted, i.e. your data file should look more like this:



<!-- data.xml -->

<instances>

<instance ab='1'>
<a1>aa</a1>
<a2>aa</a2>
</instance>

<instance ab='2'>
<b1>bb</b1>
<b2>bb</b2>
</instance>

<instance ab='3'>
<c1>cc</c1>
<c2>cc</c2>
</instance>

</instances>


If your data is formatted as valid XML, then you can use XPath with xmlstarlet to get exactly what you want with a very concise command:



xmlstarlet sel -t -m '//instance' -c "./*" -n data.xml


This produces the following output:



<a1>aa</a1><a2>aa</a2>
<b1>bb</b1><b2>bb</b2>
<c1>cc</c1><c2>cc</c2>


Or you could use Python (my personal favorite choice). Here is a Python script that accomplishes the same task:



#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""extract_instance_children.bash"""

import sys
import xml.etree.ElementTree

# Load the data
tree = xml.etree.ElementTree.parse(sys.argv[1])
root = tree.getroot()

# Extract and output the child elements
for instance in root.iter("instance"):
print(''.join([xml.etree.ElementTree.tostring(child).strip() for child in instance]))


And here is how you could run the script:



python extract_instance_children.py data.xml


This uses the xml package from the Python Standard Library which is also a strict XML parser.



If you're not concerned with having properly formatted XML and you just want to parse a text file that looks roughly like the one you've presented, then you can definitely accomplish what you want just using shell-scripting and standard command-line tools. Here is an awk script (as requested):



#!/usr/bin/env awk

# extract_instance_children.awk

BEGIN
addchild=0;
children="";



# Opening tag for "instance" element - set the "addchild" flag
if($0 ~ "^ *<instance[^<>]+>")
addchild=1;


# Closing tag for "instance" element - reset "children" string and "addchild" flag, print children
else if($0 ~ "^ *</instance>" && addchild == 1)
addchild=0;
printf("%sn", children);
children="";


# Concatenating child elements - strip whitespace
else if (addchild == 1)
gsub(/^[ t]+/,"",$0);
gsub(/[ t]+$/,"",$0);
children=children $0;




To execute the script from a file, you would use a command like this one:



awk -f extract_instance_children.awk data.xml


And here is a Bash script that produces the desired output:



#!/bin/bash

# extract_instance_children.bash

# Keep track of whether or not we're inside of an "instance" element
instance=0

# Loop through the lines of the file
while read line; do

# Set the instance flag to true if we come across an opening tag
if echo "$line" | grep -q '<instance.*>'; then
instance=1

# Set the instance flag to false and print a newline if we come across a closing tag
elif echo "$line" | grep -q '</instance>'; then
instance=0
echo

# If we're inside an instance tag then print the child element
elif [[ $instance == 1 ]]; then
printf "$line"
fi

done < "$1"


You would execute it like this:



bash extract_instance_children.bash data.xml


Or, going back to Python once again, you could use the Beautiful Soup package. Beautiful Soup is much more flexible in its ability to parse invalid XML than the standard Python XML module (and every other XML parser that I've come across). Here is a Python script which uses Beautiful Soup to achieve the desired result:



#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""extract_instance_children.bash"""

import sys
from bs4 import BeautifulSoup as Soup

with open(sys.argv[1], 'r') as xmlfile:
soup = Soup(xmlfile.read(), "html.parser")
for instance in soup.findAll('instance'):
print(''.join([str(child) for child in instance.findChildren()]))





share|improve this answer














If you really want sed- or awk-like command-line processing for XML files then you should probably consider using an XML-processing command-line tool. Here are some of the tools that I've seen more commonly used:



  • xmlstarlet

  • xmllint

  • BaseX

  • XQilla

You should also be aware that there are several XML-specific programming/query languages:



  • XPath

  • XQuery

  • XSLT

Note that (in order to be valid XML) your XML data needs a root node and that your attribute values should be quoted, i.e. your data file should look more like this:



<!-- data.xml -->

<instances>

<instance ab='1'>
<a1>aa</a1>
<a2>aa</a2>
</instance>

<instance ab='2'>
<b1>bb</b1>
<b2>bb</b2>
</instance>

<instance ab='3'>
<c1>cc</c1>
<c2>cc</c2>
</instance>

</instances>


If your data is formatted as valid XML, then you can use XPath with xmlstarlet to get exactly what you want with a very concise command:



xmlstarlet sel -t -m '//instance' -c "./*" -n data.xml


This produces the following output:



<a1>aa</a1><a2>aa</a2>
<b1>bb</b1><b2>bb</b2>
<c1>cc</c1><c2>cc</c2>


Or you could use Python (my personal favorite choice). Here is a Python script that accomplishes the same task:



#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""extract_instance_children.bash"""

import sys
import xml.etree.ElementTree

# Load the data
tree = xml.etree.ElementTree.parse(sys.argv[1])
root = tree.getroot()

# Extract and output the child elements
for instance in root.iter("instance"):
print(''.join([xml.etree.ElementTree.tostring(child).strip() for child in instance]))


And here is how you could run the script:



python extract_instance_children.py data.xml


This uses the xml package from the Python Standard Library which is also a strict XML parser.



If you're not concerned with having properly formatted XML and you just want to parse a text file that looks roughly like the one you've presented, then you can definitely accomplish what you want just using shell-scripting and standard command-line tools. Here is an awk script (as requested):



#!/usr/bin/env awk

# extract_instance_children.awk

BEGIN
addchild=0;
children="";



# Opening tag for "instance" element - set the "addchild" flag
if($0 ~ "^ *<instance[^<>]+>")
addchild=1;


# Closing tag for "instance" element - reset "children" string and "addchild" flag, print children
else if($0 ~ "^ *</instance>" && addchild == 1)
addchild=0;
printf("%sn", children);
children="";


# Concatenating child elements - strip whitespace
else if (addchild == 1)
gsub(/^[ t]+/,"",$0);
gsub(/[ t]+$/,"",$0);
children=children $0;




To execute the script from a file, you would use a command like this one:



awk -f extract_instance_children.awk data.xml


And here is a Bash script that produces the desired output:



#!/bin/bash

# extract_instance_children.bash

# Keep track of whether or not we're inside of an "instance" element
instance=0

# Loop through the lines of the file
while read line; do

# Set the instance flag to true if we come across an opening tag
if echo "$line" | grep -q '<instance.*>'; then
instance=1

# Set the instance flag to false and print a newline if we come across a closing tag
elif echo "$line" | grep -q '</instance>'; then
instance=0
echo

# If we're inside an instance tag then print the child element
elif [[ $instance == 1 ]]; then
printf "$line"
fi

done < "$1"


You would execute it like this:



bash extract_instance_children.bash data.xml


Or, going back to Python once again, you could use the Beautiful Soup package. Beautiful Soup is much more flexible in its ability to parse invalid XML than the standard Python XML module (and every other XML parser that I've come across). Here is a Python script which uses Beautiful Soup to achieve the desired result:



#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""extract_instance_children.bash"""

import sys
from bs4 import BeautifulSoup as Soup

with open(sys.argv[1], 'r') as xmlfile:
soup = Soup(xmlfile.read(), "html.parser")
for instance in soup.findAll('instance'):
print(''.join([str(child) for child in instance.findChildren()]))






share|improve this answer














share|improve this answer



share|improve this answer








edited Mar 30 at 19:19

























answered Nov 11 '17 at 15:07









igal

4,830930




4,830930











  • Thanks. bash script is working as expected.. Thanks for your prompt and quick help.. My problem had been resolved.. No need to speed more time on it.
    – Abhi S
    Nov 11 '17 at 15:44






  • 2




    @AbhiS That's great! I'm not just posting for you though. I'm trying to write a clear and complete answer for anyone else who happens to come by this post.
    – igal
    Nov 11 '17 at 16:21










  • @AbhiS If this solution worked for you, could you please accept it?
    – igal
    Feb 1 at 1:47
















  • Thanks. bash script is working as expected.. Thanks for your prompt and quick help.. My problem had been resolved.. No need to speed more time on it.
    – Abhi S
    Nov 11 '17 at 15:44






  • 2




    @AbhiS That's great! I'm not just posting for you though. I'm trying to write a clear and complete answer for anyone else who happens to come by this post.
    – igal
    Nov 11 '17 at 16:21










  • @AbhiS If this solution worked for you, could you please accept it?
    – igal
    Feb 1 at 1:47















Thanks. bash script is working as expected.. Thanks for your prompt and quick help.. My problem had been resolved.. No need to speed more time on it.
– Abhi S
Nov 11 '17 at 15:44




Thanks. bash script is working as expected.. Thanks for your prompt and quick help.. My problem had been resolved.. No need to speed more time on it.
– Abhi S
Nov 11 '17 at 15:44




2




2




@AbhiS That's great! I'm not just posting for you though. I'm trying to write a clear and complete answer for anyone else who happens to come by this post.
– igal
Nov 11 '17 at 16:21




@AbhiS That's great! I'm not just posting for you though. I'm trying to write a clear and complete answer for anyone else who happens to come by this post.
– igal
Nov 11 '17 at 16:21












@AbhiS If this solution worked for you, could you please accept it?
– igal
Feb 1 at 1:47




@AbhiS If this solution worked for you, could you please accept it?
– igal
Feb 1 at 1:47












up vote
3
down vote













This may be of help:



#!/bin/bash

awk -vtag=instance -vp=0 '
if($0~("^<"tag))p=1;next
if($0~("^</"tag))p=0;printf("n");next
if(p==1)$1=$1;printf("%s",$0)
' infile


Assuming the Sample text in your example is a mistake and keeping it simple.



The p variable decides when to print. A $1=$1 removes leading spaces.






share|improve this answer






















  • This actually didn't work for me; I got no output at all. I changed the if($0~"\<"tag) condition to if($0~"</"tag) and got the expected output, but not in the correct format (there was additional whitespace).
    – igal
    Nov 12 '17 at 0:06










  • @igal Maybe now, answer edited (made even simpler), spaces removed.
    – Arrow
    Nov 12 '17 at 2:15










  • @Arrow Yup! Very nice. Upvoted!
    – igal
    Nov 12 '17 at 4:27















up vote
3
down vote













This may be of help:



#!/bin/bash

awk -vtag=instance -vp=0 '
if($0~("^<"tag))p=1;next
if($0~("^</"tag))p=0;printf("n");next
if(p==1)$1=$1;printf("%s",$0)
' infile


Assuming the Sample text in your example is a mistake and keeping it simple.



The p variable decides when to print. A $1=$1 removes leading spaces.






share|improve this answer






















  • This actually didn't work for me; I got no output at all. I changed the if($0~"\<"tag) condition to if($0~"</"tag) and got the expected output, but not in the correct format (there was additional whitespace).
    – igal
    Nov 12 '17 at 0:06










  • @igal Maybe now, answer edited (made even simpler), spaces removed.
    – Arrow
    Nov 12 '17 at 2:15










  • @Arrow Yup! Very nice. Upvoted!
    – igal
    Nov 12 '17 at 4:27













up vote
3
down vote










up vote
3
down vote









This may be of help:



#!/bin/bash

awk -vtag=instance -vp=0 '
if($0~("^<"tag))p=1;next
if($0~("^</"tag))p=0;printf("n");next
if(p==1)$1=$1;printf("%s",$0)
' infile


Assuming the Sample text in your example is a mistake and keeping it simple.



The p variable decides when to print. A $1=$1 removes leading spaces.






share|improve this answer














This may be of help:



#!/bin/bash

awk -vtag=instance -vp=0 '
if($0~("^<"tag))p=1;next
if($0~("^</"tag))p=0;printf("n");next
if(p==1)$1=$1;printf("%s",$0)
' infile


Assuming the Sample text in your example is a mistake and keeping it simple.



The p variable decides when to print. A $1=$1 removes leading spaces.







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 12 '17 at 14:13

























answered Nov 11 '17 at 14:44









Arrow

2,400218




2,400218











  • This actually didn't work for me; I got no output at all. I changed the if($0~"\<"tag) condition to if($0~"</"tag) and got the expected output, but not in the correct format (there was additional whitespace).
    – igal
    Nov 12 '17 at 0:06










  • @igal Maybe now, answer edited (made even simpler), spaces removed.
    – Arrow
    Nov 12 '17 at 2:15










  • @Arrow Yup! Very nice. Upvoted!
    – igal
    Nov 12 '17 at 4:27

















  • This actually didn't work for me; I got no output at all. I changed the if($0~"\<"tag) condition to if($0~"</"tag) and got the expected output, but not in the correct format (there was additional whitespace).
    – igal
    Nov 12 '17 at 0:06










  • @igal Maybe now, answer edited (made even simpler), spaces removed.
    – Arrow
    Nov 12 '17 at 2:15










  • @Arrow Yup! Very nice. Upvoted!
    – igal
    Nov 12 '17 at 4:27
















This actually didn't work for me; I got no output at all. I changed the if($0~"\<"tag) condition to if($0~"</"tag) and got the expected output, but not in the correct format (there was additional whitespace).
– igal
Nov 12 '17 at 0:06




This actually didn't work for me; I got no output at all. I changed the if($0~"\<"tag) condition to if($0~"</"tag) and got the expected output, but not in the correct format (there was additional whitespace).
– igal
Nov 12 '17 at 0:06












@igal Maybe now, answer edited (made even simpler), spaces removed.
– Arrow
Nov 12 '17 at 2:15




@igal Maybe now, answer edited (made even simpler), spaces removed.
– Arrow
Nov 12 '17 at 2:15












@Arrow Yup! Very nice. Upvoted!
– igal
Nov 12 '17 at 4:27





@Arrow Yup! Very nice. Upvoted!
– igal
Nov 12 '17 at 4:27


















 

draft saved


draft discarded















































 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f403904%2fextract-the-children-of-a-specific-xml-element-type%23new-answer', 'question_page');

);

Post as a guest













































































Popular posts from this blog

How to check contact read email or not when send email to Individual?

Bahrain

Postfix configuration issue with fips on centos 7; mailgun relay