Extract the Children of a Specific XML Element Type

up vote
2
down vote

favorite

Given a specific XML element (i.e. a specific tag name) and a snippet of XML data, I want to extract the children from each occurrence of that element. More specifically, I have the following snippet of (not quite valid) XML data:

<!-- data.xml -->

<instance ab=1 >
 <a1>aa</a1>
 <a2>aa</a2>
</instance>
<instance ab=2 >
 <b1>bb</b1>
 <b2>bb</b2>
</instance>
<instance ab=3 >
 <c1>cc</c1>
 <c2>cc</c2>
</instance>

I would like a script or command which takes this data as input and produces the following output:

<a1>aa</a1><a2>aa</a2>
<b1>bb</b1><b2>bb</b2>
<c1>cc</c1><c2>cc</c2>

I would like for the solution to use standard text-processing tools such as sed or awk.

I tried using the following sed command, but it did not work:

sed -n '/<Sample/,/</Sample/p' data.xml

edited Nov 11 '17 at 17:48

igal

4,830930

asked Nov 11 '17 at 14:10

Abhi S

114

It's totally unclear what you're asking for here. Why would the sed command you're using have any effect with the data you're using? The search string you're using doesn't appear in the data. And why do you want to use sed and awk specifically? Are you sure that's a requirement?
â€“Â igal
Nov 11 '17 at 14:26

The input file is not a properly formatted XML file. It lacks a single root element.
â€“Â Kusalananda
Nov 11 '17 at 14:47

... and itÃ¢Â€Â™s indented peculiarly.
â€“Â G-Man
Nov 11 '17 at 15:16

And the attribute values aren't quoted.
â€“Â igal
Nov 11 '17 at 15:40

If your question has been resolved you should accept an answer so that the issue is closed. Otherwise this question will remain open and people may continue to submit solutions.
â€“Â igal
Dec 9 '17 at 22:48

add a commentÂ |Â

up vote
2
down vote

favorite

<!-- data.xml -->

<instance ab=1 >
 <a1>aa</a1>
 <a2>aa</a2>
</instance>
<instance ab=2 >
 <b1>bb</b1>
 <b2>bb</b2>
</instance>
<instance ab=3 >
 <c1>cc</c1>
 <c2>cc</c2>
</instance>

I would like a script or command which takes this data as input and produces the following output:

<a1>aa</a1><a2>aa</a2>
<b1>bb</b1><b2>bb</b2>
<c1>cc</c1><c2>cc</c2>

I would like for the solution to use standard text-processing tools such as sed or awk.

I tried using the following sed command, but it did not work:

sed -n '/<Sample/,/</Sample/p' data.xml

edited Nov 11 '17 at 17:48

igal

4,830930

asked Nov 11 '17 at 14:10

Abhi S

114

It's totally unclear what you're asking for here. Why would the sed command you're using have any effect with the data you're using? The search string you're using doesn't appear in the data. And why do you want to use sed and awk specifically? Are you sure that's a requirement?
â€“Â igal
Nov 11 '17 at 14:26

The input file is not a properly formatted XML file. It lacks a single root element.
â€“Â Kusalananda
Nov 11 '17 at 14:47

... and itÃ¢Â€Â™s indented peculiarly.
â€“Â G-Man
Nov 11 '17 at 15:16

And the attribute values aren't quoted.
â€“Â igal
Nov 11 '17 at 15:40

If your question has been resolved you should accept an answer so that the issue is closed. Otherwise this question will remain open and people may continue to submit solutions.
â€“Â igal
Dec 9 '17 at 22:48

add a commentÂ |Â

up vote
2
down vote

favorite

<!-- data.xml -->

<instance ab=1 >
 <a1>aa</a1>
 <a2>aa</a2>
</instance>
<instance ab=2 >
 <b1>bb</b1>
 <b2>bb</b2>
</instance>
<instance ab=3 >
 <c1>cc</c1>
 <c2>cc</c2>
</instance>

I would like a script or command which takes this data as input and produces the following output:

<a1>aa</a1><a2>aa</a2>
<b1>bb</b1><b2>bb</b2>
<c1>cc</c1><c2>cc</c2>

I would like for the solution to use standard text-processing tools such as sed or awk.

I tried using the following sed command, but it did not work:

sed -n '/<Sample/,/</Sample/p' data.xml

edited Nov 11 '17 at 17:48

igal

4,830930

asked Nov 11 '17 at 14:10

Abhi S

114

<!-- data.xml -->

<instance ab=1 >
 <a1>aa</a1>
 <a2>aa</a2>
</instance>
<instance ab=2 >
 <b1>bb</b1>
 <b2>bb</b2>
</instance>
<instance ab=3 >
 <c1>cc</c1>
 <c2>cc</c2>
</instance>

I would like a script or command which takes this data as input and produces the following output:

<a1>aa</a1><a2>aa</a2>
<b1>bb</b1><b2>bb</b2>
<c1>cc</c1><c2>cc</c2>

I would like for the solution to use standard text-processing tools such as sed or awk.

I tried using the following sed command, but it did not work:

sed -n '/<Sample/,/</Sample/p' data.xml

edited Nov 11 '17 at 17:48

igal

4,830930

asked Nov 11 '17 at 14:10

Abhi S

114

edited Nov 11 '17 at 17:48

igal

4,830930

edited Nov 11 '17 at 17:48

igal

4,830930

edited Nov 11 '17 at 17:48

igal

4,830930

asked Nov 11 '17 at 14:10

Abhi S

114

asked Nov 11 '17 at 14:10

Abhi S

114

asked Nov 11 '17 at 14:10

Abhi S

114

It's totally unclear what you're asking for here. Why would the sed command you're using have any effect with the data you're using? The search string you're using doesn't appear in the data. And why do you want to use sed and awk specifically? Are you sure that's a requirement?
â€“Â igal
Nov 11 '17 at 14:26

The input file is not a properly formatted XML file. It lacks a single root element.
â€“Â Kusalananda
Nov 11 '17 at 14:47

... and itÃ¢Â€Â™s indented peculiarly.
â€“Â G-Man
Nov 11 '17 at 15:16

And the attribute values aren't quoted.
â€“Â igal
Nov 11 '17 at 15:40

If your question has been resolved you should accept an answer so that the issue is closed. Otherwise this question will remain open and people may continue to submit solutions.
â€“Â igal
Dec 9 '17 at 22:48

add a commentÂ |Â

It's totally unclear what you're asking for here. Why would the sed command you're using have any effect with the data you're using? The search string you're using doesn't appear in the data. And why do you want to use sed and awk specifically? Are you sure that's a requirement?
â€“Â igal
Nov 11 '17 at 14:26

The input file is not a properly formatted XML file. It lacks a single root element.
â€“Â Kusalananda
Nov 11 '17 at 14:47

... and itÃ¢Â€Â™s indented peculiarly.
â€“Â G-Man
Nov 11 '17 at 15:16

And the attribute values aren't quoted.
â€“Â igal
Nov 11 '17 at 15:40

If your question has been resolved you should accept an answer so that the issue is closed. Otherwise this question will remain open and people may continue to submit solutions.
â€“Â igal
Dec 9 '17 at 22:48

It's totally unclear what you're asking for here. Why would the sed command you're using have any effect with the data you're using? The search string you're using doesn't appear in the data. And why do you want to use sed and awk specifically? Are you sure that's a requirement?
â€“Â igal
Nov 11 '17 at 14:26

The input file is not a properly formatted XML file. It lacks a single root element.
â€“Â Kusalananda
Nov 11 '17 at 14:47

... and itÃ¢Â€Â™s indented peculiarly.
â€“Â G-Man
Nov 11 '17 at 15:16

And the attribute values aren't quoted.
â€“Â igal
Nov 11 '17 at 15:40

If your question has been resolved you should accept an answer so that the issue is closed. Otherwise this question will remain open and people may continue to submit solutions.
â€“Â igal
Dec 9 '17 at 22:48

add a commentÂ |Â

2 Answers
2

active

oldest

votes

up vote
7
down vote

If you really want sed- or awk-like command-line processing for XML files then you should probably consider using an XML-processing command-line tool. Here are some of the tools that I've seen more commonly used:

xmlstarlet

xmllint

BaseX

XQilla

You should also be aware that there are several XML-specific programming/query languages:

XPath

XQuery

XSLT

Note that (in order to be valid XML) your XML data needs a root node and that your attribute values should be quoted, i.e. your data file should look more like this:

<!-- data.xml -->

<instances>

 <instance ab='1'>
 <a1>aa</a1>
 <a2>aa</a2>
 </instance>

 <instance ab='2'>
 <b1>bb</b1>
 <b2>bb</b2>
 </instance>

 <instance ab='3'>
 <c1>cc</c1>
 <c2>cc</c2>
 </instance>

</instances>

If your data is formatted as valid XML, then you can use XPath with xmlstarlet to get exactly what you want with a very concise command:

xmlstarlet sel -t -m '//instance' -c "./*" -n data.xml

This produces the following output:

<a1>aa</a1><a2>aa</a2>
<b1>bb</b1><b2>bb</b2>
<c1>cc</c1><c2>cc</c2>

Or you could use Python (my personal favorite choice). Here is a Python script that accomplishes the same task:

#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""extract_instance_children.bash"""

import sys
import xml.etree.ElementTree

# Load the data
tree = xml.etree.ElementTree.parse(sys.argv[1])
root = tree.getroot()

# Extract and output the child elements
for instance in root.iter("instance"):
 print(''.join([xml.etree.ElementTree.tostring(child).strip() for child in instance]))

And here is how you could run the script:

python extract_instance_children.py data.xml

This uses the xml package from the Python Standard Library which is also a strict XML parser.

If you're not concerned with having properly formatted XML and you just want to parse a text file that looks roughly like the one you've presented, then you can definitely accomplish what you want just using shell-scripting and standard command-line tools. Here is an awk script (as requested):

#!/usr/bin/env awk

# extract_instance_children.awk

BEGIN 
 addchild=0;
 children="";



 # Opening tag for "instance" element - set the "addchild" flag
 if($0 ~ "^ *<instance[^<>]+>") 
 addchild=1;
 

 # Closing tag for "instance" element - reset "children" string and "addchild" flag, print children
 else if($0 ~ "^ *</instance>" && addchild == 1) 
 addchild=0;
 printf("%sn", children);
 children="";
 

 # Concatenating child elements - strip whitespace
 else if (addchild == 1) 
 gsub(/^[ t]+/,"",$0);
 gsub(/[ t]+$/,"",$0);
 children=children $0;

To execute the script from a file, you would use a command like this one:

awk -f extract_instance_children.awk data.xml

And here is a Bash script that produces the desired output:

#!/bin/bash

# extract_instance_children.bash

# Keep track of whether or not we're inside of an "instance" element
instance=0

# Loop through the lines of the file
while read line; do

 # Set the instance flag to true if we come across an opening tag
 if echo "$line" | grep -q '<instance.*>'; then
 instance=1

 # Set the instance flag to false and print a newline if we come across a closing tag
 elif echo "$line" | grep -q '</instance>'; then
 instance=0
 echo

 # If we're inside an instance tag then print the child element
 elif [[ $instance == 1 ]]; then
 printf "$line"
 fi

done < "$1"

You would execute it like this:

bash extract_instance_children.bash data.xml

Or, going back to Python once again, you could use the Beautiful Soup package. Beautiful Soup is much more flexible in its ability to parse invalid XML than the standard Python XML module (and every other XML parser that I've come across). Here is a Python script which uses Beautiful Soup to achieve the desired result:

#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""extract_instance_children.bash"""

import sys
from bs4 import BeautifulSoup as Soup

with open(sys.argv[1], 'r') as xmlfile:
 soup = Soup(xmlfile.read(), "html.parser")
 for instance in soup.findAll('instance'):
 print(''.join([str(child) for child in instance.findChildren()]))

edited Mar 30 at 19:19

answered Nov 11 '17 at 15:07

igal

4,830930

Thanks. bash script is working as expected.. Thanks for your prompt and quick help.. My problem had been resolved.. No need to speed more time on it.
â€“Â Abhi S
Nov 11 '17 at 15:44

2

@AbhiS That's great! I'm not just posting for you though. I'm trying to write a clear and complete answer for anyone else who happens to come by this post.
â€“Â igal
Nov 11 '17 at 16:21

@AbhiS If this solution worked for you, could you please accept it?
â€“Â igal
Feb 1 at 1:47

add a commentÂ |Â

up vote
3
down vote

This may be of help:

#!/bin/bash

awk -vtag=instance -vp=0 '
if($0~("^<"tag))p=1;next
if($0~("^</"tag))p=0;printf("n");next
if(p==1)$1=$1;printf("%s",$0)
' infile

Assuming the Sample text in your example is a mistake and keeping it simple.

The p variable decides when to print. A $1=$1 removes leading spaces.

edited Nov 12 '17 at 14:13

answered Nov 11 '17 at 14:44

Arrow

2,400218

This actually didn't work for me; I got no output at all. I changed the if($0~"\<"tag) condition to if($0~"</"tag) and got the expected output, but not in the correct format (there was additional whitespace).
â€“Â igal
Nov 12 '17 at 0:06

@igal Maybe now, answer edited (made even simpler), spaces removed.
â€“Â Arrow
Nov 12 '17 at 2:15

@Arrow Yup! Very nice. Upvoted!
â€“Â igal
Nov 12 '17 at 4:27

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f403904%2fextract-the-children-of-a-specific-xml-element-type%23new-answer', 'question_page');

);

Post as a guest

Name

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
7
down vote

xmlstarlet

xmllint

BaseX

XQilla

You should also be aware that there are several XML-specific programming/query languages:

XPath

XQuery

XSLT

Note that (in order to be valid XML) your XML data needs a root node and that your attribute values should be quoted, i.e. your data file should look more like this:

<!-- data.xml -->

<instances>

 <instance ab='1'>
 <a1>aa</a1>
 <a2>aa</a2>
 </instance>

 <instance ab='2'>
 <b1>bb</b1>
 <b2>bb</b2>
 </instance>

 <instance ab='3'>
 <c1>cc</c1>
 <c2>cc</c2>
 </instance>

</instances>

If your data is formatted as valid XML, then you can use XPath with xmlstarlet to get exactly what you want with a very concise command:

xmlstarlet sel -t -m '//instance' -c "./*" -n data.xml

This produces the following output:

<a1>aa</a1><a2>aa</a2>
<b1>bb</b1><b2>bb</b2>
<c1>cc</c1><c2>cc</c2>

Or you could use Python (my personal favorite choice). Here is a Python script that accomplishes the same task:

#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""extract_instance_children.bash"""

import sys
import xml.etree.ElementTree

# Load the data
tree = xml.etree.ElementTree.parse(sys.argv[1])
root = tree.getroot()

# Extract and output the child elements
for instance in root.iter("instance"):
 print(''.join([xml.etree.ElementTree.tostring(child).strip() for child in instance]))

And here is how you could run the script:

python extract_instance_children.py data.xml

This uses the xml package from the Python Standard Library which is also a strict XML parser.

#!/usr/bin/env awk

# extract_instance_children.awk

BEGIN 
 addchild=0;
 children="";



 # Opening tag for "instance" element - set the "addchild" flag
 if($0 ~ "^ *<instance[^<>]+>") 
 addchild=1;
 

 # Closing tag for "instance" element - reset "children" string and "addchild" flag, print children
 else if($0 ~ "^ *</instance>" && addchild == 1) 
 addchild=0;
 printf("%sn", children);
 children="";
 

 # Concatenating child elements - strip whitespace
 else if (addchild == 1) 
 gsub(/^[ t]+/,"",$0);
 gsub(/[ t]+$/,"",$0);
 children=children $0;

To execute the script from a file, you would use a command like this one:

awk -f extract_instance_children.awk data.xml

And here is a Bash script that produces the desired output:

#!/bin/bash

# extract_instance_children.bash

# Keep track of whether or not we're inside of an "instance" element
instance=0

# Loop through the lines of the file
while read line; do

 # Set the instance flag to true if we come across an opening tag
 if echo "$line" | grep -q '<instance.*>'; then
 instance=1

 # Set the instance flag to false and print a newline if we come across a closing tag
 elif echo "$line" | grep -q '</instance>'; then
 instance=0
 echo

 # If we're inside an instance tag then print the child element
 elif [[ $instance == 1 ]]; then
 printf "$line"
 fi

done < "$1"

You would execute it like this:

bash extract_instance_children.bash data.xml

#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""extract_instance_children.bash"""

import sys
from bs4 import BeautifulSoup as Soup

with open(sys.argv[1], 'r') as xmlfile:
 soup = Soup(xmlfile.read(), "html.parser")
 for instance in soup.findAll('instance'):
 print(''.join([str(child) for child in instance.findChildren()]))

edited Mar 30 at 19:19

answered Nov 11 '17 at 15:07

igal

4,830930

Thanks. bash script is working as expected.. Thanks for your prompt and quick help.. My problem had been resolved.. No need to speed more time on it.
â€“Â Abhi S
Nov 11 '17 at 15:44

2

@AbhiS That's great! I'm not just posting for you though. I'm trying to write a clear and complete answer for anyone else who happens to come by this post.
â€“Â igal
Nov 11 '17 at 16:21

@AbhiS If this solution worked for you, could you please accept it?
â€“Â igal
Feb 1 at 1:47

add a commentÂ |Â

up vote
7
down vote

xmlstarlet

xmllint

BaseX

XQilla

You should also be aware that there are several XML-specific programming/query languages:

XPath

XQuery

XSLT

Note that (in order to be valid XML) your XML data needs a root node and that your attribute values should be quoted, i.e. your data file should look more like this:

<!-- data.xml -->

<instances>

 <instance ab='1'>
 <a1>aa</a1>
 <a2>aa</a2>
 </instance>

 <instance ab='2'>
 <b1>bb</b1>
 <b2>bb</b2>
 </instance>

 <instance ab='3'>
 <c1>cc</c1>
 <c2>cc</c2>
 </instance>

</instances>

If your data is formatted as valid XML, then you can use XPath with xmlstarlet to get exactly what you want with a very concise command:

xmlstarlet sel -t -m '//instance' -c "./*" -n data.xml

This produces the following output:

<a1>aa</a1><a2>aa</a2>
<b1>bb</b1><b2>bb</b2>
<c1>cc</c1><c2>cc</c2>

Or you could use Python (my personal favorite choice). Here is a Python script that accomplishes the same task:

#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""extract_instance_children.bash"""

import sys
import xml.etree.ElementTree

# Load the data
tree = xml.etree.ElementTree.parse(sys.argv[1])
root = tree.getroot()

# Extract and output the child elements
for instance in root.iter("instance"):
 print(''.join([xml.etree.ElementTree.tostring(child).strip() for child in instance]))

And here is how you could run the script:

python extract_instance_children.py data.xml

This uses the xml package from the Python Standard Library which is also a strict XML parser.

#!/usr/bin/env awk

# extract_instance_children.awk

BEGIN 
 addchild=0;
 children="";



 # Opening tag for "instance" element - set the "addchild" flag
 if($0 ~ "^ *<instance[^<>]+>") 
 addchild=1;
 

 # Closing tag for "instance" element - reset "children" string and "addchild" flag, print children
 else if($0 ~ "^ *</instance>" && addchild == 1) 
 addchild=0;
 printf("%sn", children);
 children="";
 

 # Concatenating child elements - strip whitespace
 else if (addchild == 1) 
 gsub(/^[ t]+/,"",$0);
 gsub(/[ t]+$/,"",$0);
 children=children $0;

To execute the script from a file, you would use a command like this one:

awk -f extract_instance_children.awk data.xml

And here is a Bash script that produces the desired output:

#!/bin/bash

# extract_instance_children.bash

# Keep track of whether or not we're inside of an "instance" element
instance=0

# Loop through the lines of the file
while read line; do

 # Set the instance flag to true if we come across an opening tag
 if echo "$line" | grep -q '<instance.*>'; then
 instance=1

 # Set the instance flag to false and print a newline if we come across a closing tag
 elif echo "$line" | grep -q '</instance>'; then
 instance=0
 echo

 # If we're inside an instance tag then print the child element
 elif [[ $instance == 1 ]]; then
 printf "$line"
 fi

done < "$1"

You would execute it like this:

bash extract_instance_children.bash data.xml

#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""extract_instance_children.bash"""

import sys
from bs4 import BeautifulSoup as Soup

with open(sys.argv[1], 'r') as xmlfile:
 soup = Soup(xmlfile.read(), "html.parser")
 for instance in soup.findAll('instance'):
 print(''.join([str(child) for child in instance.findChildren()]))

edited Mar 30 at 19:19

answered Nov 11 '17 at 15:07

igal

4,830930

Thanks. bash script is working as expected.. Thanks for your prompt and quick help.. My problem had been resolved.. No need to speed more time on it.
â€“Â Abhi S
Nov 11 '17 at 15:44

2

@AbhiS That's great! I'm not just posting for you though. I'm trying to write a clear and complete answer for anyone else who happens to come by this post.
â€“Â igal
Nov 11 '17 at 16:21

@AbhiS If this solution worked for you, could you please accept it?
â€“Â igal
Feb 1 at 1:47

add a commentÂ |Â

up vote
7
down vote

xmlstarlet

xmllint

BaseX

XQilla

You should also be aware that there are several XML-specific programming/query languages:

XPath

XQuery

XSLT

Note that (in order to be valid XML) your XML data needs a root node and that your attribute values should be quoted, i.e. your data file should look more like this:

<!-- data.xml -->

<instances>

 <instance ab='1'>
 <a1>aa</a1>
 <a2>aa</a2>
 </instance>

 <instance ab='2'>
 <b1>bb</b1>
 <b2>bb</b2>
 </instance>

 <instance ab='3'>
 <c1>cc</c1>
 <c2>cc</c2>
 </instance>

</instances>

If your data is formatted as valid XML, then you can use XPath with xmlstarlet to get exactly what you want with a very concise command:

xmlstarlet sel -t -m '//instance' -c "./*" -n data.xml

This produces the following output:

<a1>aa</a1><a2>aa</a2>
<b1>bb</b1><b2>bb</b2>
<c1>cc</c1><c2>cc</c2>

Or you could use Python (my personal favorite choice). Here is a Python script that accomplishes the same task:

#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""extract_instance_children.bash"""

import sys
import xml.etree.ElementTree

# Load the data
tree = xml.etree.ElementTree.parse(sys.argv[1])
root = tree.getroot()

# Extract and output the child elements
for instance in root.iter("instance"):
 print(''.join([xml.etree.ElementTree.tostring(child).strip() for child in instance]))

And here is how you could run the script:

python extract_instance_children.py data.xml

This uses the xml package from the Python Standard Library which is also a strict XML parser.

#!/usr/bin/env awk

# extract_instance_children.awk

BEGIN 
 addchild=0;
 children="";



 # Opening tag for "instance" element - set the "addchild" flag
 if($0 ~ "^ *<instance[^<>]+>") 
 addchild=1;
 

 # Closing tag for "instance" element - reset "children" string and "addchild" flag, print children
 else if($0 ~ "^ *</instance>" && addchild == 1) 
 addchild=0;
 printf("%sn", children);
 children="";
 

 # Concatenating child elements - strip whitespace
 else if (addchild == 1) 
 gsub(/^[ t]+/,"",$0);
 gsub(/[ t]+$/,"",$0);
 children=children $0;

To execute the script from a file, you would use a command like this one:

awk -f extract_instance_children.awk data.xml

And here is a Bash script that produces the desired output:

#!/bin/bash

# extract_instance_children.bash

# Keep track of whether or not we're inside of an "instance" element
instance=0

# Loop through the lines of the file
while read line; do

 # Set the instance flag to true if we come across an opening tag
 if echo "$line" | grep -q '<instance.*>'; then
 instance=1

 # Set the instance flag to false and print a newline if we come across a closing tag
 elif echo "$line" | grep -q '</instance>'; then
 instance=0
 echo

 # If we're inside an instance tag then print the child element
 elif [[ $instance == 1 ]]; then
 printf "$line"
 fi

done < "$1"

You would execute it like this:

bash extract_instance_children.bash data.xml

#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""extract_instance_children.bash"""

import sys
from bs4 import BeautifulSoup as Soup

with open(sys.argv[1], 'r') as xmlfile:
 soup = Soup(xmlfile.read(), "html.parser")
 for instance in soup.findAll('instance'):
 print(''.join([str(child) for child in instance.findChildren()]))

edited Mar 30 at 19:19

answered Nov 11 '17 at 15:07

igal

4,830930

xmlstarlet

xmllint

BaseX

XQilla

You should also be aware that there are several XML-specific programming/query languages:

XPath

XQuery

XSLT

Note that (in order to be valid XML) your XML data needs a root node and that your attribute values should be quoted, i.e. your data file should look more like this:

<!-- data.xml -->

<instances>

 <instance ab='1'>
 <a1>aa</a1>
 <a2>aa</a2>
 </instance>

 <instance ab='2'>
 <b1>bb</b1>
 <b2>bb</b2>
 </instance>

 <instance ab='3'>
 <c1>cc</c1>
 <c2>cc</c2>
 </instance>

</instances>

If your data is formatted as valid XML, then you can use XPath with xmlstarlet to get exactly what you want with a very concise command:

xmlstarlet sel -t -m '//instance' -c "./*" -n data.xml

This produces the following output:

<a1>aa</a1><a2>aa</a2>
<b1>bb</b1><b2>bb</b2>
<c1>cc</c1><c2>cc</c2>

Or you could use Python (my personal favorite choice). Here is a Python script that accomplishes the same task:

#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""extract_instance_children.bash"""

import sys
import xml.etree.ElementTree

# Load the data
tree = xml.etree.ElementTree.parse(sys.argv[1])
root = tree.getroot()

# Extract and output the child elements
for instance in root.iter("instance"):
 print(''.join([xml.etree.ElementTree.tostring(child).strip() for child in instance]))

And here is how you could run the script:

python extract_instance_children.py data.xml

This uses the xml package from the Python Standard Library which is also a strict XML parser.

#!/usr/bin/env awk

# extract_instance_children.awk

BEGIN 
 addchild=0;
 children="";



 # Opening tag for "instance" element - set the "addchild" flag
 if($0 ~ "^ *<instance[^<>]+>") 
 addchild=1;
 

 # Closing tag for "instance" element - reset "children" string and "addchild" flag, print children
 else if($0 ~ "^ *</instance>" && addchild == 1) 
 addchild=0;
 printf("%sn", children);
 children="";
 

 # Concatenating child elements - strip whitespace
 else if (addchild == 1) 
 gsub(/^[ t]+/,"",$0);
 gsub(/[ t]+$/,"",$0);
 children=children $0;

To execute the script from a file, you would use a command like this one:

awk -f extract_instance_children.awk data.xml

And here is a Bash script that produces the desired output:

#!/bin/bash

# extract_instance_children.bash

# Keep track of whether or not we're inside of an "instance" element
instance=0

# Loop through the lines of the file
while read line; do

 # Set the instance flag to true if we come across an opening tag
 if echo "$line" | grep -q '<instance.*>'; then
 instance=1

 # Set the instance flag to false and print a newline if we come across a closing tag
 elif echo "$line" | grep -q '</instance>'; then
 instance=0
 echo

 # If we're inside an instance tag then print the child element
 elif [[ $instance == 1 ]]; then
 printf "$line"
 fi

done < "$1"

You would execute it like this:

bash extract_instance_children.bash data.xml

#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""extract_instance_children.bash"""

import sys
from bs4 import BeautifulSoup as Soup

with open(sys.argv[1], 'r') as xmlfile:
 soup = Soup(xmlfile.read(), "html.parser")
 for instance in soup.findAll('instance'):
 print(''.join([str(child) for child in instance.findChildren()]))

edited Mar 30 at 19:19

answered Nov 11 '17 at 15:07

igal

4,830930

edited Mar 30 at 19:19

answered Nov 11 '17 at 15:07

igal

4,830930

answered Nov 11 '17 at 15:07

igal

4,830930

answered Nov 11 '17 at 15:07

igal

4,830930

Thanks. bash script is working as expected.. Thanks for your prompt and quick help.. My problem had been resolved.. No need to speed more time on it.
â€“Â Abhi S
Nov 11 '17 at 15:44

2

@AbhiS That's great! I'm not just posting for you though. I'm trying to write a clear and complete answer for anyone else who happens to come by this post.
â€“Â igal
Nov 11 '17 at 16:21

@AbhiS If this solution worked for you, could you please accept it?
â€“Â igal
Feb 1 at 1:47

add a commentÂ |Â

Thanks. bash script is working as expected.. Thanks for your prompt and quick help.. My problem had been resolved.. No need to speed more time on it.
â€“Â Abhi S
Nov 11 '17 at 15:44

2

@AbhiS That's great! I'm not just posting for you though. I'm trying to write a clear and complete answer for anyone else who happens to come by this post.
â€“Â igal
Nov 11 '17 at 16:21

@AbhiS If this solution worked for you, could you please accept it?
â€“Â igal
Feb 1 at 1:47

Thanks. bash script is working as expected.. Thanks for your prompt and quick help.. My problem had been resolved.. No need to speed more time on it.
â€“Â Abhi S
Nov 11 '17 at 15:44

@AbhiS That's great! I'm not just posting for you though. I'm trying to write a clear and complete answer for anyone else who happens to come by this post.
â€“Â igal
Nov 11 '17 at 16:21

@AbhiS If this solution worked for you, could you please accept it?
â€“Â igal
Feb 1 at 1:47

add a commentÂ |Â

up vote
3
down vote

This may be of help:

#!/bin/bash

awk -vtag=instance -vp=0 '
if($0~("^<"tag))p=1;next
if($0~("^</"tag))p=0;printf("n");next
if(p==1)$1=$1;printf("%s",$0)
' infile

Assuming the Sample text in your example is a mistake and keeping it simple.

The p variable decides when to print. A $1=$1 removes leading spaces.

edited Nov 12 '17 at 14:13

answered Nov 11 '17 at 14:44

Arrow

2,400218

This actually didn't work for me; I got no output at all. I changed the if($0~"\<"tag) condition to if($0~"</"tag) and got the expected output, but not in the correct format (there was additional whitespace).
â€“Â igal
Nov 12 '17 at 0:06

@igal Maybe now, answer edited (made even simpler), spaces removed.
â€“Â Arrow
Nov 12 '17 at 2:15

@Arrow Yup! Very nice. Upvoted!
â€“Â igal
Nov 12 '17 at 4:27

add a commentÂ |Â

up vote
3
down vote

This may be of help:

#!/bin/bash

awk -vtag=instance -vp=0 '
if($0~("^<"tag))p=1;next
if($0~("^</"tag))p=0;printf("n");next
if(p==1)$1=$1;printf("%s",$0)
' infile

Assuming the Sample text in your example is a mistake and keeping it simple.

The p variable decides when to print. A $1=$1 removes leading spaces.

edited Nov 12 '17 at 14:13

answered Nov 11 '17 at 14:44

Arrow

2,400218

This actually didn't work for me; I got no output at all. I changed the if($0~"\<"tag) condition to if($0~"</"tag) and got the expected output, but not in the correct format (there was additional whitespace).
â€“Â igal
Nov 12 '17 at 0:06

@igal Maybe now, answer edited (made even simpler), spaces removed.
â€“Â Arrow
Nov 12 '17 at 2:15

@Arrow Yup! Very nice. Upvoted!
â€“Â igal
Nov 12 '17 at 4:27

add a commentÂ |Â

up vote
3
down vote

This may be of help:

#!/bin/bash

awk -vtag=instance -vp=0 '
if($0~("^<"tag))p=1;next
if($0~("^</"tag))p=0;printf("n");next
if(p==1)$1=$1;printf("%s",$0)
' infile

Assuming the Sample text in your example is a mistake and keeping it simple.

The p variable decides when to print. A $1=$1 removes leading spaces.

edited Nov 12 '17 at 14:13

answered Nov 11 '17 at 14:44

Arrow

2,400218

This may be of help:

#!/bin/bash

awk -vtag=instance -vp=0 '
if($0~("^<"tag))p=1;next
if($0~("^</"tag))p=0;printf("n");next
if(p==1)$1=$1;printf("%s",$0)
' infile

Assuming the Sample text in your example is a mistake and keeping it simple.

The p variable decides when to print. A $1=$1 removes leading spaces.

edited Nov 12 '17 at 14:13

answered Nov 11 '17 at 14:44

Arrow

2,400218

edited Nov 12 '17 at 14:13

answered Nov 11 '17 at 14:44

Arrow

2,400218

answered Nov 11 '17 at 14:44

Arrow

2,400218

answered Nov 11 '17 at 14:44

Arrow

2,400218

This actually didn't work for me; I got no output at all. I changed the if($0~"\<"tag) condition to if($0~"</"tag) and got the expected output, but not in the correct format (there was additional whitespace).
â€“Â igal
Nov 12 '17 at 0:06

@igal Maybe now, answer edited (made even simpler), spaces removed.
â€“Â Arrow
Nov 12 '17 at 2:15

@Arrow Yup! Very nice. Upvoted!
â€“Â igal
Nov 12 '17 at 4:27

add a commentÂ |Â

This actually didn't work for me; I got no output at all. I changed the if($0~"\<"tag) condition to if($0~"</"tag) and got the expected output, but not in the correct format (there was additional whitespace).
â€“Â igal
Nov 12 '17 at 0:06

@igal Maybe now, answer edited (made even simpler), spaces removed.
â€“Â Arrow
Nov 12 '17 at 2:15

@Arrow Yup! Very nice. Upvoted!
â€“Â igal
Nov 12 '17 at 4:27

This actually didn't work for me; I got no output at all. I changed the if($0~"\<"tag) condition to if($0~"</"tag) and got the expected output, but not in the correct format (there was additional whitespace).
â€“Â igal
Nov 12 '17 at 0:06

@igal Maybe now, answer edited (made even simpler), spaces removed.
â€“Â Arrow
Nov 12 '17 at 2:15

@Arrow Yup! Very nice. Upvoted!
â€“Â igal
Nov 12 '17 at 4:27

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu