How to parse the file

up vote
-1
down vote

favorite

I have the below file1.txt. What I want to do is to take the value1 until value7 and output it in one row. The value will be scanned between the word "Start" and "End". In case the label/value is missing, the output will show "NA"

Please see the wanted output.txt below.

In short, I want to copy the values between Start and End and output in one line. If value label doesn't exist , the value will show NA. And continously scan the value for another record (Start till End) until enf og the file1.txt.

file1.txt

Start

label1 label2 label3 label4

value1 value2 value3 value4

label5

value5

label6 label7

value6 value7

End


Start

label1 label2 label4

valueA valueB valueD

label5

valueE

label6 

valueF 

End


Start
.
.
.
End

output.txt

label1 label2 label3 label4 label5 label6 label7

value1 value2 value3 value4 value5 value6 value7

valueA valueB NA valueD valueE valueF NA

edited May 9 at 13:41

Jeff Schaller

31.1k846105

asked May 9 at 13:39

user290080

Did you try anything ?
â€“Â Kiwy
May 9 at 14:09

add a commentÂ |Â

up vote
-1
down vote

favorite

Please see the wanted output.txt below.

file1.txt

Start

label1 label2 label3 label4

value1 value2 value3 value4

label5

value5

label6 label7

value6 value7

End


Start

label1 label2 label4

valueA valueB valueD

label5

valueE

label6 

valueF 

End


Start
.
.
.
End

output.txt

label1 label2 label3 label4 label5 label6 label7

value1 value2 value3 value4 value5 value6 value7

valueA valueB NA valueD valueE valueF NA

edited May 9 at 13:41

Jeff Schaller

31.1k846105

asked May 9 at 13:39

user290080

Did you try anything ?
â€“Â Kiwy
May 9 at 14:09

add a commentÂ |Â

up vote
-1
down vote

favorite

Please see the wanted output.txt below.

file1.txt

Start

label1 label2 label3 label4

value1 value2 value3 value4

label5

value5

label6 label7

value6 value7

End


Start

label1 label2 label4

valueA valueB valueD

label5

valueE

label6 

valueF 

End


Start
.
.
.
End

output.txt

label1 label2 label3 label4 label5 label6 label7

value1 value2 value3 value4 value5 value6 value7

valueA valueB NA valueD valueE valueF NA

edited May 9 at 13:41

Jeff Schaller

31.1k846105

asked May 9 at 13:39

user290080

Please see the wanted output.txt below.

file1.txt

Start

label1 label2 label3 label4

value1 value2 value3 value4

label5

value5

label6 label7

value6 value7

End


Start

label1 label2 label4

valueA valueB valueD

label5

valueE

label6 

valueF 

End


Start
.
.
.
End

output.txt

label1 label2 label3 label4 label5 label6 label7

value1 value2 value3 value4 value5 value6 value7

valueA valueB NA valueD valueE valueF NA

edited May 9 at 13:41

Jeff Schaller

31.1k846105

asked May 9 at 13:39

user290080

edited May 9 at 13:41

Jeff Schaller

31.1k846105

edited May 9 at 13:41

Jeff Schaller

31.1k846105

edited May 9 at 13:41

Jeff Schaller

31.1k846105

asked May 9 at 13:39

user290080

asked May 9 at 13:39

user290080

asked May 9 at 13:39

user290080

Did you try anything ?
â€“Â Kiwy
May 9 at 14:09

add a commentÂ |Â

Did you try anything ?
â€“Â Kiwy
May 9 at 14:09

Did you try anything ?
â€“Â Kiwy
May 9 at 14:09

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
0
down vote

This Python script should do what you want:

#!/usr/bin/env python
# -*- encoding: ascii -*-
"""parse.py

Parses a custom-format data-file.
Processes the file first and then prints the results.
"""

import sys

# Read the data from the file
file = open(sys.argv[1], 'r')

# Initialize a dictionary to collect the values for each label
labels = 

# Initialize a stack to keep track of block state
stack = 

# Initialize a counter to count the number of blocks
block = 0

# Process the file
line = file.readline()
while line:

 # Remove white-space
 line = line.strip()

 # The stack should be empty when we start a new block
 if line.lower() == "start":
 if stack:
 raise Exception("Invalid File Format: Bad Start")
 else:
 stack.append(line)

 # Otherwise the bottom of the stack should be a "Start"
 # When we reach the end of a block we empty the stack
 # end increment the block counter
 elif line.lower() == "end":
 if stack[0].lower() != "start":
 raise Exception("Invalid File Format: Bad End")
 else:
 block += 1
 stack = 

 # Other lines should come in consecutive label/value pairs
 # i.e. a value row should follow a label row
 elif line:

 # If there are an odd number of data rows in the stack then
 # the current row should be a value row - check that it matches
 # the corresponding label row
 if len(stack[1:])%2==1:

 _labels = stack[-1].split()
 _values = line.split()

 # Verify that the label row and value row have the same number
 # of columns
 if len(_labels) == len(_values):
 stack.append(line)
 for label, value in zip(_labels, _values):

 # Add new labels to the labels dictionary
 if label not in labels:
 labels[label] = 
 "cols": len(label)
 

 # Add the value for the current block
 labels[label][block] = value

 # Keep track of the longest value for each label
 # so we can format the output later
 if len(value) > labels[label]["cols"]:
 labels[label]["cols"] = len(value)
 else:
 raise Exception("Invalid File Format: Label/Value Mismatch")

 # If there are an even number of data rows in the stack then
 # the current row should be a label row - append it to the stack
 else:
 stack.append(line)

 # Read the next line
 line = file.readline()

# Construct the header row
header = ""
for label in labels:
 cols = labels[label]["cols"]
 header += "0: <width".format(label, width=cols+1)

# Construct the data rows
rows = 
for i in range(0, block):
 row = ""
 for label in labels:
 cols = labels[label]["cols"]
 row += "0: <width".format(labels[label].get(i, "NA"), width=cols+1)
 rows.append(row)

# Print the results
print(header)
for row in rows:
 print(row)

You can run it like this:

python parse.py file1.txt

It produces the following output on your example data:

label1 label2 label3 label4 label5 label6 label7
value1 value2 value3 value4 value5 value6 value7
valueA valueB NA valueD valueE valueF NA

answered May 10 at 0:01

igal

4,785930

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f442771%2fhow-to-parse-the-file%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

This Python script should do what you want:

#!/usr/bin/env python
# -*- encoding: ascii -*-
"""parse.py

Parses a custom-format data-file.
Processes the file first and then prints the results.
"""

import sys

# Read the data from the file
file = open(sys.argv[1], 'r')

# Initialize a dictionary to collect the values for each label
labels = 

# Initialize a stack to keep track of block state
stack = 

# Initialize a counter to count the number of blocks
block = 0

# Process the file
line = file.readline()
while line:

 # Remove white-space
 line = line.strip()

 # The stack should be empty when we start a new block
 if line.lower() == "start":
 if stack:
 raise Exception("Invalid File Format: Bad Start")
 else:
 stack.append(line)

 # Otherwise the bottom of the stack should be a "Start"
 # When we reach the end of a block we empty the stack
 # end increment the block counter
 elif line.lower() == "end":
 if stack[0].lower() != "start":
 raise Exception("Invalid File Format: Bad End")
 else:
 block += 1
 stack = 

 # Other lines should come in consecutive label/value pairs
 # i.e. a value row should follow a label row
 elif line:

 # If there are an odd number of data rows in the stack then
 # the current row should be a value row - check that it matches
 # the corresponding label row
 if len(stack[1:])%2==1:

 _labels = stack[-1].split()
 _values = line.split()

 # Verify that the label row and value row have the same number
 # of columns
 if len(_labels) == len(_values):
 stack.append(line)
 for label, value in zip(_labels, _values):

 # Add new labels to the labels dictionary
 if label not in labels:
 labels[label] = 
 "cols": len(label)
 

 # Add the value for the current block
 labels[label][block] = value

 # Keep track of the longest value for each label
 # so we can format the output later
 if len(value) > labels[label]["cols"]:
 labels[label]["cols"] = len(value)
 else:
 raise Exception("Invalid File Format: Label/Value Mismatch")

 # If there are an even number of data rows in the stack then
 # the current row should be a label row - append it to the stack
 else:
 stack.append(line)

 # Read the next line
 line = file.readline()

# Construct the header row
header = ""
for label in labels:
 cols = labels[label]["cols"]
 header += "0: <width".format(label, width=cols+1)

# Construct the data rows
rows = 
for i in range(0, block):
 row = ""
 for label in labels:
 cols = labels[label]["cols"]
 row += "0: <width".format(labels[label].get(i, "NA"), width=cols+1)
 rows.append(row)

# Print the results
print(header)
for row in rows:
 print(row)

You can run it like this:

python parse.py file1.txt

It produces the following output on your example data:

label1 label2 label3 label4 label5 label6 label7
value1 value2 value3 value4 value5 value6 value7
valueA valueB NA valueD valueE valueF NA

answered May 10 at 0:01

igal

4,785930

add a commentÂ |Â

up vote
0
down vote

This Python script should do what you want:

#!/usr/bin/env python
# -*- encoding: ascii -*-
"""parse.py

Parses a custom-format data-file.
Processes the file first and then prints the results.
"""

import sys

# Read the data from the file
file = open(sys.argv[1], 'r')

# Initialize a dictionary to collect the values for each label
labels = 

# Initialize a stack to keep track of block state
stack = 

# Initialize a counter to count the number of blocks
block = 0

# Process the file
line = file.readline()
while line:

 # Remove white-space
 line = line.strip()

 # The stack should be empty when we start a new block
 if line.lower() == "start":
 if stack:
 raise Exception("Invalid File Format: Bad Start")
 else:
 stack.append(line)

 # Otherwise the bottom of the stack should be a "Start"
 # When we reach the end of a block we empty the stack
 # end increment the block counter
 elif line.lower() == "end":
 if stack[0].lower() != "start":
 raise Exception("Invalid File Format: Bad End")
 else:
 block += 1
 stack = 

 # Other lines should come in consecutive label/value pairs
 # i.e. a value row should follow a label row
 elif line:

 # If there are an odd number of data rows in the stack then
 # the current row should be a value row - check that it matches
 # the corresponding label row
 if len(stack[1:])%2==1:

 _labels = stack[-1].split()
 _values = line.split()

 # Verify that the label row and value row have the same number
 # of columns
 if len(_labels) == len(_values):
 stack.append(line)
 for label, value in zip(_labels, _values):

 # Add new labels to the labels dictionary
 if label not in labels:
 labels[label] = 
 "cols": len(label)
 

 # Add the value for the current block
 labels[label][block] = value

 # Keep track of the longest value for each label
 # so we can format the output later
 if len(value) > labels[label]["cols"]:
 labels[label]["cols"] = len(value)
 else:
 raise Exception("Invalid File Format: Label/Value Mismatch")

 # If there are an even number of data rows in the stack then
 # the current row should be a label row - append it to the stack
 else:
 stack.append(line)

 # Read the next line
 line = file.readline()

# Construct the header row
header = ""
for label in labels:
 cols = labels[label]["cols"]
 header += "0: <width".format(label, width=cols+1)

# Construct the data rows
rows = 
for i in range(0, block):
 row = ""
 for label in labels:
 cols = labels[label]["cols"]
 row += "0: <width".format(labels[label].get(i, "NA"), width=cols+1)
 rows.append(row)

# Print the results
print(header)
for row in rows:
 print(row)

You can run it like this:

python parse.py file1.txt

It produces the following output on your example data:

label1 label2 label3 label4 label5 label6 label7
value1 value2 value3 value4 value5 value6 value7
valueA valueB NA valueD valueE valueF NA

answered May 10 at 0:01

igal

4,785930

add a commentÂ |Â

up vote
0
down vote

This Python script should do what you want:

#!/usr/bin/env python
# -*- encoding: ascii -*-
"""parse.py

Parses a custom-format data-file.
Processes the file first and then prints the results.
"""

import sys

# Read the data from the file
file = open(sys.argv[1], 'r')

# Initialize a dictionary to collect the values for each label
labels = 

# Initialize a stack to keep track of block state
stack = 

# Initialize a counter to count the number of blocks
block = 0

# Process the file
line = file.readline()
while line:

 # Remove white-space
 line = line.strip()

 # The stack should be empty when we start a new block
 if line.lower() == "start":
 if stack:
 raise Exception("Invalid File Format: Bad Start")
 else:
 stack.append(line)

 # Otherwise the bottom of the stack should be a "Start"
 # When we reach the end of a block we empty the stack
 # end increment the block counter
 elif line.lower() == "end":
 if stack[0].lower() != "start":
 raise Exception("Invalid File Format: Bad End")
 else:
 block += 1
 stack = 

 # Other lines should come in consecutive label/value pairs
 # i.e. a value row should follow a label row
 elif line:

 # If there are an odd number of data rows in the stack then
 # the current row should be a value row - check that it matches
 # the corresponding label row
 if len(stack[1:])%2==1:

 _labels = stack[-1].split()
 _values = line.split()

 # Verify that the label row and value row have the same number
 # of columns
 if len(_labels) == len(_values):
 stack.append(line)
 for label, value in zip(_labels, _values):

 # Add new labels to the labels dictionary
 if label not in labels:
 labels[label] = 
 "cols": len(label)
 

 # Add the value for the current block
 labels[label][block] = value

 # Keep track of the longest value for each label
 # so we can format the output later
 if len(value) > labels[label]["cols"]:
 labels[label]["cols"] = len(value)
 else:
 raise Exception("Invalid File Format: Label/Value Mismatch")

 # If there are an even number of data rows in the stack then
 # the current row should be a label row - append it to the stack
 else:
 stack.append(line)

 # Read the next line
 line = file.readline()

# Construct the header row
header = ""
for label in labels:
 cols = labels[label]["cols"]
 header += "0: <width".format(label, width=cols+1)

# Construct the data rows
rows = 
for i in range(0, block):
 row = ""
 for label in labels:
 cols = labels[label]["cols"]
 row += "0: <width".format(labels[label].get(i, "NA"), width=cols+1)
 rows.append(row)

# Print the results
print(header)
for row in rows:
 print(row)

You can run it like this:

python parse.py file1.txt

It produces the following output on your example data:

label1 label2 label3 label4 label5 label6 label7
value1 value2 value3 value4 value5 value6 value7
valueA valueB NA valueD valueE valueF NA

answered May 10 at 0:01

igal

4,785930

This Python script should do what you want:

#!/usr/bin/env python
# -*- encoding: ascii -*-
"""parse.py

Parses a custom-format data-file.
Processes the file first and then prints the results.
"""

import sys

# Read the data from the file
file = open(sys.argv[1], 'r')

# Initialize a dictionary to collect the values for each label
labels = 

# Initialize a stack to keep track of block state
stack = 

# Initialize a counter to count the number of blocks
block = 0

# Process the file
line = file.readline()
while line:

 # Remove white-space
 line = line.strip()

 # The stack should be empty when we start a new block
 if line.lower() == "start":
 if stack:
 raise Exception("Invalid File Format: Bad Start")
 else:
 stack.append(line)

 # Otherwise the bottom of the stack should be a "Start"
 # When we reach the end of a block we empty the stack
 # end increment the block counter
 elif line.lower() == "end":
 if stack[0].lower() != "start":
 raise Exception("Invalid File Format: Bad End")
 else:
 block += 1
 stack = 

 # Other lines should come in consecutive label/value pairs
 # i.e. a value row should follow a label row
 elif line:

 # If there are an odd number of data rows in the stack then
 # the current row should be a value row - check that it matches
 # the corresponding label row
 if len(stack[1:])%2==1:

 _labels = stack[-1].split()
 _values = line.split()

 # Verify that the label row and value row have the same number
 # of columns
 if len(_labels) == len(_values):
 stack.append(line)
 for label, value in zip(_labels, _values):

 # Add new labels to the labels dictionary
 if label not in labels:
 labels[label] = 
 "cols": len(label)
 

 # Add the value for the current block
 labels[label][block] = value

 # Keep track of the longest value for each label
 # so we can format the output later
 if len(value) > labels[label]["cols"]:
 labels[label]["cols"] = len(value)
 else:
 raise Exception("Invalid File Format: Label/Value Mismatch")

 # If there are an even number of data rows in the stack then
 # the current row should be a label row - append it to the stack
 else:
 stack.append(line)

 # Read the next line
 line = file.readline()

# Construct the header row
header = ""
for label in labels:
 cols = labels[label]["cols"]
 header += "0: <width".format(label, width=cols+1)

# Construct the data rows
rows = 
for i in range(0, block):
 row = ""
 for label in labels:
 cols = labels[label]["cols"]
 row += "0: <width".format(labels[label].get(i, "NA"), width=cols+1)
 rows.append(row)

# Print the results
print(header)
for row in rows:
 print(row)

You can run it like this:

python parse.py file1.txt

It produces the following output on your example data:

label1 label2 label3 label4 label5 label6 label7
value1 value2 value3 value4 value5 value6 value7
valueA valueB NA valueD valueE valueF NA

answered May 10 at 0:01

igal

4,785930

answered May 10 at 0:01

igal

4,785930

answered May 10 at 0:01

igal

4,785930

answered May 10 at 0:01

igal

4,785930

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu