Combine text files by title using grep awk sed

up vote
0
down vote

favorite

I'm trying to combine multiple files into one final file. Each file has many entries within them, most with overlapping titles. I would like to merge content between both, under the title headings.
Think of combining two dictionaries and it makes more sense. Entries for a single word can be found in both, but definitions differ slightly in each. Some entries exist in one and not the other, etc.

For example, I'd like to merge these two files to produce a single output file:

File 1

Entry 1
Green Trees
Entry 3
Orange Fibers

File 2

Entry 1
Red Trees
Entry 2
Spotted Zebras
Entry 3
Blue Fibers

Output File

Entry 1
Green Trees
Red Trees
Entry 2
Spotted Zebras
Entry 3
Orange Fibers
Blue Fibers

Note that Entry 2 did not exist in File 1, but made it to the final product. Likewise, the content of each entry was merged anywhere the entry ID matches.

How can I accomplish this?

EDIT: The above is a simplified version for asking the question. Below is a sample of actual entries in the files.

The $$$00001 is the Entry title.

From File 1

$$$00001
<b><br>- Original: ÃŽÂ‘<b><br></b></b>- Transliteration: A<b><br></b></b>- Phonetic: al'-fah<b><br></b></b>-...
$$$00002
<b><br>- Original: script<b><br></b></b>- Translitera...

From File 2

$$$00001
<b><br>ÃŽÂ±<b><br></b></b>a; indeclinable...
$$$00002
<b><br>texts<b><br></b></b>A...

edited Apr 10 at 8:53

asked Apr 10 at 8:18

Matt Zabojnik

Are the headings all of the format Entry <num>?
â€“Â muru
Apr 10 at 8:43

Entry <num> is a simplified version of the headings in order to ask the question. More realistically, they will be zero-padded numbered entries with 5 digits.
â€“Â Matt Zabojnik
Apr 10 at 8:47

Well, how do we identify the headings then?
â€“Â muru
Apr 10 at 8:47

I've updated my question with a real example for clarity.
â€“Â Matt Zabojnik
Apr 10 at 8:53

add a commentÂ |Â

up vote
0
down vote

favorite

For example, I'd like to merge these two files to produce a single output file:

File 1

Entry 1
Green Trees
Entry 3
Orange Fibers

File 2

Entry 1
Red Trees
Entry 2
Spotted Zebras
Entry 3
Blue Fibers

Output File

Entry 1
Green Trees
Red Trees
Entry 2
Spotted Zebras
Entry 3
Orange Fibers
Blue Fibers

Note that Entry 2 did not exist in File 1, but made it to the final product. Likewise, the content of each entry was merged anywhere the entry ID matches.

How can I accomplish this?

EDIT: The above is a simplified version for asking the question. Below is a sample of actual entries in the files.

The $$$00001 is the Entry title.

From File 1

$$$00001
<b><br>- Original: ÃŽÂ‘<b><br></b></b>- Transliteration: A<b><br></b></b>- Phonetic: al'-fah<b><br></b></b>-...
$$$00002
<b><br>- Original: script<b><br></b></b>- Translitera...

From File 2

$$$00001
<b><br>ÃŽÂ±<b><br></b></b>a; indeclinable...
$$$00002
<b><br>texts<b><br></b></b>A...

edited Apr 10 at 8:53

asked Apr 10 at 8:18

Matt Zabojnik

Are the headings all of the format Entry <num>?
â€“Â muru
Apr 10 at 8:43

Entry <num> is a simplified version of the headings in order to ask the question. More realistically, they will be zero-padded numbered entries with 5 digits.
â€“Â Matt Zabojnik
Apr 10 at 8:47

Well, how do we identify the headings then?
â€“Â muru
Apr 10 at 8:47

I've updated my question with a real example for clarity.
â€“Â Matt Zabojnik
Apr 10 at 8:53

add a commentÂ |Â

up vote
0
down vote

favorite

For example, I'd like to merge these two files to produce a single output file:

File 1

Entry 1
Green Trees
Entry 3
Orange Fibers

File 2

Entry 1
Red Trees
Entry 2
Spotted Zebras
Entry 3
Blue Fibers

Output File

Entry 1
Green Trees
Red Trees
Entry 2
Spotted Zebras
Entry 3
Orange Fibers
Blue Fibers

Note that Entry 2 did not exist in File 1, but made it to the final product. Likewise, the content of each entry was merged anywhere the entry ID matches.

How can I accomplish this?

EDIT: The above is a simplified version for asking the question. Below is a sample of actual entries in the files.

The $$$00001 is the Entry title.

From File 1

$$$00001
<b><br>- Original: ÃŽÂ‘<b><br></b></b>- Transliteration: A<b><br></b></b>- Phonetic: al'-fah<b><br></b></b>-...
$$$00002
<b><br>- Original: script<b><br></b></b>- Translitera...

From File 2

$$$00001
<b><br>ÃŽÂ±<b><br></b></b>a; indeclinable...
$$$00002
<b><br>texts<b><br></b></b>A...

edited Apr 10 at 8:53

asked Apr 10 at 8:18

Matt Zabojnik

For example, I'd like to merge these two files to produce a single output file:

File 1

Entry 1
Green Trees
Entry 3
Orange Fibers

File 2

Entry 1
Red Trees
Entry 2
Spotted Zebras
Entry 3
Blue Fibers

Output File

Entry 1
Green Trees
Red Trees
Entry 2
Spotted Zebras
Entry 3
Orange Fibers
Blue Fibers

Note that Entry 2 did not exist in File 1, but made it to the final product. Likewise, the content of each entry was merged anywhere the entry ID matches.

How can I accomplish this?

EDIT: The above is a simplified version for asking the question. Below is a sample of actual entries in the files.

The $$$00001 is the Entry title.

From File 1

$$$00001
<b><br>- Original: ÃŽÂ‘<b><br></b></b>- Transliteration: A<b><br></b></b>- Phonetic: al'-fah<b><br></b></b>-...
$$$00002
<b><br>- Original: script<b><br></b></b>- Translitera...

From File 2

$$$00001
<b><br>ÃŽÂ±<b><br></b></b>a; indeclinable...
$$$00002
<b><br>texts<b><br></b></b>A...

edited Apr 10 at 8:53

asked Apr 10 at 8:18

Matt Zabojnik

edited Apr 10 at 8:53

asked Apr 10 at 8:18

Matt Zabojnik

asked Apr 10 at 8:18

Matt Zabojnik

asked Apr 10 at 8:18

Matt Zabojnik

Are the headings all of the format Entry <num>?
â€“Â muru
Apr 10 at 8:43

Entry <num> is a simplified version of the headings in order to ask the question. More realistically, they will be zero-padded numbered entries with 5 digits.
â€“Â Matt Zabojnik
Apr 10 at 8:47

Well, how do we identify the headings then?
â€“Â muru
Apr 10 at 8:47

I've updated my question with a real example for clarity.
â€“Â Matt Zabojnik
Apr 10 at 8:53

add a commentÂ |Â

Are the headings all of the format Entry <num>?
â€“Â muru
Apr 10 at 8:43

Entry <num> is a simplified version of the headings in order to ask the question. More realistically, they will be zero-padded numbered entries with 5 digits.
â€“Â Matt Zabojnik
Apr 10 at 8:47

Well, how do we identify the headings then?
â€“Â muru
Apr 10 at 8:47

I've updated my question with a real example for clarity.
â€“Â Matt Zabojnik
Apr 10 at 8:53

Are the headings all of the format Entry <num>?
â€“Â muru
Apr 10 at 8:43

Entry <num> is a simplified version of the headings in order to ask the question. More realistically, they will be zero-padded numbered entries with 5 digits.
â€“Â Matt Zabojnik
Apr 10 at 8:47

Well, how do we identify the headings then?
â€“Â muru
Apr 10 at 8:47

I've updated my question with a real example for clarity.
â€“Â Matt Zabojnik
Apr 10 at 8:53

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

A simple awk one-liner solves your example:

awk '/^Entry/k=$0;nextg[k]=g[k]"n"$0ENDfor(k in g)print k g[k]' file1 file2

I suppose you know that basically awk processes input lines one after another according to a program. This particular awk program is specified as first argument and consists of three statements. LetÃ¢Â€Â™s analyze them one by one:

/^Entry/k=$0;next means: if the processed line matches /^Entry/, store it in the variable k and go to the next cycle ignoring the following statements.

g[k]=g[k]"n"$0 has no preceding condition, so it is always executed, and means: update the value stored in the dictionary g with the key k: the new value has to be the concatenation of the (possibly empty) previous value g[k], a carriage return "n", and the current line.

ENDfor(k in g)print k g[k] has an END condition and is therefore executed when all input lines have been processed. It says: for each key in g, that is, for each title which has appeared in the input files, print the associated value, which is the concatenation of all lines found in input files under that title.

To use it IRL, You have to replace /^Entry/ with the correct pattern (probably /^$$$/).

edited Apr 10 at 10:54

answered Apr 10 at 9:14

Dario

30115

Excellent, this solution worked perfectly. Thank you! Would you mind explaining a little further on what's going on in there?
â€“Â Matt Zabojnik
Apr 10 at 9:46

@MattZabojnik Done.
â€“Â Dario
Apr 10 at 10:54

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f436716%2fcombine-text-files-by-title-using-grep-awk-sed%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

A simple awk one-liner solves your example:

awk '/^Entry/k=$0;nextg[k]=g[k]"n"$0ENDfor(k in g)print k g[k]' file1 file2

/^Entry/k=$0;next means: if the processed line matches /^Entry/, store it in the variable k and go to the next cycle ignoring the following statements.

g[k]=g[k]"n"$0 has no preceding condition, so it is always executed, and means: update the value stored in the dictionary g with the key k: the new value has to be the concatenation of the (possibly empty) previous value g[k], a carriage return "n", and the current line.

ENDfor(k in g)print k g[k] has an END condition and is therefore executed when all input lines have been processed. It says: for each key in g, that is, for each title which has appeared in the input files, print the associated value, which is the concatenation of all lines found in input files under that title.

To use it IRL, You have to replace /^Entry/ with the correct pattern (probably /^$$$/).

edited Apr 10 at 10:54

answered Apr 10 at 9:14

Dario

30115

Excellent, this solution worked perfectly. Thank you! Would you mind explaining a little further on what's going on in there?
â€“Â Matt Zabojnik
Apr 10 at 9:46

@MattZabojnik Done.
â€“Â Dario
Apr 10 at 10:54

add a commentÂ |Â

up vote
1
down vote

accepted

A simple awk one-liner solves your example:

awk '/^Entry/k=$0;nextg[k]=g[k]"n"$0ENDfor(k in g)print k g[k]' file1 file2

/^Entry/k=$0;next means: if the processed line matches /^Entry/, store it in the variable k and go to the next cycle ignoring the following statements.

g[k]=g[k]"n"$0 has no preceding condition, so it is always executed, and means: update the value stored in the dictionary g with the key k: the new value has to be the concatenation of the (possibly empty) previous value g[k], a carriage return "n", and the current line.

ENDfor(k in g)print k g[k] has an END condition and is therefore executed when all input lines have been processed. It says: for each key in g, that is, for each title which has appeared in the input files, print the associated value, which is the concatenation of all lines found in input files under that title.

To use it IRL, You have to replace /^Entry/ with the correct pattern (probably /^$$$/).

edited Apr 10 at 10:54

answered Apr 10 at 9:14

Dario

30115

Excellent, this solution worked perfectly. Thank you! Would you mind explaining a little further on what's going on in there?
â€“Â Matt Zabojnik
Apr 10 at 9:46

@MattZabojnik Done.
â€“Â Dario
Apr 10 at 10:54

add a commentÂ |Â

up vote
1
down vote

accepted

A simple awk one-liner solves your example:

awk '/^Entry/k=$0;nextg[k]=g[k]"n"$0ENDfor(k in g)print k g[k]' file1 file2

/^Entry/k=$0;next means: if the processed line matches /^Entry/, store it in the variable k and go to the next cycle ignoring the following statements.

g[k]=g[k]"n"$0 has no preceding condition, so it is always executed, and means: update the value stored in the dictionary g with the key k: the new value has to be the concatenation of the (possibly empty) previous value g[k], a carriage return "n", and the current line.

ENDfor(k in g)print k g[k] has an END condition and is therefore executed when all input lines have been processed. It says: for each key in g, that is, for each title which has appeared in the input files, print the associated value, which is the concatenation of all lines found in input files under that title.

To use it IRL, You have to replace /^Entry/ with the correct pattern (probably /^$$$/).

edited Apr 10 at 10:54

answered Apr 10 at 9:14

Dario

30115

A simple awk one-liner solves your example:

awk '/^Entry/k=$0;nextg[k]=g[k]"n"$0ENDfor(k in g)print k g[k]' file1 file2

/^Entry/k=$0;next means: if the processed line matches /^Entry/, store it in the variable k and go to the next cycle ignoring the following statements.

g[k]=g[k]"n"$0 has no preceding condition, so it is always executed, and means: update the value stored in the dictionary g with the key k: the new value has to be the concatenation of the (possibly empty) previous value g[k], a carriage return "n", and the current line.

ENDfor(k in g)print k g[k] has an END condition and is therefore executed when all input lines have been processed. It says: for each key in g, that is, for each title which has appeared in the input files, print the associated value, which is the concatenation of all lines found in input files under that title.

To use it IRL, You have to replace /^Entry/ with the correct pattern (probably /^$$$/).

edited Apr 10 at 10:54

answered Apr 10 at 9:14

Dario

30115

edited Apr 10 at 10:54

answered Apr 10 at 9:14

Dario

30115

answered Apr 10 at 9:14

Dario

30115

answered Apr 10 at 9:14

Dario

30115

Excellent, this solution worked perfectly. Thank you! Would you mind explaining a little further on what's going on in there?
â€“Â Matt Zabojnik
Apr 10 at 9:46

@MattZabojnik Done.
â€“Â Dario
Apr 10 at 10:54

add a commentÂ |Â

Excellent, this solution worked perfectly. Thank you! Would you mind explaining a little further on what's going on in there?
â€“Â Matt Zabojnik
Apr 10 at 9:46

@MattZabojnik Done.
â€“Â Dario
Apr 10 at 10:54

Excellent, this solution worked perfectly. Thank you! Would you mind explaining a little further on what's going on in there?
â€“Â Matt Zabojnik
Apr 10 at 9:46

@MattZabojnik Done.
â€“Â Dario
Apr 10 at 10:54

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu