Combine text files by title using grep awk sed
Clash Royale CLAN TAG#URR8PPP
up vote
0
down vote
favorite
I'm trying to combine multiple files into one final file. Each file has many entries within them, most with overlapping titles. I would like to merge content between both, under the title headings.
Think of combining two dictionaries and it makes more sense. Entries for a single word can be found in both, but definitions differ slightly in each. Some entries exist in one and not the other, etc.
For example, I'd like to merge these two files to produce a single output file:
File 1
Entry 1
Green Trees
Entry 3
Orange Fibers
File 2
Entry 1
Red Trees
Entry 2
Spotted Zebras
Entry 3
Blue Fibers
Output File
Entry 1
Green Trees
Red Trees
Entry 2
Spotted Zebras
Entry 3
Orange Fibers
Blue Fibers
Note that Entry 2 did not exist in File 1, but made it to the final product. Likewise, the content of each entry was merged anywhere the entry ID matches.
How can I accomplish this?
EDIT: The above is a simplified version for asking the question. Below is a sample of actual entries in the files.
The $$$00001 is the Entry title.
From File 1
$$$00001
<b><br>- Original: ÃÂ<b><br></b></b>- Transliteration: A<b><br></b></b>- Phonetic: al'-fah<b><br></b></b>-...
$$$00002
<b><br>- Original: script<b><br></b></b>- Translitera...
From File 2
$$$00001
<b><br>ñ<b><br></b></b>a; indeclinable...
$$$00002
<b><br>texts<b><br></b></b>A...
text-processing
add a comment |Â
up vote
0
down vote
favorite
I'm trying to combine multiple files into one final file. Each file has many entries within them, most with overlapping titles. I would like to merge content between both, under the title headings.
Think of combining two dictionaries and it makes more sense. Entries for a single word can be found in both, but definitions differ slightly in each. Some entries exist in one and not the other, etc.
For example, I'd like to merge these two files to produce a single output file:
File 1
Entry 1
Green Trees
Entry 3
Orange Fibers
File 2
Entry 1
Red Trees
Entry 2
Spotted Zebras
Entry 3
Blue Fibers
Output File
Entry 1
Green Trees
Red Trees
Entry 2
Spotted Zebras
Entry 3
Orange Fibers
Blue Fibers
Note that Entry 2 did not exist in File 1, but made it to the final product. Likewise, the content of each entry was merged anywhere the entry ID matches.
How can I accomplish this?
EDIT: The above is a simplified version for asking the question. Below is a sample of actual entries in the files.
The $$$00001 is the Entry title.
From File 1
$$$00001
<b><br>- Original: ÃÂ<b><br></b></b>- Transliteration: A<b><br></b></b>- Phonetic: al'-fah<b><br></b></b>-...
$$$00002
<b><br>- Original: script<b><br></b></b>- Translitera...
From File 2
$$$00001
<b><br>ñ<b><br></b></b>a; indeclinable...
$$$00002
<b><br>texts<b><br></b></b>A...
text-processing
Are the headings all of the formatEntry <num>
?
â muru
Apr 10 at 8:43
Entry <num> is a simplified version of the headings in order to ask the question. More realistically, they will be zero-padded numbered entries with 5 digits.
â Matt Zabojnik
Apr 10 at 8:47
Well, how do we identify the headings then?
â muru
Apr 10 at 8:47
I've updated my question with a real example for clarity.
â Matt Zabojnik
Apr 10 at 8:53
add a comment |Â
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I'm trying to combine multiple files into one final file. Each file has many entries within them, most with overlapping titles. I would like to merge content between both, under the title headings.
Think of combining two dictionaries and it makes more sense. Entries for a single word can be found in both, but definitions differ slightly in each. Some entries exist in one and not the other, etc.
For example, I'd like to merge these two files to produce a single output file:
File 1
Entry 1
Green Trees
Entry 3
Orange Fibers
File 2
Entry 1
Red Trees
Entry 2
Spotted Zebras
Entry 3
Blue Fibers
Output File
Entry 1
Green Trees
Red Trees
Entry 2
Spotted Zebras
Entry 3
Orange Fibers
Blue Fibers
Note that Entry 2 did not exist in File 1, but made it to the final product. Likewise, the content of each entry was merged anywhere the entry ID matches.
How can I accomplish this?
EDIT: The above is a simplified version for asking the question. Below is a sample of actual entries in the files.
The $$$00001 is the Entry title.
From File 1
$$$00001
<b><br>- Original: ÃÂ<b><br></b></b>- Transliteration: A<b><br></b></b>- Phonetic: al'-fah<b><br></b></b>-...
$$$00002
<b><br>- Original: script<b><br></b></b>- Translitera...
From File 2
$$$00001
<b><br>ñ<b><br></b></b>a; indeclinable...
$$$00002
<b><br>texts<b><br></b></b>A...
text-processing
I'm trying to combine multiple files into one final file. Each file has many entries within them, most with overlapping titles. I would like to merge content between both, under the title headings.
Think of combining two dictionaries and it makes more sense. Entries for a single word can be found in both, but definitions differ slightly in each. Some entries exist in one and not the other, etc.
For example, I'd like to merge these two files to produce a single output file:
File 1
Entry 1
Green Trees
Entry 3
Orange Fibers
File 2
Entry 1
Red Trees
Entry 2
Spotted Zebras
Entry 3
Blue Fibers
Output File
Entry 1
Green Trees
Red Trees
Entry 2
Spotted Zebras
Entry 3
Orange Fibers
Blue Fibers
Note that Entry 2 did not exist in File 1, but made it to the final product. Likewise, the content of each entry was merged anywhere the entry ID matches.
How can I accomplish this?
EDIT: The above is a simplified version for asking the question. Below is a sample of actual entries in the files.
The $$$00001 is the Entry title.
From File 1
$$$00001
<b><br>- Original: ÃÂ<b><br></b></b>- Transliteration: A<b><br></b></b>- Phonetic: al'-fah<b><br></b></b>-...
$$$00002
<b><br>- Original: script<b><br></b></b>- Translitera...
From File 2
$$$00001
<b><br>ñ<b><br></b></b>a; indeclinable...
$$$00002
<b><br>texts<b><br></b></b>A...
text-processing
edited Apr 10 at 8:53
asked Apr 10 at 8:18
Matt Zabojnik
86
86
Are the headings all of the formatEntry <num>
?
â muru
Apr 10 at 8:43
Entry <num> is a simplified version of the headings in order to ask the question. More realistically, they will be zero-padded numbered entries with 5 digits.
â Matt Zabojnik
Apr 10 at 8:47
Well, how do we identify the headings then?
â muru
Apr 10 at 8:47
I've updated my question with a real example for clarity.
â Matt Zabojnik
Apr 10 at 8:53
add a comment |Â
Are the headings all of the formatEntry <num>
?
â muru
Apr 10 at 8:43
Entry <num> is a simplified version of the headings in order to ask the question. More realistically, they will be zero-padded numbered entries with 5 digits.
â Matt Zabojnik
Apr 10 at 8:47
Well, how do we identify the headings then?
â muru
Apr 10 at 8:47
I've updated my question with a real example for clarity.
â Matt Zabojnik
Apr 10 at 8:53
Are the headings all of the format
Entry <num>
?â muru
Apr 10 at 8:43
Are the headings all of the format
Entry <num>
?â muru
Apr 10 at 8:43
Entry <num> is a simplified version of the headings in order to ask the question. More realistically, they will be zero-padded numbered entries with 5 digits.
â Matt Zabojnik
Apr 10 at 8:47
Entry <num> is a simplified version of the headings in order to ask the question. More realistically, they will be zero-padded numbered entries with 5 digits.
â Matt Zabojnik
Apr 10 at 8:47
Well, how do we identify the headings then?
â muru
Apr 10 at 8:47
Well, how do we identify the headings then?
â muru
Apr 10 at 8:47
I've updated my question with a real example for clarity.
â Matt Zabojnik
Apr 10 at 8:53
I've updated my question with a real example for clarity.
â Matt Zabojnik
Apr 10 at 8:53
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
1
down vote
accepted
A simple awk
one-liner solves your example:
awk '/^Entry/k=$0;nextg[k]=g[k]"n"$0ENDfor(k in g)print k g[k]' file1 file2
I suppose you know that basically awk
processes input lines one after another according to a program. This particular awk
program is specified as first argument and consists of three statements. LetâÂÂs analyze them one by one:
/^Entry/k=$0;next
means: if the processed line matches/^Entry/
, store it in the variablek
and go to the next cycle ignoring the following statements.g[k]=g[k]"n"$0
has no preceding condition, so it is always executed, and means: update the value stored in the dictionaryg
with the keyk
: the new value has to be the concatenation of the (possibly empty) previous valueg[k]
, a carriage return"n"
, and the current line.ENDfor(k in g)print k g[k]
has anEND
condition and is therefore executed when all input lines have been processed. It says: for each key ing
, that is, for each title which has appeared in the input files, print the associated value, which is the concatenation of all lines found in input files under that title.
To use it IRL, You have to replace /^Entry/
with the correct pattern (probably /^$$$/
).
Excellent, this solution worked perfectly. Thank you! Would you mind explaining a little further on what's going on in there?
â Matt Zabojnik
Apr 10 at 9:46
@MattZabojnik Done.
â Dario
Apr 10 at 10:54
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
A simple awk
one-liner solves your example:
awk '/^Entry/k=$0;nextg[k]=g[k]"n"$0ENDfor(k in g)print k g[k]' file1 file2
I suppose you know that basically awk
processes input lines one after another according to a program. This particular awk
program is specified as first argument and consists of three statements. LetâÂÂs analyze them one by one:
/^Entry/k=$0;next
means: if the processed line matches/^Entry/
, store it in the variablek
and go to the next cycle ignoring the following statements.g[k]=g[k]"n"$0
has no preceding condition, so it is always executed, and means: update the value stored in the dictionaryg
with the keyk
: the new value has to be the concatenation of the (possibly empty) previous valueg[k]
, a carriage return"n"
, and the current line.ENDfor(k in g)print k g[k]
has anEND
condition and is therefore executed when all input lines have been processed. It says: for each key ing
, that is, for each title which has appeared in the input files, print the associated value, which is the concatenation of all lines found in input files under that title.
To use it IRL, You have to replace /^Entry/
with the correct pattern (probably /^$$$/
).
Excellent, this solution worked perfectly. Thank you! Would you mind explaining a little further on what's going on in there?
â Matt Zabojnik
Apr 10 at 9:46
@MattZabojnik Done.
â Dario
Apr 10 at 10:54
add a comment |Â
up vote
1
down vote
accepted
A simple awk
one-liner solves your example:
awk '/^Entry/k=$0;nextg[k]=g[k]"n"$0ENDfor(k in g)print k g[k]' file1 file2
I suppose you know that basically awk
processes input lines one after another according to a program. This particular awk
program is specified as first argument and consists of three statements. LetâÂÂs analyze them one by one:
/^Entry/k=$0;next
means: if the processed line matches/^Entry/
, store it in the variablek
and go to the next cycle ignoring the following statements.g[k]=g[k]"n"$0
has no preceding condition, so it is always executed, and means: update the value stored in the dictionaryg
with the keyk
: the new value has to be the concatenation of the (possibly empty) previous valueg[k]
, a carriage return"n"
, and the current line.ENDfor(k in g)print k g[k]
has anEND
condition and is therefore executed when all input lines have been processed. It says: for each key ing
, that is, for each title which has appeared in the input files, print the associated value, which is the concatenation of all lines found in input files under that title.
To use it IRL, You have to replace /^Entry/
with the correct pattern (probably /^$$$/
).
Excellent, this solution worked perfectly. Thank you! Would you mind explaining a little further on what's going on in there?
â Matt Zabojnik
Apr 10 at 9:46
@MattZabojnik Done.
â Dario
Apr 10 at 10:54
add a comment |Â
up vote
1
down vote
accepted
up vote
1
down vote
accepted
A simple awk
one-liner solves your example:
awk '/^Entry/k=$0;nextg[k]=g[k]"n"$0ENDfor(k in g)print k g[k]' file1 file2
I suppose you know that basically awk
processes input lines one after another according to a program. This particular awk
program is specified as first argument and consists of three statements. LetâÂÂs analyze them one by one:
/^Entry/k=$0;next
means: if the processed line matches/^Entry/
, store it in the variablek
and go to the next cycle ignoring the following statements.g[k]=g[k]"n"$0
has no preceding condition, so it is always executed, and means: update the value stored in the dictionaryg
with the keyk
: the new value has to be the concatenation of the (possibly empty) previous valueg[k]
, a carriage return"n"
, and the current line.ENDfor(k in g)print k g[k]
has anEND
condition and is therefore executed when all input lines have been processed. It says: for each key ing
, that is, for each title which has appeared in the input files, print the associated value, which is the concatenation of all lines found in input files under that title.
To use it IRL, You have to replace /^Entry/
with the correct pattern (probably /^$$$/
).
A simple awk
one-liner solves your example:
awk '/^Entry/k=$0;nextg[k]=g[k]"n"$0ENDfor(k in g)print k g[k]' file1 file2
I suppose you know that basically awk
processes input lines one after another according to a program. This particular awk
program is specified as first argument and consists of three statements. LetâÂÂs analyze them one by one:
/^Entry/k=$0;next
means: if the processed line matches/^Entry/
, store it in the variablek
and go to the next cycle ignoring the following statements.g[k]=g[k]"n"$0
has no preceding condition, so it is always executed, and means: update the value stored in the dictionaryg
with the keyk
: the new value has to be the concatenation of the (possibly empty) previous valueg[k]
, a carriage return"n"
, and the current line.ENDfor(k in g)print k g[k]
has anEND
condition and is therefore executed when all input lines have been processed. It says: for each key ing
, that is, for each title which has appeared in the input files, print the associated value, which is the concatenation of all lines found in input files under that title.
To use it IRL, You have to replace /^Entry/
with the correct pattern (probably /^$$$/
).
edited Apr 10 at 10:54
answered Apr 10 at 9:14
Dario
30115
30115
Excellent, this solution worked perfectly. Thank you! Would you mind explaining a little further on what's going on in there?
â Matt Zabojnik
Apr 10 at 9:46
@MattZabojnik Done.
â Dario
Apr 10 at 10:54
add a comment |Â
Excellent, this solution worked perfectly. Thank you! Would you mind explaining a little further on what's going on in there?
â Matt Zabojnik
Apr 10 at 9:46
@MattZabojnik Done.
â Dario
Apr 10 at 10:54
Excellent, this solution worked perfectly. Thank you! Would you mind explaining a little further on what's going on in there?
â Matt Zabojnik
Apr 10 at 9:46
Excellent, this solution worked perfectly. Thank you! Would you mind explaining a little further on what's going on in there?
â Matt Zabojnik
Apr 10 at 9:46
@MattZabojnik Done.
â Dario
Apr 10 at 10:54
@MattZabojnik Done.
â Dario
Apr 10 at 10:54
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f436716%2fcombine-text-files-by-title-using-grep-awk-sed%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Are the headings all of the format
Entry <num>
?â muru
Apr 10 at 8:43
Entry <num> is a simplified version of the headings in order to ask the question. More realistically, they will be zero-padded numbered entries with 5 digits.
â Matt Zabojnik
Apr 10 at 8:47
Well, how do we identify the headings then?
â muru
Apr 10 at 8:47
I've updated my question with a real example for clarity.
â Matt Zabojnik
Apr 10 at 8:53