Merging in Unix
Clash Royale CLAN TAG#URR8PPP
up vote
3
down vote
favorite
I have a CSV file with vertical bars (|
) as the delimiter, like below, for which I need to apply merging technique in Unix. The file contains hundreds of thousands of records (fourÃÂ fields), but IÃÂ gave only fiveÃÂ records for ease of reading.
field1 |field2 | field3 |field4|
1|abc|def|ghi|
4|ijk|
|lmn|
5||opq|rst|
8|
uvw||xyz|
10|hjg|jsh|nbm|
And I want the output result as
field1|field2|field3|field4|
1|abc|def|ghi|
4|ijk||lmn|
5||opq|rst|
8|uvw||xyz|
10|hjg|jsh|nbm|
Can someone help me how to do the same?
text-processing awk sed merge
add a comment |Â
up vote
3
down vote
favorite
I have a CSV file with vertical bars (|
) as the delimiter, like below, for which I need to apply merging technique in Unix. The file contains hundreds of thousands of records (fourÃÂ fields), but IÃÂ gave only fiveÃÂ records for ease of reading.
field1 |field2 | field3 |field4|
1|abc|def|ghi|
4|ijk|
|lmn|
5||opq|rst|
8|
uvw||xyz|
10|hjg|jsh|nbm|
And I want the output result as
field1|field2|field3|field4|
1|abc|def|ghi|
4|ijk||lmn|
5||opq|rst|
8|uvw||xyz|
10|hjg|jsh|nbm|
Can someone help me how to do the same?
text-processing awk sed merge
so you want leading and trailing spaces around the pipe symbols as well as any newlines except those after every 4th pipe symbol removed? is that correct?
â Sam
Sep 25 at 18:08
2
IâÂÂm sorry if youâÂÂre stuck with data that look like this.â¯â¯ While the answers that have been presented will handle this mangled structure in the best case, it is very precarious (sensitive) to data corruption.â¯â¯ For example, if you have a file where every record is split across two lines (every line has two fields), and one line gets deleted (or totally scrambled), the rebuilt (output) file will be wrong from there on.â¯â¯ You might want to specify that the first field (and only the first field) of each line is a number, so error checking becomes possible.â¯â¯ â¦â¯(ContâÂÂd)
â G-Man
Sep 25 at 21:09
(ContâÂÂd) â¦â P.S. Is it possible for parts of multiple records to be on the same line?â For example,1|abc|def|
âÂÂ/âÂÂghi|4|ijk|
âÂÂ/âÂÂ|lmn|
?âÂÂâÂÂAnd is it possible for a field to be split across lines?â For example,10|hjg|j
âÂÂ/âÂÂsh|nbm|
?
â G-Man
Sep 25 at 21:09
add a comment |Â
up vote
3
down vote
favorite
up vote
3
down vote
favorite
I have a CSV file with vertical bars (|
) as the delimiter, like below, for which I need to apply merging technique in Unix. The file contains hundreds of thousands of records (fourÃÂ fields), but IÃÂ gave only fiveÃÂ records for ease of reading.
field1 |field2 | field3 |field4|
1|abc|def|ghi|
4|ijk|
|lmn|
5||opq|rst|
8|
uvw||xyz|
10|hjg|jsh|nbm|
And I want the output result as
field1|field2|field3|field4|
1|abc|def|ghi|
4|ijk||lmn|
5||opq|rst|
8|uvw||xyz|
10|hjg|jsh|nbm|
Can someone help me how to do the same?
text-processing awk sed merge
I have a CSV file with vertical bars (|
) as the delimiter, like below, for which I need to apply merging technique in Unix. The file contains hundreds of thousands of records (fourÃÂ fields), but IÃÂ gave only fiveÃÂ records for ease of reading.
field1 |field2 | field3 |field4|
1|abc|def|ghi|
4|ijk|
|lmn|
5||opq|rst|
8|
uvw||xyz|
10|hjg|jsh|nbm|
And I want the output result as
field1|field2|field3|field4|
1|abc|def|ghi|
4|ijk||lmn|
5||opq|rst|
8|uvw||xyz|
10|hjg|jsh|nbm|
Can someone help me how to do the same?
text-processing awk sed merge
text-processing awk sed merge
edited Sep 25 at 20:54
G-Man
11.9k92658
11.9k92658
asked Sep 25 at 17:59
Sankar
191
191
so you want leading and trailing spaces around the pipe symbols as well as any newlines except those after every 4th pipe symbol removed? is that correct?
â Sam
Sep 25 at 18:08
2
IâÂÂm sorry if youâÂÂre stuck with data that look like this.â¯â¯ While the answers that have been presented will handle this mangled structure in the best case, it is very precarious (sensitive) to data corruption.â¯â¯ For example, if you have a file where every record is split across two lines (every line has two fields), and one line gets deleted (or totally scrambled), the rebuilt (output) file will be wrong from there on.â¯â¯ You might want to specify that the first field (and only the first field) of each line is a number, so error checking becomes possible.â¯â¯ â¦â¯(ContâÂÂd)
â G-Man
Sep 25 at 21:09
(ContâÂÂd) â¦â P.S. Is it possible for parts of multiple records to be on the same line?â For example,1|abc|def|
âÂÂ/âÂÂghi|4|ijk|
âÂÂ/âÂÂ|lmn|
?âÂÂâÂÂAnd is it possible for a field to be split across lines?â For example,10|hjg|j
âÂÂ/âÂÂsh|nbm|
?
â G-Man
Sep 25 at 21:09
add a comment |Â
so you want leading and trailing spaces around the pipe symbols as well as any newlines except those after every 4th pipe symbol removed? is that correct?
â Sam
Sep 25 at 18:08
2
IâÂÂm sorry if youâÂÂre stuck with data that look like this.â¯â¯ While the answers that have been presented will handle this mangled structure in the best case, it is very precarious (sensitive) to data corruption.â¯â¯ For example, if you have a file where every record is split across two lines (every line has two fields), and one line gets deleted (or totally scrambled), the rebuilt (output) file will be wrong from there on.â¯â¯ You might want to specify that the first field (and only the first field) of each line is a number, so error checking becomes possible.â¯â¯ â¦â¯(ContâÂÂd)
â G-Man
Sep 25 at 21:09
(ContâÂÂd) â¦â P.S. Is it possible for parts of multiple records to be on the same line?â For example,1|abc|def|
âÂÂ/âÂÂghi|4|ijk|
âÂÂ/âÂÂ|lmn|
?âÂÂâÂÂAnd is it possible for a field to be split across lines?â For example,10|hjg|j
âÂÂ/âÂÂsh|nbm|
?
â G-Man
Sep 25 at 21:09
so you want leading and trailing spaces around the pipe symbols as well as any newlines except those after every 4th pipe symbol removed? is that correct?
â Sam
Sep 25 at 18:08
so you want leading and trailing spaces around the pipe symbols as well as any newlines except those after every 4th pipe symbol removed? is that correct?
â Sam
Sep 25 at 18:08
2
2
IâÂÂm sorry if youâÂÂre stuck with data that look like this.â¯â¯ While the answers that have been presented will handle this mangled structure in the best case, it is very precarious (sensitive) to data corruption.â¯â¯ For example, if you have a file where every record is split across two lines (every line has two fields), and one line gets deleted (or totally scrambled), the rebuilt (output) file will be wrong from there on.â¯â¯ You might want to specify that the first field (and only the first field) of each line is a number, so error checking becomes possible.â¯â¯ â¦â¯(ContâÂÂd)
â G-Man
Sep 25 at 21:09
IâÂÂm sorry if youâÂÂre stuck with data that look like this.â¯â¯ While the answers that have been presented will handle this mangled structure in the best case, it is very precarious (sensitive) to data corruption.â¯â¯ For example, if you have a file where every record is split across two lines (every line has two fields), and one line gets deleted (or totally scrambled), the rebuilt (output) file will be wrong from there on.â¯â¯ You might want to specify that the first field (and only the first field) of each line is a number, so error checking becomes possible.â¯â¯ â¦â¯(ContâÂÂd)
â G-Man
Sep 25 at 21:09
(ContâÂÂd) â¦â P.S. Is it possible for parts of multiple records to be on the same line?â For example,
1|abc|def|
âÂÂ/âÂÂghi|4|ijk|
âÂÂ/âÂÂ|lmn|
?âÂÂâÂÂAnd is it possible for a field to be split across lines?â For example, 10|hjg|j
âÂÂ/âÂÂsh|nbm|
?â G-Man
Sep 25 at 21:09
(ContâÂÂd) â¦â P.S. Is it possible for parts of multiple records to be on the same line?â For example,
1|abc|def|
âÂÂ/âÂÂghi|4|ijk|
âÂÂ/âÂÂ|lmn|
?âÂÂâÂÂAnd is it possible for a field to be split across lines?â For example, 10|hjg|j
âÂÂ/âÂÂsh|nbm|
?â G-Man
Sep 25 at 21:09
add a comment |Â
3 Answers
3
active
oldest
votes
up vote
3
down vote
I'm assuming you don't want all those blank lines.
$ cat file
1|abc|def|ghi|
4|ijk|
|lmn|
5||opq|rst|
8|
uvw||xyz|
10|hjg|jsh|nbm|
$ awk -F'|' 'while (NF < 5) getline nextline; $0 = $0 nextline1' file
1|abc|def|ghi|
4|ijk||lmn|
5||opq|rst|
8|uvw||xyz|
10|hjg|jsh|nbm|
Update for the question edit: remove whitespace around the field separator
awk -F'[[:blank:]]*[|][[:blank:]]*' -v OFS='|' '
while (NF < 5) getline nextline; $0 = $0 nextline; $1=$1; print
' file
1
genius solution !! what we call this process? may I kindly ask you to add some explanations for newbies like me. thank you!
â Shervan
Sep 25 at 18:37
Is there any particular bit you're unclear about? I assume awhile
loop is clear.getline
reads the next line into the given variable. Then I concatentate the current line with the next line, and we re-check the number of fields. Other awk help can be found on the awk tag info page.
â glenn jackman
Sep 25 at 18:40
Yes, $0 = $0 and 1 at the end. thank you for any clarification!
â Shervan
Sep 25 at 18:42
2
It's not$0=$0
, it's "assign to $0 the concatenation of $0 and nextline". awk doesn't have a concatenation operator: other languages might want$0 = $0 + nextline
, but with awk you just put strings or variables side-by-side. For clarity we can write$0 = ($0 nextline)
â glenn jackman
Sep 25 at 19:55
2
The1
is a common awk idiom that means "print the current record". Follow the link I gave and do some reading: it's well documented.
â glenn jackman
Sep 25 at 19:56
 |Â
show 5 more comments
up vote
0
down vote
With GNU sed:
sed ':loop /(.*|)4.*/ !N; s/n//; b loop; s/ *| */|/g' file
The command dissected:
:loop
The :
signals a label that we can use for branches. "loop" is just the name that I chose for the label.
/(.*|)4.*/
Is a line selector regex that matches lines that contain 4 pipe symbols, each allowed to be preceded by zero or more arbitrary characters (.*|
), with zero or more arbitrary characters allowed to follow the last pipe.
! ...
Applies the commands in the brackets to any line that did not match the previous regex.
N; s/n//; b loop
N
concatenes the current line in pattern space with a newline symbol and the next line from the source file, then s/n//
removes the newline symbol and b loop
branches back to the label we have defined in the start, so the concatenated line will be compared against the regex again.
Lastly
s/ *| */|/g
will be applied to any line in pattern space before it is output. This removes any spaces around pipe symbols.
this code not working!
â Shervan
Sep 25 at 18:35
does too for me with GNU sed 4.4
â Sam
Sep 25 at 18:37
sed --version Mysed
(GNU sed) 4.2.2 Copyright (C) 2012 Free Software Foundation, Inc.
â Shervan
Sep 25 at 18:38
1
oh, man... the command is not at fault. you are definitely not typing it as displayed. you are using double quotes and your shell's history expansion feature is enabled.
â Sam
Sep 26 at 5:09
1
@Shervan is probably usingcsh
ortcsh
where that!
needs to be escaped, even inside single quotes.
â Stéphane Chazelas
Sep 26 at 7:37
 |Â
show 5 more comments
up vote
0
down vote
If using Vim is an option:
vim -Nesc 'g!/(.*|)4$/j!' -cwq input.txt
-Nes
runs Vim in script mode, making it easier to automate-c ...
runs Vim commands after opening the fileg!/(.*|)4$/j!
- on every line:g
, that doesn't!
match/(.*|)4$/
(a regex matching 4 pipes separated by anything), join the next line to it (:j
).wq
- save and quit.
add a comment |Â
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
3
down vote
I'm assuming you don't want all those blank lines.
$ cat file
1|abc|def|ghi|
4|ijk|
|lmn|
5||opq|rst|
8|
uvw||xyz|
10|hjg|jsh|nbm|
$ awk -F'|' 'while (NF < 5) getline nextline; $0 = $0 nextline1' file
1|abc|def|ghi|
4|ijk||lmn|
5||opq|rst|
8|uvw||xyz|
10|hjg|jsh|nbm|
Update for the question edit: remove whitespace around the field separator
awk -F'[[:blank:]]*[|][[:blank:]]*' -v OFS='|' '
while (NF < 5) getline nextline; $0 = $0 nextline; $1=$1; print
' file
1
genius solution !! what we call this process? may I kindly ask you to add some explanations for newbies like me. thank you!
â Shervan
Sep 25 at 18:37
Is there any particular bit you're unclear about? I assume awhile
loop is clear.getline
reads the next line into the given variable. Then I concatentate the current line with the next line, and we re-check the number of fields. Other awk help can be found on the awk tag info page.
â glenn jackman
Sep 25 at 18:40
Yes, $0 = $0 and 1 at the end. thank you for any clarification!
â Shervan
Sep 25 at 18:42
2
It's not$0=$0
, it's "assign to $0 the concatenation of $0 and nextline". awk doesn't have a concatenation operator: other languages might want$0 = $0 + nextline
, but with awk you just put strings or variables side-by-side. For clarity we can write$0 = ($0 nextline)
â glenn jackman
Sep 25 at 19:55
2
The1
is a common awk idiom that means "print the current record". Follow the link I gave and do some reading: it's well documented.
â glenn jackman
Sep 25 at 19:56
 |Â
show 5 more comments
up vote
3
down vote
I'm assuming you don't want all those blank lines.
$ cat file
1|abc|def|ghi|
4|ijk|
|lmn|
5||opq|rst|
8|
uvw||xyz|
10|hjg|jsh|nbm|
$ awk -F'|' 'while (NF < 5) getline nextline; $0 = $0 nextline1' file
1|abc|def|ghi|
4|ijk||lmn|
5||opq|rst|
8|uvw||xyz|
10|hjg|jsh|nbm|
Update for the question edit: remove whitespace around the field separator
awk -F'[[:blank:]]*[|][[:blank:]]*' -v OFS='|' '
while (NF < 5) getline nextline; $0 = $0 nextline; $1=$1; print
' file
1
genius solution !! what we call this process? may I kindly ask you to add some explanations for newbies like me. thank you!
â Shervan
Sep 25 at 18:37
Is there any particular bit you're unclear about? I assume awhile
loop is clear.getline
reads the next line into the given variable. Then I concatentate the current line with the next line, and we re-check the number of fields. Other awk help can be found on the awk tag info page.
â glenn jackman
Sep 25 at 18:40
Yes, $0 = $0 and 1 at the end. thank you for any clarification!
â Shervan
Sep 25 at 18:42
2
It's not$0=$0
, it's "assign to $0 the concatenation of $0 and nextline". awk doesn't have a concatenation operator: other languages might want$0 = $0 + nextline
, but with awk you just put strings or variables side-by-side. For clarity we can write$0 = ($0 nextline)
â glenn jackman
Sep 25 at 19:55
2
The1
is a common awk idiom that means "print the current record". Follow the link I gave and do some reading: it's well documented.
â glenn jackman
Sep 25 at 19:56
 |Â
show 5 more comments
up vote
3
down vote
up vote
3
down vote
I'm assuming you don't want all those blank lines.
$ cat file
1|abc|def|ghi|
4|ijk|
|lmn|
5||opq|rst|
8|
uvw||xyz|
10|hjg|jsh|nbm|
$ awk -F'|' 'while (NF < 5) getline nextline; $0 = $0 nextline1' file
1|abc|def|ghi|
4|ijk||lmn|
5||opq|rst|
8|uvw||xyz|
10|hjg|jsh|nbm|
Update for the question edit: remove whitespace around the field separator
awk -F'[[:blank:]]*[|][[:blank:]]*' -v OFS='|' '
while (NF < 5) getline nextline; $0 = $0 nextline; $1=$1; print
' file
I'm assuming you don't want all those blank lines.
$ cat file
1|abc|def|ghi|
4|ijk|
|lmn|
5||opq|rst|
8|
uvw||xyz|
10|hjg|jsh|nbm|
$ awk -F'|' 'while (NF < 5) getline nextline; $0 = $0 nextline1' file
1|abc|def|ghi|
4|ijk||lmn|
5||opq|rst|
8|uvw||xyz|
10|hjg|jsh|nbm|
Update for the question edit: remove whitespace around the field separator
awk -F'[[:blank:]]*[|][[:blank:]]*' -v OFS='|' '
while (NF < 5) getline nextline; $0 = $0 nextline; $1=$1; print
' file
edited Sep 25 at 21:42
answered Sep 25 at 18:18
glenn jackman
48.3k365105
48.3k365105
1
genius solution !! what we call this process? may I kindly ask you to add some explanations for newbies like me. thank you!
â Shervan
Sep 25 at 18:37
Is there any particular bit you're unclear about? I assume awhile
loop is clear.getline
reads the next line into the given variable. Then I concatentate the current line with the next line, and we re-check the number of fields. Other awk help can be found on the awk tag info page.
â glenn jackman
Sep 25 at 18:40
Yes, $0 = $0 and 1 at the end. thank you for any clarification!
â Shervan
Sep 25 at 18:42
2
It's not$0=$0
, it's "assign to $0 the concatenation of $0 and nextline". awk doesn't have a concatenation operator: other languages might want$0 = $0 + nextline
, but with awk you just put strings or variables side-by-side. For clarity we can write$0 = ($0 nextline)
â glenn jackman
Sep 25 at 19:55
2
The1
is a common awk idiom that means "print the current record". Follow the link I gave and do some reading: it's well documented.
â glenn jackman
Sep 25 at 19:56
 |Â
show 5 more comments
1
genius solution !! what we call this process? may I kindly ask you to add some explanations for newbies like me. thank you!
â Shervan
Sep 25 at 18:37
Is there any particular bit you're unclear about? I assume awhile
loop is clear.getline
reads the next line into the given variable. Then I concatentate the current line with the next line, and we re-check the number of fields. Other awk help can be found on the awk tag info page.
â glenn jackman
Sep 25 at 18:40
Yes, $0 = $0 and 1 at the end. thank you for any clarification!
â Shervan
Sep 25 at 18:42
2
It's not$0=$0
, it's "assign to $0 the concatenation of $0 and nextline". awk doesn't have a concatenation operator: other languages might want$0 = $0 + nextline
, but with awk you just put strings or variables side-by-side. For clarity we can write$0 = ($0 nextline)
â glenn jackman
Sep 25 at 19:55
2
The1
is a common awk idiom that means "print the current record". Follow the link I gave and do some reading: it's well documented.
â glenn jackman
Sep 25 at 19:56
1
1
genius solution !! what we call this process? may I kindly ask you to add some explanations for newbies like me. thank you!
â Shervan
Sep 25 at 18:37
genius solution !! what we call this process? may I kindly ask you to add some explanations for newbies like me. thank you!
â Shervan
Sep 25 at 18:37
Is there any particular bit you're unclear about? I assume a
while
loop is clear. getline
reads the next line into the given variable. Then I concatentate the current line with the next line, and we re-check the number of fields. Other awk help can be found on the awk tag info page.â glenn jackman
Sep 25 at 18:40
Is there any particular bit you're unclear about? I assume a
while
loop is clear. getline
reads the next line into the given variable. Then I concatentate the current line with the next line, and we re-check the number of fields. Other awk help can be found on the awk tag info page.â glenn jackman
Sep 25 at 18:40
Yes, $0 = $0 and 1 at the end. thank you for any clarification!
â Shervan
Sep 25 at 18:42
Yes, $0 = $0 and 1 at the end. thank you for any clarification!
â Shervan
Sep 25 at 18:42
2
2
It's not
$0=$0
, it's "assign to $0 the concatenation of $0 and nextline". awk doesn't have a concatenation operator: other languages might want $0 = $0 + nextline
, but with awk you just put strings or variables side-by-side. For clarity we can write $0 = ($0 nextline)
â glenn jackman
Sep 25 at 19:55
It's not
$0=$0
, it's "assign to $0 the concatenation of $0 and nextline". awk doesn't have a concatenation operator: other languages might want $0 = $0 + nextline
, but with awk you just put strings or variables side-by-side. For clarity we can write $0 = ($0 nextline)
â glenn jackman
Sep 25 at 19:55
2
2
The
1
is a common awk idiom that means "print the current record". Follow the link I gave and do some reading: it's well documented.â glenn jackman
Sep 25 at 19:56
The
1
is a common awk idiom that means "print the current record". Follow the link I gave and do some reading: it's well documented.â glenn jackman
Sep 25 at 19:56
 |Â
show 5 more comments
up vote
0
down vote
With GNU sed:
sed ':loop /(.*|)4.*/ !N; s/n//; b loop; s/ *| */|/g' file
The command dissected:
:loop
The :
signals a label that we can use for branches. "loop" is just the name that I chose for the label.
/(.*|)4.*/
Is a line selector regex that matches lines that contain 4 pipe symbols, each allowed to be preceded by zero or more arbitrary characters (.*|
), with zero or more arbitrary characters allowed to follow the last pipe.
! ...
Applies the commands in the brackets to any line that did not match the previous regex.
N; s/n//; b loop
N
concatenes the current line in pattern space with a newline symbol and the next line from the source file, then s/n//
removes the newline symbol and b loop
branches back to the label we have defined in the start, so the concatenated line will be compared against the regex again.
Lastly
s/ *| */|/g
will be applied to any line in pattern space before it is output. This removes any spaces around pipe symbols.
this code not working!
â Shervan
Sep 25 at 18:35
does too for me with GNU sed 4.4
â Sam
Sep 25 at 18:37
sed --version Mysed
(GNU sed) 4.2.2 Copyright (C) 2012 Free Software Foundation, Inc.
â Shervan
Sep 25 at 18:38
1
oh, man... the command is not at fault. you are definitely not typing it as displayed. you are using double quotes and your shell's history expansion feature is enabled.
â Sam
Sep 26 at 5:09
1
@Shervan is probably usingcsh
ortcsh
where that!
needs to be escaped, even inside single quotes.
â Stéphane Chazelas
Sep 26 at 7:37
 |Â
show 5 more comments
up vote
0
down vote
With GNU sed:
sed ':loop /(.*|)4.*/ !N; s/n//; b loop; s/ *| */|/g' file
The command dissected:
:loop
The :
signals a label that we can use for branches. "loop" is just the name that I chose for the label.
/(.*|)4.*/
Is a line selector regex that matches lines that contain 4 pipe symbols, each allowed to be preceded by zero or more arbitrary characters (.*|
), with zero or more arbitrary characters allowed to follow the last pipe.
! ...
Applies the commands in the brackets to any line that did not match the previous regex.
N; s/n//; b loop
N
concatenes the current line in pattern space with a newline symbol and the next line from the source file, then s/n//
removes the newline symbol and b loop
branches back to the label we have defined in the start, so the concatenated line will be compared against the regex again.
Lastly
s/ *| */|/g
will be applied to any line in pattern space before it is output. This removes any spaces around pipe symbols.
this code not working!
â Shervan
Sep 25 at 18:35
does too for me with GNU sed 4.4
â Sam
Sep 25 at 18:37
sed --version Mysed
(GNU sed) 4.2.2 Copyright (C) 2012 Free Software Foundation, Inc.
â Shervan
Sep 25 at 18:38
1
oh, man... the command is not at fault. you are definitely not typing it as displayed. you are using double quotes and your shell's history expansion feature is enabled.
â Sam
Sep 26 at 5:09
1
@Shervan is probably usingcsh
ortcsh
where that!
needs to be escaped, even inside single quotes.
â Stéphane Chazelas
Sep 26 at 7:37
 |Â
show 5 more comments
up vote
0
down vote
up vote
0
down vote
With GNU sed:
sed ':loop /(.*|)4.*/ !N; s/n//; b loop; s/ *| */|/g' file
The command dissected:
:loop
The :
signals a label that we can use for branches. "loop" is just the name that I chose for the label.
/(.*|)4.*/
Is a line selector regex that matches lines that contain 4 pipe symbols, each allowed to be preceded by zero or more arbitrary characters (.*|
), with zero or more arbitrary characters allowed to follow the last pipe.
! ...
Applies the commands in the brackets to any line that did not match the previous regex.
N; s/n//; b loop
N
concatenes the current line in pattern space with a newline symbol and the next line from the source file, then s/n//
removes the newline symbol and b loop
branches back to the label we have defined in the start, so the concatenated line will be compared against the regex again.
Lastly
s/ *| */|/g
will be applied to any line in pattern space before it is output. This removes any spaces around pipe symbols.
With GNU sed:
sed ':loop /(.*|)4.*/ !N; s/n//; b loop; s/ *| */|/g' file
The command dissected:
:loop
The :
signals a label that we can use for branches. "loop" is just the name that I chose for the label.
/(.*|)4.*/
Is a line selector regex that matches lines that contain 4 pipe symbols, each allowed to be preceded by zero or more arbitrary characters (.*|
), with zero or more arbitrary characters allowed to follow the last pipe.
! ...
Applies the commands in the brackets to any line that did not match the previous regex.
N; s/n//; b loop
N
concatenes the current line in pattern space with a newline symbol and the next line from the source file, then s/n//
removes the newline symbol and b loop
branches back to the label we have defined in the start, so the concatenated line will be compared against the regex again.
Lastly
s/ *| */|/g
will be applied to any line in pattern space before it is output. This removes any spaces around pipe symbols.
edited Sep 26 at 7:26
answered Sep 25 at 18:25
Sam
29219
29219
this code not working!
â Shervan
Sep 25 at 18:35
does too for me with GNU sed 4.4
â Sam
Sep 25 at 18:37
sed --version Mysed
(GNU sed) 4.2.2 Copyright (C) 2012 Free Software Foundation, Inc.
â Shervan
Sep 25 at 18:38
1
oh, man... the command is not at fault. you are definitely not typing it as displayed. you are using double quotes and your shell's history expansion feature is enabled.
â Sam
Sep 26 at 5:09
1
@Shervan is probably usingcsh
ortcsh
where that!
needs to be escaped, even inside single quotes.
â Stéphane Chazelas
Sep 26 at 7:37
 |Â
show 5 more comments
this code not working!
â Shervan
Sep 25 at 18:35
does too for me with GNU sed 4.4
â Sam
Sep 25 at 18:37
sed --version Mysed
(GNU sed) 4.2.2 Copyright (C) 2012 Free Software Foundation, Inc.
â Shervan
Sep 25 at 18:38
1
oh, man... the command is not at fault. you are definitely not typing it as displayed. you are using double quotes and your shell's history expansion feature is enabled.
â Sam
Sep 26 at 5:09
1
@Shervan is probably usingcsh
ortcsh
where that!
needs to be escaped, even inside single quotes.
â Stéphane Chazelas
Sep 26 at 7:37
this code not working!
â Shervan
Sep 25 at 18:35
this code not working!
â Shervan
Sep 25 at 18:35
does too for me with GNU sed 4.4
â Sam
Sep 25 at 18:37
does too for me with GNU sed 4.4
â Sam
Sep 25 at 18:37
sed --version My
sed
(GNU sed) 4.2.2 Copyright (C) 2012 Free Software Foundation, Inc.â Shervan
Sep 25 at 18:38
sed --version My
sed
(GNU sed) 4.2.2 Copyright (C) 2012 Free Software Foundation, Inc.â Shervan
Sep 25 at 18:38
1
1
oh, man... the command is not at fault. you are definitely not typing it as displayed. you are using double quotes and your shell's history expansion feature is enabled.
â Sam
Sep 26 at 5:09
oh, man... the command is not at fault. you are definitely not typing it as displayed. you are using double quotes and your shell's history expansion feature is enabled.
â Sam
Sep 26 at 5:09
1
1
@Shervan is probably using
csh
or tcsh
where that !
needs to be escaped, even inside single quotes.â Stéphane Chazelas
Sep 26 at 7:37
@Shervan is probably using
csh
or tcsh
where that !
needs to be escaped, even inside single quotes.â Stéphane Chazelas
Sep 26 at 7:37
 |Â
show 5 more comments
up vote
0
down vote
If using Vim is an option:
vim -Nesc 'g!/(.*|)4$/j!' -cwq input.txt
-Nes
runs Vim in script mode, making it easier to automate-c ...
runs Vim commands after opening the fileg!/(.*|)4$/j!
- on every line:g
, that doesn't!
match/(.*|)4$/
(a regex matching 4 pipes separated by anything), join the next line to it (:j
).wq
- save and quit.
add a comment |Â
up vote
0
down vote
If using Vim is an option:
vim -Nesc 'g!/(.*|)4$/j!' -cwq input.txt
-Nes
runs Vim in script mode, making it easier to automate-c ...
runs Vim commands after opening the fileg!/(.*|)4$/j!
- on every line:g
, that doesn't!
match/(.*|)4$/
(a regex matching 4 pipes separated by anything), join the next line to it (:j
).wq
- save and quit.
add a comment |Â
up vote
0
down vote
up vote
0
down vote
If using Vim is an option:
vim -Nesc 'g!/(.*|)4$/j!' -cwq input.txt
-Nes
runs Vim in script mode, making it easier to automate-c ...
runs Vim commands after opening the fileg!/(.*|)4$/j!
- on every line:g
, that doesn't!
match/(.*|)4$/
(a regex matching 4 pipes separated by anything), join the next line to it (:j
).wq
- save and quit.
If using Vim is an option:
vim -Nesc 'g!/(.*|)4$/j!' -cwq input.txt
-Nes
runs Vim in script mode, making it easier to automate-c ...
runs Vim commands after opening the fileg!/(.*|)4$/j!
- on every line:g
, that doesn't!
match/(.*|)4$/
(a regex matching 4 pipes separated by anything), join the next line to it (:j
).wq
- save and quit.
answered Sep 26 at 7:43
muru
33.9k578147
33.9k578147
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f471391%2fmerging-in-unix%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
so you want leading and trailing spaces around the pipe symbols as well as any newlines except those after every 4th pipe symbol removed? is that correct?
â Sam
Sep 25 at 18:08
2
IâÂÂm sorry if youâÂÂre stuck with data that look like this.â¯â¯ While the answers that have been presented will handle this mangled structure in the best case, it is very precarious (sensitive) to data corruption.â¯â¯ For example, if you have a file where every record is split across two lines (every line has two fields), and one line gets deleted (or totally scrambled), the rebuilt (output) file will be wrong from there on.â¯â¯ You might want to specify that the first field (and only the first field) of each line is a number, so error checking becomes possible.â¯â¯ â¦â¯(ContâÂÂd)
â G-Man
Sep 25 at 21:09
(ContâÂÂd) â¦â P.S. Is it possible for parts of multiple records to be on the same line?â For example,
1|abc|def|
âÂÂ/âÂÂghi|4|ijk|
âÂÂ/âÂÂ|lmn|
?âÂÂâÂÂAnd is it possible for a field to be split across lines?â For example,10|hjg|j
âÂÂ/âÂÂsh|nbm|
?â G-Man
Sep 25 at 21:09