Count multi-line patterns in file
Clash Royale CLAN TAG#URR8PPP
I am looking for a way to search for a multi line pattern across a file.
For example, say this list of numbers was my input file:
3
2
5
4
8
2
5
4
2
4
2
5
4
If I wanted to search for instances of lines 2-4 (inclusive), I would like the result to be:
3
Since that is the amount of times those particular lines are exactly repeated. I would also like this to work with any given amount of lines, as well as any given line number range in the file.
bash text-processing
|
show 1 more comment
I am looking for a way to search for a multi line pattern across a file.
For example, say this list of numbers was my input file:
3
2
5
4
8
2
5
4
2
4
2
5
4
If I wanted to search for instances of lines 2-4 (inclusive), I would like the result to be:
3
Since that is the amount of times those particular lines are exactly repeated. I would also like this to work with any given amount of lines, as well as any given line number range in the file.
bash text-processing
If it's inclusive then only the value in line 3 is repeated three times. The values in lines 2 and 4 are repeated four times.
– Nasir Riley
Jan 25 at 23:10
1
@NasirRiley I think they are asking for a multi-line grep, i.e.2n5n4
– Sparhawk
Jan 25 at 23:16
I really can't tell what OP is looking for. Is it possible to reword it in simpler terms?
– Jesse_b
Jan 25 at 23:17
What @Sparhawk said is correct - I am looking for something like a multi line grep.
– ToasterFrogs
Jan 25 at 23:29
2
Is the input to this script "lines 2 through 4" or is it "the sequence of numbers 2,5,4"?
– Jeff Schaller
Jan 25 at 23:46
|
show 1 more comment
I am looking for a way to search for a multi line pattern across a file.
For example, say this list of numbers was my input file:
3
2
5
4
8
2
5
4
2
4
2
5
4
If I wanted to search for instances of lines 2-4 (inclusive), I would like the result to be:
3
Since that is the amount of times those particular lines are exactly repeated. I would also like this to work with any given amount of lines, as well as any given line number range in the file.
bash text-processing
I am looking for a way to search for a multi line pattern across a file.
For example, say this list of numbers was my input file:
3
2
5
4
8
2
5
4
2
4
2
5
4
If I wanted to search for instances of lines 2-4 (inclusive), I would like the result to be:
3
Since that is the amount of times those particular lines are exactly repeated. I would also like this to work with any given amount of lines, as well as any given line number range in the file.
bash text-processing
bash text-processing
edited Jan 26 at 0:15
Sparhawk
9,69764094
9,69764094
asked Jan 25 at 22:56
ToasterFrogsToasterFrogs
443
443
If it's inclusive then only the value in line 3 is repeated three times. The values in lines 2 and 4 are repeated four times.
– Nasir Riley
Jan 25 at 23:10
1
@NasirRiley I think they are asking for a multi-line grep, i.e.2n5n4
– Sparhawk
Jan 25 at 23:16
I really can't tell what OP is looking for. Is it possible to reword it in simpler terms?
– Jesse_b
Jan 25 at 23:17
What @Sparhawk said is correct - I am looking for something like a multi line grep.
– ToasterFrogs
Jan 25 at 23:29
2
Is the input to this script "lines 2 through 4" or is it "the sequence of numbers 2,5,4"?
– Jeff Schaller
Jan 25 at 23:46
|
show 1 more comment
If it's inclusive then only the value in line 3 is repeated three times. The values in lines 2 and 4 are repeated four times.
– Nasir Riley
Jan 25 at 23:10
1
@NasirRiley I think they are asking for a multi-line grep, i.e.2n5n4
– Sparhawk
Jan 25 at 23:16
I really can't tell what OP is looking for. Is it possible to reword it in simpler terms?
– Jesse_b
Jan 25 at 23:17
What @Sparhawk said is correct - I am looking for something like a multi line grep.
– ToasterFrogs
Jan 25 at 23:29
2
Is the input to this script "lines 2 through 4" or is it "the sequence of numbers 2,5,4"?
– Jeff Schaller
Jan 25 at 23:46
If it's inclusive then only the value in line 3 is repeated three times. The values in lines 2 and 4 are repeated four times.
– Nasir Riley
Jan 25 at 23:10
If it's inclusive then only the value in line 3 is repeated three times. The values in lines 2 and 4 are repeated four times.
– Nasir Riley
Jan 25 at 23:10
1
1
@NasirRiley I think they are asking for a multi-line grep, i.e.
2n5n4
– Sparhawk
Jan 25 at 23:16
@NasirRiley I think they are asking for a multi-line grep, i.e.
2n5n4
– Sparhawk
Jan 25 at 23:16
I really can't tell what OP is looking for. Is it possible to reword it in simpler terms?
– Jesse_b
Jan 25 at 23:17
I really can't tell what OP is looking for. Is it possible to reword it in simpler terms?
– Jesse_b
Jan 25 at 23:17
What @Sparhawk said is correct - I am looking for something like a multi line grep.
– ToasterFrogs
Jan 25 at 23:29
What @Sparhawk said is correct - I am looking for something like a multi line grep.
– ToasterFrogs
Jan 25 at 23:29
2
2
Is the input to this script "lines 2 through 4" or is it "the sequence of numbers 2,5,4"?
– Jeff Schaller
Jan 25 at 23:46
Is the input to this script "lines 2 through 4" or is it "the sequence of numbers 2,5,4"?
– Jeff Schaller
Jan 25 at 23:46
|
show 1 more comment
5 Answers
5
active
oldest
votes
You could use pcregrep, which is available in most distros. The following command matches a fixed string.
pcregrep -Mc '^2n5n4$' input.txt
Explanation
From the man page, pcregrep is "a grep with Perl-compatible regular expressions."
-M
: match the regex over multiple lines-c
: output the number of matches (count), instead of the matches themselves^2n5n4$
: regex for 2, 5, 4, each on a separate line.
Pattern from specific lines instead
Later comments in the question suggest that the pattern to be matched is not a fixed string, but instead a general "lines 2 through 4". Here, you can use command substitution to parse the lines from the input file instead.
pcregrep -Mc "^Q$(sed -n 2,4p input.txt)E$" input.txt
Explanation
tail -n+2 input.txt
: output the file, from line 2 inclusivehead -n3
: only output the first three linesQ...E
: quote the...
part for a basic string matching as opposed to regexp matching (assumes the output of the command doesn't containE
).
Note that it assumes the last lines of the output of sed ... input.txt
are not empty as command substitution ($(...)
) strips all trailing newline characters.
2
sed -n 2,4p input.txt
is I think more clear than the tail|head pipeline, and simpler to plug in the start and end line numbers.
– glenn jackman
Jan 26 at 14:43
Thanks @glennjackman. Good point.
– Sparhawk
Jan 26 at 22:31
add a comment |
$ perl -l -0777pe '$_=()=/^2n5n4$/mg' input_file
3
Working:
-0777
=> slurp mode, meaning read the whole file in.-p
=> before reading the next record, print the current record,$_
to stdout.-l
=> set the RS = ORS = "n"- the regex
/^2n5n4$/mg
is implicitly applied on the$_
, which in our case is the whole file remember. the/m
regex modifier shall match the line endings and beginnings too apart from string beginnings and string endings./g
modifier will get all the matches in the$_
aka the whole file. - We do this in the list-context, and assign it to an empty list. The $_ thus gets re-assigned with the number of elements in the list, which is the number of times the regex matched really.
HTH
Without hardcoding the pattern, you can pass the start and end lines of the file and extract it within the perl code:perl -s -l -0777pe '$p = join "n", (split /n/)[$start-1 .. $end-1]; $_ = ()=/^$p$/mg' -- -start=2 -end=4 input_file
– glenn jackman
Jan 26 at 14:40
Thanks @glenn jackman, for providing the generalization.
– Rakesh Sharma
Jan 27 at 3:23
add a comment |
Your post doesn't mention any requirement for regular expression support, so I'm going to assume that you will be searching for fixed, literal text strings.
This probably isn't the fastest algorithm you've ever seen, but it works, if you have enough time. It has the slight defect that if there are more than one N-line patterns that begin with the same first line and have the same SHA256 hash, it will give incorrect results. It assumes that all possible N-line patterns will have unique SHA256 hashes.
It will be tediously slow on large files, especially those which contain numerous occurrences of the first line of the pattern.
#!/usr/bin/env bash
# What's the name of the list file?
LIST=list
# What's the name of the pattern file?
PATTERN=pattern
# We'll figure out how many times the pattern lines appear (consecutively) in the list.
# Where's your SHA256 tool?
SHA256=/sbin/sha256
# what's the first line of pattern?
PATTERN_START="$(head -1 $PATTERN)"
# where in the list does that single line appear (what line numbers?)
START_LINES="$(grep -nx "$PATTERN_START" $LIST | sed -e 's/:.*//')"
# how many lines long is the pattern?
PAT_LEN="$(grep -c ^ < $PATTERN)"
echo Pattern is $PAT_LEN lines long, and might start at any of these lines:
echo $START_LINES
PAT_HASH="$($SHA256 < "$PATTERN")"
# So how many times does $PATTERN appear consecutively in $LIST?
PAT_COUNT=0
for LINE in $START_LINES
do
HASH="$(tail +$LINE $LIST | head -$PAT_LEN | $SHA256 -q)"
if [ "$HASH" = "$PAT_HASH" ]
then
echo match at line $LINE
PAT_COUNT=$(($PAT_COUNT+1))
fi
done
echo The pattern was found $PAT_COUNT times
The output:
$ cat list
3
2
5
4
8
2
5
4
2
4
2
5
4
$ cat pattern
2
5
4
$ . foo.sh
Pattern is 3 lines long, and might start at any of these lines:
2 6 9 11
match at line 2
match at line 6
match at line 11
The pattern was found 3 times
add a comment |
mpc() tail -n $line_count)
awk -v RS='' -v FPAT="$multiline_pattern" 'print NF' "$3"
# count how many times multiline-pattern defined by lines 2 to 4 (inclusive) occurs
mpc 2 4 input_file
Requirement:
The second argument must be at least equal to or greater than the first argument. I make no guarantee to the output if you violate that.
Disclaimer:
This doesn't work if characters and/or
$
appear in any of the lines included as a pattern. awk
struggles to process those characters as parts of a pattern even if they're backslash-escaped.
add a comment |
How about
a="2 5 4"; tr 'n' ' ' < test | grep -o "[^0-9]$a[^0-9]" | wc -l
With the separator of your choice....
You need the regex to prevent a match in the event of .... 22 5 44
... or similar
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f496773%2fcount-multi-line-patterns-in-file%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
5 Answers
5
active
oldest
votes
5 Answers
5
active
oldest
votes
active
oldest
votes
active
oldest
votes
You could use pcregrep, which is available in most distros. The following command matches a fixed string.
pcregrep -Mc '^2n5n4$' input.txt
Explanation
From the man page, pcregrep is "a grep with Perl-compatible regular expressions."
-M
: match the regex over multiple lines-c
: output the number of matches (count), instead of the matches themselves^2n5n4$
: regex for 2, 5, 4, each on a separate line.
Pattern from specific lines instead
Later comments in the question suggest that the pattern to be matched is not a fixed string, but instead a general "lines 2 through 4". Here, you can use command substitution to parse the lines from the input file instead.
pcregrep -Mc "^Q$(sed -n 2,4p input.txt)E$" input.txt
Explanation
tail -n+2 input.txt
: output the file, from line 2 inclusivehead -n3
: only output the first three linesQ...E
: quote the...
part for a basic string matching as opposed to regexp matching (assumes the output of the command doesn't containE
).
Note that it assumes the last lines of the output of sed ... input.txt
are not empty as command substitution ($(...)
) strips all trailing newline characters.
2
sed -n 2,4p input.txt
is I think more clear than the tail|head pipeline, and simpler to plug in the start and end line numbers.
– glenn jackman
Jan 26 at 14:43
Thanks @glennjackman. Good point.
– Sparhawk
Jan 26 at 22:31
add a comment |
You could use pcregrep, which is available in most distros. The following command matches a fixed string.
pcregrep -Mc '^2n5n4$' input.txt
Explanation
From the man page, pcregrep is "a grep with Perl-compatible regular expressions."
-M
: match the regex over multiple lines-c
: output the number of matches (count), instead of the matches themselves^2n5n4$
: regex for 2, 5, 4, each on a separate line.
Pattern from specific lines instead
Later comments in the question suggest that the pattern to be matched is not a fixed string, but instead a general "lines 2 through 4". Here, you can use command substitution to parse the lines from the input file instead.
pcregrep -Mc "^Q$(sed -n 2,4p input.txt)E$" input.txt
Explanation
tail -n+2 input.txt
: output the file, from line 2 inclusivehead -n3
: only output the first three linesQ...E
: quote the...
part for a basic string matching as opposed to regexp matching (assumes the output of the command doesn't containE
).
Note that it assumes the last lines of the output of sed ... input.txt
are not empty as command substitution ($(...)
) strips all trailing newline characters.
2
sed -n 2,4p input.txt
is I think more clear than the tail|head pipeline, and simpler to plug in the start and end line numbers.
– glenn jackman
Jan 26 at 14:43
Thanks @glennjackman. Good point.
– Sparhawk
Jan 26 at 22:31
add a comment |
You could use pcregrep, which is available in most distros. The following command matches a fixed string.
pcregrep -Mc '^2n5n4$' input.txt
Explanation
From the man page, pcregrep is "a grep with Perl-compatible regular expressions."
-M
: match the regex over multiple lines-c
: output the number of matches (count), instead of the matches themselves^2n5n4$
: regex for 2, 5, 4, each on a separate line.
Pattern from specific lines instead
Later comments in the question suggest that the pattern to be matched is not a fixed string, but instead a general "lines 2 through 4". Here, you can use command substitution to parse the lines from the input file instead.
pcregrep -Mc "^Q$(sed -n 2,4p input.txt)E$" input.txt
Explanation
tail -n+2 input.txt
: output the file, from line 2 inclusivehead -n3
: only output the first three linesQ...E
: quote the...
part for a basic string matching as opposed to regexp matching (assumes the output of the command doesn't containE
).
Note that it assumes the last lines of the output of sed ... input.txt
are not empty as command substitution ($(...)
) strips all trailing newline characters.
You could use pcregrep, which is available in most distros. The following command matches a fixed string.
pcregrep -Mc '^2n5n4$' input.txt
Explanation
From the man page, pcregrep is "a grep with Perl-compatible regular expressions."
-M
: match the regex over multiple lines-c
: output the number of matches (count), instead of the matches themselves^2n5n4$
: regex for 2, 5, 4, each on a separate line.
Pattern from specific lines instead
Later comments in the question suggest that the pattern to be matched is not a fixed string, but instead a general "lines 2 through 4". Here, you can use command substitution to parse the lines from the input file instead.
pcregrep -Mc "^Q$(sed -n 2,4p input.txt)E$" input.txt
Explanation
tail -n+2 input.txt
: output the file, from line 2 inclusivehead -n3
: only output the first three linesQ...E
: quote the...
part for a basic string matching as opposed to regexp matching (assumes the output of the command doesn't containE
).
Note that it assumes the last lines of the output of sed ... input.txt
are not empty as command substitution ($(...)
) strips all trailing newline characters.
edited Jan 26 at 22:31
answered Jan 26 at 0:05
SparhawkSparhawk
9,69764094
9,69764094
2
sed -n 2,4p input.txt
is I think more clear than the tail|head pipeline, and simpler to plug in the start and end line numbers.
– glenn jackman
Jan 26 at 14:43
Thanks @glennjackman. Good point.
– Sparhawk
Jan 26 at 22:31
add a comment |
2
sed -n 2,4p input.txt
is I think more clear than the tail|head pipeline, and simpler to plug in the start and end line numbers.
– glenn jackman
Jan 26 at 14:43
Thanks @glennjackman. Good point.
– Sparhawk
Jan 26 at 22:31
2
2
sed -n 2,4p input.txt
is I think more clear than the tail|head pipeline, and simpler to plug in the start and end line numbers.– glenn jackman
Jan 26 at 14:43
sed -n 2,4p input.txt
is I think more clear than the tail|head pipeline, and simpler to plug in the start and end line numbers.– glenn jackman
Jan 26 at 14:43
Thanks @glennjackman. Good point.
– Sparhawk
Jan 26 at 22:31
Thanks @glennjackman. Good point.
– Sparhawk
Jan 26 at 22:31
add a comment |
$ perl -l -0777pe '$_=()=/^2n5n4$/mg' input_file
3
Working:
-0777
=> slurp mode, meaning read the whole file in.-p
=> before reading the next record, print the current record,$_
to stdout.-l
=> set the RS = ORS = "n"- the regex
/^2n5n4$/mg
is implicitly applied on the$_
, which in our case is the whole file remember. the/m
regex modifier shall match the line endings and beginnings too apart from string beginnings and string endings./g
modifier will get all the matches in the$_
aka the whole file. - We do this in the list-context, and assign it to an empty list. The $_ thus gets re-assigned with the number of elements in the list, which is the number of times the regex matched really.
HTH
Without hardcoding the pattern, you can pass the start and end lines of the file and extract it within the perl code:perl -s -l -0777pe '$p = join "n", (split /n/)[$start-1 .. $end-1]; $_ = ()=/^$p$/mg' -- -start=2 -end=4 input_file
– glenn jackman
Jan 26 at 14:40
Thanks @glenn jackman, for providing the generalization.
– Rakesh Sharma
Jan 27 at 3:23
add a comment |
$ perl -l -0777pe '$_=()=/^2n5n4$/mg' input_file
3
Working:
-0777
=> slurp mode, meaning read the whole file in.-p
=> before reading the next record, print the current record,$_
to stdout.-l
=> set the RS = ORS = "n"- the regex
/^2n5n4$/mg
is implicitly applied on the$_
, which in our case is the whole file remember. the/m
regex modifier shall match the line endings and beginnings too apart from string beginnings and string endings./g
modifier will get all the matches in the$_
aka the whole file. - We do this in the list-context, and assign it to an empty list. The $_ thus gets re-assigned with the number of elements in the list, which is the number of times the regex matched really.
HTH
Without hardcoding the pattern, you can pass the start and end lines of the file and extract it within the perl code:perl -s -l -0777pe '$p = join "n", (split /n/)[$start-1 .. $end-1]; $_ = ()=/^$p$/mg' -- -start=2 -end=4 input_file
– glenn jackman
Jan 26 at 14:40
Thanks @glenn jackman, for providing the generalization.
– Rakesh Sharma
Jan 27 at 3:23
add a comment |
$ perl -l -0777pe '$_=()=/^2n5n4$/mg' input_file
3
Working:
-0777
=> slurp mode, meaning read the whole file in.-p
=> before reading the next record, print the current record,$_
to stdout.-l
=> set the RS = ORS = "n"- the regex
/^2n5n4$/mg
is implicitly applied on the$_
, which in our case is the whole file remember. the/m
regex modifier shall match the line endings and beginnings too apart from string beginnings and string endings./g
modifier will get all the matches in the$_
aka the whole file. - We do this in the list-context, and assign it to an empty list. The $_ thus gets re-assigned with the number of elements in the list, which is the number of times the regex matched really.
HTH
$ perl -l -0777pe '$_=()=/^2n5n4$/mg' input_file
3
Working:
-0777
=> slurp mode, meaning read the whole file in.-p
=> before reading the next record, print the current record,$_
to stdout.-l
=> set the RS = ORS = "n"- the regex
/^2n5n4$/mg
is implicitly applied on the$_
, which in our case is the whole file remember. the/m
regex modifier shall match the line endings and beginnings too apart from string beginnings and string endings./g
modifier will get all the matches in the$_
aka the whole file. - We do this in the list-context, and assign it to an empty list. The $_ thus gets re-assigned with the number of elements in the list, which is the number of times the regex matched really.
HTH
answered Jan 26 at 11:34
Rakesh SharmaRakesh Sharma
332
332
Without hardcoding the pattern, you can pass the start and end lines of the file and extract it within the perl code:perl -s -l -0777pe '$p = join "n", (split /n/)[$start-1 .. $end-1]; $_ = ()=/^$p$/mg' -- -start=2 -end=4 input_file
– glenn jackman
Jan 26 at 14:40
Thanks @glenn jackman, for providing the generalization.
– Rakesh Sharma
Jan 27 at 3:23
add a comment |
Without hardcoding the pattern, you can pass the start and end lines of the file and extract it within the perl code:perl -s -l -0777pe '$p = join "n", (split /n/)[$start-1 .. $end-1]; $_ = ()=/^$p$/mg' -- -start=2 -end=4 input_file
– glenn jackman
Jan 26 at 14:40
Thanks @glenn jackman, for providing the generalization.
– Rakesh Sharma
Jan 27 at 3:23
Without hardcoding the pattern, you can pass the start and end lines of the file and extract it within the perl code:
perl -s -l -0777pe '$p = join "n", (split /n/)[$start-1 .. $end-1]; $_ = ()=/^$p$/mg' -- -start=2 -end=4 input_file
– glenn jackman
Jan 26 at 14:40
Without hardcoding the pattern, you can pass the start and end lines of the file and extract it within the perl code:
perl -s -l -0777pe '$p = join "n", (split /n/)[$start-1 .. $end-1]; $_ = ()=/^$p$/mg' -- -start=2 -end=4 input_file
– glenn jackman
Jan 26 at 14:40
Thanks @glenn jackman, for providing the generalization.
– Rakesh Sharma
Jan 27 at 3:23
Thanks @glenn jackman, for providing the generalization.
– Rakesh Sharma
Jan 27 at 3:23
add a comment |
Your post doesn't mention any requirement for regular expression support, so I'm going to assume that you will be searching for fixed, literal text strings.
This probably isn't the fastest algorithm you've ever seen, but it works, if you have enough time. It has the slight defect that if there are more than one N-line patterns that begin with the same first line and have the same SHA256 hash, it will give incorrect results. It assumes that all possible N-line patterns will have unique SHA256 hashes.
It will be tediously slow on large files, especially those which contain numerous occurrences of the first line of the pattern.
#!/usr/bin/env bash
# What's the name of the list file?
LIST=list
# What's the name of the pattern file?
PATTERN=pattern
# We'll figure out how many times the pattern lines appear (consecutively) in the list.
# Where's your SHA256 tool?
SHA256=/sbin/sha256
# what's the first line of pattern?
PATTERN_START="$(head -1 $PATTERN)"
# where in the list does that single line appear (what line numbers?)
START_LINES="$(grep -nx "$PATTERN_START" $LIST | sed -e 's/:.*//')"
# how many lines long is the pattern?
PAT_LEN="$(grep -c ^ < $PATTERN)"
echo Pattern is $PAT_LEN lines long, and might start at any of these lines:
echo $START_LINES
PAT_HASH="$($SHA256 < "$PATTERN")"
# So how many times does $PATTERN appear consecutively in $LIST?
PAT_COUNT=0
for LINE in $START_LINES
do
HASH="$(tail +$LINE $LIST | head -$PAT_LEN | $SHA256 -q)"
if [ "$HASH" = "$PAT_HASH" ]
then
echo match at line $LINE
PAT_COUNT=$(($PAT_COUNT+1))
fi
done
echo The pattern was found $PAT_COUNT times
The output:
$ cat list
3
2
5
4
8
2
5
4
2
4
2
5
4
$ cat pattern
2
5
4
$ . foo.sh
Pattern is 3 lines long, and might start at any of these lines:
2 6 9 11
match at line 2
match at line 6
match at line 11
The pattern was found 3 times
add a comment |
Your post doesn't mention any requirement for regular expression support, so I'm going to assume that you will be searching for fixed, literal text strings.
This probably isn't the fastest algorithm you've ever seen, but it works, if you have enough time. It has the slight defect that if there are more than one N-line patterns that begin with the same first line and have the same SHA256 hash, it will give incorrect results. It assumes that all possible N-line patterns will have unique SHA256 hashes.
It will be tediously slow on large files, especially those which contain numerous occurrences of the first line of the pattern.
#!/usr/bin/env bash
# What's the name of the list file?
LIST=list
# What's the name of the pattern file?
PATTERN=pattern
# We'll figure out how many times the pattern lines appear (consecutively) in the list.
# Where's your SHA256 tool?
SHA256=/sbin/sha256
# what's the first line of pattern?
PATTERN_START="$(head -1 $PATTERN)"
# where in the list does that single line appear (what line numbers?)
START_LINES="$(grep -nx "$PATTERN_START" $LIST | sed -e 's/:.*//')"
# how many lines long is the pattern?
PAT_LEN="$(grep -c ^ < $PATTERN)"
echo Pattern is $PAT_LEN lines long, and might start at any of these lines:
echo $START_LINES
PAT_HASH="$($SHA256 < "$PATTERN")"
# So how many times does $PATTERN appear consecutively in $LIST?
PAT_COUNT=0
for LINE in $START_LINES
do
HASH="$(tail +$LINE $LIST | head -$PAT_LEN | $SHA256 -q)"
if [ "$HASH" = "$PAT_HASH" ]
then
echo match at line $LINE
PAT_COUNT=$(($PAT_COUNT+1))
fi
done
echo The pattern was found $PAT_COUNT times
The output:
$ cat list
3
2
5
4
8
2
5
4
2
4
2
5
4
$ cat pattern
2
5
4
$ . foo.sh
Pattern is 3 lines long, and might start at any of these lines:
2 6 9 11
match at line 2
match at line 6
match at line 11
The pattern was found 3 times
add a comment |
Your post doesn't mention any requirement for regular expression support, so I'm going to assume that you will be searching for fixed, literal text strings.
This probably isn't the fastest algorithm you've ever seen, but it works, if you have enough time. It has the slight defect that if there are more than one N-line patterns that begin with the same first line and have the same SHA256 hash, it will give incorrect results. It assumes that all possible N-line patterns will have unique SHA256 hashes.
It will be tediously slow on large files, especially those which contain numerous occurrences of the first line of the pattern.
#!/usr/bin/env bash
# What's the name of the list file?
LIST=list
# What's the name of the pattern file?
PATTERN=pattern
# We'll figure out how many times the pattern lines appear (consecutively) in the list.
# Where's your SHA256 tool?
SHA256=/sbin/sha256
# what's the first line of pattern?
PATTERN_START="$(head -1 $PATTERN)"
# where in the list does that single line appear (what line numbers?)
START_LINES="$(grep -nx "$PATTERN_START" $LIST | sed -e 's/:.*//')"
# how many lines long is the pattern?
PAT_LEN="$(grep -c ^ < $PATTERN)"
echo Pattern is $PAT_LEN lines long, and might start at any of these lines:
echo $START_LINES
PAT_HASH="$($SHA256 < "$PATTERN")"
# So how many times does $PATTERN appear consecutively in $LIST?
PAT_COUNT=0
for LINE in $START_LINES
do
HASH="$(tail +$LINE $LIST | head -$PAT_LEN | $SHA256 -q)"
if [ "$HASH" = "$PAT_HASH" ]
then
echo match at line $LINE
PAT_COUNT=$(($PAT_COUNT+1))
fi
done
echo The pattern was found $PAT_COUNT times
The output:
$ cat list
3
2
5
4
8
2
5
4
2
4
2
5
4
$ cat pattern
2
5
4
$ . foo.sh
Pattern is 3 lines long, and might start at any of these lines:
2 6 9 11
match at line 2
match at line 6
match at line 11
The pattern was found 3 times
Your post doesn't mention any requirement for regular expression support, so I'm going to assume that you will be searching for fixed, literal text strings.
This probably isn't the fastest algorithm you've ever seen, but it works, if you have enough time. It has the slight defect that if there are more than one N-line patterns that begin with the same first line and have the same SHA256 hash, it will give incorrect results. It assumes that all possible N-line patterns will have unique SHA256 hashes.
It will be tediously slow on large files, especially those which contain numerous occurrences of the first line of the pattern.
#!/usr/bin/env bash
# What's the name of the list file?
LIST=list
# What's the name of the pattern file?
PATTERN=pattern
# We'll figure out how many times the pattern lines appear (consecutively) in the list.
# Where's your SHA256 tool?
SHA256=/sbin/sha256
# what's the first line of pattern?
PATTERN_START="$(head -1 $PATTERN)"
# where in the list does that single line appear (what line numbers?)
START_LINES="$(grep -nx "$PATTERN_START" $LIST | sed -e 's/:.*//')"
# how many lines long is the pattern?
PAT_LEN="$(grep -c ^ < $PATTERN)"
echo Pattern is $PAT_LEN lines long, and might start at any of these lines:
echo $START_LINES
PAT_HASH="$($SHA256 < "$PATTERN")"
# So how many times does $PATTERN appear consecutively in $LIST?
PAT_COUNT=0
for LINE in $START_LINES
do
HASH="$(tail +$LINE $LIST | head -$PAT_LEN | $SHA256 -q)"
if [ "$HASH" = "$PAT_HASH" ]
then
echo match at line $LINE
PAT_COUNT=$(($PAT_COUNT+1))
fi
done
echo The pattern was found $PAT_COUNT times
The output:
$ cat list
3
2
5
4
8
2
5
4
2
4
2
5
4
$ cat pattern
2
5
4
$ . foo.sh
Pattern is 3 lines long, and might start at any of these lines:
2 6 9 11
match at line 2
match at line 6
match at line 11
The pattern was found 3 times
edited Jan 26 at 0:04
answered Jan 25 at 23:30
Jim L.Jim L.
1313
1313
add a comment |
add a comment |
mpc() tail -n $line_count)
awk -v RS='' -v FPAT="$multiline_pattern" 'print NF' "$3"
# count how many times multiline-pattern defined by lines 2 to 4 (inclusive) occurs
mpc 2 4 input_file
Requirement:
The second argument must be at least equal to or greater than the first argument. I make no guarantee to the output if you violate that.
Disclaimer:
This doesn't work if characters and/or
$
appear in any of the lines included as a pattern. awk
struggles to process those characters as parts of a pattern even if they're backslash-escaped.
add a comment |
mpc() tail -n $line_count)
awk -v RS='' -v FPAT="$multiline_pattern" 'print NF' "$3"
# count how many times multiline-pattern defined by lines 2 to 4 (inclusive) occurs
mpc 2 4 input_file
Requirement:
The second argument must be at least equal to or greater than the first argument. I make no guarantee to the output if you violate that.
Disclaimer:
This doesn't work if characters and/or
$
appear in any of the lines included as a pattern. awk
struggles to process those characters as parts of a pattern even if they're backslash-escaped.
add a comment |
mpc() tail -n $line_count)
awk -v RS='' -v FPAT="$multiline_pattern" 'print NF' "$3"
# count how many times multiline-pattern defined by lines 2 to 4 (inclusive) occurs
mpc 2 4 input_file
Requirement:
The second argument must be at least equal to or greater than the first argument. I make no guarantee to the output if you violate that.
Disclaimer:
This doesn't work if characters and/or
$
appear in any of the lines included as a pattern. awk
struggles to process those characters as parts of a pattern even if they're backslash-escaped.
mpc() tail -n $line_count)
awk -v RS='' -v FPAT="$multiline_pattern" 'print NF' "$3"
# count how many times multiline-pattern defined by lines 2 to 4 (inclusive) occurs
mpc 2 4 input_file
Requirement:
The second argument must be at least equal to or greater than the first argument. I make no guarantee to the output if you violate that.
Disclaimer:
This doesn't work if characters and/or
$
appear in any of the lines included as a pattern. awk
struggles to process those characters as parts of a pattern even if they're backslash-escaped.
edited Jan 26 at 2:56
answered Jan 26 at 2:21
Niko GambtNiko Gambt
1836
1836
add a comment |
add a comment |
How about
a="2 5 4"; tr 'n' ' ' < test | grep -o "[^0-9]$a[^0-9]" | wc -l
With the separator of your choice....
You need the regex to prevent a match in the event of .... 22 5 44
... or similar
add a comment |
How about
a="2 5 4"; tr 'n' ' ' < test | grep -o "[^0-9]$a[^0-9]" | wc -l
With the separator of your choice....
You need the regex to prevent a match in the event of .... 22 5 44
... or similar
add a comment |
How about
a="2 5 4"; tr 'n' ' ' < test | grep -o "[^0-9]$a[^0-9]" | wc -l
With the separator of your choice....
You need the regex to prevent a match in the event of .... 22 5 44
... or similar
How about
a="2 5 4"; tr 'n' ' ' < test | grep -o "[^0-9]$a[^0-9]" | wc -l
With the separator of your choice....
You need the regex to prevent a match in the event of .... 22 5 44
... or similar
edited Jan 26 at 15:02
answered Jan 26 at 14:55
bu5hmanbu5hman
1,282214
1,282214
add a comment |
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f496773%2fcount-multi-line-patterns-in-file%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
If it's inclusive then only the value in line 3 is repeated three times. The values in lines 2 and 4 are repeated four times.
– Nasir Riley
Jan 25 at 23:10
1
@NasirRiley I think they are asking for a multi-line grep, i.e.
2n5n4
– Sparhawk
Jan 25 at 23:16
I really can't tell what OP is looking for. Is it possible to reword it in simpler terms?
– Jesse_b
Jan 25 at 23:17
What @Sparhawk said is correct - I am looking for something like a multi line grep.
– ToasterFrogs
Jan 25 at 23:29
2
Is the input to this script "lines 2 through 4" or is it "the sequence of numbers 2,5,4"?
– Jeff Schaller
Jan 25 at 23:46