Bash script only functioning on certain inputs

I have a bash script that I've been working on for a while. Basically, it searches through text to find repetitions of multiple lines. Here is what I have so far:

#!/bin/bash

count() pcregrep -Mc "^Q$(echo "$pattern")E$"


file=$1
fileprep=$(grep -v '=' $file | grep -v '!' | grep -v '*' | grep -o '[[:digit:]]*' | grep . )
linecount=$(echo "$fileprep" | wc -l)
len=10
start=1
end=$(( $linecount - $len + 1 ))



for i in $(seq $start $end); do
 test="$testn$(count "$fileprep" $i $((i+len-1)))"
done

a=$(printf $test | grep -v 'b1b' )

mostrepetitions=$(echo "$a" | sort -rn | head -n1)

for i in $(seq 1 $mostrepetitions); do
 var1=$(printf "$a" | grep 'b'$i'b' | wc -l)
 var2="$var2n$(echo $(( var1 / i )))"
done

printf "$var2" | tr 'n' '+' | awk 'print "0"$0' | bc -l

I have found that this works correctly on a simple file that has the numbers 1-10 repeated twice (like so):

On this, it will correctly output 1 (with the len variable at 10). When the len variable is changed to 9, it will correctly output two, because both 1-9 and 2-10 are 9 line patterns that occur at least twice.

However, when I run this on my target files (an example of which can be found here), I get impossible results.

In this script, the amount of nine-line patterns found will always have to be at least double the amount of ten line patterns. Take the above example of 1-10. In that, 1-10 is the only ten line pattern. However, within it are both 1-9 and 2-10, both of which are repeated twice. When I run my script though, for ten-line repeated patterns, I get an output of 2, and for nine-line patterns I also get an output of 2. This is clearly incorrect. Why is this happening?

Note - the fileprep variable was created to create a list of numbers from the input file (see the sample file I linked).

asked Jan 27 at 15:29

ToasterFrogs

443

Some comments in the code would help to understand what the idea behind the various parts are, and what they should do.

– nohillside
Jan 27 at 16:33

What's tho overall purpose? To find cycles? Or to do a sort of a frequency analysis?

– Kusalananda
Jan 27 at 17:11

add a comment |

I have a bash script that I've been working on for a while. Basically, it searches through text to find repetitions of multiple lines. Here is what I have so far:

#!/bin/bash

count() pcregrep -Mc "^Q$(echo "$pattern")E$"


file=$1
fileprep=$(grep -v '=' $file | grep -v '!' | grep -v '*' | grep -o '[[:digit:]]*' | grep . )
linecount=$(echo "$fileprep" | wc -l)
len=10
start=1
end=$(( $linecount - $len + 1 ))



for i in $(seq $start $end); do
 test="$testn$(count "$fileprep" $i $((i+len-1)))"
done

a=$(printf $test | grep -v 'b1b' )

mostrepetitions=$(echo "$a" | sort -rn | head -n1)

for i in $(seq 1 $mostrepetitions); do
 var1=$(printf "$a" | grep 'b'$i'b' | wc -l)
 var2="$var2n$(echo $(( var1 / i )))"
done

printf "$var2" | tr 'n' '+' | awk 'print "0"$0' | bc -l

I have found that this works correctly on a simple file that has the numbers 1-10 repeated twice (like so):

However, when I run this on my target files (an example of which can be found here), I get impossible results.

Note - the fileprep variable was created to create a list of numbers from the input file (see the sample file I linked).

asked Jan 27 at 15:29

ToasterFrogs

443

Some comments in the code would help to understand what the idea behind the various parts are, and what they should do.

– nohillside
Jan 27 at 16:33

What's tho overall purpose? To find cycles? Or to do a sort of a frequency analysis?

– Kusalananda
Jan 27 at 17:11

add a comment |

I have a bash script that I've been working on for a while. Basically, it searches through text to find repetitions of multiple lines. Here is what I have so far:

#!/bin/bash

count() pcregrep -Mc "^Q$(echo "$pattern")E$"


file=$1
fileprep=$(grep -v '=' $file | grep -v '!' | grep -v '*' | grep -o '[[:digit:]]*' | grep . )
linecount=$(echo "$fileprep" | wc -l)
len=10
start=1
end=$(( $linecount - $len + 1 ))



for i in $(seq $start $end); do
 test="$testn$(count "$fileprep" $i $((i+len-1)))"
done

a=$(printf $test | grep -v 'b1b' )

mostrepetitions=$(echo "$a" | sort -rn | head -n1)

for i in $(seq 1 $mostrepetitions); do
 var1=$(printf "$a" | grep 'b'$i'b' | wc -l)
 var2="$var2n$(echo $(( var1 / i )))"
done

printf "$var2" | tr 'n' '+' | awk 'print "0"$0' | bc -l

I have found that this works correctly on a simple file that has the numbers 1-10 repeated twice (like so):

However, when I run this on my target files (an example of which can be found here), I get impossible results.

Note - the fileprep variable was created to create a list of numbers from the input file (see the sample file I linked).

asked Jan 27 at 15:29

ToasterFrogs

443

I have a bash script that I've been working on for a while. Basically, it searches through text to find repetitions of multiple lines. Here is what I have so far:

#!/bin/bash

count() pcregrep -Mc "^Q$(echo "$pattern")E$"


file=$1
fileprep=$(grep -v '=' $file | grep -v '!' | grep -v '*' | grep -o '[[:digit:]]*' | grep . )
linecount=$(echo "$fileprep" | wc -l)
len=10
start=1
end=$(( $linecount - $len + 1 ))



for i in $(seq $start $end); do
 test="$testn$(count "$fileprep" $i $((i+len-1)))"
done

a=$(printf $test | grep -v 'b1b' )

mostrepetitions=$(echo "$a" | sort -rn | head -n1)

for i in $(seq 1 $mostrepetitions); do
 var1=$(printf "$a" | grep 'b'$i'b' | wc -l)
 var2="$var2n$(echo $(( var1 / i )))"
done

printf "$var2" | tr 'n' '+' | awk 'print "0"$0' | bc -l

I have found that this works correctly on a simple file that has the numbers 1-10 repeated twice (like so):

However, when I run this on my target files (an example of which can be found here), I get impossible results.

Note - the fileprep variable was created to create a list of numbers from the input file (see the sample file I linked).

bash text-processing

asked Jan 27 at 15:29

ToasterFrogs

443

asked Jan 27 at 15:29

ToasterFrogs

443

asked Jan 27 at 15:29

ToasterFrogs

443

asked Jan 27 at 15:29

ToasterFrogs

443

asked Jan 27 at 15:29

ToasterFrogs

443

Some comments in the code would help to understand what the idea behind the various parts are, and what they should do.

– nohillside
Jan 27 at 16:33

What's tho overall purpose? To find cycles? Or to do a sort of a frequency analysis?

– Kusalananda
Jan 27 at 17:11

add a comment |

Some comments in the code would help to understand what the idea behind the various parts are, and what they should do.

– nohillside
Jan 27 at 16:33

What's tho overall purpose? To find cycles? Or to do a sort of a frequency analysis?

– Kusalananda
Jan 27 at 17:11

Some comments in the code would help to understand what the idea behind the various parts are, and what they should do.

– nohillside
Jan 27 at 16:33

What's tho overall purpose? To find cycles? Or to do a sort of a frequency analysis?

– Kusalananda
Jan 27 at 17:11

add a comment |

1 Answer
1

active

oldest

votes

The phenomenon you describe is actually not impossible, so your script is not the problem. The smallest example I can think of is with len=3 as opposed to len=2, and the input file is

With len=3, you get the result 2, but with len=2, you don't get some number ≥4 as you would maybe suspect, but again the result 2. In order to get the same number of distinct repeating patterns with len=10 as well as with len=9, you just need to extrapolate the file to 13 lines.

Addendum:

I modified the count() function to

count() pcregrep -Mc "^Q$(echo "$pattern")E$")
 [ $occur -ge 2 ] && echo "$pattern occurs $occur times." >&2
 echo $occur

So it prints the pattern which repeats to the standard error output. It says that the 10-line pattern

16
...
16

appears 360 times, while the 10-line pattern

16
...
16
8

appears twice. On the other hand, the 9-line pattern

16
...
16

appears 362 times, while

16
...
16
8

appears twice. Your file contains many blocks of subsequent lines with 16. What puzzles me is why the 9 lines with 16 do not occur once more for each such block, but only two times more than the 10 lines in total.

edited Jan 28 at 15:34

answered Jan 27 at 17:18

Stefan Hamcke

217312

Thank you very much, I had never thought about the fact that there are more distinct 10 line possibilities than 9 line.

– ToasterFrogs
Feb 9 at 17:24

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f497028%2fbash-script-only-functioning-on-certain-inputs%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

The phenomenon you describe is actually not impossible, so your script is not the problem. The smallest example I can think of is with len=3 as opposed to len=2, and the input file is

Addendum:

I modified the count() function to

count() pcregrep -Mc "^Q$(echo "$pattern")E$")
 [ $occur -ge 2 ] && echo "$pattern occurs $occur times." >&2
 echo $occur

So it prints the pattern which repeats to the standard error output. It says that the 10-line pattern

16
...
16

appears 360 times, while the 10-line pattern

16
...
16
8

appears twice. On the other hand, the 9-line pattern

16
...
16

appears 362 times, while

16
...
16
8

edited Jan 28 at 15:34

answered Jan 27 at 17:18

Stefan Hamcke

217312

Thank you very much, I had never thought about the fact that there are more distinct 10 line possibilities than 9 line.

– ToasterFrogs
Feb 9 at 17:24

add a comment |

The phenomenon you describe is actually not impossible, so your script is not the problem. The smallest example I can think of is with len=3 as opposed to len=2, and the input file is

Addendum:

I modified the count() function to

count() pcregrep -Mc "^Q$(echo "$pattern")E$")
 [ $occur -ge 2 ] && echo "$pattern occurs $occur times." >&2
 echo $occur

So it prints the pattern which repeats to the standard error output. It says that the 10-line pattern

16
...
16

appears 360 times, while the 10-line pattern

16
...
16
8

appears twice. On the other hand, the 9-line pattern

16
...
16

appears 362 times, while

16
...
16
8

edited Jan 28 at 15:34

answered Jan 27 at 17:18

Stefan Hamcke

217312

Thank you very much, I had never thought about the fact that there are more distinct 10 line possibilities than 9 line.

– ToasterFrogs
Feb 9 at 17:24

add a comment |

The phenomenon you describe is actually not impossible, so your script is not the problem. The smallest example I can think of is with len=3 as opposed to len=2, and the input file is

Addendum:

I modified the count() function to

count() pcregrep -Mc "^Q$(echo "$pattern")E$")
 [ $occur -ge 2 ] && echo "$pattern occurs $occur times." >&2
 echo $occur

So it prints the pattern which repeats to the standard error output. It says that the 10-line pattern

16
...
16

appears 360 times, while the 10-line pattern

16
...
16
8

appears twice. On the other hand, the 9-line pattern

16
...
16

appears 362 times, while

16
...
16
8

edited Jan 28 at 15:34

answered Jan 27 at 17:18

Stefan Hamcke

217312

The phenomenon you describe is actually not impossible, so your script is not the problem. The smallest example I can think of is with len=3 as opposed to len=2, and the input file is

Addendum:

I modified the count() function to

count() pcregrep -Mc "^Q$(echo "$pattern")E$")
 [ $occur -ge 2 ] && echo "$pattern occurs $occur times." >&2
 echo $occur

So it prints the pattern which repeats to the standard error output. It says that the 10-line pattern

16
...
16

appears 360 times, while the 10-line pattern

16
...
16
8

appears twice. On the other hand, the 9-line pattern

16
...
16

appears 362 times, while

16
...
16
8

edited Jan 28 at 15:34

answered Jan 27 at 17:18

Stefan Hamcke

217312

edited Jan 28 at 15:34

answered Jan 27 at 17:18

Stefan Hamcke

217312

answered Jan 27 at 17:18

Stefan Hamcke

217312

answered Jan 27 at 17:18

Stefan Hamcke

217312

Thank you very much, I had never thought about the fact that there are more distinct 10 line possibilities than 9 line.

– ToasterFrogs
Feb 9 at 17:24

add a comment |

Thank you very much, I had never thought about the fact that there are more distinct 10 line possibilities than 9 line.

– ToasterFrogs
Feb 9 at 17:24

Thank you very much, I had never thought about the fact that there are more distinct 10 line possibilities than 9 line.

– ToasterFrogs
Feb 9 at 17:24

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

搜尋此網誌

mjhjmtu