2

I have a bash script that I've been working on for a while. Basically, it searches through text to find repetitions of multiple lines. Here is what I have so far:

#!/bin/bash

count() {
    count=$(( $3 - $2 + 1 ))
    pattern=$(echo  "$1" | head -n $3 | tail -n $count)
    echo "$1" | pcregrep -Mc "^\Q$(echo "$pattern")\E$"
}

file=$1
fileprep=$(grep -v '=' $file | grep -v '!' | grep -v '*' |  grep -o '[[:digit:]]*' | grep . )
linecount=$(echo "$fileprep" | wc -l)
len=10
start=1
end=$(( $linecount - $len + 1 ))



for i in $(seq $start $end); do
    test="$test\n$(count "$fileprep" $i $((i+len-1)))"
done

a=$(printf $test | grep -v '\b1\b' )

mostrepetitions=$(echo "$a" | sort -rn | head -n1)

for i in $(seq 1 $mostrepetitions); do
    var1=$(printf "$a" | grep '\b'$i'\b' | wc -l)
    var2="$var2\n$(echo $(( var1 / i )))"
done

printf "$var2" | tr '\n' '+' | awk '{print "0"$0}' | bc -l

I have found that this works correctly on a simple file that has the numbers 1-10 repeated twice (like so):

1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10

On this, it will correctly output 1 (with the len variable at 10). When the len variable is changed to 9, it will correctly output two, because both 1-9 and 2-10 are 9 line patterns that occur at least twice.

However, when I run this on my target files (an example of which can be found here), I get impossible results.

In this script, the amount of nine-line patterns found will always have to be at least double the amount of ten line patterns. Take the above example of 1-10. In that, 1-10 is the only ten line pattern. However, within it are both 1-9 and 2-10, both of which are repeated twice. When I run my script though, for ten-line repeated patterns, I get an output of 2, and for nine-line patterns I also get an output of 2. This is clearly incorrect. Why is this happening?

Note - the fileprep variable was created to create a list of numbers from the input file (see the sample file I linked).

2
  • Some comments in the code would help to understand what the idea behind the various parts are, and what they should do. Commented Jan 27, 2019 at 16:33
  • What's tho overall purpose? To find cycles? Or to do a sort of a frequency analysis? Commented Jan 27, 2019 at 17:11

1 Answer 1

0

The phenomenon you describe is actually not impossible, so your script is not the problem. The smallest example I can think of is with len=3 as opposed to len=2, and the input file is

1
2
1
2
1
2

With len=3, you get the result 2, but with len=2, you don't get some number ≥4 as you would maybe suspect, but again the result 2. In order to get the same number of distinct repeating patterns with len=10 as well as with len=9, you just need to extrapolate the file to 13 lines.

Addendum:

I modified the count() function to

count() {
    count=$(( $3 - $2 + 1 ))
    pattern=$(echo  "$1" | head -n $3 | tail -n $count)
    occur=$(echo "$1" | pcregrep -Mc "^\Q$(echo "$pattern")\E$")
    [ $occur -ge 2 ] && echo "$pattern occurs $occur times." >&2
    echo $occur
}

So it prints the pattern which repeats to the standard error output. It says that the 10-line pattern

16
...
16

appears 360 times, while the 10-line pattern

16
...
16
8

appears twice. On the other hand, the 9-line pattern

16
...
16

appears 362 times, while

16
...
16
8

appears twice. Your file contains many blocks of subsequent lines with 16. What puzzles me is why the 9 lines with 16 do not occur once more for each such block, but only two times more than the 10 lines in total.

1
  • Thank you very much, I had never thought about the fact that there are more distinct 10 line possibilities than 9 line. Commented Feb 9, 2019 at 17:24

You must log in to answer this question.