I have a bash script that I've been working on for a while. Basically, it searches through text to find repetitions of multiple lines. Here is what I have so far:
#!/bin/bash
count() {
count=$(( $3 - $2 + 1 ))
pattern=$(echo "$1" | head -n $3 | tail -n $count)
echo "$1" | pcregrep -Mc "^\Q$(echo "$pattern")\E$"
}
file=$1
fileprep=$(grep -v '=' $file | grep -v '!' | grep -v '*' | grep -o '[[:digit:]]*' | grep . )
linecount=$(echo "$fileprep" | wc -l)
len=10
start=1
end=$(( $linecount - $len + 1 ))
for i in $(seq $start $end); do
test="$test\n$(count "$fileprep" $i $((i+len-1)))"
done
a=$(printf $test | grep -v '\b1\b' )
mostrepetitions=$(echo "$a" | sort -rn | head -n1)
for i in $(seq 1 $mostrepetitions); do
var1=$(printf "$a" | grep '\b'$i'\b' | wc -l)
var2="$var2\n$(echo $(( var1 / i )))"
done
printf "$var2" | tr '\n' '+' | awk '{print "0"$0}' | bc -l
I have found that this works correctly on a simple file that has the numbers 1-10 repeated twice (like so):
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
On this, it will correctly output 1 (with the len variable at 10). When the len variable is changed to 9, it will correctly output two, because both 1-9 and 2-10 are 9 line patterns that occur at least twice.
However, when I run this on my target files (an example of which can be found here), I get impossible results.
In this script, the amount of nine-line patterns found will always have to be at least double the amount of ten line patterns. Take the above example of 1-10. In that, 1-10 is the only ten line pattern. However, within it are both 1-9 and 2-10, both of which are repeated twice. When I run my script though, for ten-line repeated patterns, I get an output of 2, and for nine-line patterns I also get an output of 2. This is clearly incorrect. Why is this happening?
Note - the fileprep variable was created to create a list of numbers from the input file (see the sample file I linked).