Here's the Rabin-Karp implementation that I've written in Python:
from functools import lru_cache
from itertools import islice
from typing import List, Optional
BASE = 139
MOD = 2 ** 31 - 1
@lru_cache(maxsize=None)
def char_to_int(c: str) -> int:
return ord(c) + 1
def calculate_hash(s: str, length: Optional[int] = None) -> int:
ret_hash = 0
if length is not None:
s = islice(s, 0, length)
power = len(s) - 1 if length is None else length - 1
for i, c in enumerate(s):
ret_hash += char_to_int(c) * BASE ** power
power -= 1
return ret_hash % MOD
def roll_hash(prev_hash: int, prev_char: int, next_char: str, length: int) -> int:
new_hash = ((prev_hash - char_to_int(prev_char) * BASE ** (length - 1)) * BASE) % MOD
new_hash += char_to_int(next_char)
return new_hash % MOD
def rabin_karp(text: str, pattern: str) -> List[int]:
"""
Returns a list of indices where each entry corresponds to the starting index of a
substring of `text` that exactly matches `pattern`
"""
p_hash = calculate_hash(pattern)
n = len(pattern)
curr_hash = calculate_hash(text, n)
indices = []
if p_hash == curr_hash:
indices.append(0)
for i in range(1, len(text) - n + 1):
curr_hash = roll_hash(
curr_hash, text[i - 1], text[i + n - 1], n
)
if p_hash == curr_hash:
indices.append(i)
return indices
if __name__ == "__main__":
with open("lorem_ipsum.txt", "r") as f:
text = f.read()
pattern = "lorem"
indices = rabin_karp(text, pattern)
print(f"indices: {indices}")
I'm trying to optimize the code as much as possible, so I've tried to do some dynamic code analysis to better understand bottlenecks. I used cProfile to understand the function calls and made changes to the code accordingly to arrive at the above code. Here is the final output from cProfile:
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
60301 0.091 0.000 0.091 0.000 rabin_karp.py:25(roll_hash)
1 0.035 0.035 0.126 0.126 rabin_karp.py:30(rabin_karp)
42 0.000 0.000 0.000 0.000 rabin_karp.py:11(char_to_int)
2 0.000 0.000 0.000 0.000 rabin_karp.py:15(calculate_hash)
57 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
42 0.000 0.000 0.000 0.000 {built-in method builtins.ord}
3 0.000 0.000 0.000 0.000 {built-in method builtins.len}
Are there any other ways I could further optimize the code? Another interesting thing of note is that adding @lru_cache actually increases execution time as measured by timeit despite the caching mechanism reducing the number of functions calls to char_to_int() (from 120612 to 42).
[m.start(0) for m in re.finditer(pattern, text)]. It runs in milliseconds over a file of 100K lines, while it takes seconds using yourrabin_karp. \$\endgroup\$