My use case is that I have a list of emails and I need to find if the email is in the chain:
emails = [
[' RE: I just got your test || This is a top level email roughly 300+ words in length on average'],
['RE:RE: Something | RE: Something | This is another entirely distinct email that is the top level'],
['RE: FooBar | A third email that is entirely distinct']
... #performance issues start occurring with emails growing to > 50 items
]
Now I would search for an email which is a reply:
test_email = """
A third email that is entirely distinct
"""
Currently I am using a KMP type search but I believe this is inefficient because it searches for multiple copies of the needle. I also believe the in operator is inefficient because it is using the niave method of searching the string.
An implementation of KMP in python I found and modified slightly:
class KMP:
@classmethod
def partial(self, pattern):
""" Calculate partial match table: String -> [Int]"""
ret = [0]
for i in range(1, len(pattern)):
j = ret[i - 1]
while j > 0 and pattern[j] != pattern[i]:
j = ret[j - 1]
ret.append(j + 1 if pattern[j] == pattern[i] else j)
return ret
@classmethod
def search(self, T, P):
"""
KMP search main algorithm: String -> String -> [Int]
Return all the matching position of pattern string P in S
"""
partial, ret, j = KMP.partial(P), [], 0
for i in range(len(T)):
while j > 0 and T[i] != P[j]:
j = partial[j - 1]
try:
if T[i] == P[j]: j += 1
except:
return False
if j == len(P):
return True
return False
Used here:
for choice in choices: #choice[0] is email in the reply chain
if current_email == choice[0]:
incident_type = choice[1]
break
elif len(current_email) > len(choice[0]):
if KMP.search(current_email, choice[0]):
incident_type = choice[1]
break
else:
if KMP.search(choice[0], current_email):
incident_type = choice[1]
break
Should I fix this implementation or scrap it in favor of another algo?