I am trying to work out which entries in my data store are near-duplicates using approximate string matching.
Is there any implementation of the following approach in python, or do i need to try and roll my own?
Thanks :)
...
A brute-force approach would be to compute the edit distance to P for all substrings of T, and then choose the substring with the minimum distance. However, this algorithm would have the running time O(n3 m)
A better solution[3][4], utilizing dynamic programming, uses an alternative formulation of the problem: for each position j in the text T and each position i in the pattern P, compute the minimum edit distance between the i first characters of the pattern, Pi, and any substring Tj',j of T that ends at position j.
What is the most efficient way to apply this to many strings?