3

I am trying to work out which entries in my data store are near-duplicates using approximate string matching.

Is there any implementation of the following approach in python, or do i need to try and roll my own?

Thanks :)

from wikipedia:

...

A brute-force approach would be to compute the edit distance to P for all substrings of T, and then choose the substring with the minimum distance. However, this algorithm would have the running time O(n3 m)

A better solution[3][4], utilizing dynamic programming, uses an alternative formulation of the problem: for each position j in the text T and each position i in the pattern P, compute the minimum edit distance between the i first characters of the pattern, Pi, and any substring Tj',j of T that ends at position j.

What is the most efficient way to apply this to many strings?

4 Answers 4

1

Yes.

google("python levenshtein")
Sign up to request clarification or add additional context in comments.

Comments

1

difflib.get_close_matches should do the work.

Comments

0

difflib might be the answer, eg,

from difflib import context_diff

a = 'acaacbaaca'
b = 'accabcaacc'

print ''.join(context_diff(a,b))

Comments

0

Levenshtein distance performs very similarly to the fuzzywuzzy standard ratio() function. fuzzywuzzy uses difflib http://seatgeek.com/blog/dev/fuzzywuzzy-fuzzy-string-matching-in-python

example from the fuzzywuzzy documentation: https://github.com/seatgeek/fuzzywuzzy

fuzz.ratio("this is a test", "this is a test!")
    96

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.