Fuzzy String Matching using Python

Question

I have a training dataset for eg.

Letter    Word
A         Apple
B         Bat
C         Cat
D         Dog
E         Elephant

and I need to check the dataframe such as

AD    Apple Dog
AE    Applet Elephant
DC    Dog Cow
EB    Elephant Bag
AED   Apple Elephant Dog  
D     Door                
ABC   All Bat Cat

the instances AD,AE,EB are almost accurate (Apple and Applet are considered closer to each other, similar for Bat and Bag) but DC doesn't match.

Output Required:

Letters    Words               Status
AD         Apple Dog           Accept
AE         Applet Elephant     Accept
DC         Dog Cow             Reject
EB         Elephant Bag        Accept
AED        Apple Elephant Dog  Accept
D          Door                Reject
ABC        All Bat Cat         Accept

ABC accepted because 2 of 3 words match.

The words accepted need to be matched 70% (Fuzzy Match). yet, threshold subject to change. How can I find these matches using Python.

Facing issues in framing the code. Nothing fruitful from the trials. — spd
– spd, Commented Apr 17, 2022 at 5:15
Can I assume if I have 2 letters (AD), I always have 2 and only 2 words (Apple Dog) (not 3, not 1) separated by space? — Corralien
– Corralien, Commented Apr 17, 2022 at 5:19
So 2 letters -> 2 words, 1 letter -> 1 word, 3 letters -> 3 words. You always have 1:1. — Corralien
– Corralien, Commented Apr 17, 2022 at 5:29

Corralien · Accepted Answer · 2022-04-17 05:44:32Z

1

You can use thefuzz to solve your problem:

# Python env: pip install thefuzz
# Conda env: conda install thefuzz
from thefuzz import fuzz

THRESHOLD = 70

df2['Others'] = (df2['Letters'].agg(list).explode().reset_index()
                     .merge(df1, left_on='Letters', right_on='Letter')
                     .groupby('index')['Word'].agg(' '.join))

df2['Ratio'] = df2.apply(lambda x: fuzz.ratio(x['Words'], x['Others']), axis=1)
df2['Status'] = np.where(df2['Ratio'] > THRESHOLD, 'Accept', 'Reject')

Output:

>>> df2
  Letters               Words              Others  Ratio  Status
0      AD           Apple Dog           Apple Dog    100  Accept
1      AE     Applet Elephant      Apple Elephant     97  Accept
2      DC             Dog Cow             Dog Cat     71  Accept
3      EB        Elephant Bag        Elephant Bat     92  Accept
4     AED  Apple Elephant Dog  Apple Dog Elephant     78  Accept
5       D                Door                 Dog     57  Reject
6     ABC         All Bat Cat       Apple Cat Bat     67  Reject

answered Apr 17, 2022 at 5:44

Corralien

121k8 gold badges44 silver badges69 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

spd Over a year ago

How can the code modify if ratio isn't 1:1

Corralien Over a year ago

@spd. In fact it doesn't matter. I created the Others column by exploding the Letters column. So for DC I extracted Dog and Cat from your first dataframe and compare it to Dog Cow.

spd Over a year ago

Works. Thanks. Just another thing, if i make first dataframe as a combination of letters for eg. AE as Apple Elephant AD as Apple Dog , how can it be done.

spd Over a year ago

I need to check whole Letters string in dataframe 2 with Letter string in dataframe 1 and compare those words with each other to find the fuzz ratio.

Corralien Over a year ago

If you have AE and EZ in your first dataframe and in your second, you have AEZ in your second, what should be the result?

|

Collectives™ on Stack Overflow

Fuzzy String Matching using Python

1 Answer 1

6 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Linked

Related