0

I have a training dataset for eg.

Letter    Word
A         Apple
B         Bat
C         Cat
D         Dog
E         Elephant

and I need to check the dataframe such as

AD    Apple Dog
AE    Applet Elephant
DC    Dog Cow
EB    Elephant Bag
AED   Apple Elephant Dog  
D     Door                
ABC   All Bat Cat         

the instances AD,AE,EB are almost accurate (Apple and Applet are considered closer to each other, similar for Bat and Bag) but DC doesn't match.

Output Required:

Letters    Words               Status
AD         Apple Dog           Accept
AE         Applet Elephant     Accept
DC         Dog Cow             Reject
EB         Elephant Bag        Accept
AED        Apple Elephant Dog  Accept
D          Door                Reject
ABC        All Bat Cat         Accept

ABC accepted because 2 of 3 words match.

The words accepted need to be matched 70% (Fuzzy Match). yet, threshold subject to change. How can I find these matches using Python.

6
  • Facing issues in framing the code. Nothing fruitful from the trials. Commented Apr 17, 2022 at 5:15
  • Can I assume if I have 2 letters (AD), I always have 2 and only 2 words (Apple Dog) (not 3, not 1) separated by space? Commented Apr 17, 2022 at 5:19
  • It can be variable. Commented Apr 17, 2022 at 5:21
  • Updated. Please check Commented Apr 17, 2022 at 5:27
  • So 2 letters -> 2 words, 1 letter -> 1 word, 3 letters -> 3 words. You always have 1:1. Commented Apr 17, 2022 at 5:29

1 Answer 1

1

You can use thefuzz to solve your problem:

# Python env: pip install thefuzz
# Conda env: conda install thefuzz
from thefuzz import fuzz

THRESHOLD = 70

df2['Others'] = (df2['Letters'].agg(list).explode().reset_index()
                     .merge(df1, left_on='Letters', right_on='Letter')
                     .groupby('index')['Word'].agg(' '.join))

df2['Ratio'] = df2.apply(lambda x: fuzz.ratio(x['Words'], x['Others']), axis=1)
df2['Status'] = np.where(df2['Ratio'] > THRESHOLD, 'Accept', 'Reject')

Output:

>>> df2
  Letters               Words              Others  Ratio  Status
0      AD           Apple Dog           Apple Dog    100  Accept
1      AE     Applet Elephant      Apple Elephant     97  Accept
2      DC             Dog Cow             Dog Cat     71  Accept
3      EB        Elephant Bag        Elephant Bat     92  Accept
4     AED  Apple Elephant Dog  Apple Dog Elephant     78  Accept
5       D                Door                 Dog     57  Reject
6     ABC         All Bat Cat       Apple Cat Bat     67  Reject
Sign up to request clarification or add additional context in comments.

6 Comments

How can the code modify if ratio isn't 1:1
@spd. In fact it doesn't matter. I created the Others column by exploding the Letters column. So for DC I extracted Dog and Cat from your first dataframe and compare it to Dog Cow.
Works. Thanks. Just another thing, if i make first dataframe as a combination of letters for eg. AE as Apple Elephant AD as Apple Dog , how can it be done.
I need to check whole Letters string in dataframe 2 with Letter string in dataframe 1 and compare those words with each other to find the fuzz ratio.
If you have AE and EZ in your first dataframe and in your second, you have AEZ in your second, what should be the result?
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.