0

this seems like it should be an easy problem to solve, but I've been battling with it and cannot seem to find a solution.

I have two dataframes of different sizes and different column names. I am trying to compare a column in dataframe A, with another in dataframe B, and only retain the rows in dataframe A, if the strings are an EXACT match with B. I am not looking for a partial match or sub string, but an EXACT FULL STRING match. I have checked close to 40 questions now, and still keep getting partial matches.

Dataframe A
   ID                 Text     score
0   1                admin      10.4
1   2                Care        2.0
2   3                Office      5.0
3   4                Doctor      0.5
4   5                be          0.2
5   6                to          0.9
6   7                in          0.8 

And lets assume that the second dataframe is:

Dataframe B
   ID                 Labels              Places
0   1                Office                Upper
1   2                administration       Lower
2   3                Doctor               Internal
3   4                Reception           Outer
4   5                Tone                 Outer
5   6                Skin                 Internal
6   7                Behaviour            Outer
7   8                Injury               Upper

My desired output is

Dataframe A
   ID                 Text     score
2   3               Office       5.0
3   4               Doctor       0.5

This will be from comparing DataframeA['Text'] with DataframeB['Labels'] and only keeping the exact matches.

I have tried

df_A_new = df_A[df_A['Text'].isin(df_b['Labels'])]

and the output it gives me is

   ID                 Text     score
0   1                admin      10.4
2   3               Office       5.0
3   4               Doctor       0.5
4   5                be          0.2
5   6                to          0.9
6   7                in          0.8 

it maps the substring admin and administration. Merge has not helped either.

df_a_new = df_a.merge(df_b, left_on='Text', right_on='Lables')[['ID', 'Text', 'score']]

I have checked so many answers on stackoverflow, but they all seem to be matching substrings! Can anyone help with this?

7
  • 1
    I tested and your expression df_A_new=(df_A[df_A['Text'].isin(df_B['Labels'])]) is correct and gives the desired result. Commented Sep 30, 2024 at 15:28
  • Thanks @PepeNO. This is a fake dataset for illustration. The actual dataset is larger and gives these substring mappings when I run it. For example, it retains 'be' as one of the matched entries, but that's not present in the second dataset. A search shows things like 'members', 'behaviour', etc, so I think there's some substring matching going on. Commented Sep 30, 2024 at 15:39
  • 1
    Then add dummy data that really produces the wrong result that you are mentioning because your current example doesn't do that. Commented Sep 30, 2024 at 15:47
  • I got the same result as @PepeNO It would be better if you provide data that produces the error. Commented Sep 30, 2024 at 15:51
  • 1
    With your new sample data still gives a correct result. ID Text score 2 3 Office 5.0 3 4 Doctor 0.5 what is your setup in versions or where are you loading your data from? Commented Sep 30, 2024 at 16:03

1 Answer 1

1

Here is my suggestion.

Ensure string consistency

  • Remove all hidden characters like extra spaces, tabs, or non-printable characters.
  • Case sensitivity can be another issue. If the match should be case-insensitive, we can normalize all strings to lowercase.

Use merge() with strict matching

  • Instead of isin(), use merge() which performs a more explicit join between DataFrames and guarantees exact matches.

Optionally check string length

  • If substrings like be are sneaking through, we can ensure that the strings have the exact same length as an additional check.

Here is the code.

import pandas as pd

df_A = pd.DataFrame({
    'ID': [1, 2, 3, 4],
    'Text': ['admin', 'Care', 'Office', 'Doctor'],
    'score': [10.4, 2.0, 5.0, 0.5]
})

df_B = pd.DataFrame({
    'ID': [1, 2, 3, 4],
    'Labels': ['Office', 'administration', 'Doctor', 'Reception'],
    'Places': ['Upper', 'Lower', 'Internal', 'Outer']
})

# Strip all leading/trailing spaces and normalize to lowercase (or keep case-sensitive if needed)
df_A['Text'] = df_A['Text'].str.strip().str.lower()
df_B['Labels'] = df_B['Labels'].str.strip().str.lower()

# Ensure exact length matching (optional, if substring matches are suspected)
df_A_new = df_A[df_A['Text'].apply(len) == df_A['Text'].apply(len)]

# Use merge() for strict exact matching
df_A_new = df_A.merge(df_B, left_on='Text', right_on='Labels', how='inner')[['ID_x', 'Text', 'score']]

# Optionally rename ID column
df_A_new.rename(columns={'ID_x': 'ID'}, inplace=True)

# Display the result
print(df_A_new)

I hope this will help you a little.

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks. I'll try these. I already tried the lower() and strip(), so I'll try the rest.
ok, let me know if you have any trouble.
Thanks Temunel. This appears to have solved the problem. I did the stripping and lower casing again and then merged and took out duplicates.
glad to hear that it was helpful. :)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.