Skip to main content

So far, I managed to filter out roughly 10000 rows out out ~20000 row file, although my solution contains a lot of false positives. Fxe.g. some rows marked TRUE for first names, contain text like "Det er OK.", where Python (I assume) merges the entire text together and extracts any matching substing to a name from a list, in this case I guess that could be "t er O" or "r OK", since my list has names "Tero" and "Rok" (although the case does not match and it combines letters from 2/3 separate words, which is not what I want)... Weirdly enough, this is NOT TRUE for the same text written in lowercase and without "." at the end, i.e. "det er ok", which is marked as FALSE! P.S. there are unfortunatelly few names in the emails that are written in lowercase letters and not sentence case as it should be...

Hej Thomas,

De 24 timer var en af mange sager som vi havde med til møde med Lars og Ole. De har godkendt den under dette møde.

Mvh. Per

Hej Thomas,

De 24 timer var en af mange sager som vi havde med til møde med Lars og Ole. De har godkendt den under dette møde.

Mvh. Per

Below is myMy code.:

# Import datasets and create lists/variables
import pandas as pd
from pandas import ExcelWriter

namesdf = pd.read_excel('names.xlsx', sheet_name='Alle Navne')
names = list(namesdf['Names'])

lastnamesdf = pd.read_excel('names.xlsx', sheet_name='Frie Efternavne')
lastnames = list(lastnamesdf['Frie Efternavne'])


# Import dataset and drop NULLS
df = pd.read_excel(r'Entreprise Beskeder.xlsx', sheet_name='dataark')
df["Besked"].dropna(inplace = True)


# Compare dataset to the created lists to match first and last names
df["Navner"] = df["Besked"].str.contains("|".join(names)) # Creates new column and adds TRUE/FALSE for first names
df["Efternavner"] = df["Besked"].str.contains("|".join(lastnames)) # Creates new column and adds TRUE/FALSE for last names


# Save the result
writer = ExcelWriter('PythonExport.xlsx')
df.to_excel(writer)
writer.save()
# Import datasets and create lists/variables
import pandas as pd
from pandas import ExcelWriter

namesdf = pd.read_excel('names.xlsx', sheet_name='Alle Navne')
names = list(namesdf['Names'])

lastnamesdf = pd.read_excel('names.xlsx', sheet_name='Frie Efternavne')
lastnames = list(lastnamesdf['Frie Efternavne'])


# Import dataset and drop NULLS
df = pd.read_excel(r'Entreprise Beskeder.xlsx', sheet_name='dataark')
df["Besked"].dropna(inplace = True)


# Compare dataset to the created lists to match first and last names
df["Navner"] = df["Besked"].str.contains("|".join(names)) # Creates new column and adds TRUE/FALSE for first names
df["Efternavner"] = df["Besked"].str.contains("|".join(lastnames)) # Creates new column and adds TRUE/FALSE for last names


# Save the result
writer = ExcelWriter('PythonExport.xlsx')
df.to_excel(writer)
writer.save()

So far, I managed to filter out roughly 10000 rows out out ~20000 row file, although my solution contains a lot of false positives. Fx some rows marked TRUE for first names, contain text like "Det er OK.", where Python (I assume) merges the entire text together and extracts any matching substing to a name from a list, in this case I guess that could be "t er O" or "r OK", since my list has names "Tero" and "Rok" (although the case does not match and it combines letters from 2/3 separate words, which is not what I want)... Weirdly enough, this is NOT TRUE for the same text written in lowercase and without "." at the end, i.e. "det er ok", which is marked as FALSE! P.S. there are unfortunatelly few names in the emails that are written in lowercase letters and not sentence case as it should be...

Hej Thomas,

De 24 timer var en af mange sager som vi havde med til møde med Lars og Ole. De har godkendt den under dette møde.

Mvh. Per

Below is my code.

# Import datasets and create lists/variables
import pandas as pd
from pandas import ExcelWriter

namesdf = pd.read_excel('names.xlsx', sheet_name='Alle Navne')
names = list(namesdf['Names'])

lastnamesdf = pd.read_excel('names.xlsx', sheet_name='Frie Efternavne')
lastnames = list(lastnamesdf['Frie Efternavne'])


# Import dataset and drop NULLS
df = pd.read_excel(r'Entreprise Beskeder.xlsx', sheet_name='dataark')
df["Besked"].dropna(inplace = True)


# Compare dataset to the created lists to match first and last names
df["Navner"] = df["Besked"].str.contains("|".join(names)) # Creates new column and adds TRUE/FALSE for first names
df["Efternavner"] = df["Besked"].str.contains("|".join(lastnames)) # Creates new column and adds TRUE/FALSE for last names


# Save the result
writer = ExcelWriter('PythonExport.xlsx')
df.to_excel(writer)
writer.save()

So far, I managed to filter out roughly 10000 rows out out ~20000 row file, although my solution contains a lot of false positives. e.g. some rows marked TRUE for first names, contain text like "Det er OK.", where Python (I assume) merges the entire text together and extracts any matching substing to a name from a list, in this case I guess that could be "t er O" or "r OK", since my list has names "Tero" and "Rok" (although the case does not match and it combines letters from 2/3 separate words, which is not what I want)... Weirdly enough, this is NOT TRUE for the same text written in lowercase and without "." at the end, i.e. "det er ok", which is marked as FALSE! P.S. there are unfortunatelly few names in the emails that are written in lowercase letters and not sentence case as it should be...

Hej Thomas,

De 24 timer var en af mange sager som vi havde med til møde med Lars og Ole. De har godkendt den under dette møde.

Mvh. Per

My code:

# Import datasets and create lists/variables
import pandas as pd
from pandas import ExcelWriter

namesdf = pd.read_excel('names.xlsx', sheet_name='Alle Navne')
names = list(namesdf['Names'])

lastnamesdf = pd.read_excel('names.xlsx', sheet_name='Frie Efternavne')
lastnames = list(lastnamesdf['Frie Efternavne'])


# Import dataset and drop NULLS
df = pd.read_excel(r'Entreprise Beskeder.xlsx', sheet_name='dataark')
df["Besked"].dropna(inplace = True)


# Compare dataset to the created lists to match first and last names
df["Navner"] = df["Besked"].str.contains("|".join(names)) # Creates new column and adds TRUE/FALSE for first names
df["Efternavner"] = df["Besked"].str.contains("|".join(lastnames)) # Creates new column and adds TRUE/FALSE for last names


# Save the result
writer = ExcelWriter('PythonExport.xlsx')
df.to_excel(writer)
writer.save()
edited tags
Link

Python - use a list of names to find exact match in the free text pandas column containing emails

edited tags
Link
edited tags
Link
Loading
deleted 25 characters in body
Source Link
Loading
Source Link
Loading