So far, I managed to filter out roughly 10000 rows out out ~20000 row file, although my solution contains a lot of false positives. Fxe.g. some rows marked TRUE for first names, contain text like "Det er OK.", where Python (I assume) merges the entire text together and extracts any matching substing to a name from a list, in this case I guess that could be "t er O" or "r OK", since my list has names "Tero" and "Rok" (although the case does not match and it combines letters from 2/3 separate words, which is not what I want)... Weirdly enough, this is NOT TRUE for the same text written in lowercase and without "." at the end, i.e. "det er ok", which is marked as FALSE! P.S. there are unfortunatelly few names in the emails that are written in lowercase letters and not sentence case as it should be...
Hej Thomas,
De 24 timer var en af mange sager som vi havde med til møde med Lars og Ole. De har godkendt den under dette møde.
Mvh. Per
Hej Thomas,
De 24 timer var en af mange sager som vi havde med til møde med Lars og Ole. De har godkendt den under dette møde.
Mvh. Per
Below is myMy code.:
# Import datasets and create lists/variables
import pandas as pd
from pandas import ExcelWriter
namesdf = pd.read_excel('names.xlsx', sheet_name='Alle Navne')
names = list(namesdf['Names'])
lastnamesdf = pd.read_excel('names.xlsx', sheet_name='Frie Efternavne')
lastnames = list(lastnamesdf['Frie Efternavne'])
# Import dataset and drop NULLS
df = pd.read_excel(r'Entreprise Beskeder.xlsx', sheet_name='dataark')
df["Besked"].dropna(inplace = True)
# Compare dataset to the created lists to match first and last names
df["Navner"] = df["Besked"].str.contains("|".join(names)) # Creates new column and adds TRUE/FALSE for first names
df["Efternavner"] = df["Besked"].str.contains("|".join(lastnames)) # Creates new column and adds TRUE/FALSE for last names
# Save the result
writer = ExcelWriter('PythonExport.xlsx')
df.to_excel(writer)
writer.save()
# Import datasets and create lists/variables
import pandas as pd
from pandas import ExcelWriter
namesdf = pd.read_excel('names.xlsx', sheet_name='Alle Navne')
names = list(namesdf['Names'])
lastnamesdf = pd.read_excel('names.xlsx', sheet_name='Frie Efternavne')
lastnames = list(lastnamesdf['Frie Efternavne'])
# Import dataset and drop NULLS
df = pd.read_excel(r'Entreprise Beskeder.xlsx', sheet_name='dataark')
df["Besked"].dropna(inplace = True)
# Compare dataset to the created lists to match first and last names
df["Navner"] = df["Besked"].str.contains("|".join(names)) # Creates new column and adds TRUE/FALSE for first names
df["Efternavner"] = df["Besked"].str.contains("|".join(lastnames)) # Creates new column and adds TRUE/FALSE for last names
# Save the result
writer = ExcelWriter('PythonExport.xlsx')
df.to_excel(writer)
writer.save()