formatting and idiom

Source Link

edit approved Aug 4, 2020 at 12:50

4.4k
7
25

So far, I managed to filter out roughly 10000 rows out out ~20000 row file, although my solution contains a lot of false positives. Fxe.g. some rows marked TRUE for first names, contain text like "Det er OK.", where Python (I assume) merges the entire text together and extracts any matching substing to a name from a list, in this case I guess that could be "t er O" or "r OK", since my list has names "Tero" and "Rok" (although the case does not match and it combines letters from 2/3 separate words, which is not what I want)... Weirdly enough, this is NOT TRUE for the same text written in lowercase and without "." at the end, i.e. "det er ok", which is marked as FALSE! P.S. there are unfortunatelly few names in the emails that are written in lowercase letters and not sentence case as it should be...

Hej Thomas,

De 24 timer var en af mange sager som vi havde med til møde med Lars og Ole. De har godkendt den under dette møde.

Mvh. Per

Hej Thomas,

De 24 timer var en af mange sager som vi havde med til møde med Lars og Ole. De har godkendt den under dette møde.

Mvh. Per

Below is myMy code.:

# Import datasets and create lists/variables
import pandas as pd
from pandas import ExcelWriter

namesdf = pd.read_excel('names.xlsx', sheet_name='Alle Navne')
names = list(namesdf['Names'])

lastnamesdf = pd.read_excel('names.xlsx', sheet_name='Frie Efternavne')
lastnames = list(lastnamesdf['Frie Efternavne'])


# Import dataset and drop NULLS
df = pd.read_excel(r'Entreprise Beskeder.xlsx', sheet_name='dataark')
df["Besked"].dropna(inplace = True)


# Compare dataset to the created lists to match first and last names
df["Navner"] = df["Besked"].str.contains("|".join(names)) # Creates new column and adds TRUE/FALSE for first names
df["Efternavner"] = df["Besked"].str.contains("|".join(lastnames)) # Creates new column and adds TRUE/FALSE for last names


# Save the result
writer = ExcelWriter('PythonExport.xlsx')
df.to_excel(writer)
writer.save()

# Import datasets and create lists/variables
import pandas as pd
from pandas import ExcelWriter

namesdf = pd.read_excel('names.xlsx', sheet_name='Alle Navne')
names = list(namesdf['Names'])

lastnamesdf = pd.read_excel('names.xlsx', sheet_name='Frie Efternavne')
lastnames = list(lastnamesdf['Frie Efternavne'])


# Import dataset and drop NULLS
df = pd.read_excel(r'Entreprise Beskeder.xlsx', sheet_name='dataark')
df["Besked"].dropna(inplace = True)


# Compare dataset to the created lists to match first and last names
df["Navner"] = df["Besked"].str.contains("|".join(names)) # Creates new column and adds TRUE/FALSE for first names
df["Efternavner"] = df["Besked"].str.contains("|".join(lastnames)) # Creates new column and adds TRUE/FALSE for last names


# Save the result
writer = ExcelWriter('PythonExport.xlsx')
df.to_excel(writer)
writer.save()

So far, I managed to filter out roughly 10000 rows out out ~20000 row file, although my solution contains a lot of false positives. Fx some rows marked TRUE for first names, contain text like "Det er OK.", where Python (I assume) merges the entire text together and extracts any matching substing to a name from a list, in this case I guess that could be "t er O" or "r OK", since my list has names "Tero" and "Rok" (although the case does not match and it combines letters from 2/3 separate words, which is not what I want)... Weirdly enough, this is NOT TRUE for the same text written in lowercase and without "." at the end, i.e. "det er ok", which is marked as FALSE! P.S. there are unfortunatelly few names in the emails that are written in lowercase letters and not sentence case as it should be...

Hej Thomas,

De 24 timer var en af mange sager som vi havde med til møde med Lars og Ole. De har godkendt den under dette møde.

Mvh. Per

Below is my code.

# Import datasets and create lists/variables
import pandas as pd
from pandas import ExcelWriter

namesdf = pd.read_excel('names.xlsx', sheet_name='Alle Navne')
names = list(namesdf['Names'])

lastnamesdf = pd.read_excel('names.xlsx', sheet_name='Frie Efternavne')
lastnames = list(lastnamesdf['Frie Efternavne'])


# Import dataset and drop NULLS
df = pd.read_excel(r'Entreprise Beskeder.xlsx', sheet_name='dataark')
df["Besked"].dropna(inplace = True)


# Compare dataset to the created lists to match first and last names
df["Navner"] = df["Besked"].str.contains("|".join(names)) # Creates new column and adds TRUE/FALSE for first names
df["Efternavner"] = df["Besked"].str.contains("|".join(lastnames)) # Creates new column and adds TRUE/FALSE for last names


# Save the result
writer = ExcelWriter('PythonExport.xlsx')
df.to_excel(writer)
writer.save()

So far, I managed to filter out roughly 10000 rows out out ~20000 row file, although my solution contains a lot of false positives. e.g. some rows marked TRUE for first names, contain text like "Det er OK.", where Python (I assume) merges the entire text together and extracts any matching substing to a name from a list, in this case I guess that could be "t er O" or "r OK", since my list has names "Tero" and "Rok" (although the case does not match and it combines letters from 2/3 separate words, which is not what I want)... Weirdly enough, this is NOT TRUE for the same text written in lowercase and without "." at the end, i.e. "det er ok", which is marked as FALSE! P.S. there are unfortunatelly few names in the emails that are written in lowercase letters and not sentence case as it should be...

Hej Thomas,

De 24 timer var en af mange sager som vi havde med til møde med Lars og Ole. De har godkendt den under dette møde.

Mvh. Per

My code:

# Import datasets and create lists/variables
import pandas as pd
from pandas import ExcelWriter

namesdf = pd.read_excel('names.xlsx', sheet_name='Alle Navne')
names = list(namesdf['Names'])

lastnamesdf = pd.read_excel('names.xlsx', sheet_name='Frie Efternavne')
lastnames = list(lastnamesdf['Frie Efternavne'])


# Import dataset and drop NULLS
df = pd.read_excel(r'Entreprise Beskeder.xlsx', sheet_name='dataark')
df["Besked"].dropna(inplace = True)


# Compare dataset to the created lists to match first and last names
df["Navner"] = df["Besked"].str.contains("|".join(names)) # Creates new column and adds TRUE/FALSE for first names
df["Efternavner"] = df["Besked"].str.contains("|".join(lastnames)) # Creates new column and adds TRUE/FALSE for last names


# Save the result
writer = ExcelWriter('PythonExport.xlsx')
df.to_excel(writer)
writer.save()

edited tags

Link

edited Aug 4, 2020 at 12:15

mantasbacys

61
1
7

Python - use a list of names to find exact match in the free text pandas column containing emails

edited tags

Link

edited Aug 4, 2020 at 12:09

mantasbacys

61
1
7

edited tags

Link

edited Aug 4, 2020 at 9:08

mantasbacys

61
1
7

Loading

deleted 25 characters in body

Source Link

edited Aug 4, 2020 at 5:49

mantasbacys

61
1
7

Loading

Source Link

asked Aug 3, 2020 at 12:19

mantasbacys

61
1
7

Loading

Stack Exchange Network

Return to Question

Python - use a list of names to find exact match in the free text pandas column containing emails