The code could be optimized by using pandas functions instead of iterating through each value.
Explanation:
groupby(col)[col].transform(): Instead of iterating through each value, we group the dataframe by each value in the current column. Using transform ensures the operation's result has the same shape as the original data.
x.duplicated(keep='first').cumsum(): Within each group, we identify duplicated values using the duplicated method. The keep='first' ensures the first occurrence is not considered a duplicate. The cumsum() function then provides an incremental number for each subsequent duplicate which is used to determine the number of dots to append.
x + ('.' * <count>): For each value in the group, it appends the calculated number of dots based on the cumulative sum of duplicates.
import tkinter as tk
from tkinter import filedialog
import pandas as pd
def load_excel():
global df
file_path = filedialog.askopenfilename(filetypes=[("Excel Files", "*.xlsx")])
if file_path:
df = pd.read_excel(file_path)
label_load.pack()
def process_excel():
label_load.pack_forget()
for col in df.columns:
df[col] = df.groupby(col)[col].transform(lambda x: x + ('.' * (x.duplicated(keep='first').cumsum())))
df.to_excel("output.xlsx", index=False)
label_done.pack()
root = tk.Tk()
root.geometry('700x400')
button = tk.Button(root, text="Load Excel", command=load_excel, font=('sans', 16))
button.pack(pady=20)
label_load = tk.Label(root, text="Excel file loaded", font=('sans', 16))
button = tk.Button(root, text="Process it", command=process_excel, font=('sans', 16))
button.pack(pady=20)
label_done = tk.Label(root, text="Process Done", font=('sans', 16))
root.mainloop()
Approach #2
• Use duplicated across the entire DataFrame and iterate only where duplicates are found.
import tkinter as tk
from tkinter import filedialog
import pandas as pd
def load_excel():
global df
file_path = filedialog.askopenfilename(filetypes=[("Excel Files", "*.xlsx")])
if file_path:
df = pd.read_excel(file_path)
label_load.pack()
def process_excel():
label_load.pack_forget()
duplicates = df[df.duplicated(keep=False)]
for index, row in duplicates.iterrows():
for col in df.columns:
if df.at[index, col] == row[col]:
occurrence = duplicates[duplicates[col] == row[col]][col].cumcount() + 1
df.at[index, col] = f'{row[col]}.' * occurrence.iat[0]
df.to_excel("output.xlsx", index=False)
label_done.pack()
root = tk.Tk()
root.geometry('700x400')
button = tk.Button(root, text="Load Excel", command=load_excel, font=('sans', 16))
button.pack(pady=20)
label_load = tk.Label(root, text="Excel file loaded", font=('sans', 16))
button = tk.Button(root, text="Process it", command=process_excel, font=('sans', 16))
button.pack(pady=20)
label_done = tk.Label(root, text="Process Done", font=('sans', 16))
root.mainloop()
Explanation:
df[df.duplicated(keep=False)]: Finds all duplicates in the entire DataFrame, including the first occurrence.
duplicates[duplicates[col] == row[col]][col].cumcount(): Within the duplicates, counts the cumulative occurrence of each value to determine the number of dots.
//is not a comment in Python. \$\endgroup\$// load excel filewill result in aNameError. Just change the comments to actual Python ones or remove the comments entirely. \$\endgroup\$