How to unify different datatypes from an excel to a DataFrame using python

Question

I have an excel that extracts information from a platform called SAP, when they extract the information it comes with a format of dd/mm/yyyy, but there are times that the date is extracted as dd.mm.yyyy, the thing is that when I convert that specific column to a DataFrame using python's library Pandas, the format just goes crazy. This is the code I've been trying:

import pandas as pd
import re
from datetime import datetime

# Convertir xlsx a csv
excel_data = pd.read_excel("Reportes/Crudos/Reporte SAP.xlsx", header=1)
excel_data['Asiento contable (Fecha de contabilización)'].to_csv("Reportes/Crudos/1.csv", index=False)
excel_data['Asiento contable (Fecha de contabilización)'].to_excel("Reportes/Crudos/1.xlsx", index=False)
# imprime solamente valores unicos sin repeticion
print(excel_data['Asiento contable (Fecha de contabilización)'].unique())

The print gives me this:

[datetime.datetime(2025, 8, 1, 0, 0) datetime.datetime(2025, 9, 1, 0, 0)
 '13/01/2025' '17/01/2025' datetime.datetime(2025, 3, 1, 0, 0)
 '18/01/2025' '20/01/2025' datetime.datetime(2025, 10, 1, 0, 0)
 '14/01/2025' datetime.datetime(2025, 2, 1, 0, 0)
 datetime.datetime(2025, 6, 1, 0, 0) datetime.datetime(2025, 7, 1, 0, 0)
 datetime.datetime(2025, 11, 1, 0, 0) datetime.datetime(2025, 4, 1, 0, 0)
 '15/01/2025' '16/01/2025' datetime.datetime(2025, 12, 1, 0, 0)
 datetime.datetime(2025, 5, 1, 0, 0) '19/01/2025'
 datetime.datetime(2025, 1, 30, 0, 0) datetime.datetime(2025, 1, 31, 0, 0)
 datetime.datetime(2025, 1, 28, 0, 0) datetime.datetime(2025, 1, 23, 0, 0)
 datetime.datetime(2025, 1, 27, 0, 0) datetime.datetime(2025, 1, 29, 0, 0)
 datetime.datetime(2025, 1, 22, 0, 0) datetime.datetime(2025, 1, 24, 0, 0)
 datetime.datetime(2025, 1, 25, 0, 0) datetime.datetime(2025, 1, 21, 0, 0)
 datetime.datetime(2025, 1, 26, 0, 0)]

Going in the generated csv and xlsx it gives me this:

csv:

1. 01/08/2025 00:00
2. 01/09/2025 00:00
3. 13/01/2025
4. 17/01/2025
5. 01/03/2025 00:00
6. 18/01/2025
7. 20/01/2025
8. 01/10/2025 00:00
9. 14/01/2025
10. 01/02/2025 00:00
11. 01/06/2025 00:00
12. 01/07/2025 00:00
13. 01/11/2025 00:00
14. 01/04/2025 00:00
15. 15/01/2025
16. 16/01/2025
17. 01/12/2025 00:00
18. 01/05/2025 00:00
19. 19/01/2025
20. 30/01/2025 00:00
21. 31/01/2025 00:00
22. 28/01/2025 00:00
23. 23/01/2025 00:00
24. 27/01/2025 00:00
25. 29/01/2025 00:00
26. 22/01/2025 00:00
27. 24/01/2025 00:00
28. 25/01/2025 00:00
29. 21/01/2025 00:00
30. 26/01/2025 00:00

xlsx:

1. 2025-08-01 00:00:00
2. 2025-09-01 00:00:00
3. 13/01/2025
4. 17/01/2025
5. 2025-03-01 00:00:00
6. 18/01/2025
7. 20/01/2025
8. 2025-10-01 00:00:00
9. 14/01/2025
10. 2025-02-01 00:00:00
11. 2025-06-01 00:00:00
12. 2025-07-01 00:00:00
13. 2025-11-01 00:00:00
14. 2025-04-01 00:00:00
15. 15/01/2025
16. 16/01/2025
17. 2025-12-01 00:00:00
18. 2025-05-01 00:00:00
19. 19/01/2025
20. 2025-01-30 00:00:00
21. 2025-01-31 00:00:00
22. 2025-01-28 00:00:00
23. 2025-01-23 00:00:00
24. 2025-01-27 00:00:00
25. 2025-01-29 00:00:00
26. 2025-01-22 00:00:00
27. 2025-01-24 00:00:00
28. 2025-01-25 00:00:00
29. 2025-01-21 00:00:00
30. 2025-01-26 00:00:00

Which means we have 3 types of format:

dd/mm/yyyy hh:mm (csv) | dd-mm-yyyy hh:mm (excel)
dd/mm/yyyy
mm/dd/yyyy hh:mm (csv) | mm-dd-yyyy hh:mm (excel)

If we make an analysis, we get that from January 1 to 12 they come out with this format: cell: mm/dd/yyyy hh:mm formula bar: mm/dd/yyyy hh:mm:ss a. m.

from 13-20 come with this format: cell: dd/mm/yyyy formula bar: dd/mm/yyyy

and from 21-31 they come with this format: cell: dd/mm/yyyy hh:mm Formula bar: dd/mm/yyyy hh:mm:ss a. m.

I've tried making a:

df["Asiento contable (Fecha de contabilización)"] = pd.to_datetime(df["Asiento contable (Fecha de contabilización)"], dayfirst=True)

But it doesnt recognize the dates well and I end with no data between the 1st and the 12th of January

I want to know if there is a way to unify this 3 types of format into just one: dd/mm/yyyy

The original Excel data is not shown (as a table), but the list output indicates that your code fails, because pd.read_excel assumes like pd.to_datetime a format of MM/DD/YYYY. This is because the conversion to datetime object always failed, if first 2 characters converted to decimal were above 12. You could try option dayfirst=True of pd.to_datetime. — RA Prism
– RA Prism, Commented Feb 13, 2025 at 20:04

franjefriten · Accepted Answer · 2025-02-25 19:34:46Z

It seems that, indeed, your dataset uses more than one single format for dates. For one, you can almost always ignore 00:00, since it's the default hour when you are parsing a date to either datetime.datetime or pd.to_datetime. The difference between 00:00:00 and 00:00 is just the formatting of both excel and .csv files and is not itself an error. The issue would be, inmy opinion, handle both formats YY/mm/dd and dd/mm/YY. Take into account that excel does not necessarily need to have one single data type for a column. This means that 2025-08-01 00:00:00 is interpreted as a date whereas 13/01/2025 is a string. The same thing happens with the .csv file. 01/09/2025 00:00 is a datetime and 13/01/2025 is a string.

What you need to do is parse the correct format to pd.to_datetime like this.

df["Asiento contable (Fecha de contabilización)"] = pd.to_datetime(df["Asiento contable (Fecha de contabilización)"], format='mixed', dayfirst=True)

as stayed in the docs

Collectives™ on Stack Overflow

How to unify different datatypes from an excel to a DataFrame using python

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related