1

I have an excel that extracts information from a platform called SAP, when they extract the information it comes with a format of dd/mm/yyyy, but there are times that the date is extracted as dd.mm.yyyy, the thing is that when I convert that specific column to a DataFrame using python's library Pandas, the format just goes crazy. This is the code I've been trying:

import pandas as pd
import re
from datetime import datetime

# Convertir xlsx a csv
excel_data = pd.read_excel("Reportes/Crudos/Reporte SAP.xlsx", header=1)
excel_data['Asiento contable (Fecha de contabilización)'].to_csv("Reportes/Crudos/1.csv", index=False)
excel_data['Asiento contable (Fecha de contabilización)'].to_excel("Reportes/Crudos/1.xlsx", index=False)
# imprime solamente valores unicos sin repeticion
print(excel_data['Asiento contable (Fecha de contabilización)'].unique())

The print gives me this:

[datetime.datetime(2025, 8, 1, 0, 0) datetime.datetime(2025, 9, 1, 0, 0)
 '13/01/2025' '17/01/2025' datetime.datetime(2025, 3, 1, 0, 0)
 '18/01/2025' '20/01/2025' datetime.datetime(2025, 10, 1, 0, 0)
 '14/01/2025' datetime.datetime(2025, 2, 1, 0, 0)
 datetime.datetime(2025, 6, 1, 0, 0) datetime.datetime(2025, 7, 1, 0, 0)
 datetime.datetime(2025, 11, 1, 0, 0) datetime.datetime(2025, 4, 1, 0, 0)
 '15/01/2025' '16/01/2025' datetime.datetime(2025, 12, 1, 0, 0)
 datetime.datetime(2025, 5, 1, 0, 0) '19/01/2025'
 datetime.datetime(2025, 1, 30, 0, 0) datetime.datetime(2025, 1, 31, 0, 0)
 datetime.datetime(2025, 1, 28, 0, 0) datetime.datetime(2025, 1, 23, 0, 0)
 datetime.datetime(2025, 1, 27, 0, 0) datetime.datetime(2025, 1, 29, 0, 0)
 datetime.datetime(2025, 1, 22, 0, 0) datetime.datetime(2025, 1, 24, 0, 0)
 datetime.datetime(2025, 1, 25, 0, 0) datetime.datetime(2025, 1, 21, 0, 0)
 datetime.datetime(2025, 1, 26, 0, 0)]

Going in the generated csv and xlsx it gives me this:

csv:

1. 01/08/2025 00:00
2. 01/09/2025 00:00
3. 13/01/2025
4. 17/01/2025
5. 01/03/2025 00:00
6. 18/01/2025
7. 20/01/2025
8. 01/10/2025 00:00
9. 14/01/2025
10. 01/02/2025 00:00
11. 01/06/2025 00:00
12. 01/07/2025 00:00
13. 01/11/2025 00:00
14. 01/04/2025 00:00
15. 15/01/2025
16. 16/01/2025
17. 01/12/2025 00:00
18. 01/05/2025 00:00
19. 19/01/2025
20. 30/01/2025 00:00
21. 31/01/2025 00:00
22. 28/01/2025 00:00
23. 23/01/2025 00:00
24. 27/01/2025 00:00
25. 29/01/2025 00:00
26. 22/01/2025 00:00
27. 24/01/2025 00:00
28. 25/01/2025 00:00
29. 21/01/2025 00:00
30. 26/01/2025 00:00

xlsx:

1. 2025-08-01 00:00:00
2. 2025-09-01 00:00:00
3. 13/01/2025
4. 17/01/2025
5. 2025-03-01 00:00:00
6. 18/01/2025
7. 20/01/2025
8. 2025-10-01 00:00:00
9. 14/01/2025
10. 2025-02-01 00:00:00
11. 2025-06-01 00:00:00
12. 2025-07-01 00:00:00
13. 2025-11-01 00:00:00
14. 2025-04-01 00:00:00
15. 15/01/2025
16. 16/01/2025
17. 2025-12-01 00:00:00
18. 2025-05-01 00:00:00
19. 19/01/2025
20. 2025-01-30 00:00:00
21. 2025-01-31 00:00:00
22. 2025-01-28 00:00:00
23. 2025-01-23 00:00:00
24. 2025-01-27 00:00:00
25. 2025-01-29 00:00:00
26. 2025-01-22 00:00:00
27. 2025-01-24 00:00:00
28. 2025-01-25 00:00:00
29. 2025-01-21 00:00:00
30. 2025-01-26 00:00:00

Which means we have 3 types of format:

  1. dd/mm/yyyy hh:mm (csv) | dd-mm-yyyy hh:mm (excel)
  2. dd/mm/yyyy
  3. mm/dd/yyyy hh:mm (csv) | mm-dd-yyyy hh:mm (excel)

If we make an analysis, we get that from January 1 to 12 they come out with this format: cell: mm/dd/yyyy hh:mm formula bar: mm/dd/yyyy hh:mm:ss a. m.

from 13-20 come with this format: cell: dd/mm/yyyy formula bar: dd/mm/yyyy

and from 21-31 they come with this format: cell: dd/mm/yyyy hh:mm Formula bar: dd/mm/yyyy hh:mm:ss a. m.

I've tried making a:

df["Asiento contable (Fecha de contabilización)"] = pd.to_datetime(df["Asiento contable (Fecha de contabilización)"], dayfirst=True)

But it doesnt recognize the dates well and I end with no data between the 1st and the 12th of January

I want to know if there is a way to unify this 3 types of format into just one: dd/mm/yyyy

1
  • The original Excel data is not shown (as a table), but the list output indicates that your code fails, because pd.read_excel assumes like pd.to_datetime a format of MM/DD/YYYY. This is because the conversion to datetime object always failed, if first 2 characters converted to decimal were above 12. You could try option dayfirst=True of pd.to_datetime. Commented Feb 13, 2025 at 20:04

1 Answer 1

0

It seems that, indeed, your dataset uses more than one single format for dates. For one, you can almost always ignore 00:00, since it's the default hour when you are parsing a date to either datetime.datetime or pd.to_datetime. The difference between 00:00:00 and 00:00 is just the formatting of both excel and .csv files and is not itself an error. The issue would be, inmy opinion, handle both formats YY/mm/dd and dd/mm/YY. Take into account that excel does not necessarily need to have one single data type for a column. This means that 2025-08-01 00:00:00 is interpreted as a date whereas 13/01/2025 is a string. The same thing happens with the .csv file. 01/09/2025 00:00 is a datetime and 13/01/2025 is a string.

What you need to do is parse the correct format to pd.to_datetime like this.

df["Asiento contable (Fecha de contabilización)"] = pd.to_datetime(df["Asiento contable (Fecha de contabilización)"], format='mixed', dayfirst=True)

as stayed in the docs

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.