Thanks in advance for any help you can provide. I have a dataset containing some healthcare data and am trying my hand at using python for EDA/regression modeling on the set. I have one date column [date_of_incident] with a lot of the data missing or incorrect. I also have a [treatment_date] column that has accurate information. I have converted both columns to datetime and created a new column: [dt_diff]=[treatment_date]-[date_of_incident] to find out how many days between the two columns.
I want to use the average of [dt_diff] to impute new dates in the [date_of_incident] column. Basically [new_date_of_incident]=[treatment_date]-[dt_diff].mean(), but I do not want to replace all of the dates in the column. Just the missing or incorrect ones.
For the sake of an example, say the average of [dt_diff] is 7 days. Case A's [date_of_incident] is NaN and has a [treatment_date] of 5/17/2025, Case B's [date_of_incident] is 6/30/1965 and has a [treatment_date] of 5/20/2025. What is the best way for me to change Case A's [date_of_incident] to 5/10/2025 and Case B's [date_of_incident] to 5/13/2025, but for well over 1,000 rows? The dataset is not large enough for me to drop these rows and that column is important to the goal of the analysis.
I calculated the average of [dt_diff] and just don't know how to proceed. I'm a super noob to python coding.
pandas?