3

Thanks in advance for any help you can provide. I have a dataset containing some healthcare data and am trying my hand at using python for EDA/regression modeling on the set. I have one date column [date_of_incident] with a lot of the data missing or incorrect. I also have a [treatment_date] column that has accurate information. I have converted both columns to datetime and created a new column: [dt_diff]=[treatment_date]-[date_of_incident] to find out how many days between the two columns.

I want to use the average of [dt_diff] to impute new dates in the [date_of_incident] column. Basically [new_date_of_incident]=[treatment_date]-[dt_diff].mean(), but I do not want to replace all of the dates in the column. Just the missing or incorrect ones.

For the sake of an example, say the average of [dt_diff] is 7 days. Case A's [date_of_incident] is NaN and has a [treatment_date] of 5/17/2025, Case B's [date_of_incident] is 6/30/1965 and has a [treatment_date] of 5/20/2025. What is the best way for me to change Case A's [date_of_incident] to 5/10/2025 and Case B's [date_of_incident] to 5/13/2025, but for well over 1,000 rows? The dataset is not large enough for me to drop these rows and that column is important to the goal of the analysis.

I calculated the average of [dt_diff] and just don't know how to proceed. I'm a super noob to python coding.

1
  • are you using pandas? Commented May 29 at 4:03

1 Answer 1

1

Assuming that you are using pandas.

Using the following as an example:

>>> df
  date_of_incident treatment_date
0       2025-02-20     2025-04-29
1              NaT     2025-05-22
2              NaT     2025-05-27
3       2025-03-14     2025-04-26
4              NaT     2025-05-11
5       2025-02-16     2025-04-23
6              NaT     2025-05-04
7              NaT     2025-05-20
8       2025-03-18     2025-05-28
9       2025-02-22     2025-05-21

A possible solution could be to calculate the average difference like you mentioned, and then create a copy of the two relevant columns, and set/overwrite the "date_of_incident" column to the "treatment_date" value minus the average.

>>> subdf = df[df["date_of_incident"].notna()]
>>> avg_diff = (subdf["treatment_date"] - subdf["date_of_incident"]).mean().round('d')

>>> avg_diff
Timedelta('67 days 00:00:00')

>>> df_copy = df[["date_of_incident", "treatment_date"]].copy()
>>> df_copy["date_of_incident"] = df_copy["treatment_date"] + avg_diff

Then use the update method on the original DataFrame with the overwrite keyword argument set to False to update the missing values with the calculated value from the copy.

>>> df_copy["date_of_incident"] = df_copy["treatment_date"] + avg_diff
>>> df.update(df_copy, overwrite=False)
>>> df
  date_of_incident treatment_date
0       2025-02-20     2025-04-29
1       2025-07-28     2025-05-22
2       2025-08-02     2025-05-27
3       2025-03-14     2025-04-26
4       2025-07-17     2025-05-11
5       2025-02-16     2025-04-23
6       2025-07-10     2025-05-04
7       2025-07-26     2025-05-20
8       2025-03-18     2025-05-28
9       2025-02-22     2025-05-21

By using update with overwrite=False you are preserving the original values in the column while filling in any missing values that are found in the copy.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.