1

I have two dataframes with timestamp data. It is sensor readouts from different sources. I want to combine them. The left dataframe (df1) can be quite large as it will be a combination of multiple sources, the right dataframe (df2) will have max. 8 columns. Some cols of df2 may already be in df1, but there might be more or less timestamps with values. The timestamps may also be double. Some columns in df2 will be new to df1.

E.g.

df1 = pd.DataFrame(
    {
        "PT1": ["A0", "A1", "A2"],
        "PT2": ["B0", "B1", "B2"],
        "PT3": ["C0", "C1", "C2"],
    },
    index=pd.DatetimeIndex(["2025-05-01 10:00", "2025-05-01 10:01", "2025-05-01 10:02"]),
)

df2 = pd.DataFrame(
    {
        "PT1": ["A0", "A1", "A3"],
        "PT4": ["D0", "D1", "D3"],
    },
    index=pd.DatetimeIndex(["2025-05-01 10:00", "2025-05-01 10:01", "2025-05-01 10:03"]),
)

I tried concat & merge, but either I don't get the Timestamps combined or I loose the index. :-/

Expected output would be:

df1updated = pd.DataFrame(
    {
        "PT1": ["A0", "A1", "A2", "A3"],
        "PT2": ["B0", "B1", "B2", nan ],
        "PT3": ["C0", "C1", "C2", nan ],
        "PT4": ["D0", "D1", nan,  "D3"],
    },
    index=pd.DatetimeIndex(["2025-05-01 10:00", "2025-05-01 10:01", "2025-05-01 10:02", , "2025-05-01 10:03"]),
)

Update after @ouroboros1 comment: Usually, there should only be double entries in the two dataframes, when the value is either the same or one of them is nan. Two different values could happen, but can be solved from the data source side. If the two values are different, it is because the source filled df2 with data from an earlier timestamp for that sensor. So I need to detect that somehow. But my plan was to do that on df2 before combining it with df1. E.g. by checking for duplicate values in df2 per column and repalcing them with nan again.

2
  • 1
    "Two different values could happen", do you mean that df2 could also have: "PT1": ["A1000", "A1", "A3"]? And if so, what should happen? Always pick the value from df1 ("A0") and silently ignore the one from df2 ("A1000")? If so, or if this does not happen, all you need is df1.combine_first(df2). If not, it would be useful to update your example with edge cases and explain what should happen when they appear. Commented May 21 at 9:41
  • Thank you so much! I did not find combine_first, because it is not mentioned on the concat, merge help page of pandas... I updated the question with your request Commented May 21 at 9:55

1 Answer 1

0

If I understand you correctly, for each sensor and timestamp, if values exist in both df1 and df2, then use value from df2. Otherwise, use whatever value that is available. This is a perfect job for combine_first.

# Merge on the indexes. df1 gets to keep its original column names but df2's
# columns are suffixed with "_y". If you have sensors whose names end with "_y",
# pick a different suffix.
merged = df1.merge(
    df2, how="outer", left_index=True, right_index=True, suffixes=("", "_y")
)

# For those columns that end with "_y" (or your custom suffix), combine its
# value with the original column from df1
y_cols = merged.columns[merged.columns.str.endswith("_y")]
for y_col in y_cols:
    x_col = y_col.rstrip("_y")
    merged[x_col] = merged[y_col].combine_first(merged[x_col])

# Drop the "_y" columns
merged.drop(columns=y_cols, inplace=True)

This way you get to keep the columns order from df1. Any new columns from df2 will be appended to the right.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.