1

I am trying to merge two pandas dataframes each consisting two string columns and one date column.

df1
a    b      date
100  200    2022-01-03
100  200    2022-01-04
101  200    2022-01-05
101  200    2022-01-06
101  200    2022-01-07

df2
a    b      date
100  200    2022-01-04
100  200    2022-01-06
101  200    2022-01-03
101  200    2022-01-06
101  200    2022-01-09

The goal is to merge them on a, b, date and take the closest date (forward direction). Desired output:

df
a    b      date_x      date_y
100  200    2022-01-03  2022-01-04
100  200    2022-01-04  2022-01-04
101  200    2022-01-05  2022-01-06 (not 2022-01-03 because it is behind not forward)
101  200    2022-01-06  2022-01-06
101  200    2022-01-07  2022-01-09
3
  • must a,b match ?
    – ansev
    Commented Jan 18, 2023 at 16:02
  • Yes; a,b must match Commented Jan 18, 2023 at 16:06
  • tell me if my solution works:)
    – ansev
    Commented Jan 18, 2023 at 16:21

3 Answers 3

3

We can merge on a and b, filter by the min difference between date_y and date_x taking into account forward direction

new_df = df1.merge(df2, on=['a', 'b'], how='inner')\
            .assign(diff_date=lambda df: df['date_y']
                        .sub(df['date_x'])
                        .where(lambda x: df['date_y'].ge(df['date_x'])), 
                    mask=lambda df: df['diff_date']
                        .eq(df.groupby(['a', 'b', 'date_x'])['diff_date']
                        .transform('min')))\
            .loc[lambda df: df['mask']]\
            .drop(['diff_date', 'mask'], axis=1)
print(new_df)


    a    b     date_x     date_y
0   100  200 2022-01-03 2022-01-04
2   100  200 2022-01-04 2022-01-04
5   101  200 2022-01-05 2022-01-06
8   101  200 2022-01-06 2022-01-06
12  101  200 2022-01-07 2022-01-09
1
  • this is a nice solution. what I am looking for is not to merge df1 and df2 straight away. Trying to utilize pd.merge_asof here. If we cannot figure it out, I will accept your solution :) Commented Jan 18, 2023 at 16:50
3

You can also try

# merge on a,b and sort based on date
m = df1.merge(df2, on=['a', 'b'], how='left').sort_values(['date_x', 'date_y'])
# only keep dates that are <= df2 date
df = m[m['date_x'] <= m['date_y']]
# drop duplicates and filter
final_df = df.loc[df[['a', 'b', 'date_x']].drop_duplicates(keep='first').index]

      a    b     date_x     date_y
0   100  200 2022-01-03 2022-01-04
2   100  200 2022-01-04 2022-01-04
5   101  200 2022-01-05 2022-01-06
8   101  200 2022-01-06 2022-01-06
12  101  200 2022-01-07 2022-01-09
4
  • drop_duplicates is nice here
    – ansev
    Commented Jan 18, 2023 at 16:30
  • thank you, but this is kind of computationally expensive I believe especially when I have a too large df2. Could we use pd.merge_asof by any chance? Commented Jan 18, 2023 at 16:43
  • @sakalansaka why do you think that, sort_values? I cannot really think of a way off the top of my head to use merge_asof because you want an exact match of a and b but a forward match on the date. Commented Jan 18, 2023 at 17:00
  • Suppose that df2 is a too large dataframe with useless information to this operation and we are doing a pd.merge operation. I thought it would be computationally expensive with respect to pd.merge_asof. You are right about using pd.merge_asof though. Commented Jan 18, 2023 at 19:06
1
import pandas as pd

df1 = pd.DataFrame({'a': ['100', '100', '101', '101', '101'],
                    'b': ['200', '200', '200', '200', '200'],
                    'date': ['2022-01-03', '2022-01-04', '2022-01-05', '2022-01-06', '2022-01-07']})

df2 = pd.DataFrame({'a': ['100', '100', '101', '101', '101'],
                    'b': ['200', '200', '200', '200', '200'],
                    'date': ['2022-01-04', '2022-01-06', '2022-01-03', '2022-01-06', '2022-01-09']})
  
df3 = pd.merge(df1,df2,how='left',left_on=['a','b'],right_on=['b','a']).drop(['a_y','b_y'], axis=1) 
df3['date_y'] = df2['date']

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.