3

What is the best way to create a new column and assign a value if date falls between two dates in another dataframe ?

e.g.

dataframe A    
date          values
2017-05-16      x  
2017-04-12      Y


dataframe B    #df contains dates to use to filter and associated id

start            end           id
2017-05-08     2017-05-18      34
2017-04-24     2017-05-08      33
2017-04-03     2017-04-24      32

desired result

dataframe A     
date          values    id
2017-05-16      x       34 
2017-04-12      Y       32

I have looked into pd.cut which doesn't seem to work for what I want and it seems inefficient to write a loop to iterate over the dataframe with multiple conditions.

2
  • I haven't tried this myself but for similar problems intervaltree package was suggested.
    – user2285236
    Commented May 18, 2017 at 17:50
  • 1
    What about the dates falling on the edges such as "2017-04-24"? Which ID should be picked?
    – Allen Qin
    Commented May 18, 2017 at 18:06

2 Answers 2

8

Using an IntervalIndex, which is new in Pandas 0.20.0. This looks to still be in the experimental phase though, so other solutions may be more reliable.

# Get the 'id' column indexed by the 'start'/'end' intervals.
s = pd.Series(df_b['id'].values, pd.IntervalIndex.from_arrays(df_b['start'], df_b['end']))

# Map based on the date of df_a.
df_a['id'] = df_a['date'].map(s)

The resulting output:

        date values  id
0 2017-05-16      x  34
1 2017-04-12      Y  32

Alternatively, if you don't mind altering the index of df_b, you could just directly convert to an IntervalIndex on it:

# Create an IntervalIndex on df_b.
df_b = df_b.set_index(['start', 'end'])
df_b.index = pd.IntervalIndex.from_tuples(df_b.index)

# Map based on the date of df_a.
df_a['id'] = df_a['date'].map(df_b['id'])
3

This is an example of something that's really straightforward to do in one step in sql, but not so much in Pandas. So with the proviso that I don't love this approach, here it is.

  1. do a full cartesian join
  2. filter down to desired rows & columns.

_

# First Full Outer Join Dataframes 
# (Requires a Common Column in Pandas Unlike SQL)
df_A['fake key'] = 1
df_B['fake key'] = 1
outer_join = pd.merge(df_A, df_B, how='outer', on='fake key')

# Now Filter Back down to Desired Rows/Columns
desired_rows    = outer_join.query('date < end and date > start')
desired_columns = ['date', 'values', 'id']

final = desired_rows[desired_columns]
final

output:

        date values  id
0 2017-05-16      x  34
5 2017-04-12      y  32

The things that make this answer somewhat unsatisfying to me are:

  1. The first step of a full cartesian join doesn't scale well at all to large data
  2. The cartesian join requires a common column, so in this case first creating that fake key column (see this github issue

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.