Pandas: splitting dataframe into multiple dataframe based on threshold value

Question

I have dataframe like this

                 Transport  Elapsed_Time     gap_time        gap_minutes 
0                  taxi         556.0   0 days 00:00:02          0.0 
1                  walk          95.0   0 days 00:53:34         53.0 
2                  taxi          44.0   0 days 02:02:00        122.0 
3                  taxi           2.0   0 days 17:05:56       1025.0 
4                  walk          73.0   0 days 00:14:31         14.0 
5                  boat          10.0   0 days 00:02:16          2.0 
6                  walk          34.0   0 days 00:00:42          0.0 
7                  boat           8.0   0 days 00:00:54          0.0 
8                 walk          37.0   0 days 00:07:25          7.0 
9                 boat          30.0   0 days 00:00:23          0.0 
10                 walk         105.0   0 days 00:04:59          4.0
11                 taxi          14.0   0 days 00:01:06          1.0
12                 walk          31.0   0 days 18:01:32       1081.0
13                 taxi          10.0   0 days 01:06:11         66.0
14                train          41.0   0 days 16:59:25       1019.0
15                 walk           3.0   0 days 00:02:28          2.0
16                 taxi         137.0 276 days 23:49:58       1429.0

I like to partition the dataframe into multiple dataframes based on threshold value of gap_minutes>20

The resulting dataframes looke like this

df1:

0                  taxi         556.0   0 days 00:00:02          0.0 
1                  walk          95.0   0 days 00:53:34         53.0

df2:

2                  taxi          44.0   0 days 02:02:00        122.0

df3:

3                  taxi           2.0   0 days 17:05:56       1025.0

df4:

4                  walk          73.0   0 days 00:14:31         14.0 
5                  boat          10.0   0 days 00:02:16          2.0 
6                  walk          34.0   0 days 00:00:42          0.0 
7                  boat           8.0   0 days 00:00:54          0.0 
8                 walk          37.0   0 days 00:07:25          7.0 
9                 boat          30.0   0 days 00:00:23          0.0 
10                 walk         105.0   0 days 00:04:59          4.0
11                 taxi          14.0   0 days 00:01:06          1.0
12                 walk          31.0   0 days 18:01:32       1081.0

df5:

13                 taxi          10.0   0 days 01:06:11         66.0

df6:

14                train          41.0   0 days 16:59:25       1019.0

df7:

15                 walk           3.0   0 days 00:02:28          2.0 
16                 taxi         137.0 276 days 23:49:58       1429.0

How do you intend the partition to work ? It seems not to depend on gap_minutes since all elements with the same value in this column do not end up in the same part — WNG
– WNG, Commented Nov 21, 2017 at 14:20

Scott Boston · Accepted Answer · 2017-11-22 05:31:19Z

Let's try this, 'listofdf' is a dictionary of dataframes with keys of 1 to 7 in this case. First let's make sure gap-time is pd.TimeDelta dtype, then group:

df.gap_time = pd.to_timedelta(df.gap_time)
g = df.groupby((df.gap_time / pd.Timedelta('20 minutes')).ge(1)[::-1].cumsum())
for n,g in g:
    listofdf[n] = g

Outputs:

print(listofdf[1])

       Transport  Elapsed_Time          gap_time  gap_minutes
15      walk           3.0   0 days 00:02:28          2.0
16      taxi         137.0 276 days 23:49:58       1429.0

print(listofdf[2])

   Transport  Elapsed_Time gap_time  gap_minutes
14     train          41.0 16:59:25       1019.0

. . .

print(listofdf[7])

  Transport  Elapsed_Time gap_time  gap_minutes
0      taxi         556.0 00:00:02          0.0
1      walk          95.0 00:53:34         53.0

How it works:

Best way to figure out how it works is to break the statement in question in to parts. First,

Let's figure out which intervals are greater than 20, so if divide gap_time by 20 minutes, and get a value greater than or equal to 1, then we know that we need to start a new group.

(df.gap_time / pd.Timedelta('20 minutes')).ge(1)

Output:

0     False
1      True
2      True
3      True
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12     True
13     True
14     True
15    False
16     True
Name: gap_time, dtype: bool

This is the trick part, now, I want to group all 'False' records with the following 'True' records. Look at gap_time and your logic. To do this we need to reverse the order of the records then use cumsum. Cumsum basically increments for every true records. So, true equals to 1 then all the false records get a 1 until the next true record which becomes 2 and all false records get 2 until the next true record.

(df.gap_time / pd.Timedelta('20 minutes')).ge(1)[::-1].cumsum()

Output:

16    1
15    1
14    2
13    3
12    4
11    4
10    4
9     4
8     4
7     4
6     4
5     4
4     4
3     5
2     6
1     7
0     7
Name: gap_time, dtype: int64

Using this new series as a way to group your dataframe into chunks, so we use g = groupby the above series.

Although the code works but I also want its conceptual understanding how it works. especially the second statement, g = df.groupby((df.gap_time / pd.Timedelta('20 minutes')).ge(1)[::-1].cumsum()), and especially its last part .ge(1)[::-1].cumsum()).

Collectives™ on Stack Overflow

Pandas: splitting dataframe into multiple dataframe based on threshold value

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related