5

I have a large pandas dataframe that can be represented structurally as:

      id          date    status
0     12    2015-05-01         0
1     12    2015-05-22         1
2     12    2015-05-14         1
3     12    2015-05-06         0
4     45    2015-05-03         1
5     45    2015-05-12         1
6     45    2015-05-02         0
7     51    2015-05-05         1
8     51    2015-05-01         0
9     51    2015-05-23         1
10    51    2015-05-17         1
11    51    2015-05-03         0
12    51    2015-05-05         0
13    76    2015-05-04         1
14    76    2015-05-22         1
15    76    2015-05-08         0

And can be created in Python 3.4 using:

tempDF = pd.DataFrame({ 'id': [12,12,12,12,45,45,45,51,51,51,51,51,51,76,76,76],
                        'date': ['2015-05-01','2015-05-22','2015-05-14','2015-05-06','2015-05-03','2015-05-12','2015-05-02','2015-05-05','2015-05-01','2015-05-23','2015-05-17','2015-05-03','2015-05-05','2015-05-04','2015-05-22','2015-05-08'],
                        'status': [0,1,1,0,1,1,0,1,0,1,1,0,0,1,1,0]})
tempDF['date'] = pd.to_datetime(tempDF['date'])

I would like to divide the dataframe into groups based on variable 'id', sort within groups based on 'date' and then get the last 'status' value within each group.

So far, I have:

tempGrouped = tempDF.groupby('id')
tempGrouped['status'].last()

which produces:

id
12    0
45    0
51    0
76    0

However, the status should be 1 in each case (the value associated with the latest date). I can't work out how to sort the groups by date before selecting the last value. It's likely I'm a little snow-blind after trying to work this out for a while, so I apologise in advance if the solution is obvious.

1 Answer 1

14

you can sort and group like this :

tempDF.sort(['id','date']).groupby('id')['status'].last()
3
  • Of course! I was stuck in the mindset of groupby-ing first and then sorting within groups. Thanks for heads-up! Commented Jun 29, 2015 at 17:08
  • 2
    Why is the opposite approach (sort, then group) any different than (group, then sort)? I wonder why the operations is not commutative. Commented Aug 5, 2017 at 1:44
  • 10
    DataFrame.sort is deprecated, use: tempDF.sort_values(by=['id', 'date']) Commented Mar 9, 2018 at 0:30

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.