10

I have a dataframe news_count. Here are its column names, from the output of news_count.columns.values:

 [('date', '') ('EBIX UW Equity', 'NEWS_SENTIMENT_DAILY_AVG') ('Date', '')
  ('day', '') ('month', '') ('year', '')]

I need to groupby by year and month and sum values of 'NEWS_SENTIMENT_DAILY_AVG'. Below is code I tried, but neither work:

Attempt 1

news_count.groupby(['year','month']).NEWS_SENTIMENT_DAILY_AVG.values.sum()

'AttributeError: 'DataFrameGroupBy' object has no attribute' 

Attempt 2

news_count.groupby(['year','month']).iloc[:,1].values.sum()

AttributeError: Cannot access callable attribute 'iloc' of 'DataFrameGroupBy' objects, try using the 'apply' method

Input data:

      ticker       date           EBIX UW Equity    month    year
      field             NEWS_SENTIMENT_DAILY_AVG
         0      2007-05-25                   0.3992      5       2007
         1      2007-11-06                   0.3936      11      2007 
         2      2007-11-07                   0.2039      11      2007
         3      2009-01-14                   0.2881       1      2014
8
  • 3
    And did you try news_count.groupby(['year','month']).NEWS_SENTIMENT_DAILY_AVG.sum()? Commented Oct 2, 2017 at 22:38
  • The problem is it not identifying the NEWS_SENTIMENT_DAILY_AVG column. Error message - AttributeError: 'DataFrameGroupBy' object has no attribute 'NEWS_SENTIMENT_DAILY_AVG' Commented Oct 2, 2017 at 22:50
  • 2
    Are you working with a multi index of columns? Commented Oct 2, 2017 at 22:52
  • 2
    Reset_index works for index, not columns... Commented Oct 2, 2017 at 23:08
  • 2
    I'm not sure I can? because I'm not 100% sure I understand the structure of your dataframe, those columns look bad. Try explicitly reassigning them: df.columns = ['date', 'avg', 'day', 'month', 'year', ...] and so on. If you can do that, please update your dataframe, and try my suggestion in my first comment again. Commented Oct 2, 2017 at 23:27

3 Answers 3

0

extract required columns from dataframe in news_count_res variable and then apply aggregation function

news_count_res = news_count[['year','month','NEWS_SENTIMENT_DAILY_AVG']]
news_count_res.group(['year','month']).sum()
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for this...but I'm getting "AttributeError: 'SeriesGroupBy' object has no attribute 'sample'" at "df_sample = df.groupby("persons").sample(frac=percentage_to_flag, random_state=random_state)". If I can figure out why, maybe it'll work for me...
0

Thanks to answers so far (I've made comments there as I haven't got those solutions to work--maybe I'm not understanding something). In the meantime, I've also come up with another approach, which I still suspect isn't very Pythonic. It does get the job done and doesn't take too long for my purposes, but it would be great if I could figure out how to tweak the approaches suggested above to get them to work...any thoughts very welcome!

Here's what I've got:

    import pandas as pd
    import math
    y = ['Alex'] * 2321 + ['Doug'] * 34123  + ['Chuck'] * 2012 + ['Bob'] * 9281 
        z = ['xyz'] * len(y)
    df = pd.DataFrame({'persons': y, 'data' : z})
    percent = 10  #CHANGE AS NEEDED

    #add a 'helper'column with random numbers
    df['rand'] = np.random.random(df.shape[0])
    df = df.sample(frac=1)  #optional:  this shuffles data, just to show order doesn't matter

    #CREATE A HELPER LIST
    helper = pd.DataFrame(df.groupby('persons')['rand'].count()).reset_index().values.tolist()
    for row in helper:
        df_temp = df[df['persons'] == row[0]][['persons','rand']]
        lim = math.ceil(len(df_temp) * percent * 0.01)
        row.append(df_temp.nlargest(lim,'rand').iloc[-1][1])

    def flag(name,num):
        for row in helper:
            if row[0] == name:
                if num >= row[2]:
                    return 'yes'
                else:
                    return 'no'
    
    df['flag'] = df.apply(lambda x: flag(x['persons'], x['rand']), axis=1)

And to check the results:

piv = df.pivot_table(index="persons", columns="flag", values="data", aggfunc='count', fill_value=0)
piv = piv.apivend(piv.sum().rename('Total')).assign(Total=lambda x: x.sum(1))
piv['% selected'] = 100 * piv.yes/piv.Total
print(piv)

OUTPUT:
flag        no   yes  Total  % selected
persons                                
Alex      2088   233   2321   10.038776
Bob       8352   929   9281   10.009697
Chuck     1810   202   2012   10.039761
Doug     30710  3413  34123   10.002051
Total    42960  4777  47737   10.006913

Seems to work with different %s and different numbers of persons...but it would be nice to make it simpler, I think.

Comments

0
df = df.groupby(['col1', 'col2'], as_index = False).agg('value1':'sum', 'value2':'sum')


news_count = news_count.groupby(['year', 'month'],as_index = False).agg({'NEWS_SENTIMENT_DAILY_AVG':'sum'})

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.