Implement vectorization instead of nested loops on dataframe

Question

I have a dataset ('sample_data.csv') of the form below:

timestamp,cell_id,crnti,enodeb_id,uplane_downlink.cqi_wideband_mean,uplane_downlink.distance_initial_km,uplane_downlink.non_gbr_dl_bytes,time_sec,timestamp_y,uplane_uplink.non_gbr_ul_bytes
1606208553171,1,1487,1002107,7,27,134,1606208553,1606208553256,146
1606208557195,1,1487,1002107,6,27,388,1606208557,1606208557369,577
1606208557236,1,2631,1002107,10,27,251,1606208557,1606208557351,194
1606208561301,1,1487,1002107,6,27,140,1606208561,1606208561463,245
1606208561339,1,2631,1002107,11,27,159,1606208561,1606208561381,108
1606208565415,1,1487,1002107,7,27,99,1606208565,1606208565472,54
1606208565435,1,2631,1002107,10,27,64,1606208565,1606208565495,54
1606208569518,1,1487,1002107,6,27,101,1606208569,1606208569601,289
1606208569533,1,2528,1002107,7,16,57,1606208569,1606208569621,0
1606208571263,1,2631,1002107,10,27,0,1606208571,1606208571263,8500
1606208573557,1,2528,1002107,8,16,0,1606208573,1606208573557,0
1606208575838,1,1487,1002107,7,27,0,1606208575,1606208575838,50

Here the column crnti represents an user and the column time_sec gives a timestamp for this user's session. Sessions that are separated by less than 9 secs are considered continuation (small session of a longer session). My end goal is to summarize session properties for each user. In the summary I want to show

Session duration → calculated as (last = max(time_sec)) - (first=min (time_sec)) for a dataframe that is a filtered DF (including only records that are less than 9 sec apart) for a crnti. In the code below I do it as:
```
session_summary['session_duration'] = int(session_summary['last'])-int(session_summary['first'])
```

Sum of uplane_uplink.non_gbr_ul_bytes across all small sessions for a user → calculated as sum of all values in the filtered df (session_summary):

session_summary = sub_session.astype(int).groupby(['crnti', 'cell_id', 'enodeb_id']).agg(distance_initial_km = ('uplane_downlink.distance_initial_km', 'mean'),
                                                                                           first = ('time_sec', 'min'),
                                                                                           last = ('time_sec', 'max'),
                                                                                       cqi = ('uplane_downlink.cqi_wideband_mean', 'mean'),
                                                   non_gbr_dl_bytes = ('uplane_downlink.non_gbr_dl_bytes', 'sum'),
                                                   non_gbr_ul_bytes = ('uplane_uplink.non_gbr_ul_bytes', 'sum'))

Sum of uplane_downlink.non_gbr_dl_bytes across all small sessions for a user → calculation described above
Mean of cqi_wideband_mean → calculation described above
Mean of uplane_downlink.distance_initial_km → calculation described above

As an example for rnti = 1487. Here is the summary -

I have the following code that gets me the expected output. However my dataset is pretty big (6M rows) and this approach is not scalable and takes way too long. I came across the concept of vectorization using numpy or pandas and want to implement it, but not sure where to start. Any pointers are appreciated.

Working but slow code:

import pandas as pd
import numpy as np



comb_dl_ul_full_191 = pd.read_csv('sample_data.csv')


column_names = ['crnti', 'enodeb_id', 'cell_id', 'uplane_downlink.cqi_wideband_mean', 'uplane_downlink.distance_initial_km', 'uplane_downlink.non_gbr_dl_bytes',
                                                 'time_sec',
                                                 'uplane_uplink.non_gbr_ul_bytes']

all_sessions_summary = pd.DataFrame()

enbs = [1002107]  ## this is an array. For simplicity I am only showing one element. Creating it as shown below

#enbs = np.unique(comb_dl_ul_full['enodeb_id'].to_list())

comb_dl_ul_full_191 = comb_dl_ul_full_191.astype(int) # converting all to INT


for enb in enbs:
  cells = [1] # this is an array. For simplicity I am only showing one element. Creating it as shown below
    #cells = np.unique(comb_dl_ul_full_191[comb_dl_ul_full_191['enodeb_id']==enb]['cell_id'].to_list())
  for cell in cells:
    dd = comb_dl_ul_full_191[(comb_dl_ul_full_191['enodeb_id']==enb) & (comb_dl_ul_full_191['cell_id']==cell)]
    rntis = np.unique(dd['crnti'].to_list())
    dd.sort_values(by=['time_sec'], inplace=True)
    for rnti in rntis:
      d = dd[(dd['enodeb_id']==enb) & (dd['crnti']==rnti) & (dd['cell_id']==cell)]
      f = d['time_sec'].to_list()
      timedeltas = [int(f[i-1])-int(f[i]) for i in range(1, len(f))]

      session_summary = pd.DataFrame()
      session_summary_overall = pd.DataFrame()

      count=0
      i=0
      while i < len(timedeltas):
          if timedeltas[i] > -9:
            sub_session = pd.DataFrame(columns=column_names)
            d = d[column_names]
            sub_session = sub_session.append(d.iloc[[i]])
            for  j in range(i, len(timedeltas)):
              if timedeltas[j] > -9: 
                sub_session = sub_session.append(d.iloc[[j+1]])
              else:
                break       

            count = int(len(sub_session['time_sec']))
            sub_session.astype(int)
            session_summary = sub_session.astype(int).groupby(['crnti', 'cell_id', 'enodeb_id']).agg(distance_initial_km = ('uplane_downlink.distance_initial_km', 'mean'),
                                                                                              first = ('time_sec', 'min'),
                                                                                              last = ('time_sec', 'max'),
                                                                                          cqi = ('uplane_downlink.cqi_wideband_mean', 'mean'),
                                                      non_gbr_dl_bytes = ('uplane_downlink.non_gbr_dl_bytes', 'sum'),
                                                      non_gbr_ul_bytes = ('uplane_uplink.non_gbr_ul_bytes', 'sum'))
            session_summary['session_duration'] = int(session_summary['last'])-int(session_summary['first'])
            session_summary = session_summary.reset_index()
            session_summary_overall = session_summary_overall.append(session_summary, ignore_index=True)
            i=i+count
          else:
            sub_session = pd.DataFrame(columns=column_names)
            sub_session = sub_session.append(d.iloc[[i]])
            session_summary = sub_session.astype(int).groupby(['crnti', 'cell_id', 'enodeb_id']).agg(distance_initial_km = ('uplane_downlink.distance_initial_km', 'mean'),
                                                                                              first = ('time_sec', min),
                                                                                              last = ('time_sec', max),
                                                                                          cqi = ('uplane_downlink.cqi_wideband_mean', 'mean'),
                                                      non_gbr_dl_bytes = ('uplane_downlink.non_gbr_dl_bytes', 'sum'),
                                                      non_gbr_ul_bytes = ('uplane_uplink.non_gbr_ul_bytes', 'sum'))
            session_summary['session_duration'] = 4.096
            session_summary = session_summary.reset_index()
            session_summary_overall = session_summary_overall.append(session_summary, ignore_index=True)
            count = count+1
            i=i+1
      all_sessions_summary = all_sessions_summary.append(session_summary_overall, ignore_index=True)

Output using data_set:

,crnti,cell_id,enodeb_id,distance_initial_km,first,last,cqi,non_gbr_dl_bytes,non_gbr_ul_bytes,session_duration
0,1487,1,1002107,27,1606208553,1606208575,6.5,862,1361,22
1,2528,1,1002107,16,1606208569,1606208573,7.5,57,0,4
2,2631,1,1002107,27,1606208557,1606208571,10.25,474,8856,14

Those variable names, woof. I'm looking at it and tbh it's a mess. Can you clarify how each of those 5 variables you want should be calculated? Like what does session duration mean in terms of the sample data? If I can get a better handle on what you're calculations look like I can try and write some vectorized versions from scratch for you to see. Maybe not all of them but enough to get you started on syntax and general approach — Coupcoup, Commented Nov 25, 2020 at 4:13
@Coupcoup - sorry for the long names. thanks for looking into this. I will add some info re calculations for the variables — rfguy, Commented Nov 25, 2020 at 4:18
@Coupcoup - Please let me know if you need any other clarifications. Again thanks for your help with this. — rfguy, Commented Nov 25, 2020 at 6:42
lol. thanks a lot. you can assume comb_dl_ul_full_191 to be the sample dataset that I provided. You will need to load it as dataframe. — rfguy, Commented Nov 26, 2020 at 3:59
can you rewrite the first few lines before the for statement so that your code runs when you copy-paste it into a new file? I can't get what's posted here to run — Coupcoup, Commented Nov 26, 2020 at 4:06

Reinderien · Accepted Answer · 2024-11-12 23:08:25Z

Sessions that are separated by less than 9 secs are considered continuation

OK, but is that really what this does?

if timedeltas[i] > -9:

If the time is monotonic then why would the time delta be negative? You guarantee monotonicity via dd.sort_values(by=['time_sec'], inplace=True). Anyway, it's pretty difficult to tell what you wanted, because your test data never actually hit a positive or negative differential this large.

Continuing,

Don't do this:

enbs = np.unique(comb_dl_ul_full['enodeb_id'].to_list())

First, Numpy isn't necessary: the Pandas API supports a unique method directly. Also, every unique in this program needs to go away and be replaced with a groupby.

None of the astype(int) seem justified; the data are already integral.

The calculation for timedeltas needs to be replaced with a diff().

There's an unnecessary intermediate dataframe session_summary_overall that can be deleted.

There's probably more that can be done here, but without more test data and explanation it's difficult to tell what's safe.

import io

import pandas as pd

with io.StringIO('''timestamp,cell_id,crnti,enodeb_id,uplane_downlink.cqi_wideband_mean,uplane_downlink.distance_initial_km,uplane_downlink.non_gbr_dl_bytes,time_sec,timestamp_y,uplane_uplink.non_gbr_ul_bytes
1606208553171,1,1487,1002107,7,27,134,1606208553,1606208553256,146
1606208557195,1,1487,1002107,6,27,388,1606208557,1606208557369,577
1606208557236,1,2631,1002107,10,27,251,1606208557,1606208557351,194
1606208561301,1,1487,1002107,6,27,140,1606208561,1606208561463,245
1606208561339,1,2631,1002107,11,27,159,1606208561,1606208561381,108
1606208565415,1,1487,1002107,7,27,99,1606208565,1606208565472,54
1606208565435,1,2631,1002107,10,27,64,1606208565,1606208565495,54
1606208569518,1,1487,1002107,6,27,101,1606208569,1606208569601,289
1606208569533,1,2528,1002107,7,16,57,1606208569,1606208569621,0
1606208571263,1,2631,1002107,10,27,0,1606208571,1606208571263,8500
1606208573557,1,2528,1002107,8,16,0,1606208573,1606208573557,0
1606208575838,1,1487,1002107,7,27,0,1606208575,1606208575838,50
''') as sample_data_csv:
    comb_dl_ul_full_191 = pd.read_csv(sample_data_csv)

column_names = [
    'crnti', 'enodeb_id', 'cell_id', 'uplane_downlink.cqi_wideband_mean',
    'uplane_downlink.distance_initial_km', 'uplane_downlink.non_gbr_dl_bytes',
    'time_sec', 'uplane_uplink.non_gbr_ul_bytes',
]

all_sessions_summary = pd.DataFrame()

for (enb, cell, rnti), d in comb_dl_ul_full_191[column_names].groupby(['enodeb_id', 'cell_id', 'crnti']):
    time_deltas = d['time_sec'].diff().values[1:]
    count = 0
    i = 0

    while i < len(time_deltas):
        new = time_deltas[i] > -9
        if new:
            for j in range(i, len(time_deltas)):
                if time_deltas[j] <= -9:
                    break

            sub_session = d.iloc[i: j+2]
            count = len(sub_session)
        else:
            sub_session = d.iloc[i]
            count = 1

        session_summary = pd.DataFrame({
            'crnti': rnti,
            'cell_id': cell,
            'enodeb_id': enb,
            'distance_initial_km': sub_session['uplane_downlink.distance_initial_km'].mean(),
            'first': sub_session['time_sec'].min(),
            'last': sub_session['time_sec'].max(),
            'cqi': sub_session['uplane_downlink.cqi_wideband_mean'].mean(),
            'non_gbr_dl_bytes': sub_session['uplane_downlink.non_gbr_dl_bytes'].sum(),
            'non_gbr_ul_bytes': sub_session['uplane_uplink.non_gbr_ul_bytes'].sum(),
        }, index=(0,))

        if new:
            session_summary['session_duration'] = session_summary['last'] - session_summary['first']
        else:
            session_summary['session_duration'] = 4.096

        all_sessions_summary = pd.concat((all_sessions_summary, session_summary), ignore_index=True)

        i += count
        if not new:
            count += 1

# Regression test:
# all_sessions_summary.to_csv('reference.csv')
reference = pd.read_csv('reference.csv', index_col=0)
assert reference.equals(all_sessions_summary)

pd.options.display.max_columns = 99
print(all_sessions_summary)

   crnti  cell_id  enodeb_id  distance_initial_km       first        last  \
0   1487        1    1002107                 27.0  1606208553  1606208575   
1   2528        1    1002107                 16.0  1606208569  1606208573   
2   2631        1    1002107                 27.0  1606208557  1606208571   

     cqi  non_gbr_dl_bytes  non_gbr_ul_bytes  session_duration  
0   6.50               862              1361                22  
1   7.50                57                 0                 4  
2  10.25               474              8856                14

Stack Exchange Network

Implement vectorization instead of nested loops on dataframe

1 Answer 1

Hot Network Questions

Implement vectorization instead of nested loops on dataframe

1 Answer 1

Related

Hot Network Questions