I have a dataset ('sample_data.csv')
of the form below:
timestamp,cell_id,crnti,enodeb_id,uplane_downlink.cqi_wideband_mean,uplane_downlink.distance_initial_km,uplane_downlink.non_gbr_dl_bytes,time_sec,timestamp_y,uplane_uplink.non_gbr_ul_bytes
1606208553171,1,1487,1002107,7,27,134,1606208553,1606208553256,146
1606208557195,1,1487,1002107,6,27,388,1606208557,1606208557369,577
1606208557236,1,2631,1002107,10,27,251,1606208557,1606208557351,194
1606208561301,1,1487,1002107,6,27,140,1606208561,1606208561463,245
1606208561339,1,2631,1002107,11,27,159,1606208561,1606208561381,108
1606208565415,1,1487,1002107,7,27,99,1606208565,1606208565472,54
1606208565435,1,2631,1002107,10,27,64,1606208565,1606208565495,54
1606208569518,1,1487,1002107,6,27,101,1606208569,1606208569601,289
1606208569533,1,2528,1002107,7,16,57,1606208569,1606208569621,0
1606208571263,1,2631,1002107,10,27,0,1606208571,1606208571263,8500
1606208573557,1,2528,1002107,8,16,0,1606208573,1606208573557,0
1606208575838,1,1487,1002107,7,27,0,1606208575,1606208575838,50
Here the column crnti
represents an user and the column time_sec
gives a timestamp for this user's session. Sessions that are separated by less than 9 secs are considered continuation (small session of a longer session). My end goal is to summarize session properties for each user. In the summary I want to show
Session duration → calculated as
(last = max(time_sec)) - (first=min (time_sec))
for a dataframe that is a filtered DF (including only records that are less than 9 sec apart) for a crnti. In the code below I do it as:session_summary['session_duration'] = int(session_summary['last'])-int(session_summary['first'])
Sum of
uplane_uplink.non_gbr_ul_bytes
across all small sessions for a user → calculated as sum of all values in the filtered df (session_summary):session_summary = sub_session.astype(int).groupby(['crnti', 'cell_id', 'enodeb_id']).agg(distance_initial_km = ('uplane_downlink.distance_initial_km', 'mean'), first = ('time_sec', 'min'), last = ('time_sec', 'max'), cqi = ('uplane_downlink.cqi_wideband_mean', 'mean'), non_gbr_dl_bytes = ('uplane_downlink.non_gbr_dl_bytes', 'sum'), non_gbr_ul_bytes = ('uplane_uplink.non_gbr_ul_bytes', 'sum'))
Sum of
uplane_downlink.non_gbr_dl_bytes
across all small sessions for a user → calculation described aboveMean of
cqi_wideband_mean
→ calculation described aboveMean of
uplane_downlink.distance_initial_km
→ calculation described above
As an example for rnti = 1487. Here is the summary -
I have the following code that gets me the expected output. However my dataset is pretty big (6M rows) and this approach is not scalable and takes way too long. I came across the concept of vectorization using numpy or pandas and want to implement it, but not sure where to start. Any pointers are appreciated.
Working but slow code:
import pandas as pd
import numpy as np
comb_dl_ul_full_191 = pd.read_csv('sample_data.csv')
column_names = ['crnti', 'enodeb_id', 'cell_id', 'uplane_downlink.cqi_wideband_mean', 'uplane_downlink.distance_initial_km', 'uplane_downlink.non_gbr_dl_bytes',
'time_sec',
'uplane_uplink.non_gbr_ul_bytes']
all_sessions_summary = pd.DataFrame()
enbs = [1002107] ## this is an array. For simplicity I am only showing one element. Creating it as shown below
#enbs = np.unique(comb_dl_ul_full['enodeb_id'].to_list())
comb_dl_ul_full_191 = comb_dl_ul_full_191.astype(int) # converting all to INT
for enb in enbs:
cells = [1] # this is an array. For simplicity I am only showing one element. Creating it as shown below
#cells = np.unique(comb_dl_ul_full_191[comb_dl_ul_full_191['enodeb_id']==enb]['cell_id'].to_list())
for cell in cells:
dd = comb_dl_ul_full_191[(comb_dl_ul_full_191['enodeb_id']==enb) & (comb_dl_ul_full_191['cell_id']==cell)]
rntis = np.unique(dd['crnti'].to_list())
dd.sort_values(by=['time_sec'], inplace=True)
for rnti in rntis:
d = dd[(dd['enodeb_id']==enb) & (dd['crnti']==rnti) & (dd['cell_id']==cell)]
f = d['time_sec'].to_list()
timedeltas = [int(f[i-1])-int(f[i]) for i in range(1, len(f))]
session_summary = pd.DataFrame()
session_summary_overall = pd.DataFrame()
count=0
i=0
while i < len(timedeltas):
if timedeltas[i] > -9:
sub_session = pd.DataFrame(columns=column_names)
d = d[column_names]
sub_session = sub_session.append(d.iloc[[i]])
for j in range(i, len(timedeltas)):
if timedeltas[j] > -9:
sub_session = sub_session.append(d.iloc[[j+1]])
else:
break
count = int(len(sub_session['time_sec']))
sub_session.astype(int)
session_summary = sub_session.astype(int).groupby(['crnti', 'cell_id', 'enodeb_id']).agg(distance_initial_km = ('uplane_downlink.distance_initial_km', 'mean'),
first = ('time_sec', 'min'),
last = ('time_sec', 'max'),
cqi = ('uplane_downlink.cqi_wideband_mean', 'mean'),
non_gbr_dl_bytes = ('uplane_downlink.non_gbr_dl_bytes', 'sum'),
non_gbr_ul_bytes = ('uplane_uplink.non_gbr_ul_bytes', 'sum'))
session_summary['session_duration'] = int(session_summary['last'])-int(session_summary['first'])
session_summary = session_summary.reset_index()
session_summary_overall = session_summary_overall.append(session_summary, ignore_index=True)
i=i+count
else:
sub_session = pd.DataFrame(columns=column_names)
sub_session = sub_session.append(d.iloc[[i]])
session_summary = sub_session.astype(int).groupby(['crnti', 'cell_id', 'enodeb_id']).agg(distance_initial_km = ('uplane_downlink.distance_initial_km', 'mean'),
first = ('time_sec', min),
last = ('time_sec', max),
cqi = ('uplane_downlink.cqi_wideband_mean', 'mean'),
non_gbr_dl_bytes = ('uplane_downlink.non_gbr_dl_bytes', 'sum'),
non_gbr_ul_bytes = ('uplane_uplink.non_gbr_ul_bytes', 'sum'))
session_summary['session_duration'] = 4.096
session_summary = session_summary.reset_index()
session_summary_overall = session_summary_overall.append(session_summary, ignore_index=True)
count = count+1
i=i+1
all_sessions_summary = all_sessions_summary.append(session_summary_overall, ignore_index=True)
Output using data_set:
,crnti,cell_id,enodeb_id,distance_initial_km,first,last,cqi,non_gbr_dl_bytes,non_gbr_ul_bytes,session_duration
0,1487,1,1002107,27,1606208553,1606208575,6.5,862,1361,22
1,2528,1,1002107,16,1606208569,1606208573,7.5,57,0,4
2,2631,1,1002107,27,1606208557,1606208571,10.25,474,8856,14
comb_dl_ul_full_191
to be the sample dataset that I provided. You will need to load it as dataframe. \$\endgroup\$for
statement so that your code runs when you copy-paste it into a new file? I can't get what's posted here to run \$\endgroup\$