python dataframe indexed by a list

Question

I am trying to take a DataFrame column that contains repeating values from a finite set and substitute these values by index number, so if the values are [200,20,1000,1] the indexes of their occurrence will be [1,2,3,4]. Actual data example is:

0    aaa
1    aaa
2    bbb
3    aaa
4    bbb
5    bbb
6    ccc
7    ddd
8    ccc
9    ddd

The desired output is

I want to change the values that make little sense to numbers. That's all... I do not care about the order of indexing, i.e. 1 could be 3 and so on, as long the ordering is consistent. I.e., I don't care if ['aaa','bbb','ccc','ddd'] will be indexed by [1,2,3,4] or [2,4,3,1].

Suppose that the DF name is tbl and I want to change only a subset of indexes in column 'aaa'. Let's denote these indexes by tbl_ind. The way I want to do that is:

tmp_r = tbl[tbl_ind]
un_r_ind = np.unique(tmp_r)
for r_ind in range(len(un_r_ind)):
    r_ind_ind = np.array(np.where(tmp_r == un_r_ind[r_ind])[0])
    for j_ind in range(len(r_ind_ind)):
        tbl['aaa'].iloc[tbl_ind[r_ind_ind[j_ind]]] = r_ind

It works. And it is REALLY slow on large data sets. Python does not let to update tbl['aaa'].iloc[tbl_ind[r_ind_ind]] as it's a list of indexes.... Help please? How is it possible to speed this up? Many thanks!

Can you post actual input data and desired output, your question makes little sense — EdChum, Commented Apr 2, 2015 at 13:57
Are you asking how to find where a range of values exist in a df and how to update them all? e.g. if you list of values is [200,20, 1000,1] you want to find all rows that have these values, are you wanting to change all these rows to the same value or different values for each entry in the list? — EdChum, Commented Apr 2, 2015 at 14:02

EdChum · Accepted Answer · 2015-04-02 15:31:43Z

2

I'd construct a dict of the values you want to replace and then call map:

In [7]:

df
Out[7]:
  data
0     
1  aaa
2  bbb
3  aaa
4  bbb
5  bbb
6  ccc
7  ddd
8  ccc
9  ddd
In [8]:

d = {'aaa':1,'bbb':2,'ccc':3,'ddd':4}
df['data'] = df['data'].map(d)
df

Out[8]:
   data
0      
1     1
2     2
3     1
4     2
5     2
6     3
7     4
8     3
9     4

answered Apr 2, 2015 at 15:31

EdChum

395k203 gold badges835 silver badges582 bronze badges

What would you do if you have, say half a million of such distinct values over few millions of data-points (len[df] = 'very big')? I wonder regarding the size of the dictionary as well as the way of its creation. Is there a way to create a dictionary fast without actual applying d['fff'] = index for all possible values of keys before applying map? Thanks.
– user3861925
Commented Apr 4, 2015 at 5:53
You could construct just a series of the distinct values and use that: so df['data'].unique() returns a series of all the unique values the index will be an auto generated int64 index, so you could create a dict from that d = dict(zip(df['data'].unique().values, df[data'].unique().index)) you could substitute the last param for np.arange(len(df['data'].unique()))
– EdChum
Commented Apr 4, 2015 at 8:52

Add a comment |

DSM · Accepted Answer · 2015-04-02 16:01:30Z

2

You could use rank with the dense method:

>>> df[0].rank("dense")
0    1
1    1
2    2
3    1
4    2
5    2
6    3
7    4
8    3
9    4
Name: 0, dtype: float64

This basically sorts the values and maps the lowest to 1, the second-lowest to 2, and so on.

answered Apr 2, 2015 at 16:01

DSM

354k67 gold badges603 silver badges502 bronze badges

Add a comment |

Pynchia · Accepted Answer · 2015-04-02 15:37:30Z

1

I am not sure I have understood correctly from your example. Is this what you are trying to achieve? (apart from the bias on the index (zero instead of one)):

df=['aaa','aaa','bbb','aaa','bbb','bbb','ccc','ddd','ccc','ddd']
idx={}

def index_data(v):
    global idx

    if v in idx:
        return idx[v]
    else:
        n = len(idx)
        idx[v] = n
        return n

if __name__ == "__main__":
    outlist = []
    for i in df:
        outlist.append(index_data(i))
    for i, v in enumerate(outlist):
        print i, v

It outputs:

Obviously it can be optimised (e.g. simply incrementing a counter for n instead of checking the size of the index)

edited Apr 2, 2015 at 15:37

answered Apr 2, 2015 at 15:21

Pynchia

11.6k5 gold badges37 silver badges48 bronze badges

Thanks @Pynchia I tried to work directly on tbl['aaa'].iloc[tbl_ind[r_ind_ind]] but it didn't work as r_ind_ind was returned by r_ind_ind = np.array(np.where(tmp_r == un_r_ind[r_ind])[0]) as a legitimate index, but tbl_ind was a list. Casting tbl_ind into np.array(tbl_ind) before applying tbl_ind[r_ind_ind ] solved the problem. Thank you for your solution too.
– user3861925
Commented Apr 2, 2015 at 15:31

Add a comment |

Collectives™ on Stack Overflow

python dataframe indexed by a list

3 Answers 3

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Related