1

I am trying to take a DataFrame column that contains repeating values from a finite set and substitute these values by index number, so if the values are [200,20,1000,1] the indexes of their occurrence will be [1,2,3,4]. Actual data example is:

0    aaa
1    aaa
2    bbb
3    aaa
4    bbb
5    bbb
6    ccc
7    ddd
8    ccc
9    ddd

The desired output is

0    1
1    1
2    2
3    1
4    2
5    2
6    4
7    3
8    4
9    3

I want to change the values that make little sense to numbers. That's all... I do not care about the order of indexing, i.e. 1 could be 3 and so on, as long the ordering is consistent. I.e., I don't care if ['aaa','bbb','ccc','ddd'] will be indexed by [1,2,3,4] or [2,4,3,1].

Suppose that the DF name is tbl and I want to change only a subset of indexes in column 'aaa'. Let's denote these indexes by tbl_ind. The way I want to do that is:

tmp_r = tbl[tbl_ind]
un_r_ind = np.unique(tmp_r)
for r_ind in range(len(un_r_ind)):
    r_ind_ind = np.array(np.where(tmp_r == un_r_ind[r_ind])[0])
    for j_ind in range(len(r_ind_ind)):
        tbl['aaa'].iloc[tbl_ind[r_ind_ind[j_ind]]] = r_ind

It works. And it is REALLY slow on large data sets. Python does not let to update tbl['aaa'].iloc[tbl_ind[r_ind_ind]] as it's a list of indexes.... Help please? How is it possible to speed this up? Many thanks!

3
  • Can you post actual input data and desired output, your question makes little sense
    – EdChum
    Commented Apr 2, 2015 at 13:57
  • Are you asking how to find where a range of values exist in a df and how to update them all? e.g. if you list of values is [200,20, 1000,1] you want to find all rows that have these values, are you wanting to change all these rows to the same value or different values for each entry in the list?
    – EdChum
    Commented Apr 2, 2015 at 14:02
  • @EdChum I added an example. thanks. Commented Apr 2, 2015 at 14:13

3 Answers 3

2

I'd construct a dict of the values you want to replace and then call map:

In [7]:

df
Out[7]:
  data
0     
1  aaa
2  bbb
3  aaa
4  bbb
5  bbb
6  ccc
7  ddd
8  ccc
9  ddd
In [8]:

d = {'aaa':1,'bbb':2,'ccc':3,'ddd':4}
df['data'] = df['data'].map(d)
df

Out[8]:
   data
0      
1     1
2     2
3     1
4     2
5     2
6     3
7     4
8     3
9     4
2
  • What would you do if you have, say half a million of such distinct values over few millions of data-points (len[df] = 'very big')? I wonder regarding the size of the dictionary as well as the way of its creation. Is there a way to create a dictionary fast without actual applying d['fff'] = index for all possible values of keys before applying map? Thanks. Commented Apr 4, 2015 at 5:53
  • You could construct just a series of the distinct values and use that: so df['data'].unique() returns a series of all the unique values the index will be an auto generated int64 index, so you could create a dict from that d = dict(zip(df['data'].unique().values, df[data'].unique().index)) you could substitute the last param for np.arange(len(df['data'].unique()))
    – EdChum
    Commented Apr 4, 2015 at 8:52
2

You could use rank with the dense method:

>>> df[0].rank("dense")
0    1
1    1
2    2
3    1
4    2
5    2
6    3
7    4
8    3
9    4
Name: 0, dtype: float64

This basically sorts the values and maps the lowest to 1, the second-lowest to 2, and so on.

1

I am not sure I have understood correctly from your example. Is this what you are trying to achieve? (apart from the bias on the index (zero instead of one)):

df=['aaa','aaa','bbb','aaa','bbb','bbb','ccc','ddd','ccc','ddd']
idx={}

def index_data(v):
    global idx

    if v in idx:
        return idx[v]
    else:
        n = len(idx)
        idx[v] = n
        return n

if __name__ == "__main__":
    outlist = []
    for i in df:
        outlist.append(index_data(i))
    for i, v in enumerate(outlist):
        print i, v

It outputs:

0 0
1 0
2 1
3 0
4 1
5 1
6 2
7 3
8 2
9 3

Obviously it can be optimised (e.g. simply incrementing a counter for n instead of checking the size of the index)

1
  • Thanks @Pynchia I tried to work directly on tbl['aaa'].iloc[tbl_ind[r_ind_ind]] but it didn't work as r_ind_ind was returned by r_ind_ind = np.array(np.where(tmp_r == un_r_ind[r_ind])[0]) as a legitimate index, but tbl_ind was a list. Casting tbl_ind into np.array(tbl_ind) before applying tbl_ind[r_ind_ind ] solved the problem. Thank you for your solution too. Commented Apr 2, 2015 at 15:31

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.