I am trying to take a DataFrame column that contains repeating values from a finite set and substitute these values by index number, so if the values are [200,20,1000,1] the indexes of their occurrence will be [1,2,3,4]. Actual data example is:
0 aaa
1 aaa
2 bbb
3 aaa
4 bbb
5 bbb
6 ccc
7 ddd
8 ccc
9 ddd
The desired output is
0 1
1 1
2 2
3 1
4 2
5 2
6 4
7 3
8 4
9 3
I want to change the values that make little sense to numbers. That's all... I do not care about the order of indexing, i.e. 1 could be 3 and so on, as long the ordering is consistent. I.e., I don't care if ['aaa','bbb','ccc','ddd']
will be indexed by [1,2,3,4]
or [2,4,3,1]
.
Suppose that the DF name is tbl and I want to change only a subset of indexes in column 'aaa'. Let's denote these indexes by tbl_ind. The way I want to do that is:
tmp_r = tbl[tbl_ind]
un_r_ind = np.unique(tmp_r)
for r_ind in range(len(un_r_ind)):
r_ind_ind = np.array(np.where(tmp_r == un_r_ind[r_ind])[0])
for j_ind in range(len(r_ind_ind)):
tbl['aaa'].iloc[tbl_ind[r_ind_ind[j_ind]]] = r_ind
It works. And it is REALLY slow on large data sets.
Python does not let to update tbl['aaa'].iloc[tbl_ind[r_ind_ind]]
as it's a list of indexes....
Help please? How is it possible to speed this up?
Many thanks!