Skip to main content
deleted 7 characters in body
Source Link
tdy
  • 2.3k
  • 1
  • 10
  • 21

If promotions is large, then vectorize that as well using Series instead of dict(-zip(...)):

ranks = pd.Series(range(len(promotions)), index=promotions)  # ~4x~40x faster given 1,00010K jobcodes

If promotions is large, then vectorize that as well using Series instead of dict(zip(...)):

ranks = pd.Series(range(len(promotions)), index=promotions)  # ~4x faster given 1,000 jobcodes

If promotions is large, then vectorize that as well using Series instead of dict-zip:

ranks = pd.Series(range(len(promotions)), index=promotions)  # ~40x faster given 10K jobcodes
added 209 characters in body
Source Link
tdy
  • 2.3k
  • 1
  • 10
  • 21

If promotions is large, then vectorize that as well using Series instead of dict(zip(...)):

ranks = pd.Series(range(len(promotions)), index=promotions)  # ~4x faster given 1,000 jobcodes

Concrete example:

Concrete example:

If promotions is large, then vectorize that as well using Series instead of dict(zip(...)):

ranks = pd.Series(range(len(promotions)), index=promotions)  # ~4x faster given 1,000 jobcodes

Concrete example:

deleted 29 characters in body
Source Link
tdy
  • 2.3k
  • 1
  • 10
  • 21

Here a simple way to actually vectorize would be to map the jobcodes to their numerical rankranks and just compare the ranks (assuming the promotions are ordered, which is indeed the case in your provided example):

import pandas as pd

promotions = ('AGM4', 'GM2', 'ADO3')
df = pd.DataFrame({"PayGroup_prev": ["---", "ADO3", "AGM4", "AGM4", "AGM4", "AGM4", "ADO3"], "PayGroup_cur": ["AGM4", "GM2", "ADO3", "???", "AGM4", "GM2", "ADO3"]})
#   PayGroup_prev  PayGroup_cur
# 0           ---          AGM4
# 1          ADO3           GM2
# 2          AGM4          ADO3
# 3          AGM4           ???
# 4          AGM4          AGM4
# 5          AGM4           GM2
# 6          ADO3          ADO3

promotions_rankranks = dict(zip(promotions, range(len(promotions))))
# {'AGM4': 0, 'GM2': 1, 'ADO3': 2}

df['PayGroup_prev_rank'] = df['PayGroup_prev'].map(promotions_rankranks)
df['PayGroup_cur_rank'] = df['PayGroup_cur'].map(promotions_rankranks)

df['Promoted'] = df['PayGroup_cur_rank'] > df['PayGroup_prev_rank']
#   PayGroup_prev  PayGroup_cur  PayGroup_prev_rank  PayGroup_cur_rank  Promoted
# 0           ---          AGM4                 NaN                0.0     False
# 1          ADO3           GM2                 2.0                1.0     False
# 2          AGM4          ADO3                 0.0                2.0      True
# 3          AGM4           ???                 0.0                NaN     False
# 4          AGM4          AGM4                 0.0                0.0     False
# 5          AGM4           GM2                 0.0                1.0      True
# 6          ADO3          ADO3                 2.0                2.0     False

Here a simple way to actually vectorize would be to map the jobcodes to their numerical rank and just compare the ranks (assuming the promotions are ordered, which is indeed the case in your provided example):

import pandas as pd

promotions = ('AGM4', 'GM2', 'ADO3')
df = pd.DataFrame({"PayGroup_prev": ["---", "ADO3", "AGM4", "AGM4", "AGM4", "AGM4", "ADO3"], "PayGroup_cur": ["AGM4", "GM2", "ADO3", "???", "AGM4", "GM2", "ADO3"]})
#   PayGroup_prev  PayGroup_cur
# 0           ---          AGM4
# 1          ADO3           GM2
# 2          AGM4          ADO3
# 3          AGM4           ???
# 4          AGM4          AGM4
# 5          AGM4           GM2
# 6          ADO3          ADO3

promotions_rank = dict(zip(promotions, range(len(promotions))))
# {'AGM4': 0, 'GM2': 1, 'ADO3': 2}

df['PayGroup_prev_rank'] = df['PayGroup_prev'].map(promotions_rank)
df['PayGroup_cur_rank'] = df['PayGroup_cur'].map(promotions_rank)

df['Promoted'] = df['PayGroup_cur_rank'] > df['PayGroup_prev_rank']
#   PayGroup_prev  PayGroup_cur  PayGroup_prev_rank  PayGroup_cur_rank  Promoted
# 0           ---          AGM4                 NaN                0.0     False
# 1          ADO3           GM2                 2.0                1.0     False
# 2          AGM4          ADO3                 0.0                2.0      True
# 3          AGM4           ???                 0.0                NaN     False
# 4          AGM4          AGM4                 0.0                0.0     False
# 5          AGM4           GM2                 0.0                1.0      True
# 6          ADO3          ADO3                 2.0                2.0     False

Here a simple way to actually vectorize would be to map the jobcodes to their numerical ranks and just compare the ranks (assuming the promotions are ordered, which is indeed the case in your provided example):

import pandas as pd

promotions = ('AGM4', 'GM2', 'ADO3')
df = pd.DataFrame({"PayGroup_prev": ["---", "ADO3", "AGM4", "AGM4", "AGM4", "AGM4", "ADO3"], "PayGroup_cur": ["AGM4", "GM2", "ADO3", "???", "AGM4", "GM2", "ADO3"]})
#   PayGroup_prev  PayGroup_cur
# 0           ---          AGM4
# 1          ADO3           GM2
# 2          AGM4          ADO3
# 3          AGM4           ???
# 4          AGM4          AGM4
# 5          AGM4           GM2
# 6          ADO3          ADO3

ranks = dict(zip(promotions, range(len(promotions))))
# {'AGM4': 0, 'GM2': 1, 'ADO3': 2}

df['PayGroup_prev_rank'] = df['PayGroup_prev'].map(ranks)
df['PayGroup_cur_rank'] = df['PayGroup_cur'].map(ranks)

df['Promoted'] = df['PayGroup_cur_rank'] > df['PayGroup_prev_rank']
#   PayGroup_prev  PayGroup_cur  PayGroup_prev_rank  PayGroup_cur_rank  Promoted
# 0           ---          AGM4                 NaN                0.0     False
# 1          ADO3           GM2                 2.0                1.0     False
# 2          AGM4          ADO3                 0.0                2.0      True
# 3          AGM4           ???                 0.0                NaN     False
# 4          AGM4          AGM4                 0.0                0.0     False
# 5          AGM4           GM2                 0.0                1.0      True
# 6          ADO3          ADO3                 2.0                2.0     False
Source Link
tdy
  • 2.3k
  • 1
  • 10
  • 21
Loading