To me this looks like a classic map-reduce problem. I would implement a map and reduce functions, of which mapping of chunks looks like
Setup thread local datastructures (no thread safety needed, because they are used by only one thread) and assign one to each worker
If the current chunk is bigger than threshold, divide it in half and search for closest end of a token, then generate two tasks from that (go to 1 again)
If the chunk is smaller or equal to threshold, perform normal tokenize
This allows a worker to concentrate on its own task and not contending on resources with other workers. With worker being threads, it will improve data locality thus leading to doing more useful things per CPU cycle.
After the mapping process finishes, the reduce part could be like the following:
Agree on common lexem ids
Post a task for each worker to establish lexem rename maps
Create a hashmap with value type being atomic (falseif false sharing is established to be a problem, pad entries to break false sharing)
Post task for every worker to sum the results of the index together
Although it looks like 4 might contend to update the same memory page, there will be at most number of worker contentions once, instead of number of workers multiplied by possible occurences. I do not know how long step 1 of reduce will take, but given sufficiently big text, I suppose it will be faster.
With the algorithm above, there will be multiple small taxes on performance (dividing the task, posting tasks for each worker, data structure allocations) and two big (establishing common lexem id map + rename maps for each worker). Number of unique lexems is likely much smaller than the text itself.
Static chunk scheduling suffers from variability in CPU time availability for the process. In most computing, the launched program is not the only running on the machine. Getting scheduled out might eat into CPU time of the process thus creating unevenness, stalling the other CPUs from progressing. Dynamic task scheduling with dividing the task into smaller chunks makes more work, but works a lot better in most situations.
Non-composable threading is not great example of executor. The code in question does not provide any means of cooperating with other threads in the program, nor does it allow the work chunks to be executed on something else. It would be great to be able to separate work division/scheduling from executor.