So, to mitigate the cache fighting, if that is indeed a problem, you could choose to limit the individual cpu runs to a portion of the A x B matrix that will fit in the cache, and have all the cpu's work on that limited set that fits in the cache until all the cpus are done, then move on to another subset of the A x B matrix. If one cpu finishes first, I probably would not even give it new work, presuming that (a) this cache thrashing is a real problem for your domain, and (b) that all the cpu's will finish with their A x B subset more-or-less at the same time. Assuming all that we'd probably be better off running all cpus to completion on the subset before embarking on the next subset.
On the other hand, spawning exactly as many threads as cores, but with an algorithm that works incrementally on well-defined on matrix subranges of A x B, then waits for all the other cores to acknowledge completion before starting on the next subrange may provide the best of all solutions. Each thread announces subset completion and then suspend itself waiting for notification that all threads have completed.
So each core would march thru the same A x B subset using your notion of n/C iterations, then all the cores would move on to the next A x B subset.
In fact, even using subranges for A x B on a single threadded algorithm might improve performance over ranging over all of A x B numerous times, which is entirely possible if the memory touched in the A x B matrix doesn't fit in the cache. Each run thru A x B needs to bring the entirety of both data structures into the cache again (and maybe even again and again just for one iteration thru A x B), whereas a single thread running on a manageable subset could bring each such subset into the cache only once for the n iterations, so you might start with a single threaded version of matrix subsetting, and then add the parallel synchronization.