Revisions to Best algorithm to multi-thread this application?

Bounty Awarded with 50 reputation awarded by CommunityBot

occurred Sep 4, 2016 at 22:03

added 246 characters in body

Source Link

edited Aug 28, 2016 at 0:38

34.8k
6
61
95

So, to mitigate the cache fighting, if that is indeed a problem, you could choose to limit the individual cpu runs to a portion of the A x B matrix that will fit in the cache, and have all the cpu's work on that limited set that fits in the cache until all the cpus are done, then move on to another subset of the A x B matrix. If one cpu finishes first, I probably would not even give it new work, presuming that (a) this cache thrashing is a real problem for your domain, and (b) that all the cpu's will finish with their A x B subset more-or-less at the same time. Assuming all that we'd probably be better off running all cpus to completion on the subset before embarking on the next subset.

On the other hand, spawning exactly as many threads as cores, but with an algorithm that works incrementally on well-defined on matrix subranges of A x B, then waits for all the other cores to acknowledge completion before starting on the next subrange may provide the best of all solutions. Each thread announces subset completion and then suspend itself waiting for notification that all threads have completed.

So each core would march thru the same A x B subset using your notion of n/C iterations, then all the cores would move on to the next A x B subset.

In fact, even using subranges for A x B on a single threadded algorithm might improve performance over ranging over all of A x B numerous times, which is entirely possible if the memory touched in the A x B matrix doesn't fit in the cache. Each run thru A x B needs to bring the entirety of both data structures into the cache again (and maybe even again and again just for one iteration thru A x B), whereas a single thread running on a manageable subset could bring each such subset into the cache only once for the n iterations, so you might start with a single threaded version of matrix subsetting, and then add the parallel synchronization.

So, to mitigate the cache fighting, if that is indeed a problem, you could choose to limit the individual cpu runs to a portion of the A x B matrix that will fit in the cache, and have all the cpu's work on that limited set that fits in the cache until all the cpus are done, then move on to another A x B matrix. If one cpu finishes first, I probably would not even give it new work, presuming that (a) this cache thrashing is a real problem for your domain, and (b) that all the cpu's will finish with their A x B subset more-or-less at the same time. Assuming all that we'd probably be better off running all cpus to completion on the subset before embarking on the next subset.

On the other hand, spawning exactly as many threads as cores, but with an algorithm that works incrementally on well-defined on matrix subranges of A x B, then waits for all the other cores to acknowledge completion before starting on the next subrange may provide the best of all solutions. Each thread announces subset completion and then suspend itself waiting for notification that all threads have completed.

In fact, even using subranges for A x B on a single threadded algorithm might improve performance over ranging over all of A x B numerous times, which is entirely possible if the memory touched in the A x B matrix doesn't fit in the cache. Each run thru A x B needs to bring the entirety of both data structures into the cache again, whereas a single thread running on a manageable subset could bring each subset into the cache only once.

So, to mitigate the cache fighting, if that is indeed a problem, you could choose to limit the individual cpu runs to a portion of the A x B matrix that will fit in the cache, and have all the cpu's work on that limited set that fits in the cache until all the cpus are done, then move on to another subset of the A x B matrix. If one cpu finishes first, I probably would not even give it new work, presuming that (a) this cache thrashing is a real problem for your domain, and (b) that all the cpu's will finish with their A x B subset more-or-less at the same time. Assuming all that we'd probably be better off running all cpus to completion on the subset before embarking on the next subset.

On the other hand, spawning exactly as many threads as cores, but with an algorithm that works incrementally on well-defined on matrix subranges of A x B, then waits for all the other cores to acknowledge completion before starting on the next subrange may provide the best of all solutions. Each thread announces subset completion and then suspend itself waiting for notification that all threads have completed.

So each core would march thru the same A x B subset using your notion of n/C iterations, then all the cores would move on to the next A x B subset.

In fact, even using subranges for A x B on a single threadded algorithm might improve performance over ranging over all of A x B numerous times, which is entirely possible if the memory touched in the A x B matrix doesn't fit in the cache. Each run thru A x B needs to bring the entirety of both data structures into the cache again (and maybe even again and again just for one iteration thru A x B), whereas a single thread running on a manageable subset could bring each such subset into the cache only once for the n iterations, so you might start with a single threaded version of matrix subsetting, and then add the parallel synchronization.

Source Link

answered Aug 28, 2016 at 0:30

Erik Eidt

34.8k
6
61
95

There's not a lot to work with here. You are not divulging f and also not asking to parallelize the internals of f, just the doubly-nested loops that invoke f. While there is some relationship between f(,,n) and f(,,n-1), I don't see how to take advantage of it because of the undisclosed randomizing component.

(I presume this is your intent, but just to be clear, there may be a better solution if we could understand the randomizing component, since repeating that over and over looks like where all the work really is, and doing something different instead might be most effective.)

So, the only thing you can do is slice the data to keep all the cores busy.

There are three methods by which you can slice the data: subranges of A, subranges of B, and subranges of N, the latter of which you've already done with your own answer.

You also haven't divulged the structure of A or B, except that they are obviously collections, or at least generators.

If they are collection manifested by memory (arrays, lists, etc..), then if they are of significant size, then iterating over A and B by each core could thrash the cache, unless somehow the cores cooperate and happen to run at same range of A & B at the same time, then they'll actually get a boost from each other!

But if they happen to get significantly out of sync with each other (say because conditional logic in f is not evaluating the same), then they'll be fighting each other for the cache.

(An analogy is doing copying of large folders/directories on your hard drive. If start another copy of different folders at the same time, both copies will more than likely slow to a crawl, and together take 10x of the serially run sum of the copies.)

So, to mitigate the cache fighting, if that is indeed a problem, you could choose to limit the individual cpu runs to a portion of the A x B matrix that will fit in the cache, and have all the cpu's work on that limited set that fits in the cache until all the cpus are done, then move on to another A x B matrix. If one cpu finishes first, I probably would not even give it new work, presuming that (a) this cache thrashing is a real problem for your domain, and (b) that all the cpu's will finish with their A x B subset more-or-less at the same time. Assuming all that we'd probably be better off running all cpus to completion on the subset before embarking on the next subset.

Of course, you also want to spawn as few threads as you can because that represents overhead as well, which is a benefit of the slicing in your answer. But it is possible that out-of-sync threads thrash the cache sufficiently fixing that would be worth spawning additional threads.

On the other hand, spawning exactly as many threads as cores, but with an algorithm that works incrementally on well-defined on matrix subranges of A x B, then waits for all the other cores to acknowledge completion before starting on the next subrange may provide the best of all solutions. Each thread announces subset completion and then suspend itself waiting for notification that all threads have completed.

In fact, even using subranges for A x B on a single threadded algorithm might improve performance over ranging over all of A x B numerous times, which is entirely possible if the memory touched in the A x B matrix doesn't fit in the cache. Each run thru A x B needs to bring the entirety of both data structures into the cache again, whereas a single thread running on a manageable subset could bring each subset into the cache only once.

Stack Exchange Network

Return to Answer