Skip to main content

You are not logged in. Your edit will be placed in a queue until it is peer reviewed.

We welcome edits that make the post easier to understand and more valuable for readers. Because community members review edits, please try to make the post substantially better than how you found it, for example, by fixing grammar or adding additional resources and hyperlinks.

Required fields*

15
  • 3
    I don't think specifying the signature is a good advise, it prevents optimizations based on the contiguous-ness of the data (sometimes resulting in noticable degraded performance). Also I'm not sure why you mention GPU here. Nothing in the question mentions GPU. Commented Oct 25, 2017 at 11:44
  • 1
    But I like the part about the cost of parallel processing, especially the often ignored part that "it is very, indeed VERY EASY to pay MUCH more than one may gain from"! Commented Oct 25, 2017 at 11:45
  • Ad GPU) actually it was mentioned in comments above to try numba @guvectorize tool, so I added a few remarks on hidden extreme costs of ( also indeed VERY OFTEN mis-used ) GPU-latency-masking-SMX toys for this sort of problems. GPU can help for "mathematically"-large-GPU-kernels operating on very-compact-&-small-data-region + having minimum, best none, SIMT-synchronisation, but not for anything else. Parallelisation AT ANY COSTS is so, so often these days. "Ó Tempóra, ó Mórés ..." :o) Commented Oct 25, 2017 at 11:52
  • 1
    Thanks for this detailed answer. One thing to keep in mind is that I wrote a very similar csrMult function in C++, where it was trivial to parallelize the for loop (because C++ supports openMP natively), and by parallelizing the for loop I observed a 6x or 7x speedup, using the same matrix. I would expect a similar speedup here. In any case, I think it should at least be possible to parallelize my for loop using prange() without having the code crash. In C++, I only needed to write #pragma omp parallel for above the for loop to make the loop execute in parallel. Commented Oct 25, 2017 at 11:54
  • 2
    if I am reading this correctly there seems to be a mistaken assumption that guvectorize decorators imply GPU computation, but this is not correct. Indeed I use such constructs all the time on CPU targets. Commented Dec 16, 2017 at 19:47