Edit - Stack Overflow

You are not logged in. Your edit will be placed in a queue until it is peer reviewed.

We welcome edits that make the post easier to understand and more valuable for readers. Because community members review edits, please try to make the post substantially better than how you found it, for example, by fixing grammar or adding additional resources and hyperlinks.

Required fields*

Rev

3

I don't think specifying the signature is a good advise, it prevents optimizations based on the contiguous-ness of the data (sometimes resulting in noticable degraded performance). Also I'm not sure why you mention GPU here. Nothing in the question mentions GPU.

MSeifert
– MSeifert

2017-10-25 11:44:30 +00:00
Commented Oct 25, 2017 at 11:44
1

But I like the part about the cost of parallel processing, especially the often ignored part that "it is very, indeed VERY EASY to pay MUCH more than one may gain from"!

MSeifert
– MSeifert

2017-10-25 11:45:59 +00:00
Commented Oct 25, 2017 at 11:45
Ad GPU) actually it was mentioned in comments above to try numba @guvectorize tool, so I added a few remarks on hidden extreme costs of ( also indeed VERY OFTEN mis-used ) GPU-latency-masking-SMX toys for this sort of problems. GPU can help for "mathematically"-large-GPU-kernels operating on very-compact-&-small-data-region + having minimum, best none, SIMT-synchronisation, but not for anything else. Parallelisation AT ANY COSTS is so, so often these days. "Ó Tempóra, ó Mórés ..." :o)

user3666197
– user3666197

2017-10-25 11:52:35 +00:00
Commented Oct 25, 2017 at 11:52
1

Thanks for this detailed answer. One thing to keep in mind is that I wrote a very similar csrMult function in C++, where it was trivial to parallelize the for loop (because C++ supports openMP natively), and by parallelizing the for loop I observed a 6x or 7x speedup, using the same matrix. I would expect a similar speedup here. In any case, I think it should at least be possible to parallelize my for loop using prange() without having the code crash. In C++, I only needed to write #pragma omp parallel for above the for loop to make the loop execute in parallel.

littleO
– littleO

2017-10-25 11:54:47 +00:00
Commented Oct 25, 2017 at 11:54
2

if I am reading this correctly there seems to be a mistaken assumption that guvectorize decorators imply GPU computation, but this is not correct. Indeed I use such constructs all the time on CPU targets.

Michael Grant
– Michael Grant

2017-12-16 19:47:11 +00:00
Commented Dec 16, 2017 at 19:47

| Show 10 more comments

Correct minor typos or mistakes
Clarify meaning without changing it
Add related resources or links
Always respect the author’s intent
Don’t use edits to reply to the author

create code fences with backticks ` or tildes ~
```
like so
```
add language identifier to highlight code
```python
def function(foo):
print(foo)
```
put returns between paragraphs
for linebreak add 2 spaces at end
_italic_ or **bold**
indent code by 4 spaces
backtick escapes `like _so_`
quote by placing > at start of line
to make links (use https whenever possible)

<https://example.com>

[example](https://example.com)

<a href="https://example.com">example</a>

formatting help »
answering help »

Collectives™ on Stack Overflow