I'm open to a solution that uses a huge matrix in OpenCL (I'm aware that would be much more efficient than my solution).
The kernel code doesn't share(but produce similar, see the 2D kernel part ) any data between workitems. Each workitem working only its own data in this program. So having a larger image just increases the ratio of kernel launch overhead to the (computation+buffer_copy) time so the percieved throughput increases.
But,
Since each compute unit has SIMD, neighboring workitems should produce same or similar colors so decreasing local group size as much as possible should use those SIMDs better since difference in color means divergence in pipeline and bad for performance.
Think of drawing a filled circle, interior pixels need more work, outer part less work. Smaller tiles mean more efficient work distribution around the surface line.
2D Kernel
Scanline is not enough. Even Y-axis can have same or similar for neighbour pixels so you should use 2D-ndrange kernel and have them Z-ordered or at least squares.
If each compute unit has 64 cores, try tiles of 8x8 instead of 2x16 or 16x2 because of pixel result divergence.
Even with 1-D kernel, you can achieve same performance.
- Get group id, get group x and group y values from that using modulus and division.
- Map a local group to a tile using modulus and division again so each local thread works on neighbours in a tile instead of a scanline.
// 64 threads per group(square mapped), 256x256 image
thread_group_x = get_group_id(0)%32 ---> 32 tiles along X axis
thread_group_y = get_group_id(0)/32 ---> 32 tiles along Y axis
thread_x = get_local_id(0)%8 ----> pixel-x inside tile
thread_y = get_local_id(0)/8 ----> pixel-y inside tile
---calculate---
---calculate end---
store(result, index=(thread_x+thread_group_x*8 + 256*(thread_y+thread_group_y*8)));
In simplePixel indices by 1D, emulated to 2D kernel for 8k x 8k example with 64 local threads 1,2,3,4 have only 1-2 2-3 3-4 neighbours.:
In (emulated)2D, threads 1,2,3,4 are all neigbours.
unsigned ix = (get_group_id (0)%1024)*8+get_local_id(0)%8;
unsigned iy = (get_group_id (0)/1024)*8+get_local_id(0)/8;
After data locality problem is solved, you can optimize for buffer copies to see actual compute performance instead of pci-e bottleneck.