Skip to main content
added 95 characters in body
Source Link

I'm open to a solution that uses a huge matrix in OpenCL (I'm aware that would be much more efficient than my solution).

The kernel code doesn't share(but produce similar, see the 2D kernel part ) any data between workitems. Each workitem working only its own data in this program. So having a larger image just increases the ratio of kernel launch overhead to the (computation+buffer_copy) time so the percieved throughput increases.

But,

Since each compute unit has SIMD, neighboring workitems should produce same or similar colors so decreasing local group size as much as possible should use those SIMDs better since difference in color means divergence in pipeline and bad for performance.

Think of drawing a filled circle, interior pixels need more work, outer part less work. Smaller tiles mean more efficient work distribution around the surface line.

2D Kernel

Scanline is not enough. Even Y-axis can have same or similar for neighbour pixels so you should use 2D-ndrange kernel and have them Z-ordered or at least squares.

If each compute unit has 64 cores, try tiles of 8x8 instead of 2x16 or 16x2 because of pixel result divergence.

Even with 1-D kernel, you can achieve same performance.

  • Get group id, get group x and group y values from that using modulus and division.
  • Map a local group to a tile using modulus and division again so each local thread works on neighbours in a tile instead of a scanline.

// 64 threads per group(square mapped), 256x256 image

 thread_group_x = get_group_id(0)%32  ---> 32 tiles along X axis
 thread_group_y = get_group_id(0)/32  ---> 32 tiles along Y axis
 thread_x = get_local_id(0)%8 ----> pixel-x inside tile
 thread_y = get_local_id(0)/8 ----> pixel-y inside tile
 ---calculate---
 ---calculate end---
 store(result, index=(thread_x+thread_group_x*8 + 256*(thread_y+thread_group_y*8)));

In simplePixel indices by 1D, emulated to 2D kernel for 8k x 8k example with 64 local threads 1,2,3,4 have only 1-2 2-3 3-4 neighbours.:

In (emulated)2D, threads 1,2,3,4 are all neigbours.

 unsigned ix = (get_group_id (0)%1024)*8+get_local_id(0)%8;
 unsigned iy = (get_group_id (0)/1024)*8+get_local_id(0)/8;

After data locality problem is solved, you can optimize for buffer copies to see actual compute performance instead of pci-e bottleneck.

I'm open to a solution that uses a huge matrix in OpenCL (I'm aware that would be much more efficient than my solution).

The kernel code doesn't share(but produce similar, see the 2D kernel part ) any data between workitems. Each workitem working only its own data in this program. So having a larger image just increases the ratio of kernel launch overhead to the (computation+buffer_copy) time so the percieved throughput increases.

But,

Since each compute unit has SIMD, neighboring workitems should produce same or similar colors so decreasing local group size as much as possible should use those SIMDs better since difference in color means divergence in pipeline and bad for performance.

Think of drawing a filled circle, interior pixels need more work, outer part less work. Smaller tiles mean more efficient work distribution around the surface line.

2D Kernel

Scanline is not enough. Even Y-axis can have same or similar for neighbour pixels so you should use 2D-ndrange kernel and have them Z-ordered or at least squares.

If each compute unit has 64 cores, try tiles of 8x8 instead of 2x16 or 16x2 because of pixel result divergence.

Even with 1-D kernel, you can achieve same performance.

  • Get group id, get group x and group y values from that using modulus and division.
  • Map a local group to a tile using modulus and division again so each local thread works on neighbours in a tile instead of a scanline.

// 64 threads per group(square mapped), 256x256 image

 thread_group_x = get_group_id(0)%32  ---> 32 tiles along X axis
 thread_group_y = get_group_id(0)/32  ---> 32 tiles along Y axis
 thread_x = get_local_id(0)%8 ----> pixel-x inside tile
 thread_y = get_local_id(0)/8 ----> pixel-y inside tile
 ---calculate---
 ---calculate end---
 store(result, index=(thread_x+thread_group_x*8 + 256*(thread_y+thread_group_y*8)));

In simple 1D, threads 1,2,3,4 have only 1-2 2-3 3-4 neighbours.

In (emulated)2D, threads 1,2,3,4 are all neigbours.


After data locality problem is solved, you can optimize for buffer copies to see actual compute performance instead of pci-e bottleneck.

I'm open to a solution that uses a huge matrix in OpenCL (I'm aware that would be much more efficient than my solution).

The kernel code doesn't share(but produce similar, see the 2D kernel part ) any data between workitems. Each workitem working only its own data in this program. So having a larger image just increases the ratio of kernel launch overhead to the (computation+buffer_copy) time so the percieved throughput increases.

But,

Since each compute unit has SIMD, neighboring workitems should produce same or similar colors so decreasing local group size as much as possible should use those SIMDs better since difference in color means divergence in pipeline and bad for performance.

Think of drawing a filled circle, interior pixels need more work, outer part less work. Smaller tiles mean more efficient work distribution around the surface line.

2D Kernel

Scanline is not enough. Even Y-axis can have same or similar for neighbour pixels so you should use 2D-ndrange kernel and have them Z-ordered or at least squares.

If each compute unit has 64 cores, try tiles of 8x8 instead of 2x16 or 16x2 because of pixel result divergence.

Even with 1-D kernel, you can achieve same performance.

  • Get group id, get group x and group y values from that using modulus and division.
  • Map a local group to a tile using modulus and division again so each local thread works on neighbours in a tile instead of a scanline.

// 64 threads per group(square mapped), 256x256 image

 thread_group_x = get_group_id(0)%32  ---> 32 tiles along X axis
 thread_group_y = get_group_id(0)/32  ---> 32 tiles along Y axis
 thread_x = get_local_id(0)%8 ----> pixel-x inside tile
 thread_y = get_local_id(0)/8 ----> pixel-y inside tile
 ---calculate---
 ---calculate end---
 store(result, index=(thread_x+thread_group_x*8 + 256*(thread_y+thread_group_y*8)));

Pixel indices by 1D emulated to 2D kernel for 8k x 8k example with 64 local threads:

 unsigned ix = (get_group_id (0)%1024)*8+get_local_id(0)%8;
 unsigned iy = (get_group_id (0)/1024)*8+get_local_id(0)/8;

After data locality problem is solved, you can optimize for buffer copies to see actual compute performance instead of pci-e bottleneck.

added 6 characters in body
Source Link

I'm open to a solution that uses a huge matrix in OpenCL (I'm aware that would be much more efficient than my solution).

The kernel code doesn't share(but produce similar, see the 2D kernel part ) any data between workitems. Each workitem working only its own data in this program. So having a larger image just increases the ratio of kernel launch overhead to the (computation+buffer_copy) time so the percieved throughput increases.

But,

Since each compute unit has SIMD, neighboring workitems should produce same or similar colors so decreasing local group size as much as possible should use those SIMDs better since difference in color means divergence in pipeline and bad for performance.

Think of drawing a filled circle, interior pixels need more work, outer part less work. Smaller tiles mean more efficient work distribution around the surface line.

2D Kernel

Scanline is not enough. Even Y-axis can have same or similar for neighbour pixels so you should use 2D-ndrange kernel and have them Z-ordered or at least squares.

If each compute unit has 64 cores, try tiles of 8x8 instead of 2x16 or 16x2 because of pixel result divergence.

Even with 1-D kernel, you can achieve same performance.

  • Get group id, get group x and group y values from that using modulus and division.
  • Map a local group to a tile using modulus and division again so each local thread works on neighbours in a tile instead of a scanline.

// 64 threads per group(square mapped), 256x256 image

 thread_group_x = get_group_id(0)%32  ---> 32 tiles along X axis
 thread_group_y = get_group_id(0)/32  ---> 32 tiles along Y axis
 thread_x = get_local_id(0)%8 ----> pixel-x inside tile
 thread_y = get_local_id(0)/8 ----> pixel-y inside tile
 ---calculate---
 ---calculate end---
 store(result, index=(thread_x+thread_group_x*8 + 256*(thread_y+thread_group_y*8)));

In simple 1D, threads 1,2,3,4 have only 1-2 2-3 3-4 neighbours.

In (emulated)2D, threads 1,2,3,4 are all neigbours.


After data locality problem is solved, you can optimize for buffer copies to see actual compute performance instead of pci-e bottleneck.

I'm open to a solution that uses a huge matrix in OpenCL (I'm aware that would be much more efficient than my solution).

The kernel code doesn't share any data between workitems. Each workitem working only its own data in this program. So having a larger image just increases the ratio of kernel launch overhead to the (computation+buffer_copy) time so the percieved throughput increases.

But,

Since each compute unit has SIMD, neighboring workitems should produce same or similar colors so decreasing local group size as much as possible should use those SIMDs better since difference in color means divergence in pipeline and bad for performance.

Think of drawing a filled circle, interior pixels need more work, outer part less work. Smaller tiles mean more efficient work distribution around the surface line.

Scanline is not enough. Even Y-axis can have same or similar for neighbour pixels so you should use 2D-ndrange kernel and have them Z-ordered or at least squares.

If each compute unit has 64 cores, try tiles of 8x8 instead of 2x16 or 16x2 because of pixel result divergence.

Even with 1-D kernel, you can achieve same performance.

  • Get group id, get group x and group y values from that using modulus and division.
  • Map a local group to a tile using modulus and division again so each local thread works on neighbours in a tile instead of a scanline.

// 64 threads per group(square mapped), 256x256 image

 thread_group_x = get_group_id(0)%32  ---> 32 tiles along X axis
 thread_group_y = get_group_id(0)/32  ---> 32 tiles along Y axis
 thread_x = get_local_id(0)%8 ----> pixel-x inside tile
 thread_y = get_local_id(0)/8 ----> pixel-y inside tile
 ---calculate---
 ---calculate end---
 store(result, index=(thread_x+thread_group_x*8 + 256*(thread_y+thread_group_y*8)));

After data locality problem is solved, you can optimize for buffer copies to see actual compute performance instead of pci-e bottleneck.

I'm open to a solution that uses a huge matrix in OpenCL (I'm aware that would be much more efficient than my solution).

The kernel code doesn't share(but produce similar, see the 2D kernel part ) any data between workitems. Each workitem working only its own data in this program. So having a larger image just increases the ratio of kernel launch overhead to the (computation+buffer_copy) time so the percieved throughput increases.

But,

Since each compute unit has SIMD, neighboring workitems should produce same or similar colors so decreasing local group size as much as possible should use those SIMDs better since difference in color means divergence in pipeline and bad for performance.

Think of drawing a filled circle, interior pixels need more work, outer part less work. Smaller tiles mean more efficient work distribution around the surface line.

2D Kernel

Scanline is not enough. Even Y-axis can have same or similar for neighbour pixels so you should use 2D-ndrange kernel and have them Z-ordered or at least squares.

If each compute unit has 64 cores, try tiles of 8x8 instead of 2x16 or 16x2 because of pixel result divergence.

Even with 1-D kernel, you can achieve same performance.

  • Get group id, get group x and group y values from that using modulus and division.
  • Map a local group to a tile using modulus and division again so each local thread works on neighbours in a tile instead of a scanline.

// 64 threads per group(square mapped), 256x256 image

 thread_group_x = get_group_id(0)%32  ---> 32 tiles along X axis
 thread_group_y = get_group_id(0)/32  ---> 32 tiles along Y axis
 thread_x = get_local_id(0)%8 ----> pixel-x inside tile
 thread_y = get_local_id(0)/8 ----> pixel-y inside tile
 ---calculate---
 ---calculate end---
 store(result, index=(thread_x+thread_group_x*8 + 256*(thread_y+thread_group_y*8)));

In simple 1D, threads 1,2,3,4 have only 1-2 2-3 3-4 neighbours.

In (emulated)2D, threads 1,2,3,4 are all neigbours.


After data locality problem is solved, you can optimize for buffer copies to see actual compute performance instead of pci-e bottleneck.

added 6 characters in body
Source Link

I'm open to a solution that uses a huge matrix in OpenCL (I'm aware that would be much more efficient than my solution).

The kernel code doesn't share any data between workitems. Each workitem working only its own data in this program. So having a larger image just increases the ratio of kernel launch overhead to the (computation+buffer_copy) time so the percieved throughput increases.

But,

Since each compute unit has SIMD, neighboring workitems should produce same or similar colors so decreasing local group size as much as possible should use those SIMDs better since difference in color means divergence in pipeline and bad for performance.

Think of drawing a filled circle, interior pixels need more work, outer part less work. Smaller tiles mean more efficient work distribution around the surface line.

Scanline is not enough. Even Y-axis can have same or similar for neighbour pixels so you should use 2D-ndrange kernel and have them Z-ordered or at least squares.

If each compute unit has 64 cores, try tiles of 8x8 instead of 2x16 or 16x2 because of pixel result divergence.

Even with 1-D kernel, you can achieve same performance.

  • Get group id, get group x and group y values from that using modulus and division.
  • Map a local group to a tile using modulus and division again so each local thread works on neighbours in a tile instead of a scanline.

// 64 threads per group(square mapped), 256x256 image

 thread_group_x = get_group_id(0)%32  ---> 32 tiles along X axis
 thread_group_y = get_group_id(0)/32  ---> 32 tiles along Y axis
 thread_x = get_local_id(0)%8 ----> pixel-x inside tile
 thread_y = get_local_id(0)/8 ----> pixel-y inside tile
 ---calculate---
 ---calculate end---
 store(result, index=(thread_x+thread_group_x*8 + 256*(thread_y+thread_group_y*8)));

After data locality problem is solved, you can optimize for buffer copies to see actual compute performance instead of pci-e bottleneck.

I'm open to a solution that uses a huge matrix in OpenCL (I'm aware that would be much more efficient than my solution).

The kernel code doesn't share any data between workitems. Each workitem working only its own data in this program. So having a larger image just increases the ratio of kernel launch overhead to the (computation+buffer_copy) time so the percieved throughput increases.

Since each compute unit has SIMD, neighboring workitems should produce same or similar colors so decreasing local group size as much as possible should use those SIMDs better since difference in color means divergence in pipeline and bad for performance.

Scanline is not enough. Even Y-axis can have same or similar for neighbour pixels so you should use 2D-ndrange kernel and have them Z-ordered or at least squares.

If each compute unit has 64 cores, try tiles of 8x8 instead of 2x16 or 16x2 because of pixel result divergence.

Even with 1-D kernel, you can achieve same performance.

  • Get group id, get group x and group y values from that using modulus and division.
  • Map a local group to a tile using modulus and division again so each local thread works on neighbours in a tile instead of a scanline.

// 64 threads per group(square mapped), 256x256 image

 thread_group_x = get_group_id(0)%32  ---> 32 tiles along X axis
 thread_group_y = get_group_id(0)/32  ---> 32 tiles along Y axis
 thread_x = get_local_id(0)%8 ----> pixel-x inside tile
 thread_y = get_local_id(0)/8 ----> pixel-y inside tile
 ---calculate---
 ---calculate end---
 store(result, index=(thread_x+thread_group_x*8 + 256*(thread_y+thread_group_y*8)));

I'm open to a solution that uses a huge matrix in OpenCL (I'm aware that would be much more efficient than my solution).

The kernel code doesn't share any data between workitems. Each workitem working only its own data in this program. So having a larger image just increases the ratio of kernel launch overhead to the (computation+buffer_copy) time so the percieved throughput increases.

But,

Since each compute unit has SIMD, neighboring workitems should produce same or similar colors so decreasing local group size as much as possible should use those SIMDs better since difference in color means divergence in pipeline and bad for performance.

Think of drawing a filled circle, interior pixels need more work, outer part less work. Smaller tiles mean more efficient work distribution around the surface line.

Scanline is not enough. Even Y-axis can have same or similar for neighbour pixels so you should use 2D-ndrange kernel and have them Z-ordered or at least squares.

If each compute unit has 64 cores, try tiles of 8x8 instead of 2x16 or 16x2 because of pixel result divergence.

Even with 1-D kernel, you can achieve same performance.

  • Get group id, get group x and group y values from that using modulus and division.
  • Map a local group to a tile using modulus and division again so each local thread works on neighbours in a tile instead of a scanline.

// 64 threads per group(square mapped), 256x256 image

 thread_group_x = get_group_id(0)%32  ---> 32 tiles along X axis
 thread_group_y = get_group_id(0)/32  ---> 32 tiles along Y axis
 thread_x = get_local_id(0)%8 ----> pixel-x inside tile
 thread_y = get_local_id(0)/8 ----> pixel-y inside tile
 ---calculate---
 ---calculate end---
 store(result, index=(thread_x+thread_group_x*8 + 256*(thread_y+thread_group_y*8)));

After data locality problem is solved, you can optimize for buffer copies to see actual compute performance instead of pci-e bottleneck.

added 6 characters in body
Source Link
Loading
added 6 characters in body
Source Link
Loading
added 6 characters in body
Source Link
Loading
Post Undeleted by huseyin tugrul buyukisik
deleted 1058 characters in body
Source Link
Loading
Post Deleted by huseyin tugrul buyukisik
added 988 characters in body
Source Link
Loading
Source Link
Loading