Here is some OpenCl test on Intel HD Graphics 400 with 12 compute units and using 1-channel 1600 MHz ddr3 ram:
- image: 1024 x 1024 and 1 byte per pixel
time: 4.5 ms
- image: 2048 x 2048 and 1 byte per pixel
time: 9.7 ms
- image: 4096 x 4096 and 1 byte per pixel
time: 21 ms
- image: 8192 x 8192 and 1 byte per pixel
time:65 ms
kernel code(number of threads are half of total pixels, each thread swap uppermost line's pixel with bottommost line's pixel):
__kernel void test0(__global char *imagebuf)
{
int i=get_global_id(0);
int height=8192;
int width=8192;
int y=i/width;
int x=i%width;
char tmp=255-imagebuf[((height-y)-1)+x];
char tmp2=255-imagebuf[x+y*width];
imagebuf[x+y*width]=tmp;
imagebuf[((height-y)-1)+x]=tmp2;
}
throughput increases for larger images and minimum latency depends on hardware and opencl wrapper thickness. This example was run on a not-thin wrapper.