Here is some OpenCl test on Intel HD Graphics 400 with 12 compute units and using 1-channel 1600 MHz ddr3 ram:
timings include buffer copies. No mapping was used. With memory mapping, it would drop to much lower time.
- image: 1024 x 1024 and 1 byte per pixel
time: 4.5 ms (1.1 ms with mapping)
- image: 2048 x 2048 and 1 byte per pixel
time: 9.7 ms (1.9 ms with mapping)
- image: 4096 x 4096 and 1 byte per pixel
time: 21 ms
- image: 8192 x 8192 and 1 byte per pixel
time:65 ms
kernel code(number of threads are half of total pixels, each thread swap uppermost line's pixel with bottommost line's pixel):
__kernel void test0(__global char *imagebuf)
{
int i=get_global_id(0);
int height=8192;
int width=8192;
int y=i/width;
int x=i%width;
char tmp=255-imagebuf[((height-y)-1)+x];
char tmp2=255-imagebuf[x+y*width];
imagebuf[x+y*width]=tmp;
imagebuf[((height-y)-1)+x]=tmp2;
}
throughput increases for larger images and minimum latency depends on hardware and opencl wrapper thickness. This example was run on a not-thin wrapper.
One of the pros, cpu can be used for other things when gpu is computing this.
One of the cons, depending on the image size, completion time will have variance.