Revisions to Saving image data - Code Review Stack Exchange

added 105 characters in body

Source Link

edited Mar 20, 2017 at 22:45

huseyin tugrul buyukisik

28
1
11

Here is some OpenCl test on Intel HD Graphics 400 with 12 compute units and using 1-channel 1600 MHz ddr3 ram:

timings include buffer copies. No mapping was used. With memory mapping, it would drop to much lower time.

Edit: fixed the mapped buffer accessing, it is even faster now(laptop battery nearly empty)

1 byte per pixel
1024 x 1024: 4.5 ms (21.352 ms with mapping)
2048 x 2048: 9.7 ms (34.719 ms with mapping)
4096 x 4096: 21 ms (813.993 ms with mapping)
8192 x 8192: 65 ms (2855.316 ms with mapping)

kernel code(number of threads are half of total pixels, each thread swap uppermost line's pixel with bottommost line's pixel):

__kernel void test0(__global char *imagebuf)
{
        int i=get_global_id(0);
        int height=8192;
        int width=8192;
        int y=i/width;
        int x=i%width;
        char tmp=255-imagebuf[((height-y)-1)*width+x];
        char tmp2=255-imagebuf[x+y*width];
        imagebuf[x+y*width]=tmp;
        imagebuf[((height-y)-1)*width+x]=tmp2;
}

throughput increases for larger images and minimum latency depends on hardware and opencl wrapper thickness. This example was run on a not-thin wrapper.

One of the pros, cpu can be used for other things when gpu is computing this.

One of the cons, depending on the image size, completion time will have variance.

Kernel has low compute to data ratio so it will not be faster for computers with same pci-e bandiwdth.(just tried with a r7-240, 8k_8k took 67 ms)

Here is some OpenCl test on Intel HD Graphics 400 with 12 compute units and using 1-channel 1600 MHz ddr3 ram:

timings include buffer copies. No mapping was used. With memory mapping, it would drop to much lower time.

1 byte per pixel
1024 x 1024: 4.5 ms (2.3 ms with mapping)
2048 x 2048: 9.7 ms (3.7 ms with mapping)
4096 x 4096: 21 ms (8.9 ms with mapping)
8192 x 8192: 65 ms (28.3 ms with mapping)

kernel code(number of threads are half of total pixels, each thread swap uppermost line's pixel with bottommost line's pixel):

__kernel void test0(__global char *imagebuf)
{
        int i=get_global_id(0);
        int height=8192;
        int width=8192;
        int y=i/width;
        int x=i%width;
        char tmp=255-imagebuf[((height-y)-1)*width+x];
        char tmp2=255-imagebuf[x+y*width];
        imagebuf[x+y*width]=tmp;
        imagebuf[((height-y)-1)*width+x]=tmp2;
}

throughput increases for larger images and minimum latency depends on hardware and opencl wrapper thickness. This example was run on a not-thin wrapper.

One of the pros, cpu can be used for other things when gpu is computing this.

One of the cons, depending on the image size, completion time will have variance.

Kernel has low compute to data ratio so it will not be faster for computers with same pci-e bandiwdth.(just tried with a r7-240, 8k_8k took 67 ms)

Here is some OpenCl test on Intel HD Graphics 400 with 12 compute units and using 1-channel 1600 MHz ddr3 ram:

timings include buffer copies. No mapping was used. With memory mapping, it would drop to much lower time.

Edit: fixed the mapped buffer accessing, it is even faster now(laptop battery nearly empty)

1 byte per pixel
1024 x 1024: 4.5 ms (1.52 ms with mapping)
2048 x 2048: 9.7 ms (4.19 ms with mapping)
4096 x 4096: 21 ms (13.93 ms with mapping)
8192 x 8192: 65 ms (55.16 ms with mapping)

kernel code(number of threads are half of total pixels, each thread swap uppermost line's pixel with bottommost line's pixel):

__kernel void test0(__global char *imagebuf)
{
        int i=get_global_id(0);
        int height=8192;
        int width=8192;
        int y=i/width;
        int x=i%width;
        char tmp=255-imagebuf[((height-y)-1)*width+x];
        char tmp2=255-imagebuf[x+y*width];
        imagebuf[x+y*width]=tmp;
        imagebuf[((height-y)-1)*width+x]=tmp2;
}

throughput increases for larger images and minimum latency depends on hardware and opencl wrapper thickness. This example was run on a not-thin wrapper.

One of the pros, cpu can be used for other things when gpu is computing this.

One of the cons, depending on the image size, completion time will have variance.

Kernel has low compute to data ratio so it will not be faster for computers with same pci-e bandiwdth.(just tried with a r7-240, 8k_8k took 67 ms)

added 150 characters in body

Source Link

edited Mar 1, 2017 at 23:44

huseyin tugrul buyukisik

28
1
11

Here is some OpenCl test on Intel HD Graphics 400 with 12 compute units and using 1-channel 1600 MHz ddr3 ram:

timings include buffer copies. No mapping was used. With memory mapping, it would drop to much lower time.

1 byte per pixel
1024 x 1024: 4.5 ms (2.3 ms with mapping)
2048 x 2048: 9.7 ms (3.7 ms with mapping)
4096 x 4096: 21 ms (8.9 ms with mapping)
8192 x 8192: 65 ms (28.3 ms with mapping)

kernel code(number of threads are half of total pixels, each thread swap uppermost line's pixel with bottommost line's pixel):

__kernel void test0(__global char *imagebuf)
{
        int i=get_global_id(0);
        int height=8192;
        int width=8192;
        int y=i/width;
        int x=i%width;
        char tmp=255-imagebuf[((height-y)-1)*width+x];
        char tmp2=255-imagebuf[x+y*width];
        imagebuf[x+y*width]=tmp;
        imagebuf[((height-y)-1)*width+x]=tmp2;
}

throughput increases for larger images and minimum latency depends on hardware and opencl wrapper thickness. This example was run on a not-thin wrapper.

One of the pros, cpu can be used for other things when gpu is computing this.

One of the cons, depending on the image size, completion time will have variance.

Kernel has low compute to data ratio so it will not be faster for computers with same pci-e bandiwdth.(just tried with a r7-240, 8k_8k took 67 ms)

Here is some OpenCl test on Intel HD Graphics 400 with 12 compute units and using 1-channel 1600 MHz ddr3 ram:

timings include buffer copies. No mapping was used. With memory mapping, it would drop to much lower time.

1 byte per pixel
1024 x 1024: 4.5 ms (2.3 ms with mapping)
2048 x 2048: 9.7 ms (3.7 ms with mapping)
4096 x 4096: 21 ms (8.9 ms with mapping)
8192 x 8192: 65 ms (28.3 ms with mapping)

kernel code(number of threads are half of total pixels, each thread swap uppermost line's pixel with bottommost line's pixel):

__kernel void test0(__global char *imagebuf)
{
        int i=get_global_id(0);
        int height=8192;
        int width=8192;
        int y=i/width;
        int x=i%width;
        char tmp=255-imagebuf[((height-y)-1)*width+x];
        char tmp2=255-imagebuf[x+y*width];
        imagebuf[x+y*width]=tmp;
        imagebuf[((height-y)-1)*width+x]=tmp2;
}

throughput increases for larger images and minimum latency depends on hardware and opencl wrapper thickness. This example was run on a not-thin wrapper.

One of the pros, cpu can be used for other things when gpu is computing this.

One of the cons, depending on the image size, completion time will have variance.

Here is some OpenCl test on Intel HD Graphics 400 with 12 compute units and using 1-channel 1600 MHz ddr3 ram:

timings include buffer copies. No mapping was used. With memory mapping, it would drop to much lower time.

1 byte per pixel
1024 x 1024: 4.5 ms (2.3 ms with mapping)
2048 x 2048: 9.7 ms (3.7 ms with mapping)
4096 x 4096: 21 ms (8.9 ms with mapping)
8192 x 8192: 65 ms (28.3 ms with mapping)

kernel code(number of threads are half of total pixels, each thread swap uppermost line's pixel with bottommost line's pixel):

__kernel void test0(__global char *imagebuf)
{
        int i=get_global_id(0);
        int height=8192;
        int width=8192;
        int y=i/width;
        int x=i%width;
        char tmp=255-imagebuf[((height-y)-1)*width+x];
        char tmp2=255-imagebuf[x+y*width];
        imagebuf[x+y*width]=tmp;
        imagebuf[((height-y)-1)*width+x]=tmp2;
}

throughput increases for larger images and minimum latency depends on hardware and opencl wrapper thickness. This example was run on a not-thin wrapper.

One of the pros, cpu can be used for other things when gpu is computing this.

One of the cons, depending on the image size, completion time will have variance.

Kernel has low compute to data ratio so it will not be faster for computers with same pci-e bandiwdth.(just tried with a r7-240, 8k_8k took 67 ms)

added 19 characters in body

Source Link

edited Feb 20, 2017 at 23:02

huseyin tugrul buyukisik

28
1
11

Here is some OpenCl test on Intel HD Graphics 400 with 12 compute units and using 1-channel 1600 MHz ddr3 ram:

timings include buffer copies. No mapping was used. With memory mapping, it would drop to much lower time.

1 byte per pixel
1024 x 1024: 4.5 ms (12.13 ms with mapping)
2048 x 2048: 9.7 ms (13.97 ms with mapping)
4096 x 4096: 21 ms (58.79 ms with mapping)
8192 x 8192: 65 ms (2028.3 ms with mapping -> nearly 12 GB/s )

kernel code(number of threads are half of total pixels, each thread swap uppermost line's pixel with bottommost line's pixel):

__kernel void test0(__global char *imagebuf)
{
        int i=get_global_id(0);
        int height=8192;
        int width=8192;
        int y=i/width;
        int x=i%width;
        char tmp=255-imagebuf[((height-y)-1)+x];*width+x];
        char tmp2=255-imagebuf[x+y*width];
        imagebuf[x+y*width]=tmp;
        imagebuf[((height-y)-1)+x]=tmp2;*width+x]=tmp2;
}

throughput increases for larger images and minimum latency depends on hardware and opencl wrapper thickness. This example was run on a not-thin wrapper.

One of the pros, cpu can be used for other things when gpu is computing this.

One of the cons, depending on the image size, completion time will have variance.

Here is some OpenCl test on Intel HD Graphics 400 with 12 compute units and using 1-channel 1600 MHz ddr3 ram:

timings include buffer copies. No mapping was used. With memory mapping, it would drop to much lower time.

1 byte per pixel
1024 x 1024: 4.5 ms (1.1 ms with mapping)
2048 x 2048: 9.7 ms (1.9 ms with mapping)
4096 x 4096: 21 ms (5.7 ms with mapping)
8192 x 8192: 65 ms (20.3 ms with mapping -> nearly 12 GB/s )

kernel code(number of threads are half of total pixels, each thread swap uppermost line's pixel with bottommost line's pixel):

__kernel void test0(__global char *imagebuf)
{
        int i=get_global_id(0);
        int height=8192;
        int width=8192;
        int y=i/width;
        int x=i%width;
        char tmp=255-imagebuf[((height-y)-1)+x];
        char tmp2=255-imagebuf[x+y*width];
        imagebuf[x+y*width]=tmp;
        imagebuf[((height-y)-1)+x]=tmp2;
}

throughput increases for larger images and minimum latency depends on hardware and opencl wrapper thickness. This example was run on a not-thin wrapper.

One of the pros, cpu can be used for other things when gpu is computing this.

One of the cons, depending on the image size, completion time will have variance.

Here is some OpenCl test on Intel HD Graphics 400 with 12 compute units and using 1-channel 1600 MHz ddr3 ram:

timings include buffer copies. No mapping was used. With memory mapping, it would drop to much lower time.

1 byte per pixel
1024 x 1024: 4.5 ms (2.3 ms with mapping)
2048 x 2048: 9.7 ms (3.7 ms with mapping)
4096 x 4096: 21 ms (8.9 ms with mapping)
8192 x 8192: 65 ms (28.3 ms with mapping)

kernel code(number of threads are half of total pixels, each thread swap uppermost line's pixel with bottommost line's pixel):

__kernel void test0(__global char *imagebuf)
{
        int i=get_global_id(0);
        int height=8192;
        int width=8192;
        int y=i/width;
        int x=i%width;
        char tmp=255-imagebuf[((height-y)-1)*width+x];
        char tmp2=255-imagebuf[x+y*width];
        imagebuf[x+y*width]=tmp;
        imagebuf[((height-y)-1)*width+x]=tmp2;
}

throughput increases for larger images and minimum latency depends on hardware and opencl wrapper thickness. This example was run on a not-thin wrapper.

One of the pros, cpu can be used for other things when gpu is computing this.

One of the cons, depending on the image size, completion time will have variance.

added 19 characters in body

Source Link

edited Feb 20, 2017 at 22:56

huseyin tugrul buyukisik

28
1
11

Loading

added 18 characters in body

Source Link

edited Feb 20, 2017 at 22:48

huseyin tugrul buyukisik

28
1
11

Loading

added 18 characters in body

Source Link

edited Feb 20, 2017 at 22:41

huseyin tugrul buyukisik

28
1
11

Loading

added 115 characters in body

Source Link

edited Feb 20, 2017 at 22:36

huseyin tugrul buyukisik

28
1
11

Loading

Source Link

answered Feb 20, 2017 at 22:22

huseyin tugrul buyukisik

28
1
11

Loading

Stack Exchange Network

Return to Answer