I am writing some code in C in which I want to add the optional ability to have certain sections of the code accelerated using OpenMP, and with an additional optional ability to have them accelerated with devices such as GPUs. For example, my matrix multiplication function leverages GPU acceleration:
/* ... */
int numeric_matmul(const float_t *pt_a, const float_t *pt_b, float_t *pt_c, uintmax_t t_m, uintmax_t t_k, uintmax_t t_n)
{
#ifdef _OPENMP
#pragma omp target teams distribute parallel for collapse(2) schedule(dynamic) map(to: pt_a[0 : t_m * t_k], pt_b[0 : t_k * t_n]) map(from: pt_c[0 : t_m * t_n])
#endif
for(uintmax_t l_i = 0; l_i < t_m; l_i++)
{
for(uintmax_t l_j = 0; l_j < t_n; l_j++)
{
/* Compute the sum. */
float_t l_sum = 0.0;
for(uintmax_t l_p = 0; l_p < t_k; l_p++) l_sum += pt_a[l_i * t_k + l_p] * pt_b[l_p * t_n + l_j];
/* Store the result. */
pt_c[l_i * t_n + l_j] = l_sum;
}
}
/* Return with success. */
return 0;
}
And it works fine. However, when I try to use common mathematics-related functions (from math.h
), I face an obstacle; how can I use those functions for both the CPU and the GPU?
I have this function here:
/**
* @brief Perform the sigmoid function on a value.
* @param t_x The input value.
* @param pt_y The output value.
* @return The result status code. In this case, it'll always return 0.
*/
static inline int numeric_sigmoid(float_t t_x, float_t *pt_y)
{
/* Set the output value to the sigmoid of the input value. */
*pt_y = 1.0 / (1.0 + expf(-t_x));
/* Return with success. */
return 0;
}
Which relies on the expf
function. I want this function to be both capable of being run on the CPU, and the GPU. It runs fine on the CPU-side of my codebase, but as soon as I try leveraging the GPU's power:
#pragma omp target teams distribute parallel for schedule(dynamic) map(to: pt_feedforward->ppt_hidden_layer_bias_buffer[l_i][0 : l_next_layer_activation_buffer_size]) map(from: pl_next_layer_activation_buffer[0 : l_next_layer_activation_buffer_size])
for(uintmax_t l_j = 0; l_j < l_next_layer_activation_buffer_size; l_j++)
{
pl_next_layer_activation_buffer[l_j] += pt_feedforward->ppt_hidden_layer_bias_buffer[l_i][l_j];
numeric_sigmoid(pl_next_layer_activation_buffer[l_j], &pl_next_layer_activation_buffer[l_j]);
}
I face a runtime error:
libgomp: pointer target not mapped for attach
I get GCC to compile with GPU offloading using my NVIDIA card by telling CMake to use some additional parameters:
cmake -DCMAKE_C_COMPILER=gcc -DCMAKE_C_FLAGS="-fopenmp -foffload=nvptx-none -foffload-options=-misa=sm_80 -fcf-protection=none -fno-stack-protector -no-pie" ..
I also tried using -ffast-math
, but that didn't seem to do anything either...
1.0 / (1.0 + expf(-t_x));
certainly uses the double precision for the computation so you should use1.0f / (1.0f + expf(-t_x));
instead. Double precision is generally (much) slower on GPU (especially on client-side ones).-foffload=-lm
for older gcc versions: gcc.gnu.org/wiki/Offloading