Standard way of calling math functions in C when using OpenMP & its offloading feature(s)?

Question

I am writing some code in C in which I want to add the optional ability to have certain sections of the code accelerated using OpenMP, and with an additional optional ability to have them accelerated with devices such as GPUs. For example, my matrix multiplication function leverages GPU acceleration:

/* ... */
int numeric_matmul(const float_t *pt_a, const float_t *pt_b, float_t *pt_c, uintmax_t t_m, uintmax_t t_k, uintmax_t t_n)
{
#ifdef _OPENMP
#pragma omp target teams distribute parallel for collapse(2) schedule(dynamic) map(to: pt_a[0 : t_m * t_k], pt_b[0 : t_k * t_n]) map(from: pt_c[0 : t_m * t_n])
#endif
    for(uintmax_t l_i = 0; l_i < t_m; l_i++)
    {
        for(uintmax_t l_j = 0; l_j < t_n; l_j++)
        {
/* Compute the sum. */
            float_t l_sum = 0.0;
            for(uintmax_t l_p = 0; l_p < t_k; l_p++) l_sum += pt_a[l_i * t_k + l_p] * pt_b[l_p * t_n + l_j];

/* Store the result. */
            pt_c[l_i * t_n + l_j] = l_sum;
        }
    }

/* Return with success. */
    return 0;
}

And it works fine. However, when I try to use common mathematics-related functions (from math.h), I face an obstacle; how can I use those functions for both the CPU and the GPU? I have this function here:

/**
 *  @brief Perform the sigmoid function on a value.
 *  @param t_x The input value.
 *  @param pt_y The output value.
 *  @return The result status code. In this case, it'll always return 0.
 */
static inline int numeric_sigmoid(float_t t_x, float_t *pt_y)
{
/* Set the output value to the sigmoid of the input value. */
    *pt_y = 1.0 / (1.0 + expf(-t_x));

/* Return with success. */
    return 0;
}

Which relies on the expf function. I want this function to be both capable of being run on the CPU, and the GPU. It runs fine on the CPU-side of my codebase, but as soon as I try leveraging the GPU's power:

#pragma omp target teams distribute parallel for schedule(dynamic) map(to: pt_feedforward->ppt_hidden_layer_bias_buffer[l_i][0 : l_next_layer_activation_buffer_size]) map(from: pl_next_layer_activation_buffer[0 : l_next_layer_activation_buffer_size])
for(uintmax_t l_j = 0; l_j < l_next_layer_activation_buffer_size; l_j++)
{
    pl_next_layer_activation_buffer[l_j] += pt_feedforward->ppt_hidden_layer_bias_buffer[l_i][l_j];
    numeric_sigmoid(pl_next_layer_activation_buffer[l_j], &pl_next_layer_activation_buffer[l_j]);
}

I face a runtime error:

libgomp: pointer target not mapped for attach

I get GCC to compile with GPU offloading using my NVIDIA card by telling CMake to use some additional parameters:

cmake -DCMAKE_C_COMPILER=gcc -DCMAKE_C_FLAGS="-fopenmp -foffload=nvptx-none -foffload-options=-misa=sm_80 -fcf-protection=none -fno-stack-protector -no-pie" ..

I also tried using -ffast-math, but that didn't seem to do anything either...

Does anyone here know how to properly accomplish the use of `math.h` & it's features on GPU-accelerated contexts as well when dealing with OpenMP's GPU offloading feature?

Few side notes: matrix multiplication is implemented on recent (Nvidia) GPUs and will certainly not be used by the kernel. MM units are generally much faster. 1.0 / (1.0 + expf(-t_x)); certainly uses the double precision for the computation so you should use 1.0f / (1.0f + expf(-t_x)); instead. Double precision is generally (much) slower on GPU (especially on client-side ones). — Jérôme Richard, Commented Mar 30 at 19:36
Which gcc version do you use? According to the gcc wiki, you need to specify -foffload=-lm for older gcc versions: gcc.gnu.org/wiki/Offloading — Joachim, Commented Mar 30 at 19:54
@JérômeRichard Thanks for the suggestion! I didn't know NVIDIA supported matrix multiplication as an independent operator but I'll certainly look into it! Thanks for the suggestion with fixing the double-precision problem as well, I didn't notice that! — Matthew G., Commented Mar 30 at 20:21

Joachim · Accepted Answer · 2025-04-03 07:19:12Z

From OpenMP spec perspective, calling functions from the math library is not different from calling any other function.

To call a function from within a target region, the function must be declared as target function (#pragma omp declare target) and also compiled for the targeted device.

Some math functions might have a specialized instruction or implementation available for the targeted architecture. In such case, the compiler would replace the function call by calling the specialized implementation. For math functions, most OpenMP implementations ship libraries compiled for the targeted architecture. Like any other library you want to use from an application code, you also need to link the math library (-lm). I think, that many OpenMP programmers are not used to explicitly link the math library, when needed, because for a long time gcc's -fopenmp implicitly linked the math library.

With gcc-13, you need to link with -lm -foffload=-lm. With gcc-14, it is sufficient to link with -lm, the offloading math library is implied then.

MSalters · Accepted Answer · 2025-04-03 08:41:48Z

This looks like neural network code. There's a rather fundamental problem that you are completely overlooking here. GPU's physically have their own memory. If you are trying to execute operations on CPU and GPU, modern GPU's can hide some of the complexity, but you simply cannot hide the time it physically takes to move data between different memories.

Hence, what you absolutely want to do is to execute one big chunk of operations on the GPU. That means you copy the raw input in, and copy the output out. Leave the CPU for things like file reading, things that are not mathematically hard.

Specifically, use CUDA/CuDNN for all operations that they directly implement. Your OpenMP code is never going to be nearly as fast.

Collectives™ on Stack Overflow

Standard way of calling math functions in C when using OpenMP & its offloading feature(s)?

Does anyone here know how to properly accomplish the use of `math.h` & it's features on GPU-accelerated contexts as well when dealing with OpenMP's GPU offloading feature?

2 Answers 2

Hot Network Questions

Collectives™ on Stack Overflow

Does anyone here know how to properly accomplish the use of math.h & it's features on GPU-accelerated contexts as well when dealing with OpenMP's GPU offloading feature?

2 Answers 2

Related

Does anyone here know how to properly accomplish the use of `math.h` & it's features on GPU-accelerated contexts as well when dealing with OpenMP's GPU offloading feature?