Multithread execution times are slow on Ubuntu and fast on Windows

Question

I'm working on a project accelerating the execution of a Genetic Algorithm both on multiple cores and on a GPU. The algorithm is specifically suited for the solution of the Traveling Salesman Problem in a 2D symmetric Euclidean space.

I've put all the functions in place, starting with the development of the CUDA version for the GPU and following with a single core simple implementation and since there, as expected, the speed-up was notable and consistent, scaling really good with the total number of individuals used in natural selection.

I then decided to compare both the versions with a Multi-core implementation.

Long story short, My multithreaded versions run really slow on Ubuntu 22.04.4, but on Windows 10, the same code is almost perfect and scales surprisingly good with increasing threads. Here a reproducible snippet of code (48_cities.txt file needed):

#include <iostream>
#include <thread>
#include <chrono>
#include <time.h>
#include <math.h>

#define pathSize 48
#define popSize 64000
#define subPopSize 32
#define threads 4

void random_shuffle(int *population, int sub_pop_num){
    for (int i = 0; i < sub_pop_num * subPopSize; i++){
        // Avoid last thread to overflow the islands (sub-populations)
        if (i >= popSize) break;
        // shuffle the ith chromosome
        int gene;
        for (int j = 0; j < pathSize; j++){
            int idx1 = (int)((rand()/(RAND_MAX+1.0)) * pathSize);
            int idx2 = (int)((rand()/(RAND_MAX+1.0)) * pathSize);
            gene = population[i*pathSize + idx1];
            population[i*pathSize + idx1] = population[i*pathSize + idx2];
            population[i*pathSize + idx2] = gene;
        }
    }
}

double calculate_distance(int *chromosome, double *distance_matrix){
    double distance = 0.0;
    for (int i = 0; i < pathSize-1; i++){
        distance += distance_matrix[chromosome[i]*pathSize + chromosome[i+1]];
    }
    distance += distance_matrix[chromosome[pathSize-1]*pathSize + chromosome[0]];
    return distance;
}


void calculate_scores(int *population, double *distance_matrix, double *population_fitness, double *population_distances, int sub_pop_num){
    for (int i = 0; i < sub_pop_num * subPopSize; i++){
        // Avoid last thread to overflow the islands (sub-populations)
        if (i >= popSize) break;
        population_distances[i] = calculate_distance(population + i * pathSize, distance_matrix);
        population_fitness[i] = 1000/population_distances[i];
    }
}


void load_data(int *coordinates, const char *filename){
    // read filename
    FILE *file = fopen(filename, "r");
    if (file == NULL){
        printf("Error opening file %s\n", filename);
        exit(1);
    }
    int i = 0;
    while (fscanf(file, "%d %d", &coordinates[i*2], &coordinates[i*2+1]) != EOF){
        i++;
    }
}

int main(){
    auto start = std::chrono::high_resolution_clock::now();

    //-------------------------------------------------
    // Load the coordinates of the cities from file
    //-------------------------------------------------
    int *path_coordinates = (int*)malloc(pathSize * 2 * sizeof(int));   //[pathSize][2];
    load_data(path_coordinates, "48_cities.txt");

    //-----------------------------------------
    // Allocate and fill the distance matrix
    //-----------------------------------------
    double *distance_matrix = (double*)malloc(pathSize*pathSize*sizeof(double));    //[pathSize][pathSize];
    for(int i = 0; i < pathSize; i++){
        for(int j = 0; j < pathSize; j++){
            distance_matrix[i*pathSize+j] = sqrt(pow(path_coordinates[i*2]-path_coordinates[j*2],2) + pow(path_coordinates[i*2+1]-path_coordinates[j*2+1],2));
        }
    }


    int *population = (int*)malloc(popSize * pathSize * sizeof(int));   //[popSize][pathSize]; - This represents the order of the cities for each chromosome in the population
    for (int i = 0; i < pathSize * popSize; i++){
        population[i] = i % pathSize;
        
    }

    auto load_checkpoint = std::chrono::high_resolution_clock::now();
    printf("Data loaded in %.2ld ms\n", std::chrono::duration_cast<std::chrono::milliseconds>(load_checkpoint - start).count());


    //---------------------------------------------------------------
    // Random shuffle the population for the first time
    //---------------------------------------------------------------
    srand(time(NULL));  // seed the random number generator

    // measure time to shuffle the population
    auto shuffle_start = std::chrono::high_resolution_clock::now();

    std::thread t[threads];
    int sub_pop_num = std::ceil((popSize/subPopSize)/threads);  // number of subpopulations/islands per thread

    //-----------------------------------------------------------------------
    // Allocate fitness and distances for each individual and calculate them
    //-----------------------------------------------------------------------
    double *population_fitness = (double*)malloc(popSize*sizeof(double));   //[popSize];
    double *population_distances = (double*)malloc(popSize*sizeof(double)); //[popSize];

    for (int i = 0; i < threads; i++){
        int *pop_start = population + i * sub_pop_num * subPopSize * pathSize;
        double *pop_fit_start = population_fitness + i * subPopSize;
        t[i] = std::thread([pop_start, pop_fit_start, distance_matrix, population_distances, sub_pop_num](){
            random_shuffle(pop_start, sub_pop_num);
            calculate_scores(pop_start, distance_matrix, pop_fit_start, population_distances, sub_pop_num);
        });
    }
    // join the threads
    for (int i = 0; i < threads; i++){
        t[i].join();
    }
    printf("Scores calculated\n");
    
    auto shuffle_end = std::chrono::high_resolution_clock::now();
    printf("Population shuffled in %.2ld ms\n", std::chrono::duration_cast<std::chrono::milliseconds>(shuffle_end - shuffle_start).count());

    // do the same in a single thread
    shuffle_start = std::chrono::high_resolution_clock::now();
    random_shuffle(population, popSize/subPopSize);
    calculate_scores(population, distance_matrix, population_fitness, population_distances, popSize/subPopSize);
    shuffle_end = std::chrono::high_resolution_clock::now();
    printf("Population shuffled in %.2ld ms\n", std::chrono::duration_cast<std::chrono::milliseconds>(shuffle_end - shuffle_start).count());

}

And the 48_cities.txt file

I expected the timing to reduce with an increasing number of threads, up to the maximum available in the system, but the opposite happens.

The next part of the post explains in detail what I tried.

1) My first iteration created a fixed number of std::threads and prompted them with tasks in the form of lambda functions on each iteration of the genetic algorithm. (Note that given the structure of the code and the model of genetic algorithm, synchronization points are distributed on every iteration, allowing for consistent mutations and, periodically, migrations of individuals)

Here a part of the While loop that handles evolution

int generation = 1;
    printf("Starting the GA...\n");
    while (generation <= generations){
        auto start_gen = std::chrono::high_resolution_clock::now();
        // assign the threads to the genetic_step function and mutate function
        for (int i = 0; i < threads; i++){
            int *pop_start = population + i * sub_pop_num * subPopSize * pathSize;
            double *pop_fit_start = population_fitness + i * sub_pop_num * subPopSize;
            t[i] = std::thread([pop_start, pop_fit_start, sub_pop_num,  distance_matrix](){
                genetic_step(pop_start, pop_fit_start, distance_matrix, sub_pop_num);
                mutation(pop_start, sub_pop_num);
            });
        }
        // join the threads
        for (int i = 0; i < threads; i++){
            t[i].join();
        }

The performance were poor and time increased with the growing of employed threads instead of decreasing as I expected. Searching for this on stack overflow and around the web I resolved this had to be due to thread creation, assignment and joining overheads. I decided to employ a threadpool.

2) The Threadpool was implemented as follows:

class ThreadPool {
public:
    ThreadPool(int numThreads) : stop(false), completedTasks(0), totalTasks(0) {
        for (int i = 0; i < numThreads; ++i) {
            workers.emplace_back([this, i] {

                set_thread_affinity(i);

                //std::cout << "Thread " << i << " started" << std::endl;
                while (true) {
                    std::function<void()> task;
                    {
                        std::unique_lock<std::mutex> lock(queueMutex);
                        condition.wait(lock, [this] { return stop || !tasks.empty(); });
                        if (stop && tasks.empty())
                            return;
                        task = std::move(tasks.front());
                        tasks.pop();
                    }
                    task();
                    {
                        std::lock_guard<std::mutex> lock(completedTasksMutex);
                        ++completedTasks;
                        if (completedTasks == totalTasks) {
                            condition.notify_all();
                        }
                    }
                }
            });
        }
    }

    ~ThreadPool() {
        {
            std::unique_lock<std::mutex> lock(queueMutex);
            stop = true;
        }
        condition.notify_all();
        for (std::thread& thread : workers) {
            thread.join();
        }
    }

    template<class F>
    void enqueue(F&& f) {
        {
            std::unique_lock<std::mutex> lock(queueMutex);
            tasks.emplace(std::forward<F>(f));
        }
        condition.notify_one();
    }

    void wait() {
        std::unique_lock<std::mutex> lock(completedTasksMutex);
        condition.wait(lock, [this] { return tasks.empty() && completedTasks == totalTasks; });
        completedTasks = 0; // Reset completed tasks after they have all finished
    }

    void setTotalTasks(int numTasks) {
        totalTasks = numTasks;
    }

private:
    std::vector<std::thread> workers;
    std::queue<std::function<void()>> tasks;
    std::mutex queueMutex;
    std::mutex completedTasksMutex;
    std::condition_variable condition;
    bool stop;
    int completedTasks;
    int totalTasks;

    void set_thread_affinity(int core_id) {
        cpu_set_t cpuset;
        CPU_ZERO(&cpuset);
        CPU_SET(core_id, &cpuset);
        pthread_t current_thread = pthread_self();
        if (pthread_setaffinity_np(current_thread, sizeof(cpu_set_t), &cpuset) != 0) {
            std::cerr << "Error setting thread affinity for core " << core_id << std::endl;
        }
    }
};

And the tasks enqueued accordingly:

pool.setTotalTasks(threads);
    for (int i = 0; i < threads; i++){
        int *pop_start = population + i * sub_pop_num * subPopSize * pathSize;
        double *pop_fit_start = population_fitness + i * subPopSize;
        pool.enqueue([pop_start, pop_fit_start, distance_matrix, population_distances, sub_pop_num](){
            random_shuffle(pop_start, sub_pop_num);
            calculate_scores(pop_start, distance_matrix, pop_fit_start, population_distances, sub_pop_num);
        });
    }
    
    // Wait for the threads to finish
    pool.wait();

(This implementation use counters to keep track of enqueued executions in order to let the master thread wait the completion of the tasks)

But unfortunately, with and without affinity mechanisms, as described in the set_thread_affinity(int core_id) function, the performance didn't change that much, still resulting in scalability issues, i.e. increasing the used threads increased the time of execution.

3) As a last hope I tried to use the OpenMP library to parallilize the workload, the same for loop above was then implemented as follows:

#pragma omp parallel for num_threads(threads)
        for (int i = 0; i < threads; i++){
            int *pop_start = population + i * sub_pop_num * subPopSize * pathSize;
            double *pop_fit_start = population_fitness + i * sub_pop_num * subPopSize;
            genetic_step(pop_start, pop_fit_start, distance_matrix, sub_pop_num);
            mutation(pop_start, sub_pop_num);
        }

Luckily enough, the code structure admitted this mean of parallelization, as long as each thread had a defined indexing offset computed on each run. But still the performance was bad.

Investigating deeper on OpenMP directives I managed to change:

export OMP_PLACES=cores
export OMP_PROC_BIND=master

Which mitigated the slowing issues with a growing number of threads employed. Nonetheless right now this whole code base can't scale in performances on my ubuntu machine.

From this consideration stems the question: I decided to try the first 2 versions on another Windows 10 machine and they scaled surprisingly well (note the thread affinity wasn't used in the threadpool version), obtaining up to a 10x acceleration on the workload with a sufficient number of threads. I'm now wondering if I'm missing any fundamental step in operating a multithreaded application on an Ubuntu OS.

The entire codebase (repo still in progress) can be found here.

PREVIOUSLY POSTED ON STACKOVERFLOW

You need to write on what devices you test it. The change in performance might have been caused by different hardware. For instance, memory-heavy tasks do not benefit from multithreading unless you have a computer with a large memory bus. Modern laptops tend to not posses that and the memory could be saturated by a single core. — ALX23z, Commented Apr 21, 2024 at 11:39
@ALX23z Thank you for the correct feedback, my Ubuntu Laptop runs on a pci express 5 architecture and the windows 10 machine is on a gen 3. Both machines have 32 gb of ram with the laptop ones running at 5600 Mhz against the 3000Mhz of the other. The performance issues were due to the random number generation as suggested in the answer below! — Pietro Pianigiani, Commented Apr 21, 2024 at 17:41

G. Sliepen · Accepted Answer · 2024-04-21 15:12:01Z

Use C++'s random number generators instead of `rand()`

Not only is this good advice in general, it's actually the main source of your performance problems. You use rand() a lot, and currently glibc, the standard library used on Linux, internally uses a mutex to prevent multiple simultaneous calls to rand(), to ensure the state of the pseudo-random number generator it uses does not get corrupted.

C++ comes with its own random number generator facilities. In particular, you should use a fast generator, like std::mt19937_64. Its quality is also vastly better than that of rand(), although that might not matter for your simulation. Make sure you create one generator object per thread! (Or you can cheat by making a thread_local one.)

Apart from that, generating a useful value is also much nicer in C++. If you want a double between 0 and 1, use std::uniform_real_distribution{0.0, 1.0}, and if you want a random path index, use std::uniform_int_distribution{0, pathSize - 1}.

In general, prefer C++ functions over C functions

So instead of printf(), use std::cout <<, possibly in combination with std::format(). Or coming soon to a compiler near you, C++23's std::print().

For reading and writing files, use std::ifstream and std::ofstream.

Avoid macros

Use static constexpr to define constants instead of using #define. Apart from avoiding some macro pitfalls, you also get to give the values a type.

The main performance Issue, as also another user mentioned on stackoverflo, was due to the random number generation, I solved using std::mt19937! Thank you so much for the help. Regarding the use of C functions, unfortunately This code is used along with some other versions to have an overall performance comparison and for similarity it's currently both ugly and mixed in syntax but I appreciate the suggestions and a definitive version will for sure be adapted to run the best way through C++. Thank you again! — Pietro Pianigiani, Commented Apr 21, 2024 at 17:36

Stack Exchange Network

Multithread execution times are slow on Ubuntu and fast on Windows

1 Answer 1

Use C++'s random number generators instead of `rand()`

In general, prefer C++ functions over C functions

Avoid macros

Hot Network Questions

Multithread execution times are slow on Ubuntu and fast on Windows

1 Answer 1

Use C++'s random number generators instead of rand()

In general, prefer C++ functions over C functions

Avoid macros

Related

Hot Network Questions

Use C++'s random number generators instead of `rand()`