How to optimize GNU parallel for this use?

Question

I created this script out of boredom with the sole purpose of using/testing GNU parallel so I know it's not particularly useful or optimized, but I have a script that will calculate all prime numbers up to n:

#!/usr/bin/env bash

isprime () {
    local n=$1
    ((n==1)) && return 1
    for ((i=2;i<n;i++)); do
        if ((n%i==0)); then
            return 1
        fi
    done
    printf '%d\n' "$n"
}

for ((f=1;f<=$1;f++)); do
    isprime "$f"
done

When run with the loop:

$ time ./script.sh 5000 >/dev/null

real    0m28.875s
user    0m38.818s
sys     0m29.628s

I would expect replacing the for loop with GNU parallel would make this run significantly faster but that has not been my experience. On average it's only about 1 second faster:

#!/usr/bin/env bash

isprime () {
    local n=$1
    ((n==1)) && return 1
    for ((i=2;i<n;i++)); do
        if ((n%i==0)); then
            return 1
        fi
    done
    printf '%d\n' "$n"
}

export -f isprime

seq 1 $1 | parallel -j 20 -N 1 isprime {}

Run with parallel:

$ time ./script.sh 5000 >/dev/null

real    0m27.655s
user    0m38.145s
sys     0m28.774s

I'm not really interested in optimizing the isprime() function, I am just wondering if there is something I can do to optimize GNU parallel?

In my testing seq actually runs faster than for ((i=1...)) so I don't think that has much if anything to do with the runtime

Interestingly, if I modify the for loop to:

for ((f=1;f<=$1;f++)); do
    isprime "$f" &
done | sort -n

It runs even quicker:

$ time ./script.sh 5000 >/dev/null

real    0m5.995s
user    0m33.229s
sys     0m6.382s

CPU usage peaked at about 20% with the for loop and about 60% with GNU parallel. — jesse_b
– jesse_b, Commented Jan 25, 2020 at 15:10
So the conclusion is that parallel takes about 400ms to start a bash, run a function that should take a few ms to run on average, and reap the process. In general you should arrange for jobs to do more work and have fewer jobs, but I am surprised by the overhead. — icarus
– icarus, Commented Jan 25, 2020 at 15:40

Ole Tange · Accepted Answer · 2020-01-29 10:16:07Z

GNU Parallel spends 2-10 ms overhead per job. It can be lowered a bit by using -u, but that means you may get output from different jobs mixed.

GNU Parallel is not ideal if your jobs are in the ms range and runtime matters: The overhead will often be too big.

You can spread the overhead to multiple cores by running multiple GNU Parallels:

seq 5000 | parallel --pipe --round-robin -N100 parallel isprime

You still pay the overhead, but now you at least have more cores to pay with.

A better way would be to change isprime so that it takes multiple inputs and thus takes longer to run:

isprime() {
  _isprime () {
      local n=$1
      ((n==1)) && return 1
      for ((i=2;i<n;i++)); do
          if ((n%i==0)); then
              return 1
          fi
      done
      printf '%d\n' "$n"
  }
  for t in "$@"; do
    _isprime $t
  done
}
export -f isprime

seq 5000 | parallel -X isprime
# If you do not care about order, this is faster because higher numbers always take more time
seq 5000 | parallel --shuf -X isprime

With the new isprime() exported, this: seq 5000 |shuf| parallel -N1250 isprime | sort -n really runs faster: 1.9 s (compared to 5s or 2.5s in my answer). WITHOUT shuf | (or --shuf) it takes longer (2.7s). Thanks for that simple syntax parallel -X FUNC. Is it so hard to find a better example? Now that for once it is not IO-limited, it is a slowed-down algorithm. — user373503
– user373503, Commented Jan 29, 2020 at 12:01

ctrl-alt-delor · Accepted Answer · 2020-01-25 16:23:34Z

1

I will not mention optimising is_prime iterate to squrare_root of (n).

I suspect that the version with parallel, is spending a significant amount of time starting processes. Therefore break it up into bigger chunks. e.g. n/Number_of_cpus should be the fastest (if each chunk takes the same time). Try a few chunk sizes, see what happens.

You will have to adapt your script to take lower and increment.

e.g. arrange for parallel to run (if you have 5 cores).

./script    0 1000 &
./script 1000 1000 &
./script 2000 1000 &
./script 3000 1000 &
./script 4000 1000 &

edited Jan 25, 2020 at 16:23

answered Jan 25, 2020 at 16:01

ctrl-alt-delor

28.8k11 gold badges66 silver badges113 bronze badges

Interestingly iterating to the square root is actually slower (probably due to the fact that it has to spawn another process for awk or bc for each number).

jesse_b
– jesse_b

2020-01-25 16:03:36 +00:00
Commented Jan 25, 2020 at 16:03
1

@jesse_b How does for((i=2;i*i<=n;i++)) compare?

icarus
– icarus

2020-01-25 16:43:14 +00:00
Commented Jan 25, 2020 at 16:43
@icarus: That makes the for loop run in 0.6 seconds but parallel still takes 28-29.

jesse_b
– jesse_b

2020-01-25 16:57:23 +00:00
Commented Jan 25, 2020 at 16:57

Add a comment |

user373503 · Accepted Answer · 2020-01-26 21:14:23Z

By changing the main for loop:

for ((f=1;f<=$1;f+=2)); do
    isprime $f &
    isprime $((f+1))
done

it runs a bit faster

]# time ./jj.sh 5000 |wc
    669     669    3148

real    0m2.537s
user    0m8.109s
sys     0m1.374s

than without the &:

real    0m5.758s
user    0m5.761s
sys     0m0.007s

or with only background calls &:

real    0m3.298s
user    0m10.743s
sys     0m1.869s

So while the ampersand took you from 28s to 5s, I went from 5s to 3s.

I also tried 2 with ampersand and 1 without, but this is already getting slower.

]# time ./jj.sh 5000 |wc
^C

real    0m17.668s
user    0m17.576s
sys     0m1.344s

Dramatic slow down (see the ^C) if the ampersand is on the second call only:

for ((f=1;f<=$1;f+=2)); do
    isprime $f
    isprime $((f+1)) &
done

This seems a bit confusing.

By using the found prime numbers only as divisors you can speed up by factor 20:

max=5000
max2=75
primes=('3')
echo '2'; echo '3'

for ((n=5; n<max; n+=2))
do  size=${#primes[@]}
    for ((pi=0; pi<=$size; pi++))
    do  p=${primes[$pi]}
        if (( $n % $p == 0 ))
        then break
        fi
        if (( $p * $p > $n ))
        then echo $n
             (( $n < $max2 )) && primes+=("$n")
             break
        fi
    done
done

This gives:

]# time . prim.sh |wc
    669     669    3148

real    0m0.126s
user    0m0.142s
sys     0m0.001s

And the same thing in perl:

]# time perl prim.pl | wc
    668    1336    6486

real    0m0.008s
user    0m0.009s
sys     0m0.001s

(first line looks like *** 3, so wc output is normal)

But this algorithm is more difficult to parallelize: isprime() has to have access to the (growing) list of prime numbers (up to sqrt).

Maybe factor (a command fighting against section 6 :) would be useful as a standard functional unit. Then you can feed it with different "chunks".

]# time seq 2 5000 |factor |sed '/ .* /d' |cut -f1 -d':' |wc 
    669     669    3148

real    0m0.008s
user    0m0.014s
sys     0m0.005s

The sed deletes lines with more than one space (i.e. more than one factor).

But then again, it is too fast to be helped:

]# time seq 900000000002 900000005000 | factor  |wc
   4999   26848  163457

real    0m0.031s
user    0m0.035s
sys     0m0.003s

The question is about optimizing GNU parallel, calculating prime numbers is only an example. — jesse_b
– jesse_b, Commented Jan 26, 2020 at 23:47

Stack Exchange Network

How to optimize GNU parallel for this use?

3 Answers 3

You must log in to answer this question.

Hot Network Questions

How to optimize GNU parallel for this use?

3 Answers 3

You must log in to answer this question.

Related

Hot Network Questions