Here is the solution to the much more general problem of parallelizing any loop cleanly.
Just let each thread pull items (pairs of a,b in your case) one after another and compute the partial result until all are consumed.
Main thread:
output = 0
iter = iterator_over(A,B)
// start threads and wait until done
answer = output / size(A) / size(B)
return answer
Each thread:
res = 0
while true:
synchronized:
if !iter.hasNext():
break
a,b = it.next()
else:
break
res += f(a, b, n)
synchronized:
output += res
For optimal performance, the amount of threads should be the same as the amount of cpu cores.