1

I test a simple program via perf stat

#include <cassert>
#include <cstddef>
#include <iostream>

int main(int argc, const char* argv[]) {
    assert(argc == 3);
    int64_t iters = atoll(argv[1]);
    int64_t step = atoll(argv[2]);

    int64_t value = 0;

    for(int64_t i = 0; i < iters; ++i) {
        value += step;
    }
    std::cout << value << std::endl;

    return 0;
}

Easy to see in godbolt (https://godbolt.org/z/d831q1c9W) that there is only one repeated branch according to the cycle (jge .LBB0_4 by link, i guess)

The command to build I use is

clang++ -std=c++2b bp.cpp tp2.cpp -o bp.exe -Wall -O0 -DNDEBUG

If I run

perf stat ./bp.exe 10000000 700 500 1013 2>&1 | grep branches | tee run4_1.txt
perf stat ./bp.exe 20000000 700 500 1013 2>&1 | grep branches | tee run4_2.txt
perf stat ./bp.exe 30000000 700 500 1013 2>&1 | grep branches | tee run4_3.txt

It outputs:

+ perf stat ./bp_arc.exe 10000000 700
+ grep branches
+ tee run4_1_arc.txt
          15516156      branches:u                #  483.498 M/sec                    (66.64%)
              4390      branch-misses:u           #    0.03% of all branches          (63.53%)
+ perf stat ./bp_arc.exe 20000000 700
+ grep branches
+ tee run4_2_arc.txt
          30860233      branches:u                #  534.704 M/sec                    (67.69%)
              6400      branch-misses:u           #    0.02% of all branches          (65.99%)
+ perf stat ./bp_arc.exe 30000000 700
+ grep branches
+ tee run4_3_arc.txt
          47616449      branches:u                #  535.042 M/sec                    (67.21%)
              6531      branch-misses:u           #    0.01% of all branches          (66.81%)

So, number of branches reported by perf stat is in ~1.5 greater than the actual number of iterations.

In some other setup I see numbers with multiplier about 2.1.

So the question: what kind of branches is reported by perf stat -> branches counters, that reasons of difference between number of checks (tests) visibly in asm and reported counters.

New contributor
ilnurKh is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.
4
  • I suspect this has something to do with hardware branch prediction. Commented Apr 25 at 17:14
  • 1
    I've found perf to be one of the buggiest parts of the kernel - it's extremely hardware-dependent, and I suspect something about it's not getting saved/restored properly when rescheduling happens (which is pretty frequent). It's quite possible that the recorded stats are actually a mix of your process and some other process (including the idle process) or kernel interrupts or ... Commented Apr 25 at 18:48
  • 1
    There is lot more that happened in the process than just the loop in your code. First ld had to load your binary and the shared libraries. Then you are also calling shared lib functions. So it makes sense that it's much higher than the number of iterations in your code. P.S. As an answer in related question suggested by SO says, use perf_event_open. Commented Apr 28 at 5:54
  • yes, some other actions can explain constant diff, but not the multiplication Commented 2 days ago

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.