Question

I witnessed the following weird behavior. I have two functions, which do almost the same - they measure the number of cycles it takes to do a certain operation. In one function, inside the loop I increment a variable; in the other nothing happens. The variables are volatile so they won't be optimized away. These are the functions:

unsigned int _osm_iterations=5000;

double osm_operation_time(){
    // volatile is used so that j will not be optimized, and ++ operation
    // will be done in each loop
    volatile unsigned int j=0;
    volatile unsigned int i;
    tsc_counter_t start_t, end_t;
    start_t = tsc_readCycles_C();
    for (i=0; i<_osm_iterations; i++){
       ++j;
    }
    end_t = tsc_readCycles_C();
    if (tsc_C2CI(start_t) ==0 || tsc_C2CI(end_t) ==0 || tsc_C2CI(start_t) >= tsc_C2CI(end_t))
         return -1;
    return (tsc_C2CI(end_t)-tsc_C2CI(start_t))/_osm_iterations;
}

double osm_empty_time(){
    volatile unsigned int i;
    volatile unsigned int j=0;
    tsc_counter_t start_t, end_t;
    start_t = tsc_readCycles_C();
    for (i=0; i<_osm_iterations; i++){
        ;
    }
    end_t = tsc_readCycles_C();
    if (tsc_C2CI(start_t) ==0 || tsc_C2CI(end_t) ==0 || tsc_C2CI(start_t) >= tsc_C2CI(end_t))
        return -1;
    return (tsc_C2CI(end_t)-tsc_C2CI(start_t))/_osm_iterations;
}

There are some non-standard functions there but I'm sure you'll manage.

The thing is, the first function returns 4, while the second function (which supposedly does less) returns 6, although the second one obviously does less than the first one.

Does that make any sense to anyone?

Actually I made the first function so I could reduce the loop overhead for my measurement of the second. Do you have any idea how to do that (as this method doesn't really cut it)?

I'm on Ubuntu (64 bit I think).

Thanks a lot.

Was it helpful?

Solution

I can see a couple of things here. One is that the code for the two loops looks identical. Secondly, the compiler will probably realise that the variable i and the variable j will always have the same value and optimise one of them away. You should look at the generated assembly and see what is really going on.

Another theory is that the change to the inner body of the loop has affected the cachability of the code - this could have moved it across cache lines or some other thing.

Since the code is so trivial, you may find it difficult to get an accuate timing value, even if you are doing 5000 iterations, you may find that the time is inside the margin for error for the timing code you are using. A modern computer can probably run that in far less than a millisecond - perhaps you should increase the number of iterations?

To see the generated assembly in gcc, specify the -S compiler option:

Q: How can I peek at the assembly code generated by GCC?

Q: How can I create a file where I can see the C code and its assembly translation together?

A: Use the -S (note: capital S) switch to GCC, and it will emit the assembly code to a file with a .s extension. For example, the following command:

gcc -O2 -S -c foo.c

will leave the generated assembly code on the file foo.s.

If you want to see the C code together with the assembly it was converted to, use a command line like this:

gcc -c -g -Wa,-a,-ad [other GCC options] foo.c > foo.lst

which will output the combined C/assembly listing to the file foo.lst.

OTHER TIPS

This kind of thing depends heavily on compiler optimizations and on timer resolution. The results you provide (4 and 6) are kind of low, regardless of the unit. To benchmark properly you should wrap both of these functions in a loop that executes them a few thousand times.

It's sometimes difficult to guess at this sort of thing, especially due to the small number of iterations. One thing that might be happening, though, is the increment could be executing on a free integer execution unit, gaining some slight degree of parallelism, since it has no dep on the value of i.

Since you mentioned this was 64 bit os, it's almost certain all these values are in registers, since there's more registers in the x86_64 architecture. Other than that, i'd say perform many more iterations, and see how stable the results are.

If you are truly trying to test the operation of a piece of code ("j++;" in this case), you're actually better off doing the following:

1/ Do it in two separate executables since there is a possibility that position within the executable may affect the code.

2/ Make sure you use CPU time rather than elapsed time (I'm not sure what "tsc_readCycles_C()" gives you). This is to avoid errant results from a CPU loaded up with other tasks.

3/ Turn off compiler optimization (e.g., "gcc -O0") to ensure gcc doesn't put in any fancy stuff that's likely to skew the results.

4/ You don't need to worry about volatile if you use the actual result, such as placing:

printf ("%d\n",j);

after the loop, or:

FILE *fx = fopen ("/dev/null","w");
fprintf (fx, "%d\n", j);
fclose (fx);

if you don't want any output at all. I can't remember whether volatile was a suggestion to the compiler or enforced.

5/ Iterations of 5,000 seem a little on the low side, where "noise" could affect the readings. Maybe a higher value would be better. This may not be an issue if you're timing a larger piece of code and you've just included "j++;" as a place-holder.

When I'm running tests similar to this, I normally:

  1. Ensure that the times are measured in at least seconds, preferably (small) tens of seconds.
  2. Have a single run of the program call the first function, then the second, then the first again, then the second again, and so on, just to see if there are weird cache warmup issues.
  3. Run the program multiple times to see how stable the timing is across runs.

I'm still at a loss to explain your observed results, but if you're sure you've got your functions identified properly (not self-evidently the case given that there were copy'n'paste errors earlier, for example), then looking at the assembler output is the main option left.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top