Why is if (variable1 % variable2 == 0) inefficient?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP












162















I am new to java, and was running some code last night, and this really bothered me. I was building a simple program to display every X outputs in a for loop, and I noticed a MASSIVE decrease in performance, when I used modulus as variable % variable vs variable % 5000 or whatnot. Can someone explain to me why this is and what is causing it? So I can be better...



Here is the "efficient" code (sorry if I get a bit of syntax wrong I'm not on the computer with the code right now)



long startNum = 0;
long stopNum = 1000000000L;

for (long i = startNum; i <= stopNum; i++)
if (i % 50000 == 0)
System.out.println(i);




Here is the "inefficient code"



long startNum = 0;
long stopNum = 1000000000L;
long progressCheck = 50000;

for (long i = startNum; i <= stopNum; i++)
if (i % progressCheck == 0)
System.out.println(i);




Mind you I had a date variable to measure the differences, and once it became long enough, the first one took 50ms while the other took 12 seconds or something like that. You may have to increase the stopNum or decrease the progressCheck if your PC is more efficient than mine or what not.



I looked for this question around the web, but I can't find an answer, maybe I'm just not asking it right.



EDIT:
I did not expect my question to be so popular, I appreciate all the answers. I did perform a benchmark on each half in time taken, and the inefficient code took considerably longer, 1/4 of a second vs 10 seconds give or take. Granted they are using println, but they are both doing the same amount, so I wouldn't imagine that would skew it much, especially since the discrepancy is repeatable. As for the answers, since I am new to Java, I will let votes decide for now which answer is best. I will try to pick one by Wednesday.



EDIT2:
I am going to make another test tonight, where instead of modulus, it just increments a variable, and when it reaches progressCheck, it will perform one, and then reset that variable to 0. for a 3rd option.



EDIT3.5:



I used this code, and below I'll show my results.. Thank you ALL for the wonderful help! I also tried comparing the short value of the long to 0, so all my new checks happen ever "65536" times making it equal in repeats.



public class Main 


public static void main(String args)

long startNum = 0;
long stopNum = 1000000000L;
long progressCheck = 65536;
final long finalProgressCheck = 50000;
long date;

// using a fixed value
date = System.currentTimeMillis();
for (long i = startNum; i <= stopNum; i++)
if (i % 65536 == 0)
System.out.println(i);


long final1 = System.currentTimeMillis() - date;
date = System.currentTimeMillis();
//using a variable
for (long i = startNum; i <= stopNum; i++)
if (i % progressCheck == 0)
System.out.println(i);


long final2 = System.currentTimeMillis() - date;
date = System.currentTimeMillis();

// using a final declared variable
for (long i = startNum; i <= stopNum; i++)
if (i % finalProgressCheck == 0)
System.out.println(i);


long final3 = System.currentTimeMillis() - date;
date = System.currentTimeMillis();
// using increments to determine progressCheck
int increment = 0;
for (long i = startNum; i <= stopNum; i++)
if (increment == 65536)
System.out.println(i);
increment = 0;

increment++;



//using a short conversion
long final4 = System.currentTimeMillis() - date;
date = System.currentTimeMillis();
for (long i = startNum; i <= stopNum; i++)
if ((short)i == 0)
System.out.println(i);


long final5 = System.currentTimeMillis() - date;

System.out.println(
"nfixed = " + final1 + " ms " + "nvariable = " + final2 + " ms " + "nfinal variable = " + final3 + " ms " + "nincrement = " + final4 + " ms" + "nShort Conversion = " + final5 + " ms");




Results:



  • fixed = 874 ms (normally around 1000ms, but faster due to it being a power of 2)

  • variable = 8590 ms

  • final variable = 1944 ms (Was ~1000ms when using 50000)

  • increment = 1904 ms

  • Short Conversion = 679 ms

Not surprising enough, due to a lack of division, the Short Conversion was 23% faster than the "fast" way. This is interesting to note. If you need to show or compare something every 256 times (or about there) you can do this, and use



if ((byte)integer == 0) 'Perform progress check code here'


ONE FINAL INTERESTING NOTE, using modulus on the "Final declared Variable" with 65536 (not a pretty number) was half the speed of (slower) than the fixed value. Where before it was benchmarking near the same speed.










share|improve this question



















  • 27





    I got the same result actually. On my machine, the first loop runs in about 1,5 seconds and the second runs in about 9 seconds. If I add final in front of the progressCheck variable, both run at the same speed again. That leads me to believe that the compiler or the JIT manages to optimize the loop when it knows that progressCheck is constant.

    – marstran
    Jan 28 at 16:07






  • 6





    Related: How do I write a correct micro-benchmark in Java?

    – jaco0646
    Jan 28 at 16:21






  • 23





    Division by a constant can be easily converted to a multiplication by the multiplicative inverse. Division by a variable can't. And a 32-bit division is faster than a 64-bit division on x86

    – phuclv
    Jan 28 at 16:27






  • 2





    @phuclv note 32-bit division is not a issue here, it is a 64-bit remainder operation in both cases

    – Carlos Heuberger
    Jan 28 at 16:36






  • 4





    @RobertCotterman if you declare the variable as final, the compiler creates the same bytecode as using the constant (eclipse/Java 11) ((despite using one more memory slot for the variable))

    – Carlos Heuberger
    Jan 28 at 16:39
















162















I am new to java, and was running some code last night, and this really bothered me. I was building a simple program to display every X outputs in a for loop, and I noticed a MASSIVE decrease in performance, when I used modulus as variable % variable vs variable % 5000 or whatnot. Can someone explain to me why this is and what is causing it? So I can be better...



Here is the "efficient" code (sorry if I get a bit of syntax wrong I'm not on the computer with the code right now)



long startNum = 0;
long stopNum = 1000000000L;

for (long i = startNum; i <= stopNum; i++)
if (i % 50000 == 0)
System.out.println(i);




Here is the "inefficient code"



long startNum = 0;
long stopNum = 1000000000L;
long progressCheck = 50000;

for (long i = startNum; i <= stopNum; i++)
if (i % progressCheck == 0)
System.out.println(i);




Mind you I had a date variable to measure the differences, and once it became long enough, the first one took 50ms while the other took 12 seconds or something like that. You may have to increase the stopNum or decrease the progressCheck if your PC is more efficient than mine or what not.



I looked for this question around the web, but I can't find an answer, maybe I'm just not asking it right.



EDIT:
I did not expect my question to be so popular, I appreciate all the answers. I did perform a benchmark on each half in time taken, and the inefficient code took considerably longer, 1/4 of a second vs 10 seconds give or take. Granted they are using println, but they are both doing the same amount, so I wouldn't imagine that would skew it much, especially since the discrepancy is repeatable. As for the answers, since I am new to Java, I will let votes decide for now which answer is best. I will try to pick one by Wednesday.



EDIT2:
I am going to make another test tonight, where instead of modulus, it just increments a variable, and when it reaches progressCheck, it will perform one, and then reset that variable to 0. for a 3rd option.



EDIT3.5:



I used this code, and below I'll show my results.. Thank you ALL for the wonderful help! I also tried comparing the short value of the long to 0, so all my new checks happen ever "65536" times making it equal in repeats.



public class Main 


public static void main(String args)

long startNum = 0;
long stopNum = 1000000000L;
long progressCheck = 65536;
final long finalProgressCheck = 50000;
long date;

// using a fixed value
date = System.currentTimeMillis();
for (long i = startNum; i <= stopNum; i++)
if (i % 65536 == 0)
System.out.println(i);


long final1 = System.currentTimeMillis() - date;
date = System.currentTimeMillis();
//using a variable
for (long i = startNum; i <= stopNum; i++)
if (i % progressCheck == 0)
System.out.println(i);


long final2 = System.currentTimeMillis() - date;
date = System.currentTimeMillis();

// using a final declared variable
for (long i = startNum; i <= stopNum; i++)
if (i % finalProgressCheck == 0)
System.out.println(i);


long final3 = System.currentTimeMillis() - date;
date = System.currentTimeMillis();
// using increments to determine progressCheck
int increment = 0;
for (long i = startNum; i <= stopNum; i++)
if (increment == 65536)
System.out.println(i);
increment = 0;

increment++;



//using a short conversion
long final4 = System.currentTimeMillis() - date;
date = System.currentTimeMillis();
for (long i = startNum; i <= stopNum; i++)
if ((short)i == 0)
System.out.println(i);


long final5 = System.currentTimeMillis() - date;

System.out.println(
"nfixed = " + final1 + " ms " + "nvariable = " + final2 + " ms " + "nfinal variable = " + final3 + " ms " + "nincrement = " + final4 + " ms" + "nShort Conversion = " + final5 + " ms");




Results:



  • fixed = 874 ms (normally around 1000ms, but faster due to it being a power of 2)

  • variable = 8590 ms

  • final variable = 1944 ms (Was ~1000ms when using 50000)

  • increment = 1904 ms

  • Short Conversion = 679 ms

Not surprising enough, due to a lack of division, the Short Conversion was 23% faster than the "fast" way. This is interesting to note. If you need to show or compare something every 256 times (or about there) you can do this, and use



if ((byte)integer == 0) 'Perform progress check code here'


ONE FINAL INTERESTING NOTE, using modulus on the "Final declared Variable" with 65536 (not a pretty number) was half the speed of (slower) than the fixed value. Where before it was benchmarking near the same speed.










share|improve this question



















  • 27





    I got the same result actually. On my machine, the first loop runs in about 1,5 seconds and the second runs in about 9 seconds. If I add final in front of the progressCheck variable, both run at the same speed again. That leads me to believe that the compiler or the JIT manages to optimize the loop when it knows that progressCheck is constant.

    – marstran
    Jan 28 at 16:07






  • 6





    Related: How do I write a correct micro-benchmark in Java?

    – jaco0646
    Jan 28 at 16:21






  • 23





    Division by a constant can be easily converted to a multiplication by the multiplicative inverse. Division by a variable can't. And a 32-bit division is faster than a 64-bit division on x86

    – phuclv
    Jan 28 at 16:27






  • 2





    @phuclv note 32-bit division is not a issue here, it is a 64-bit remainder operation in both cases

    – Carlos Heuberger
    Jan 28 at 16:36






  • 4





    @RobertCotterman if you declare the variable as final, the compiler creates the same bytecode as using the constant (eclipse/Java 11) ((despite using one more memory slot for the variable))

    – Carlos Heuberger
    Jan 28 at 16:39














162












162








162


35






I am new to java, and was running some code last night, and this really bothered me. I was building a simple program to display every X outputs in a for loop, and I noticed a MASSIVE decrease in performance, when I used modulus as variable % variable vs variable % 5000 or whatnot. Can someone explain to me why this is and what is causing it? So I can be better...



Here is the "efficient" code (sorry if I get a bit of syntax wrong I'm not on the computer with the code right now)



long startNum = 0;
long stopNum = 1000000000L;

for (long i = startNum; i <= stopNum; i++)
if (i % 50000 == 0)
System.out.println(i);




Here is the "inefficient code"



long startNum = 0;
long stopNum = 1000000000L;
long progressCheck = 50000;

for (long i = startNum; i <= stopNum; i++)
if (i % progressCheck == 0)
System.out.println(i);




Mind you I had a date variable to measure the differences, and once it became long enough, the first one took 50ms while the other took 12 seconds or something like that. You may have to increase the stopNum or decrease the progressCheck if your PC is more efficient than mine or what not.



I looked for this question around the web, but I can't find an answer, maybe I'm just not asking it right.



EDIT:
I did not expect my question to be so popular, I appreciate all the answers. I did perform a benchmark on each half in time taken, and the inefficient code took considerably longer, 1/4 of a second vs 10 seconds give or take. Granted they are using println, but they are both doing the same amount, so I wouldn't imagine that would skew it much, especially since the discrepancy is repeatable. As for the answers, since I am new to Java, I will let votes decide for now which answer is best. I will try to pick one by Wednesday.



EDIT2:
I am going to make another test tonight, where instead of modulus, it just increments a variable, and when it reaches progressCheck, it will perform one, and then reset that variable to 0. for a 3rd option.



EDIT3.5:



I used this code, and below I'll show my results.. Thank you ALL for the wonderful help! I also tried comparing the short value of the long to 0, so all my new checks happen ever "65536" times making it equal in repeats.



public class Main 


public static void main(String args)

long startNum = 0;
long stopNum = 1000000000L;
long progressCheck = 65536;
final long finalProgressCheck = 50000;
long date;

// using a fixed value
date = System.currentTimeMillis();
for (long i = startNum; i <= stopNum; i++)
if (i % 65536 == 0)
System.out.println(i);


long final1 = System.currentTimeMillis() - date;
date = System.currentTimeMillis();
//using a variable
for (long i = startNum; i <= stopNum; i++)
if (i % progressCheck == 0)
System.out.println(i);


long final2 = System.currentTimeMillis() - date;
date = System.currentTimeMillis();

// using a final declared variable
for (long i = startNum; i <= stopNum; i++)
if (i % finalProgressCheck == 0)
System.out.println(i);


long final3 = System.currentTimeMillis() - date;
date = System.currentTimeMillis();
// using increments to determine progressCheck
int increment = 0;
for (long i = startNum; i <= stopNum; i++)
if (increment == 65536)
System.out.println(i);
increment = 0;

increment++;



//using a short conversion
long final4 = System.currentTimeMillis() - date;
date = System.currentTimeMillis();
for (long i = startNum; i <= stopNum; i++)
if ((short)i == 0)
System.out.println(i);


long final5 = System.currentTimeMillis() - date;

System.out.println(
"nfixed = " + final1 + " ms " + "nvariable = " + final2 + " ms " + "nfinal variable = " + final3 + " ms " + "nincrement = " + final4 + " ms" + "nShort Conversion = " + final5 + " ms");




Results:



  • fixed = 874 ms (normally around 1000ms, but faster due to it being a power of 2)

  • variable = 8590 ms

  • final variable = 1944 ms (Was ~1000ms when using 50000)

  • increment = 1904 ms

  • Short Conversion = 679 ms

Not surprising enough, due to a lack of division, the Short Conversion was 23% faster than the "fast" way. This is interesting to note. If you need to show or compare something every 256 times (or about there) you can do this, and use



if ((byte)integer == 0) 'Perform progress check code here'


ONE FINAL INTERESTING NOTE, using modulus on the "Final declared Variable" with 65536 (not a pretty number) was half the speed of (slower) than the fixed value. Where before it was benchmarking near the same speed.










share|improve this question
















I am new to java, and was running some code last night, and this really bothered me. I was building a simple program to display every X outputs in a for loop, and I noticed a MASSIVE decrease in performance, when I used modulus as variable % variable vs variable % 5000 or whatnot. Can someone explain to me why this is and what is causing it? So I can be better...



Here is the "efficient" code (sorry if I get a bit of syntax wrong I'm not on the computer with the code right now)



long startNum = 0;
long stopNum = 1000000000L;

for (long i = startNum; i <= stopNum; i++)
if (i % 50000 == 0)
System.out.println(i);




Here is the "inefficient code"



long startNum = 0;
long stopNum = 1000000000L;
long progressCheck = 50000;

for (long i = startNum; i <= stopNum; i++)
if (i % progressCheck == 0)
System.out.println(i);




Mind you I had a date variable to measure the differences, and once it became long enough, the first one took 50ms while the other took 12 seconds or something like that. You may have to increase the stopNum or decrease the progressCheck if your PC is more efficient than mine or what not.



I looked for this question around the web, but I can't find an answer, maybe I'm just not asking it right.



EDIT:
I did not expect my question to be so popular, I appreciate all the answers. I did perform a benchmark on each half in time taken, and the inefficient code took considerably longer, 1/4 of a second vs 10 seconds give or take. Granted they are using println, but they are both doing the same amount, so I wouldn't imagine that would skew it much, especially since the discrepancy is repeatable. As for the answers, since I am new to Java, I will let votes decide for now which answer is best. I will try to pick one by Wednesday.



EDIT2:
I am going to make another test tonight, where instead of modulus, it just increments a variable, and when it reaches progressCheck, it will perform one, and then reset that variable to 0. for a 3rd option.



EDIT3.5:



I used this code, and below I'll show my results.. Thank you ALL for the wonderful help! I also tried comparing the short value of the long to 0, so all my new checks happen ever "65536" times making it equal in repeats.



public class Main 


public static void main(String args)

long startNum = 0;
long stopNum = 1000000000L;
long progressCheck = 65536;
final long finalProgressCheck = 50000;
long date;

// using a fixed value
date = System.currentTimeMillis();
for (long i = startNum; i <= stopNum; i++)
if (i % 65536 == 0)
System.out.println(i);


long final1 = System.currentTimeMillis() - date;
date = System.currentTimeMillis();
//using a variable
for (long i = startNum; i <= stopNum; i++)
if (i % progressCheck == 0)
System.out.println(i);


long final2 = System.currentTimeMillis() - date;
date = System.currentTimeMillis();

// using a final declared variable
for (long i = startNum; i <= stopNum; i++)
if (i % finalProgressCheck == 0)
System.out.println(i);


long final3 = System.currentTimeMillis() - date;
date = System.currentTimeMillis();
// using increments to determine progressCheck
int increment = 0;
for (long i = startNum; i <= stopNum; i++)
if (increment == 65536)
System.out.println(i);
increment = 0;

increment++;



//using a short conversion
long final4 = System.currentTimeMillis() - date;
date = System.currentTimeMillis();
for (long i = startNum; i <= stopNum; i++)
if ((short)i == 0)
System.out.println(i);


long final5 = System.currentTimeMillis() - date;

System.out.println(
"nfixed = " + final1 + " ms " + "nvariable = " + final2 + " ms " + "nfinal variable = " + final3 + " ms " + "nincrement = " + final4 + " ms" + "nShort Conversion = " + final5 + " ms");




Results:



  • fixed = 874 ms (normally around 1000ms, but faster due to it being a power of 2)

  • variable = 8590 ms

  • final variable = 1944 ms (Was ~1000ms when using 50000)

  • increment = 1904 ms

  • Short Conversion = 679 ms

Not surprising enough, due to a lack of division, the Short Conversion was 23% faster than the "fast" way. This is interesting to note. If you need to show or compare something every 256 times (or about there) you can do this, and use



if ((byte)integer == 0) 'Perform progress check code here'


ONE FINAL INTERESTING NOTE, using modulus on the "Final declared Variable" with 65536 (not a pretty number) was half the speed of (slower) than the fixed value. Where before it was benchmarking near the same speed.







java performance






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Feb 3 at 13:06









Yvette Colomb

20.3k1470110




20.3k1470110










asked Jan 28 at 16:01









Robert CottermanRobert Cotterman

1,2662312




1,2662312







  • 27





    I got the same result actually. On my machine, the first loop runs in about 1,5 seconds and the second runs in about 9 seconds. If I add final in front of the progressCheck variable, both run at the same speed again. That leads me to believe that the compiler or the JIT manages to optimize the loop when it knows that progressCheck is constant.

    – marstran
    Jan 28 at 16:07






  • 6





    Related: How do I write a correct micro-benchmark in Java?

    – jaco0646
    Jan 28 at 16:21






  • 23





    Division by a constant can be easily converted to a multiplication by the multiplicative inverse. Division by a variable can't. And a 32-bit division is faster than a 64-bit division on x86

    – phuclv
    Jan 28 at 16:27






  • 2





    @phuclv note 32-bit division is not a issue here, it is a 64-bit remainder operation in both cases

    – Carlos Heuberger
    Jan 28 at 16:36






  • 4





    @RobertCotterman if you declare the variable as final, the compiler creates the same bytecode as using the constant (eclipse/Java 11) ((despite using one more memory slot for the variable))

    – Carlos Heuberger
    Jan 28 at 16:39













  • 27





    I got the same result actually. On my machine, the first loop runs in about 1,5 seconds and the second runs in about 9 seconds. If I add final in front of the progressCheck variable, both run at the same speed again. That leads me to believe that the compiler or the JIT manages to optimize the loop when it knows that progressCheck is constant.

    – marstran
    Jan 28 at 16:07






  • 6





    Related: How do I write a correct micro-benchmark in Java?

    – jaco0646
    Jan 28 at 16:21






  • 23





    Division by a constant can be easily converted to a multiplication by the multiplicative inverse. Division by a variable can't. And a 32-bit division is faster than a 64-bit division on x86

    – phuclv
    Jan 28 at 16:27






  • 2





    @phuclv note 32-bit division is not a issue here, it is a 64-bit remainder operation in both cases

    – Carlos Heuberger
    Jan 28 at 16:36






  • 4





    @RobertCotterman if you declare the variable as final, the compiler creates the same bytecode as using the constant (eclipse/Java 11) ((despite using one more memory slot for the variable))

    – Carlos Heuberger
    Jan 28 at 16:39








27




27





I got the same result actually. On my machine, the first loop runs in about 1,5 seconds and the second runs in about 9 seconds. If I add final in front of the progressCheck variable, both run at the same speed again. That leads me to believe that the compiler or the JIT manages to optimize the loop when it knows that progressCheck is constant.

– marstran
Jan 28 at 16:07





I got the same result actually. On my machine, the first loop runs in about 1,5 seconds and the second runs in about 9 seconds. If I add final in front of the progressCheck variable, both run at the same speed again. That leads me to believe that the compiler or the JIT manages to optimize the loop when it knows that progressCheck is constant.

– marstran
Jan 28 at 16:07




6




6





Related: How do I write a correct micro-benchmark in Java?

– jaco0646
Jan 28 at 16:21





Related: How do I write a correct micro-benchmark in Java?

– jaco0646
Jan 28 at 16:21




23




23





Division by a constant can be easily converted to a multiplication by the multiplicative inverse. Division by a variable can't. And a 32-bit division is faster than a 64-bit division on x86

– phuclv
Jan 28 at 16:27





Division by a constant can be easily converted to a multiplication by the multiplicative inverse. Division by a variable can't. And a 32-bit division is faster than a 64-bit division on x86

– phuclv
Jan 28 at 16:27




2




2





@phuclv note 32-bit division is not a issue here, it is a 64-bit remainder operation in both cases

– Carlos Heuberger
Jan 28 at 16:36





@phuclv note 32-bit division is not a issue here, it is a 64-bit remainder operation in both cases

– Carlos Heuberger
Jan 28 at 16:36




4




4





@RobertCotterman if you declare the variable as final, the compiler creates the same bytecode as using the constant (eclipse/Java 11) ((despite using one more memory slot for the variable))

– Carlos Heuberger
Jan 28 at 16:39






@RobertCotterman if you declare the variable as final, the compiler creates the same bytecode as using the constant (eclipse/Java 11) ((despite using one more memory slot for the variable))

– Carlos Heuberger
Jan 28 at 16:39













4 Answers
4






active

oldest

votes


















130














You are measuring the OSR (on-stack replacement) stub.



OSR stub is a special version of compiled method intended specifically for transferring execution from interpreted mode to compiled code while the method is running.



OSR stubs are not as optimized as regular methods, because they need a frame layout compatible with interpreted frame. I showed this already in the following answers: 1, 2, 3.



A similar thing happens here, too. While "inefficient code" is running a long loop, the method is compiled specially for the on-stack replacement right inside the loop. The state is transferred from the interpreted frame to OSR-compiled method, and this state includes progressCheck local variable. At this point JIT cannot replace the variable with the constant, and thus cannot apply certain optimizations like strength reduction.



In particular this means JIT does not replace integer division with multiplication. (See Why does GCC use multiplication by a strange number in implementing integer division? for the asm trick from an ahead-of-time compiler, when the value is a compile-time constant after inlining / constant-propagation, if those optimizations are enabled. An integer literal right in the % expression also gets optimized by gcc -O0, similar to here where it's optimized by the JITer even in an OSR stub.)



However, if you run the same method several times, the second and the subsequent runs will execute the regular (non-OSR) code, which is fully optimized. Here is a benchmark to prove the theory (benchmarked using JMH):



@State(Scope.Benchmark)
public class Div

@Benchmark
public void divConst(Blackhole blackhole)
long startNum = 0;
long stopNum = 100000000L;

for (long i = startNum; i <= stopNum; i++)
if (i % 50000 == 0)
blackhole.consume(i);




@Benchmark
public void divVar(Blackhole blackhole)
long startNum = 0;
long stopNum = 100000000L;
long progressCheck = 50000;

for (long i = startNum; i <= stopNum; i++)
if (i % progressCheck == 0)
blackhole.consume(i);






And the results:



# Benchmark: bench.Div.divConst

# Run progress: 0,00% complete, ETA 00:00:16
# Fork: 1 of 1
# Warmup Iteration 1: 126,967 ms/op
# Warmup Iteration 2: 105,660 ms/op
# Warmup Iteration 3: 106,205 ms/op
Iteration 1: 105,620 ms/op
Iteration 2: 105,789 ms/op
Iteration 3: 105,915 ms/op
Iteration 4: 105,629 ms/op
Iteration 5: 105,632 ms/op


# Benchmark: bench.Div.divVar

# Run progress: 50,00% complete, ETA 00:00:09
# Fork: 1 of 1
# Warmup Iteration 1: 844,708 ms/op <-- much slower!
# Warmup Iteration 2: 105,893 ms/op <-- as fast as divConst
# Warmup Iteration 3: 105,601 ms/op
Iteration 1: 105,570 ms/op
Iteration 2: 105,475 ms/op
Iteration 3: 105,702 ms/op
Iteration 4: 105,535 ms/op
Iteration 5: 105,766 ms/op


The very first iteration of divVar is indeed much slower, because of inefficiently compiled OSR stub. But as soon as the method reruns from the beginning, the new unconstrained version is executed which leverages all the available compiler optimizations.






share|improve this answer




















  • 5





    I hesitate to vote on this. On the one hand, it sounds like an elaborate way of saying "You messed up your benchmark, read something about JIT". On the other hand, I wonder why you seem to be so sure that OSR was the main relevant point here. I mean, doing a (micro) benchmark that involves System.out.println will nearly necessarily produce garbage results, and the fact that both versions are equally fast does not have to do anything with OSR in particular, as far as I can tell..

    – Marco13
    Jan 29 at 0:31






  • 2





    (I'm curious and like to understand this. I hope the comments are not disturbing, might delete them later, but: ) The link 1 is a bit dubious - the empty loop could also be optimized away completely. The second is more similar to that one. But again, it's not clear why you attribute the difference to the OSR specifically. I'd just say: At some point, the method is JITed and becomes faster. To my understanding, the OSR only causes the usage of the final, optimized code to (roughly) be ~"deferred to the next optimization pass". (continued...)

    – Marco13
    Jan 29 at 10:32






  • 1





    (continued:) Unless you are specifically analyzing the hotspot logs, you cannot say whether the difference is caused by comparing JITed and un-JITed code, or by comparing JITed and OSR-stub-code. And you certainly cannot say that for sure when the question does not contain the real code or a complete JMH benchmark. So arguing that the difference is caused by OSR sounds, for me, inappropriately specific (and "unjustified") compared to saying that it is caused by the JIT in general. (No offense - I'm just wondering...)

    – Marco13
    Jan 29 at 10:34







  • 4





    @Marco13 there’s a simple heuristic: without the JIT’s activity, each % operation would have the same weight, as an optimized execution is only possible, well, if an optimizer did actual work. So the fact that one loop variant is significantly faster than the other proves the presence of an optimizer and further proves that it failed to optimize one of the loops to the same degree as the other (within the same method!). As this answer proves the capability of optimizing both loops to the same degree, there must be something that hindered the optimization. And that’s OSR in 99.9% of all cases

    – Holger
    Jan 29 at 13:19






  • 4





    @Marco13 That was an "educated guess" based on the knowledge of HotSpot Runtime and the experience of analyzing similar issues before. Such a long loop could hardly be compiled in a way other than OSR, especially in a simple hand-made benchmark. Now, when OP has posted the complete code, I can only confirm the reasoning once again by running the code with -XX:+PrintCompilation -XX:+TraceNMethodInstalls.

    – apangin
    Jan 29 at 21:12



















40














In follow-up to @phuclv comment, I checked the code generated by JIT1, the results are as follows:



for variable % 5000 (division by constant):



mov rax,29f16b11c6d1e109h
imul rbx
mov r10,rbx
sar r10,3fh
sar rdx,0dh
sub rdx,r10
imul r10,rdx,0c350h ; <-- imul
mov r11,rbx
sub r11,r10
test r11,r11
jne 1d707ad14a0h


for variable % variable:



mov rax,r14
mov rdx,8000000000000000h
cmp rax,rdx
jne 22ccce218edh
xor edx,edx
cmp rbx,0ffffffffffffffffh
je 22ccce218f2h
cqo
idiv rax,rbx ; <-- idiv
test rdx,rdx
jne 22ccce218c0h


Because division always takes longer than multiplication, the last code snippet is less performant.



Java version:



java version "11" 2018-09-25
Java(TM) SE Runtime Environment 18.9 (build 11+28)
Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11+28, mixed mode)



1 - VM options used: -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=print,src/java/Main.main






share|improve this answer




















  • 13





    To give an order of magnitude on "slower", for x86_64: imul is 3 cycles, idiv is between 30 and 90 cycles. So integer division is between 10x and 30x slower than integer multiplication.

    – Matthieu M.
    Jan 29 at 13:19






  • 2





    Could you explain what all of that means for readers who are interested, but don't speak assembler?

    – Nico Haase
    Jan 29 at 14:00






  • 7





    @NicoHaase The two commented lines are the only important ones. In the first section, the code is performing an integer multiplication, whereas the second section, the code is performing an integer division. If you think about doing multiplication and division by hand, when you multiply you're usually doing a bunch of small multiplications and then one big set of additions, but division is a small division, a small multiplication, a subtraction, and repeat. Division is slow because you're essentially doing a bunch of multiplications.

    – MBraedley
    Jan 29 at 15:05







  • 4





    @MBraedley thanks for your input, but such explanation should be added to the answer itself and not be hidden in the comment section

    – Nico Haase
    Jan 29 at 15:16






  • 6





    @MBraedley: More to the point, multiplication in a modern CPU is fast because the partial products are independent and can thus be computed separately, while each stage of a division is dependent upon the preceding stages.

    – supercat
    Jan 29 at 20:34


















24














As others have noted, the general modulus operation requires a division to be done. In some cases, the division can be replaced (by the compiler) by a multiplication. But both can be slow compared to addition/subtraction. Hence, the best performance can be expected by something along these lines:



long progressCheck = 50000;

long counter = progressCheck;

for (long i = startNum; i <= stopNum; i++)
if (--counter == 0)
System.out.println(i);
counter = progressCheck;




(As a minor optmiziation attempt we use a pre-decrement down-counter here because on many architectures comparing to 0 after an arithmetic operation costs exactly 0 instructions/CPU cycles because the ALU's flags are already set appropriately by the preceeding operation. A decent optimizing compiler will, however, do that optimization automatically if you write if (counter++ == 50000) ... counter = 0; .)



Notice that often you don't really want/need modulus, because you know that your loop counter (i) or whatever is only ever incremented by 1, and you really don't care about the actual remainder the modulus will give you, just see if the incrementing-by-one counter hits some value.



Another 'trick' is to use power-of-two values/limits, e.g. progressCheck = 1024;. Modulus a power of two can be quickly calculated via bitwise and, i.e. if ( (i & (1024-1)) == 0 ) .... This should be pretty fast too, and may on some architectures outperform the explicit counter above.






share|improve this answer




















  • 3





    A smart compiler would invert the loops here. Or you could do that in the source. The if() body becomes an outer-loop body, and the stuff outside the if() becomes an inner loop body that runs for min(progressCheck, stopNum-i) iterations. So at the start, and every time counter reaches 0, you do long next_stop = i + min(progressCheck, stopNum-i); to set up for a for(; i< next_stop; i++) loop. In this case that inner loop is empty and should hopefully optimize away entirely, of you can do that in the source and make it easy for the JITer, reducing your loop to i+=50k.

    – Peter Cordes
    Jan 29 at 0:26






  • 2





    But yes, in general a down-counter is a good efficient technique for fizzbuzz / progresscheck type stuff.

    – Peter Cordes
    Jan 29 at 0:26











  • I added to my question, and did an increment, the --counter is just as fast as my increment version, but less code.also it was 1 lower than it should be, i'm curious if it should be counter-- to get the exact number you want, not that it's much of a difference

    – Robert Cotterman
    Jan 29 at 3:36











  • @PeterCordes A smart compiler would just print the numbers, no loop at all. (I think some only slightly more trivial benchmarks started to fail that way maybe 10 years ago.)

    – Peter A. Schneider
    Jan 29 at 8:19







  • 2





    @RobertCotterman Yes, --counter is off by one. counter-- will give you exactly the progressCheck number of iterations (or you could set progressCheck = 50001; of course).

    – JimmyB
    Jan 29 at 14:33


















3














I am also surprised by seeing the performance of the above codes. It's all about the time taken by the compiler for executing the program as per the declared variable. In the second (inefficient) example:



for (long i = startNum; i <= stopNum; i++) 
if (i % progressCheck == 0)
System.out.println(i)




You are performing the modulus operation between two variables. Here, compiler has to check the value of stopNum and progressCheck to go to the specific memory block located for these variables every time after each iteration because it is a variable and its value might be change.



That's why after each iteration compiler went to the memory location to check the latest value of the variables. Therefore at the compile time the compiler was not able to create efficient byte code.



In the first code example, you are performing modulus operator between a variable and a constant numeric value which is not going to change within execution and compiler no need to check the value of that numeric value from the memory location. That's why compiler was able to create efficient byte code. If you declare progressCheck as a final or as a final static variable then at the time of run-time/compile-time compiler know that it's a final variable and its value not going to change then compiler replace the progressCheck with 50000 in code:



for (long i = startNum; i <= stopNum; i++) 
if (i % 50000== 0)
System.out.println(i)




Now you can see that this code also looks like the first (efficient) code example. The performance of first code and as we mentioned above both code will work efficiently. There will not be much difference in execution time of either code example.






share|improve this answer




















  • 1





    There is a HUGE difference, albeit i was doing the operation a trillion times, so over 1 trillion operations it saved 89% time to do the "efficient" code. mind you if you're only doing it a few thousand times, were talking such a tiny difference, it's probably not a big deal. i mean over 1000 operations it would save you 1 millionth of 7 seconds.

    – Robert Cotterman
    Jan 29 at 3:38







  • 1





    @Bishal Dubey "There will not be much difference in execution time of both code." Did you read the question?

    – gfos
    Jan 29 at 16:27











  • "That's why after each iteration compiler went to the memory location to check the latest value of the variables" - Unless the variable was declared volatile the 'compiler' will not read its value from RAM over and over again.

    – JimmyB
    Feb 10 at 10:07










Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54405842%2fwhy-is-if-variable1-variable2-0-inefficient%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























4 Answers
4






active

oldest

votes








4 Answers
4






active

oldest

votes









active

oldest

votes






active

oldest

votes









130














You are measuring the OSR (on-stack replacement) stub.



OSR stub is a special version of compiled method intended specifically for transferring execution from interpreted mode to compiled code while the method is running.



OSR stubs are not as optimized as regular methods, because they need a frame layout compatible with interpreted frame. I showed this already in the following answers: 1, 2, 3.



A similar thing happens here, too. While "inefficient code" is running a long loop, the method is compiled specially for the on-stack replacement right inside the loop. The state is transferred from the interpreted frame to OSR-compiled method, and this state includes progressCheck local variable. At this point JIT cannot replace the variable with the constant, and thus cannot apply certain optimizations like strength reduction.



In particular this means JIT does not replace integer division with multiplication. (See Why does GCC use multiplication by a strange number in implementing integer division? for the asm trick from an ahead-of-time compiler, when the value is a compile-time constant after inlining / constant-propagation, if those optimizations are enabled. An integer literal right in the % expression also gets optimized by gcc -O0, similar to here where it's optimized by the JITer even in an OSR stub.)



However, if you run the same method several times, the second and the subsequent runs will execute the regular (non-OSR) code, which is fully optimized. Here is a benchmark to prove the theory (benchmarked using JMH):



@State(Scope.Benchmark)
public class Div

@Benchmark
public void divConst(Blackhole blackhole)
long startNum = 0;
long stopNum = 100000000L;

for (long i = startNum; i <= stopNum; i++)
if (i % 50000 == 0)
blackhole.consume(i);




@Benchmark
public void divVar(Blackhole blackhole)
long startNum = 0;
long stopNum = 100000000L;
long progressCheck = 50000;

for (long i = startNum; i <= stopNum; i++)
if (i % progressCheck == 0)
blackhole.consume(i);






And the results:



# Benchmark: bench.Div.divConst

# Run progress: 0,00% complete, ETA 00:00:16
# Fork: 1 of 1
# Warmup Iteration 1: 126,967 ms/op
# Warmup Iteration 2: 105,660 ms/op
# Warmup Iteration 3: 106,205 ms/op
Iteration 1: 105,620 ms/op
Iteration 2: 105,789 ms/op
Iteration 3: 105,915 ms/op
Iteration 4: 105,629 ms/op
Iteration 5: 105,632 ms/op


# Benchmark: bench.Div.divVar

# Run progress: 50,00% complete, ETA 00:00:09
# Fork: 1 of 1
# Warmup Iteration 1: 844,708 ms/op <-- much slower!
# Warmup Iteration 2: 105,893 ms/op <-- as fast as divConst
# Warmup Iteration 3: 105,601 ms/op
Iteration 1: 105,570 ms/op
Iteration 2: 105,475 ms/op
Iteration 3: 105,702 ms/op
Iteration 4: 105,535 ms/op
Iteration 5: 105,766 ms/op


The very first iteration of divVar is indeed much slower, because of inefficiently compiled OSR stub. But as soon as the method reruns from the beginning, the new unconstrained version is executed which leverages all the available compiler optimizations.






share|improve this answer




















  • 5





    I hesitate to vote on this. On the one hand, it sounds like an elaborate way of saying "You messed up your benchmark, read something about JIT". On the other hand, I wonder why you seem to be so sure that OSR was the main relevant point here. I mean, doing a (micro) benchmark that involves System.out.println will nearly necessarily produce garbage results, and the fact that both versions are equally fast does not have to do anything with OSR in particular, as far as I can tell..

    – Marco13
    Jan 29 at 0:31






  • 2





    (I'm curious and like to understand this. I hope the comments are not disturbing, might delete them later, but: ) The link 1 is a bit dubious - the empty loop could also be optimized away completely. The second is more similar to that one. But again, it's not clear why you attribute the difference to the OSR specifically. I'd just say: At some point, the method is JITed and becomes faster. To my understanding, the OSR only causes the usage of the final, optimized code to (roughly) be ~"deferred to the next optimization pass". (continued...)

    – Marco13
    Jan 29 at 10:32






  • 1





    (continued:) Unless you are specifically analyzing the hotspot logs, you cannot say whether the difference is caused by comparing JITed and un-JITed code, or by comparing JITed and OSR-stub-code. And you certainly cannot say that for sure when the question does not contain the real code or a complete JMH benchmark. So arguing that the difference is caused by OSR sounds, for me, inappropriately specific (and "unjustified") compared to saying that it is caused by the JIT in general. (No offense - I'm just wondering...)

    – Marco13
    Jan 29 at 10:34







  • 4





    @Marco13 there’s a simple heuristic: without the JIT’s activity, each % operation would have the same weight, as an optimized execution is only possible, well, if an optimizer did actual work. So the fact that one loop variant is significantly faster than the other proves the presence of an optimizer and further proves that it failed to optimize one of the loops to the same degree as the other (within the same method!). As this answer proves the capability of optimizing both loops to the same degree, there must be something that hindered the optimization. And that’s OSR in 99.9% of all cases

    – Holger
    Jan 29 at 13:19






  • 4





    @Marco13 That was an "educated guess" based on the knowledge of HotSpot Runtime and the experience of analyzing similar issues before. Such a long loop could hardly be compiled in a way other than OSR, especially in a simple hand-made benchmark. Now, when OP has posted the complete code, I can only confirm the reasoning once again by running the code with -XX:+PrintCompilation -XX:+TraceNMethodInstalls.

    – apangin
    Jan 29 at 21:12
















130














You are measuring the OSR (on-stack replacement) stub.



OSR stub is a special version of compiled method intended specifically for transferring execution from interpreted mode to compiled code while the method is running.



OSR stubs are not as optimized as regular methods, because they need a frame layout compatible with interpreted frame. I showed this already in the following answers: 1, 2, 3.



A similar thing happens here, too. While "inefficient code" is running a long loop, the method is compiled specially for the on-stack replacement right inside the loop. The state is transferred from the interpreted frame to OSR-compiled method, and this state includes progressCheck local variable. At this point JIT cannot replace the variable with the constant, and thus cannot apply certain optimizations like strength reduction.



In particular this means JIT does not replace integer division with multiplication. (See Why does GCC use multiplication by a strange number in implementing integer division? for the asm trick from an ahead-of-time compiler, when the value is a compile-time constant after inlining / constant-propagation, if those optimizations are enabled. An integer literal right in the % expression also gets optimized by gcc -O0, similar to here where it's optimized by the JITer even in an OSR stub.)



However, if you run the same method several times, the second and the subsequent runs will execute the regular (non-OSR) code, which is fully optimized. Here is a benchmark to prove the theory (benchmarked using JMH):



@State(Scope.Benchmark)
public class Div

@Benchmark
public void divConst(Blackhole blackhole)
long startNum = 0;
long stopNum = 100000000L;

for (long i = startNum; i <= stopNum; i++)
if (i % 50000 == 0)
blackhole.consume(i);




@Benchmark
public void divVar(Blackhole blackhole)
long startNum = 0;
long stopNum = 100000000L;
long progressCheck = 50000;

for (long i = startNum; i <= stopNum; i++)
if (i % progressCheck == 0)
blackhole.consume(i);






And the results:



# Benchmark: bench.Div.divConst

# Run progress: 0,00% complete, ETA 00:00:16
# Fork: 1 of 1
# Warmup Iteration 1: 126,967 ms/op
# Warmup Iteration 2: 105,660 ms/op
# Warmup Iteration 3: 106,205 ms/op
Iteration 1: 105,620 ms/op
Iteration 2: 105,789 ms/op
Iteration 3: 105,915 ms/op
Iteration 4: 105,629 ms/op
Iteration 5: 105,632 ms/op


# Benchmark: bench.Div.divVar

# Run progress: 50,00% complete, ETA 00:00:09
# Fork: 1 of 1
# Warmup Iteration 1: 844,708 ms/op <-- much slower!
# Warmup Iteration 2: 105,893 ms/op <-- as fast as divConst
# Warmup Iteration 3: 105,601 ms/op
Iteration 1: 105,570 ms/op
Iteration 2: 105,475 ms/op
Iteration 3: 105,702 ms/op
Iteration 4: 105,535 ms/op
Iteration 5: 105,766 ms/op


The very first iteration of divVar is indeed much slower, because of inefficiently compiled OSR stub. But as soon as the method reruns from the beginning, the new unconstrained version is executed which leverages all the available compiler optimizations.






share|improve this answer




















  • 5





    I hesitate to vote on this. On the one hand, it sounds like an elaborate way of saying "You messed up your benchmark, read something about JIT". On the other hand, I wonder why you seem to be so sure that OSR was the main relevant point here. I mean, doing a (micro) benchmark that involves System.out.println will nearly necessarily produce garbage results, and the fact that both versions are equally fast does not have to do anything with OSR in particular, as far as I can tell..

    – Marco13
    Jan 29 at 0:31






  • 2





    (I'm curious and like to understand this. I hope the comments are not disturbing, might delete them later, but: ) The link 1 is a bit dubious - the empty loop could also be optimized away completely. The second is more similar to that one. But again, it's not clear why you attribute the difference to the OSR specifically. I'd just say: At some point, the method is JITed and becomes faster. To my understanding, the OSR only causes the usage of the final, optimized code to (roughly) be ~"deferred to the next optimization pass". (continued...)

    – Marco13
    Jan 29 at 10:32






  • 1





    (continued:) Unless you are specifically analyzing the hotspot logs, you cannot say whether the difference is caused by comparing JITed and un-JITed code, or by comparing JITed and OSR-stub-code. And you certainly cannot say that for sure when the question does not contain the real code or a complete JMH benchmark. So arguing that the difference is caused by OSR sounds, for me, inappropriately specific (and "unjustified") compared to saying that it is caused by the JIT in general. (No offense - I'm just wondering...)

    – Marco13
    Jan 29 at 10:34







  • 4





    @Marco13 there’s a simple heuristic: without the JIT’s activity, each % operation would have the same weight, as an optimized execution is only possible, well, if an optimizer did actual work. So the fact that one loop variant is significantly faster than the other proves the presence of an optimizer and further proves that it failed to optimize one of the loops to the same degree as the other (within the same method!). As this answer proves the capability of optimizing both loops to the same degree, there must be something that hindered the optimization. And that’s OSR in 99.9% of all cases

    – Holger
    Jan 29 at 13:19






  • 4





    @Marco13 That was an "educated guess" based on the knowledge of HotSpot Runtime and the experience of analyzing similar issues before. Such a long loop could hardly be compiled in a way other than OSR, especially in a simple hand-made benchmark. Now, when OP has posted the complete code, I can only confirm the reasoning once again by running the code with -XX:+PrintCompilation -XX:+TraceNMethodInstalls.

    – apangin
    Jan 29 at 21:12














130












130








130







You are measuring the OSR (on-stack replacement) stub.



OSR stub is a special version of compiled method intended specifically for transferring execution from interpreted mode to compiled code while the method is running.



OSR stubs are not as optimized as regular methods, because they need a frame layout compatible with interpreted frame. I showed this already in the following answers: 1, 2, 3.



A similar thing happens here, too. While "inefficient code" is running a long loop, the method is compiled specially for the on-stack replacement right inside the loop. The state is transferred from the interpreted frame to OSR-compiled method, and this state includes progressCheck local variable. At this point JIT cannot replace the variable with the constant, and thus cannot apply certain optimizations like strength reduction.



In particular this means JIT does not replace integer division with multiplication. (See Why does GCC use multiplication by a strange number in implementing integer division? for the asm trick from an ahead-of-time compiler, when the value is a compile-time constant after inlining / constant-propagation, if those optimizations are enabled. An integer literal right in the % expression also gets optimized by gcc -O0, similar to here where it's optimized by the JITer even in an OSR stub.)



However, if you run the same method several times, the second and the subsequent runs will execute the regular (non-OSR) code, which is fully optimized. Here is a benchmark to prove the theory (benchmarked using JMH):



@State(Scope.Benchmark)
public class Div

@Benchmark
public void divConst(Blackhole blackhole)
long startNum = 0;
long stopNum = 100000000L;

for (long i = startNum; i <= stopNum; i++)
if (i % 50000 == 0)
blackhole.consume(i);




@Benchmark
public void divVar(Blackhole blackhole)
long startNum = 0;
long stopNum = 100000000L;
long progressCheck = 50000;

for (long i = startNum; i <= stopNum; i++)
if (i % progressCheck == 0)
blackhole.consume(i);






And the results:



# Benchmark: bench.Div.divConst

# Run progress: 0,00% complete, ETA 00:00:16
# Fork: 1 of 1
# Warmup Iteration 1: 126,967 ms/op
# Warmup Iteration 2: 105,660 ms/op
# Warmup Iteration 3: 106,205 ms/op
Iteration 1: 105,620 ms/op
Iteration 2: 105,789 ms/op
Iteration 3: 105,915 ms/op
Iteration 4: 105,629 ms/op
Iteration 5: 105,632 ms/op


# Benchmark: bench.Div.divVar

# Run progress: 50,00% complete, ETA 00:00:09
# Fork: 1 of 1
# Warmup Iteration 1: 844,708 ms/op <-- much slower!
# Warmup Iteration 2: 105,893 ms/op <-- as fast as divConst
# Warmup Iteration 3: 105,601 ms/op
Iteration 1: 105,570 ms/op
Iteration 2: 105,475 ms/op
Iteration 3: 105,702 ms/op
Iteration 4: 105,535 ms/op
Iteration 5: 105,766 ms/op


The very first iteration of divVar is indeed much slower, because of inefficiently compiled OSR stub. But as soon as the method reruns from the beginning, the new unconstrained version is executed which leverages all the available compiler optimizations.






share|improve this answer















You are measuring the OSR (on-stack replacement) stub.



OSR stub is a special version of compiled method intended specifically for transferring execution from interpreted mode to compiled code while the method is running.



OSR stubs are not as optimized as regular methods, because they need a frame layout compatible with interpreted frame. I showed this already in the following answers: 1, 2, 3.



A similar thing happens here, too. While "inefficient code" is running a long loop, the method is compiled specially for the on-stack replacement right inside the loop. The state is transferred from the interpreted frame to OSR-compiled method, and this state includes progressCheck local variable. At this point JIT cannot replace the variable with the constant, and thus cannot apply certain optimizations like strength reduction.



In particular this means JIT does not replace integer division with multiplication. (See Why does GCC use multiplication by a strange number in implementing integer division? for the asm trick from an ahead-of-time compiler, when the value is a compile-time constant after inlining / constant-propagation, if those optimizations are enabled. An integer literal right in the % expression also gets optimized by gcc -O0, similar to here where it's optimized by the JITer even in an OSR stub.)



However, if you run the same method several times, the second and the subsequent runs will execute the regular (non-OSR) code, which is fully optimized. Here is a benchmark to prove the theory (benchmarked using JMH):



@State(Scope.Benchmark)
public class Div

@Benchmark
public void divConst(Blackhole blackhole)
long startNum = 0;
long stopNum = 100000000L;

for (long i = startNum; i <= stopNum; i++)
if (i % 50000 == 0)
blackhole.consume(i);




@Benchmark
public void divVar(Blackhole blackhole)
long startNum = 0;
long stopNum = 100000000L;
long progressCheck = 50000;

for (long i = startNum; i <= stopNum; i++)
if (i % progressCheck == 0)
blackhole.consume(i);






And the results:



# Benchmark: bench.Div.divConst

# Run progress: 0,00% complete, ETA 00:00:16
# Fork: 1 of 1
# Warmup Iteration 1: 126,967 ms/op
# Warmup Iteration 2: 105,660 ms/op
# Warmup Iteration 3: 106,205 ms/op
Iteration 1: 105,620 ms/op
Iteration 2: 105,789 ms/op
Iteration 3: 105,915 ms/op
Iteration 4: 105,629 ms/op
Iteration 5: 105,632 ms/op


# Benchmark: bench.Div.divVar

# Run progress: 50,00% complete, ETA 00:00:09
# Fork: 1 of 1
# Warmup Iteration 1: 844,708 ms/op <-- much slower!
# Warmup Iteration 2: 105,893 ms/op <-- as fast as divConst
# Warmup Iteration 3: 105,601 ms/op
Iteration 1: 105,570 ms/op
Iteration 2: 105,475 ms/op
Iteration 3: 105,702 ms/op
Iteration 4: 105,535 ms/op
Iteration 5: 105,766 ms/op


The very first iteration of divVar is indeed much slower, because of inefficiently compiled OSR stub. But as soon as the method reruns from the beginning, the new unconstrained version is executed which leverages all the available compiler optimizations.







share|improve this answer














share|improve this answer



share|improve this answer








edited Jan 31 at 7:07









Ian Kemp

16.8k126999




16.8k126999










answered Jan 28 at 21:41









apanginapangin

51.4k8100130




51.4k8100130







  • 5





    I hesitate to vote on this. On the one hand, it sounds like an elaborate way of saying "You messed up your benchmark, read something about JIT". On the other hand, I wonder why you seem to be so sure that OSR was the main relevant point here. I mean, doing a (micro) benchmark that involves System.out.println will nearly necessarily produce garbage results, and the fact that both versions are equally fast does not have to do anything with OSR in particular, as far as I can tell..

    – Marco13
    Jan 29 at 0:31






  • 2





    (I'm curious and like to understand this. I hope the comments are not disturbing, might delete them later, but: ) The link 1 is a bit dubious - the empty loop could also be optimized away completely. The second is more similar to that one. But again, it's not clear why you attribute the difference to the OSR specifically. I'd just say: At some point, the method is JITed and becomes faster. To my understanding, the OSR only causes the usage of the final, optimized code to (roughly) be ~"deferred to the next optimization pass". (continued...)

    – Marco13
    Jan 29 at 10:32






  • 1





    (continued:) Unless you are specifically analyzing the hotspot logs, you cannot say whether the difference is caused by comparing JITed and un-JITed code, or by comparing JITed and OSR-stub-code. And you certainly cannot say that for sure when the question does not contain the real code or a complete JMH benchmark. So arguing that the difference is caused by OSR sounds, for me, inappropriately specific (and "unjustified") compared to saying that it is caused by the JIT in general. (No offense - I'm just wondering...)

    – Marco13
    Jan 29 at 10:34







  • 4





    @Marco13 there’s a simple heuristic: without the JIT’s activity, each % operation would have the same weight, as an optimized execution is only possible, well, if an optimizer did actual work. So the fact that one loop variant is significantly faster than the other proves the presence of an optimizer and further proves that it failed to optimize one of the loops to the same degree as the other (within the same method!). As this answer proves the capability of optimizing both loops to the same degree, there must be something that hindered the optimization. And that’s OSR in 99.9% of all cases

    – Holger
    Jan 29 at 13:19






  • 4





    @Marco13 That was an "educated guess" based on the knowledge of HotSpot Runtime and the experience of analyzing similar issues before. Such a long loop could hardly be compiled in a way other than OSR, especially in a simple hand-made benchmark. Now, when OP has posted the complete code, I can only confirm the reasoning once again by running the code with -XX:+PrintCompilation -XX:+TraceNMethodInstalls.

    – apangin
    Jan 29 at 21:12













  • 5





    I hesitate to vote on this. On the one hand, it sounds like an elaborate way of saying "You messed up your benchmark, read something about JIT". On the other hand, I wonder why you seem to be so sure that OSR was the main relevant point here. I mean, doing a (micro) benchmark that involves System.out.println will nearly necessarily produce garbage results, and the fact that both versions are equally fast does not have to do anything with OSR in particular, as far as I can tell..

    – Marco13
    Jan 29 at 0:31






  • 2





    (I'm curious and like to understand this. I hope the comments are not disturbing, might delete them later, but: ) The link 1 is a bit dubious - the empty loop could also be optimized away completely. The second is more similar to that one. But again, it's not clear why you attribute the difference to the OSR specifically. I'd just say: At some point, the method is JITed and becomes faster. To my understanding, the OSR only causes the usage of the final, optimized code to (roughly) be ~"deferred to the next optimization pass". (continued...)

    – Marco13
    Jan 29 at 10:32






  • 1





    (continued:) Unless you are specifically analyzing the hotspot logs, you cannot say whether the difference is caused by comparing JITed and un-JITed code, or by comparing JITed and OSR-stub-code. And you certainly cannot say that for sure when the question does not contain the real code or a complete JMH benchmark. So arguing that the difference is caused by OSR sounds, for me, inappropriately specific (and "unjustified") compared to saying that it is caused by the JIT in general. (No offense - I'm just wondering...)

    – Marco13
    Jan 29 at 10:34







  • 4





    @Marco13 there’s a simple heuristic: without the JIT’s activity, each % operation would have the same weight, as an optimized execution is only possible, well, if an optimizer did actual work. So the fact that one loop variant is significantly faster than the other proves the presence of an optimizer and further proves that it failed to optimize one of the loops to the same degree as the other (within the same method!). As this answer proves the capability of optimizing both loops to the same degree, there must be something that hindered the optimization. And that’s OSR in 99.9% of all cases

    – Holger
    Jan 29 at 13:19






  • 4





    @Marco13 That was an "educated guess" based on the knowledge of HotSpot Runtime and the experience of analyzing similar issues before. Such a long loop could hardly be compiled in a way other than OSR, especially in a simple hand-made benchmark. Now, when OP has posted the complete code, I can only confirm the reasoning once again by running the code with -XX:+PrintCompilation -XX:+TraceNMethodInstalls.

    – apangin
    Jan 29 at 21:12








5




5





I hesitate to vote on this. On the one hand, it sounds like an elaborate way of saying "You messed up your benchmark, read something about JIT". On the other hand, I wonder why you seem to be so sure that OSR was the main relevant point here. I mean, doing a (micro) benchmark that involves System.out.println will nearly necessarily produce garbage results, and the fact that both versions are equally fast does not have to do anything with OSR in particular, as far as I can tell..

– Marco13
Jan 29 at 0:31





I hesitate to vote on this. On the one hand, it sounds like an elaborate way of saying "You messed up your benchmark, read something about JIT". On the other hand, I wonder why you seem to be so sure that OSR was the main relevant point here. I mean, doing a (micro) benchmark that involves System.out.println will nearly necessarily produce garbage results, and the fact that both versions are equally fast does not have to do anything with OSR in particular, as far as I can tell..

– Marco13
Jan 29 at 0:31




2




2





(I'm curious and like to understand this. I hope the comments are not disturbing, might delete them later, but: ) The link 1 is a bit dubious - the empty loop could also be optimized away completely. The second is more similar to that one. But again, it's not clear why you attribute the difference to the OSR specifically. I'd just say: At some point, the method is JITed and becomes faster. To my understanding, the OSR only causes the usage of the final, optimized code to (roughly) be ~"deferred to the next optimization pass". (continued...)

– Marco13
Jan 29 at 10:32





(I'm curious and like to understand this. I hope the comments are not disturbing, might delete them later, but: ) The link 1 is a bit dubious - the empty loop could also be optimized away completely. The second is more similar to that one. But again, it's not clear why you attribute the difference to the OSR specifically. I'd just say: At some point, the method is JITed and becomes faster. To my understanding, the OSR only causes the usage of the final, optimized code to (roughly) be ~"deferred to the next optimization pass". (continued...)

– Marco13
Jan 29 at 10:32




1




1





(continued:) Unless you are specifically analyzing the hotspot logs, you cannot say whether the difference is caused by comparing JITed and un-JITed code, or by comparing JITed and OSR-stub-code. And you certainly cannot say that for sure when the question does not contain the real code or a complete JMH benchmark. So arguing that the difference is caused by OSR sounds, for me, inappropriately specific (and "unjustified") compared to saying that it is caused by the JIT in general. (No offense - I'm just wondering...)

– Marco13
Jan 29 at 10:34






(continued:) Unless you are specifically analyzing the hotspot logs, you cannot say whether the difference is caused by comparing JITed and un-JITed code, or by comparing JITed and OSR-stub-code. And you certainly cannot say that for sure when the question does not contain the real code or a complete JMH benchmark. So arguing that the difference is caused by OSR sounds, for me, inappropriately specific (and "unjustified") compared to saying that it is caused by the JIT in general. (No offense - I'm just wondering...)

– Marco13
Jan 29 at 10:34





4




4





@Marco13 there’s a simple heuristic: without the JIT’s activity, each % operation would have the same weight, as an optimized execution is only possible, well, if an optimizer did actual work. So the fact that one loop variant is significantly faster than the other proves the presence of an optimizer and further proves that it failed to optimize one of the loops to the same degree as the other (within the same method!). As this answer proves the capability of optimizing both loops to the same degree, there must be something that hindered the optimization. And that’s OSR in 99.9% of all cases

– Holger
Jan 29 at 13:19





@Marco13 there’s a simple heuristic: without the JIT’s activity, each % operation would have the same weight, as an optimized execution is only possible, well, if an optimizer did actual work. So the fact that one loop variant is significantly faster than the other proves the presence of an optimizer and further proves that it failed to optimize one of the loops to the same degree as the other (within the same method!). As this answer proves the capability of optimizing both loops to the same degree, there must be something that hindered the optimization. And that’s OSR in 99.9% of all cases

– Holger
Jan 29 at 13:19




4




4





@Marco13 That was an "educated guess" based on the knowledge of HotSpot Runtime and the experience of analyzing similar issues before. Such a long loop could hardly be compiled in a way other than OSR, especially in a simple hand-made benchmark. Now, when OP has posted the complete code, I can only confirm the reasoning once again by running the code with -XX:+PrintCompilation -XX:+TraceNMethodInstalls.

– apangin
Jan 29 at 21:12






@Marco13 That was an "educated guess" based on the knowledge of HotSpot Runtime and the experience of analyzing similar issues before. Such a long loop could hardly be compiled in a way other than OSR, especially in a simple hand-made benchmark. Now, when OP has posted the complete code, I can only confirm the reasoning once again by running the code with -XX:+PrintCompilation -XX:+TraceNMethodInstalls.

– apangin
Jan 29 at 21:12














40














In follow-up to @phuclv comment, I checked the code generated by JIT1, the results are as follows:



for variable % 5000 (division by constant):



mov rax,29f16b11c6d1e109h
imul rbx
mov r10,rbx
sar r10,3fh
sar rdx,0dh
sub rdx,r10
imul r10,rdx,0c350h ; <-- imul
mov r11,rbx
sub r11,r10
test r11,r11
jne 1d707ad14a0h


for variable % variable:



mov rax,r14
mov rdx,8000000000000000h
cmp rax,rdx
jne 22ccce218edh
xor edx,edx
cmp rbx,0ffffffffffffffffh
je 22ccce218f2h
cqo
idiv rax,rbx ; <-- idiv
test rdx,rdx
jne 22ccce218c0h


Because division always takes longer than multiplication, the last code snippet is less performant.



Java version:



java version "11" 2018-09-25
Java(TM) SE Runtime Environment 18.9 (build 11+28)
Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11+28, mixed mode)



1 - VM options used: -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=print,src/java/Main.main






share|improve this answer




















  • 13





    To give an order of magnitude on "slower", for x86_64: imul is 3 cycles, idiv is between 30 and 90 cycles. So integer division is between 10x and 30x slower than integer multiplication.

    – Matthieu M.
    Jan 29 at 13:19






  • 2





    Could you explain what all of that means for readers who are interested, but don't speak assembler?

    – Nico Haase
    Jan 29 at 14:00






  • 7





    @NicoHaase The two commented lines are the only important ones. In the first section, the code is performing an integer multiplication, whereas the second section, the code is performing an integer division. If you think about doing multiplication and division by hand, when you multiply you're usually doing a bunch of small multiplications and then one big set of additions, but division is a small division, a small multiplication, a subtraction, and repeat. Division is slow because you're essentially doing a bunch of multiplications.

    – MBraedley
    Jan 29 at 15:05







  • 4





    @MBraedley thanks for your input, but such explanation should be added to the answer itself and not be hidden in the comment section

    – Nico Haase
    Jan 29 at 15:16






  • 6





    @MBraedley: More to the point, multiplication in a modern CPU is fast because the partial products are independent and can thus be computed separately, while each stage of a division is dependent upon the preceding stages.

    – supercat
    Jan 29 at 20:34















40














In follow-up to @phuclv comment, I checked the code generated by JIT1, the results are as follows:



for variable % 5000 (division by constant):



mov rax,29f16b11c6d1e109h
imul rbx
mov r10,rbx
sar r10,3fh
sar rdx,0dh
sub rdx,r10
imul r10,rdx,0c350h ; <-- imul
mov r11,rbx
sub r11,r10
test r11,r11
jne 1d707ad14a0h


for variable % variable:



mov rax,r14
mov rdx,8000000000000000h
cmp rax,rdx
jne 22ccce218edh
xor edx,edx
cmp rbx,0ffffffffffffffffh
je 22ccce218f2h
cqo
idiv rax,rbx ; <-- idiv
test rdx,rdx
jne 22ccce218c0h


Because division always takes longer than multiplication, the last code snippet is less performant.



Java version:



java version "11" 2018-09-25
Java(TM) SE Runtime Environment 18.9 (build 11+28)
Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11+28, mixed mode)



1 - VM options used: -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=print,src/java/Main.main






share|improve this answer




















  • 13





    To give an order of magnitude on "slower", for x86_64: imul is 3 cycles, idiv is between 30 and 90 cycles. So integer division is between 10x and 30x slower than integer multiplication.

    – Matthieu M.
    Jan 29 at 13:19






  • 2





    Could you explain what all of that means for readers who are interested, but don't speak assembler?

    – Nico Haase
    Jan 29 at 14:00






  • 7





    @NicoHaase The two commented lines are the only important ones. In the first section, the code is performing an integer multiplication, whereas the second section, the code is performing an integer division. If you think about doing multiplication and division by hand, when you multiply you're usually doing a bunch of small multiplications and then one big set of additions, but division is a small division, a small multiplication, a subtraction, and repeat. Division is slow because you're essentially doing a bunch of multiplications.

    – MBraedley
    Jan 29 at 15:05







  • 4





    @MBraedley thanks for your input, but such explanation should be added to the answer itself and not be hidden in the comment section

    – Nico Haase
    Jan 29 at 15:16






  • 6





    @MBraedley: More to the point, multiplication in a modern CPU is fast because the partial products are independent and can thus be computed separately, while each stage of a division is dependent upon the preceding stages.

    – supercat
    Jan 29 at 20:34













40












40








40







In follow-up to @phuclv comment, I checked the code generated by JIT1, the results are as follows:



for variable % 5000 (division by constant):



mov rax,29f16b11c6d1e109h
imul rbx
mov r10,rbx
sar r10,3fh
sar rdx,0dh
sub rdx,r10
imul r10,rdx,0c350h ; <-- imul
mov r11,rbx
sub r11,r10
test r11,r11
jne 1d707ad14a0h


for variable % variable:



mov rax,r14
mov rdx,8000000000000000h
cmp rax,rdx
jne 22ccce218edh
xor edx,edx
cmp rbx,0ffffffffffffffffh
je 22ccce218f2h
cqo
idiv rax,rbx ; <-- idiv
test rdx,rdx
jne 22ccce218c0h


Because division always takes longer than multiplication, the last code snippet is less performant.



Java version:



java version "11" 2018-09-25
Java(TM) SE Runtime Environment 18.9 (build 11+28)
Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11+28, mixed mode)



1 - VM options used: -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=print,src/java/Main.main






share|improve this answer















In follow-up to @phuclv comment, I checked the code generated by JIT1, the results are as follows:



for variable % 5000 (division by constant):



mov rax,29f16b11c6d1e109h
imul rbx
mov r10,rbx
sar r10,3fh
sar rdx,0dh
sub rdx,r10
imul r10,rdx,0c350h ; <-- imul
mov r11,rbx
sub r11,r10
test r11,r11
jne 1d707ad14a0h


for variable % variable:



mov rax,r14
mov rdx,8000000000000000h
cmp rax,rdx
jne 22ccce218edh
xor edx,edx
cmp rbx,0ffffffffffffffffh
je 22ccce218f2h
cqo
idiv rax,rbx ; <-- idiv
test rdx,rdx
jne 22ccce218c0h


Because division always takes longer than multiplication, the last code snippet is less performant.



Java version:



java version "11" 2018-09-25
Java(TM) SE Runtime Environment 18.9 (build 11+28)
Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11+28, mixed mode)



1 - VM options used: -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=print,src/java/Main.main







share|improve this answer














share|improve this answer



share|improve this answer








edited Jan 29 at 10:55

























answered Jan 28 at 21:01









OleksandrOleksandr

9,36544071




9,36544071







  • 13





    To give an order of magnitude on "slower", for x86_64: imul is 3 cycles, idiv is between 30 and 90 cycles. So integer division is between 10x and 30x slower than integer multiplication.

    – Matthieu M.
    Jan 29 at 13:19






  • 2





    Could you explain what all of that means for readers who are interested, but don't speak assembler?

    – Nico Haase
    Jan 29 at 14:00






  • 7





    @NicoHaase The two commented lines are the only important ones. In the first section, the code is performing an integer multiplication, whereas the second section, the code is performing an integer division. If you think about doing multiplication and division by hand, when you multiply you're usually doing a bunch of small multiplications and then one big set of additions, but division is a small division, a small multiplication, a subtraction, and repeat. Division is slow because you're essentially doing a bunch of multiplications.

    – MBraedley
    Jan 29 at 15:05







  • 4





    @MBraedley thanks for your input, but such explanation should be added to the answer itself and not be hidden in the comment section

    – Nico Haase
    Jan 29 at 15:16






  • 6





    @MBraedley: More to the point, multiplication in a modern CPU is fast because the partial products are independent and can thus be computed separately, while each stage of a division is dependent upon the preceding stages.

    – supercat
    Jan 29 at 20:34












  • 13





    To give an order of magnitude on "slower", for x86_64: imul is 3 cycles, idiv is between 30 and 90 cycles. So integer division is between 10x and 30x slower than integer multiplication.

    – Matthieu M.
    Jan 29 at 13:19






  • 2





    Could you explain what all of that means for readers who are interested, but don't speak assembler?

    – Nico Haase
    Jan 29 at 14:00






  • 7





    @NicoHaase The two commented lines are the only important ones. In the first section, the code is performing an integer multiplication, whereas the second section, the code is performing an integer division. If you think about doing multiplication and division by hand, when you multiply you're usually doing a bunch of small multiplications and then one big set of additions, but division is a small division, a small multiplication, a subtraction, and repeat. Division is slow because you're essentially doing a bunch of multiplications.

    – MBraedley
    Jan 29 at 15:05







  • 4





    @MBraedley thanks for your input, but such explanation should be added to the answer itself and not be hidden in the comment section

    – Nico Haase
    Jan 29 at 15:16






  • 6





    @MBraedley: More to the point, multiplication in a modern CPU is fast because the partial products are independent and can thus be computed separately, while each stage of a division is dependent upon the preceding stages.

    – supercat
    Jan 29 at 20:34







13




13





To give an order of magnitude on "slower", for x86_64: imul is 3 cycles, idiv is between 30 and 90 cycles. So integer division is between 10x and 30x slower than integer multiplication.

– Matthieu M.
Jan 29 at 13:19





To give an order of magnitude on "slower", for x86_64: imul is 3 cycles, idiv is between 30 and 90 cycles. So integer division is between 10x and 30x slower than integer multiplication.

– Matthieu M.
Jan 29 at 13:19




2




2





Could you explain what all of that means for readers who are interested, but don't speak assembler?

– Nico Haase
Jan 29 at 14:00





Could you explain what all of that means for readers who are interested, but don't speak assembler?

– Nico Haase
Jan 29 at 14:00




7




7





@NicoHaase The two commented lines are the only important ones. In the first section, the code is performing an integer multiplication, whereas the second section, the code is performing an integer division. If you think about doing multiplication and division by hand, when you multiply you're usually doing a bunch of small multiplications and then one big set of additions, but division is a small division, a small multiplication, a subtraction, and repeat. Division is slow because you're essentially doing a bunch of multiplications.

– MBraedley
Jan 29 at 15:05






@NicoHaase The two commented lines are the only important ones. In the first section, the code is performing an integer multiplication, whereas the second section, the code is performing an integer division. If you think about doing multiplication and division by hand, when you multiply you're usually doing a bunch of small multiplications and then one big set of additions, but division is a small division, a small multiplication, a subtraction, and repeat. Division is slow because you're essentially doing a bunch of multiplications.

– MBraedley
Jan 29 at 15:05





4




4





@MBraedley thanks for your input, but such explanation should be added to the answer itself and not be hidden in the comment section

– Nico Haase
Jan 29 at 15:16





@MBraedley thanks for your input, but such explanation should be added to the answer itself and not be hidden in the comment section

– Nico Haase
Jan 29 at 15:16




6




6





@MBraedley: More to the point, multiplication in a modern CPU is fast because the partial products are independent and can thus be computed separately, while each stage of a division is dependent upon the preceding stages.

– supercat
Jan 29 at 20:34





@MBraedley: More to the point, multiplication in a modern CPU is fast because the partial products are independent and can thus be computed separately, while each stage of a division is dependent upon the preceding stages.

– supercat
Jan 29 at 20:34











24














As others have noted, the general modulus operation requires a division to be done. In some cases, the division can be replaced (by the compiler) by a multiplication. But both can be slow compared to addition/subtraction. Hence, the best performance can be expected by something along these lines:



long progressCheck = 50000;

long counter = progressCheck;

for (long i = startNum; i <= stopNum; i++)
if (--counter == 0)
System.out.println(i);
counter = progressCheck;




(As a minor optmiziation attempt we use a pre-decrement down-counter here because on many architectures comparing to 0 after an arithmetic operation costs exactly 0 instructions/CPU cycles because the ALU's flags are already set appropriately by the preceeding operation. A decent optimizing compiler will, however, do that optimization automatically if you write if (counter++ == 50000) ... counter = 0; .)



Notice that often you don't really want/need modulus, because you know that your loop counter (i) or whatever is only ever incremented by 1, and you really don't care about the actual remainder the modulus will give you, just see if the incrementing-by-one counter hits some value.



Another 'trick' is to use power-of-two values/limits, e.g. progressCheck = 1024;. Modulus a power of two can be quickly calculated via bitwise and, i.e. if ( (i & (1024-1)) == 0 ) .... This should be pretty fast too, and may on some architectures outperform the explicit counter above.






share|improve this answer




















  • 3





    A smart compiler would invert the loops here. Or you could do that in the source. The if() body becomes an outer-loop body, and the stuff outside the if() becomes an inner loop body that runs for min(progressCheck, stopNum-i) iterations. So at the start, and every time counter reaches 0, you do long next_stop = i + min(progressCheck, stopNum-i); to set up for a for(; i< next_stop; i++) loop. In this case that inner loop is empty and should hopefully optimize away entirely, of you can do that in the source and make it easy for the JITer, reducing your loop to i+=50k.

    – Peter Cordes
    Jan 29 at 0:26






  • 2





    But yes, in general a down-counter is a good efficient technique for fizzbuzz / progresscheck type stuff.

    – Peter Cordes
    Jan 29 at 0:26











  • I added to my question, and did an increment, the --counter is just as fast as my increment version, but less code.also it was 1 lower than it should be, i'm curious if it should be counter-- to get the exact number you want, not that it's much of a difference

    – Robert Cotterman
    Jan 29 at 3:36











  • @PeterCordes A smart compiler would just print the numbers, no loop at all. (I think some only slightly more trivial benchmarks started to fail that way maybe 10 years ago.)

    – Peter A. Schneider
    Jan 29 at 8:19







  • 2





    @RobertCotterman Yes, --counter is off by one. counter-- will give you exactly the progressCheck number of iterations (or you could set progressCheck = 50001; of course).

    – JimmyB
    Jan 29 at 14:33















24














As others have noted, the general modulus operation requires a division to be done. In some cases, the division can be replaced (by the compiler) by a multiplication. But both can be slow compared to addition/subtraction. Hence, the best performance can be expected by something along these lines:



long progressCheck = 50000;

long counter = progressCheck;

for (long i = startNum; i <= stopNum; i++)
if (--counter == 0)
System.out.println(i);
counter = progressCheck;




(As a minor optmiziation attempt we use a pre-decrement down-counter here because on many architectures comparing to 0 after an arithmetic operation costs exactly 0 instructions/CPU cycles because the ALU's flags are already set appropriately by the preceeding operation. A decent optimizing compiler will, however, do that optimization automatically if you write if (counter++ == 50000) ... counter = 0; .)



Notice that often you don't really want/need modulus, because you know that your loop counter (i) or whatever is only ever incremented by 1, and you really don't care about the actual remainder the modulus will give you, just see if the incrementing-by-one counter hits some value.



Another 'trick' is to use power-of-two values/limits, e.g. progressCheck = 1024;. Modulus a power of two can be quickly calculated via bitwise and, i.e. if ( (i & (1024-1)) == 0 ) .... This should be pretty fast too, and may on some architectures outperform the explicit counter above.






share|improve this answer




















  • 3





    A smart compiler would invert the loops here. Or you could do that in the source. The if() body becomes an outer-loop body, and the stuff outside the if() becomes an inner loop body that runs for min(progressCheck, stopNum-i) iterations. So at the start, and every time counter reaches 0, you do long next_stop = i + min(progressCheck, stopNum-i); to set up for a for(; i< next_stop; i++) loop. In this case that inner loop is empty and should hopefully optimize away entirely, of you can do that in the source and make it easy for the JITer, reducing your loop to i+=50k.

    – Peter Cordes
    Jan 29 at 0:26






  • 2





    But yes, in general a down-counter is a good efficient technique for fizzbuzz / progresscheck type stuff.

    – Peter Cordes
    Jan 29 at 0:26











  • I added to my question, and did an increment, the --counter is just as fast as my increment version, but less code.also it was 1 lower than it should be, i'm curious if it should be counter-- to get the exact number you want, not that it's much of a difference

    – Robert Cotterman
    Jan 29 at 3:36











  • @PeterCordes A smart compiler would just print the numbers, no loop at all. (I think some only slightly more trivial benchmarks started to fail that way maybe 10 years ago.)

    – Peter A. Schneider
    Jan 29 at 8:19







  • 2





    @RobertCotterman Yes, --counter is off by one. counter-- will give you exactly the progressCheck number of iterations (or you could set progressCheck = 50001; of course).

    – JimmyB
    Jan 29 at 14:33













24












24








24







As others have noted, the general modulus operation requires a division to be done. In some cases, the division can be replaced (by the compiler) by a multiplication. But both can be slow compared to addition/subtraction. Hence, the best performance can be expected by something along these lines:



long progressCheck = 50000;

long counter = progressCheck;

for (long i = startNum; i <= stopNum; i++)
if (--counter == 0)
System.out.println(i);
counter = progressCheck;




(As a minor optmiziation attempt we use a pre-decrement down-counter here because on many architectures comparing to 0 after an arithmetic operation costs exactly 0 instructions/CPU cycles because the ALU's flags are already set appropriately by the preceeding operation. A decent optimizing compiler will, however, do that optimization automatically if you write if (counter++ == 50000) ... counter = 0; .)



Notice that often you don't really want/need modulus, because you know that your loop counter (i) or whatever is only ever incremented by 1, and you really don't care about the actual remainder the modulus will give you, just see if the incrementing-by-one counter hits some value.



Another 'trick' is to use power-of-two values/limits, e.g. progressCheck = 1024;. Modulus a power of two can be quickly calculated via bitwise and, i.e. if ( (i & (1024-1)) == 0 ) .... This should be pretty fast too, and may on some architectures outperform the explicit counter above.






share|improve this answer















As others have noted, the general modulus operation requires a division to be done. In some cases, the division can be replaced (by the compiler) by a multiplication. But both can be slow compared to addition/subtraction. Hence, the best performance can be expected by something along these lines:



long progressCheck = 50000;

long counter = progressCheck;

for (long i = startNum; i <= stopNum; i++)
if (--counter == 0)
System.out.println(i);
counter = progressCheck;




(As a minor optmiziation attempt we use a pre-decrement down-counter here because on many architectures comparing to 0 after an arithmetic operation costs exactly 0 instructions/CPU cycles because the ALU's flags are already set appropriately by the preceeding operation. A decent optimizing compiler will, however, do that optimization automatically if you write if (counter++ == 50000) ... counter = 0; .)



Notice that often you don't really want/need modulus, because you know that your loop counter (i) or whatever is only ever incremented by 1, and you really don't care about the actual remainder the modulus will give you, just see if the incrementing-by-one counter hits some value.



Another 'trick' is to use power-of-two values/limits, e.g. progressCheck = 1024;. Modulus a power of two can be quickly calculated via bitwise and, i.e. if ( (i & (1024-1)) == 0 ) .... This should be pretty fast too, and may on some architectures outperform the explicit counter above.







share|improve this answer














share|improve this answer



share|improve this answer








edited Jan 29 at 14:19

























answered Jan 28 at 22:22









JimmyBJimmyB

9,47811738




9,47811738







  • 3





    A smart compiler would invert the loops here. Or you could do that in the source. The if() body becomes an outer-loop body, and the stuff outside the if() becomes an inner loop body that runs for min(progressCheck, stopNum-i) iterations. So at the start, and every time counter reaches 0, you do long next_stop = i + min(progressCheck, stopNum-i); to set up for a for(; i< next_stop; i++) loop. In this case that inner loop is empty and should hopefully optimize away entirely, of you can do that in the source and make it easy for the JITer, reducing your loop to i+=50k.

    – Peter Cordes
    Jan 29 at 0:26






  • 2





    But yes, in general a down-counter is a good efficient technique for fizzbuzz / progresscheck type stuff.

    – Peter Cordes
    Jan 29 at 0:26











  • I added to my question, and did an increment, the --counter is just as fast as my increment version, but less code.also it was 1 lower than it should be, i'm curious if it should be counter-- to get the exact number you want, not that it's much of a difference

    – Robert Cotterman
    Jan 29 at 3:36











  • @PeterCordes A smart compiler would just print the numbers, no loop at all. (I think some only slightly more trivial benchmarks started to fail that way maybe 10 years ago.)

    – Peter A. Schneider
    Jan 29 at 8:19







  • 2





    @RobertCotterman Yes, --counter is off by one. counter-- will give you exactly the progressCheck number of iterations (or you could set progressCheck = 50001; of course).

    – JimmyB
    Jan 29 at 14:33












  • 3





    A smart compiler would invert the loops here. Or you could do that in the source. The if() body becomes an outer-loop body, and the stuff outside the if() becomes an inner loop body that runs for min(progressCheck, stopNum-i) iterations. So at the start, and every time counter reaches 0, you do long next_stop = i + min(progressCheck, stopNum-i); to set up for a for(; i< next_stop; i++) loop. In this case that inner loop is empty and should hopefully optimize away entirely, of you can do that in the source and make it easy for the JITer, reducing your loop to i+=50k.

    – Peter Cordes
    Jan 29 at 0:26






  • 2





    But yes, in general a down-counter is a good efficient technique for fizzbuzz / progresscheck type stuff.

    – Peter Cordes
    Jan 29 at 0:26











  • I added to my question, and did an increment, the --counter is just as fast as my increment version, but less code.also it was 1 lower than it should be, i'm curious if it should be counter-- to get the exact number you want, not that it's much of a difference

    – Robert Cotterman
    Jan 29 at 3:36











  • @PeterCordes A smart compiler would just print the numbers, no loop at all. (I think some only slightly more trivial benchmarks started to fail that way maybe 10 years ago.)

    – Peter A. Schneider
    Jan 29 at 8:19







  • 2





    @RobertCotterman Yes, --counter is off by one. counter-- will give you exactly the progressCheck number of iterations (or you could set progressCheck = 50001; of course).

    – JimmyB
    Jan 29 at 14:33







3




3





A smart compiler would invert the loops here. Or you could do that in the source. The if() body becomes an outer-loop body, and the stuff outside the if() becomes an inner loop body that runs for min(progressCheck, stopNum-i) iterations. So at the start, and every time counter reaches 0, you do long next_stop = i + min(progressCheck, stopNum-i); to set up for a for(; i< next_stop; i++) loop. In this case that inner loop is empty and should hopefully optimize away entirely, of you can do that in the source and make it easy for the JITer, reducing your loop to i+=50k.

– Peter Cordes
Jan 29 at 0:26





A smart compiler would invert the loops here. Or you could do that in the source. The if() body becomes an outer-loop body, and the stuff outside the if() becomes an inner loop body that runs for min(progressCheck, stopNum-i) iterations. So at the start, and every time counter reaches 0, you do long next_stop = i + min(progressCheck, stopNum-i); to set up for a for(; i< next_stop; i++) loop. In this case that inner loop is empty and should hopefully optimize away entirely, of you can do that in the source and make it easy for the JITer, reducing your loop to i+=50k.

– Peter Cordes
Jan 29 at 0:26




2




2





But yes, in general a down-counter is a good efficient technique for fizzbuzz / progresscheck type stuff.

– Peter Cordes
Jan 29 at 0:26





But yes, in general a down-counter is a good efficient technique for fizzbuzz / progresscheck type stuff.

– Peter Cordes
Jan 29 at 0:26













I added to my question, and did an increment, the --counter is just as fast as my increment version, but less code.also it was 1 lower than it should be, i'm curious if it should be counter-- to get the exact number you want, not that it's much of a difference

– Robert Cotterman
Jan 29 at 3:36





I added to my question, and did an increment, the --counter is just as fast as my increment version, but less code.also it was 1 lower than it should be, i'm curious if it should be counter-- to get the exact number you want, not that it's much of a difference

– Robert Cotterman
Jan 29 at 3:36













@PeterCordes A smart compiler would just print the numbers, no loop at all. (I think some only slightly more trivial benchmarks started to fail that way maybe 10 years ago.)

– Peter A. Schneider
Jan 29 at 8:19






@PeterCordes A smart compiler would just print the numbers, no loop at all. (I think some only slightly more trivial benchmarks started to fail that way maybe 10 years ago.)

– Peter A. Schneider
Jan 29 at 8:19





2




2





@RobertCotterman Yes, --counter is off by one. counter-- will give you exactly the progressCheck number of iterations (or you could set progressCheck = 50001; of course).

– JimmyB
Jan 29 at 14:33





@RobertCotterman Yes, --counter is off by one. counter-- will give you exactly the progressCheck number of iterations (or you could set progressCheck = 50001; of course).

– JimmyB
Jan 29 at 14:33











3














I am also surprised by seeing the performance of the above codes. It's all about the time taken by the compiler for executing the program as per the declared variable. In the second (inefficient) example:



for (long i = startNum; i <= stopNum; i++) 
if (i % progressCheck == 0)
System.out.println(i)




You are performing the modulus operation between two variables. Here, compiler has to check the value of stopNum and progressCheck to go to the specific memory block located for these variables every time after each iteration because it is a variable and its value might be change.



That's why after each iteration compiler went to the memory location to check the latest value of the variables. Therefore at the compile time the compiler was not able to create efficient byte code.



In the first code example, you are performing modulus operator between a variable and a constant numeric value which is not going to change within execution and compiler no need to check the value of that numeric value from the memory location. That's why compiler was able to create efficient byte code. If you declare progressCheck as a final or as a final static variable then at the time of run-time/compile-time compiler know that it's a final variable and its value not going to change then compiler replace the progressCheck with 50000 in code:



for (long i = startNum; i <= stopNum; i++) 
if (i % 50000== 0)
System.out.println(i)




Now you can see that this code also looks like the first (efficient) code example. The performance of first code and as we mentioned above both code will work efficiently. There will not be much difference in execution time of either code example.






share|improve this answer




















  • 1





    There is a HUGE difference, albeit i was doing the operation a trillion times, so over 1 trillion operations it saved 89% time to do the "efficient" code. mind you if you're only doing it a few thousand times, were talking such a tiny difference, it's probably not a big deal. i mean over 1000 operations it would save you 1 millionth of 7 seconds.

    – Robert Cotterman
    Jan 29 at 3:38







  • 1





    @Bishal Dubey "There will not be much difference in execution time of both code." Did you read the question?

    – gfos
    Jan 29 at 16:27











  • "That's why after each iteration compiler went to the memory location to check the latest value of the variables" - Unless the variable was declared volatile the 'compiler' will not read its value from RAM over and over again.

    – JimmyB
    Feb 10 at 10:07















3














I am also surprised by seeing the performance of the above codes. It's all about the time taken by the compiler for executing the program as per the declared variable. In the second (inefficient) example:



for (long i = startNum; i <= stopNum; i++) 
if (i % progressCheck == 0)
System.out.println(i)




You are performing the modulus operation between two variables. Here, compiler has to check the value of stopNum and progressCheck to go to the specific memory block located for these variables every time after each iteration because it is a variable and its value might be change.



That's why after each iteration compiler went to the memory location to check the latest value of the variables. Therefore at the compile time the compiler was not able to create efficient byte code.



In the first code example, you are performing modulus operator between a variable and a constant numeric value which is not going to change within execution and compiler no need to check the value of that numeric value from the memory location. That's why compiler was able to create efficient byte code. If you declare progressCheck as a final or as a final static variable then at the time of run-time/compile-time compiler know that it's a final variable and its value not going to change then compiler replace the progressCheck with 50000 in code:



for (long i = startNum; i <= stopNum; i++) 
if (i % 50000== 0)
System.out.println(i)




Now you can see that this code also looks like the first (efficient) code example. The performance of first code and as we mentioned above both code will work efficiently. There will not be much difference in execution time of either code example.






share|improve this answer




















  • 1





    There is a HUGE difference, albeit i was doing the operation a trillion times, so over 1 trillion operations it saved 89% time to do the "efficient" code. mind you if you're only doing it a few thousand times, were talking such a tiny difference, it's probably not a big deal. i mean over 1000 operations it would save you 1 millionth of 7 seconds.

    – Robert Cotterman
    Jan 29 at 3:38







  • 1





    @Bishal Dubey "There will not be much difference in execution time of both code." Did you read the question?

    – gfos
    Jan 29 at 16:27











  • "That's why after each iteration compiler went to the memory location to check the latest value of the variables" - Unless the variable was declared volatile the 'compiler' will not read its value from RAM over and over again.

    – JimmyB
    Feb 10 at 10:07













3












3








3







I am also surprised by seeing the performance of the above codes. It's all about the time taken by the compiler for executing the program as per the declared variable. In the second (inefficient) example:



for (long i = startNum; i <= stopNum; i++) 
if (i % progressCheck == 0)
System.out.println(i)




You are performing the modulus operation between two variables. Here, compiler has to check the value of stopNum and progressCheck to go to the specific memory block located for these variables every time after each iteration because it is a variable and its value might be change.



That's why after each iteration compiler went to the memory location to check the latest value of the variables. Therefore at the compile time the compiler was not able to create efficient byte code.



In the first code example, you are performing modulus operator between a variable and a constant numeric value which is not going to change within execution and compiler no need to check the value of that numeric value from the memory location. That's why compiler was able to create efficient byte code. If you declare progressCheck as a final or as a final static variable then at the time of run-time/compile-time compiler know that it's a final variable and its value not going to change then compiler replace the progressCheck with 50000 in code:



for (long i = startNum; i <= stopNum; i++) 
if (i % 50000== 0)
System.out.println(i)




Now you can see that this code also looks like the first (efficient) code example. The performance of first code and as we mentioned above both code will work efficiently. There will not be much difference in execution time of either code example.






share|improve this answer















I am also surprised by seeing the performance of the above codes. It's all about the time taken by the compiler for executing the program as per the declared variable. In the second (inefficient) example:



for (long i = startNum; i <= stopNum; i++) 
if (i % progressCheck == 0)
System.out.println(i)




You are performing the modulus operation between two variables. Here, compiler has to check the value of stopNum and progressCheck to go to the specific memory block located for these variables every time after each iteration because it is a variable and its value might be change.



That's why after each iteration compiler went to the memory location to check the latest value of the variables. Therefore at the compile time the compiler was not able to create efficient byte code.



In the first code example, you are performing modulus operator between a variable and a constant numeric value which is not going to change within execution and compiler no need to check the value of that numeric value from the memory location. That's why compiler was able to create efficient byte code. If you declare progressCheck as a final or as a final static variable then at the time of run-time/compile-time compiler know that it's a final variable and its value not going to change then compiler replace the progressCheck with 50000 in code:



for (long i = startNum; i <= stopNum; i++) 
if (i % 50000== 0)
System.out.println(i)




Now you can see that this code also looks like the first (efficient) code example. The performance of first code and as we mentioned above both code will work efficiently. There will not be much difference in execution time of either code example.







share|improve this answer














share|improve this answer



share|improve this answer








edited Jan 30 at 17:42









double-beep

2,3254924




2,3254924










answered Jan 28 at 19:55









Bishal DubeyBishal Dubey

1503




1503







  • 1





    There is a HUGE difference, albeit i was doing the operation a trillion times, so over 1 trillion operations it saved 89% time to do the "efficient" code. mind you if you're only doing it a few thousand times, were talking such a tiny difference, it's probably not a big deal. i mean over 1000 operations it would save you 1 millionth of 7 seconds.

    – Robert Cotterman
    Jan 29 at 3:38







  • 1





    @Bishal Dubey "There will not be much difference in execution time of both code." Did you read the question?

    – gfos
    Jan 29 at 16:27











  • "That's why after each iteration compiler went to the memory location to check the latest value of the variables" - Unless the variable was declared volatile the 'compiler' will not read its value from RAM over and over again.

    – JimmyB
    Feb 10 at 10:07












  • 1





    There is a HUGE difference, albeit i was doing the operation a trillion times, so over 1 trillion operations it saved 89% time to do the "efficient" code. mind you if you're only doing it a few thousand times, were talking such a tiny difference, it's probably not a big deal. i mean over 1000 operations it would save you 1 millionth of 7 seconds.

    – Robert Cotterman
    Jan 29 at 3:38







  • 1





    @Bishal Dubey "There will not be much difference in execution time of both code." Did you read the question?

    – gfos
    Jan 29 at 16:27











  • "That's why after each iteration compiler went to the memory location to check the latest value of the variables" - Unless the variable was declared volatile the 'compiler' will not read its value from RAM over and over again.

    – JimmyB
    Feb 10 at 10:07







1




1





There is a HUGE difference, albeit i was doing the operation a trillion times, so over 1 trillion operations it saved 89% time to do the "efficient" code. mind you if you're only doing it a few thousand times, were talking such a tiny difference, it's probably not a big deal. i mean over 1000 operations it would save you 1 millionth of 7 seconds.

– Robert Cotterman
Jan 29 at 3:38






There is a HUGE difference, albeit i was doing the operation a trillion times, so over 1 trillion operations it saved 89% time to do the "efficient" code. mind you if you're only doing it a few thousand times, were talking such a tiny difference, it's probably not a big deal. i mean over 1000 operations it would save you 1 millionth of 7 seconds.

– Robert Cotterman
Jan 29 at 3:38





1




1





@Bishal Dubey "There will not be much difference in execution time of both code." Did you read the question?

– gfos
Jan 29 at 16:27





@Bishal Dubey "There will not be much difference in execution time of both code." Did you read the question?

– gfos
Jan 29 at 16:27













"That's why after each iteration compiler went to the memory location to check the latest value of the variables" - Unless the variable was declared volatile the 'compiler' will not read its value from RAM over and over again.

– JimmyB
Feb 10 at 10:07





"That's why after each iteration compiler went to the memory location to check the latest value of the variables" - Unless the variable was declared volatile the 'compiler' will not read its value from RAM over and over again.

– JimmyB
Feb 10 at 10:07

















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54405842%2fwhy-is-if-variable1-variable2-0-inefficient%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown






Popular posts from this blog

How to check contact read email or not when send email to Individual?

How many registers does an x86_64 CPU actually have?

Nur Jahan