Faster Unsigned Integer Arithmetic

CPUs are able to perform bitwise operations on unsigned integers much faster than arithmetic operation – generally, a single bitwise operation in your source code will translate to a single machine code instruction, this instruction will be executed by a CPU core in a single clock cycle.

Here are some bitwise operations you can use in place of arithmetic:

I ÷ 2N  =  I >> N ‘right shift’ of N bits on I
(the result will always be the floor value)

I × 2N = I << N ‘left shift’ of N bits on I

I mod (N+1) = I & N bitwise ‘and’ of I and N

don’t trust anything without empirical evidence
– always test for actual performance gain in your own applications!

Microarchitecture-specific
performance optimization

It’s good practice to search the documentation of intended target architectures for SIMD (Single Instruction, Multiple Data) arithmetic blocks. These IP blocks implements logic for the sole purpose of significantly speeding up the execution of arithmetic operations on integers and floating point numbers.

These blocks will have their own CPU instruction set which you can take advantage of using the associated library provisioned by the CPU vendor.

Examples of SIMD instruction set ‘extentions’ include:

  • SSE
  • SSE2
  • SSE3
  • SSE4
  • SSE4.1
  • SSE4.2
  • SSE4a
  • POPCNT
  • LZCNT
  • AVX
  • AVX2
  • etc.

See:

  • https://www.cs.uaf.edu/2011/fall/cs301/intel_x64_megadoc_2011.pdf
  • http://www.intel.com/Assets/PDF/manual/248966.pdf
  • http://sci.tuomastonteri.fi/programming/sse
  • http://softpixel.com/~cwright/programming/simd/sse.php
  • https://felix.abecassis.me/2011/09/cpp-getting-started-with-sse/

Note: One more thing to consider is the performance advantage gained through the particular compiler being used to compile your source code for the target architecture… and the tuning and optimization parameters being passed to it!

Even without specifying tuning/optimization parameters, the compiler performance-gain may not only vary between target architectures, but also between one program and another – because of how well the compiler logic optimizes important/most-often-executed segments of the code.

See:

https://software.intel.com/en-us/forums/intel-c-compiler/topic/327585
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=13712