Benchmarking random float functions.
Fast floating-point functions with a good distribution are essential in a number of applications from games to machine learning. In this article, I will be benchmarking the fastest floating-point random implementations for both uniform ranged randoms and Box-Muller normal distributions.
Introducing the random functions I will be testing in uniform 0–1 range:
That’s five different methods of producing a floating-point random that I will be testing, and most interestingly is the last one
rand_float5() which is a function I obtained from the Ogre3D source code originally located in the asm_math.h file. This file is no longer included in the recent distributions of the Ogre3D source code but below I have a screenshot of its original form before I re-implemented it using intrinsic functions:
So let’s cut to the chase shall we? Here are the benchmarks of each random function when using the uniform range of -20 to +20:
Compiler Flags: none
rand_float1() Cycles: 24
rand_float2() Cycles: 4,740
rand_float3() Cycles: 37
rand_float4() Cycles: 36
rand_float5() Cycles: 18Compiler Flags: -Ofast
rand_float1() Cycles: 23
rand_float2() Cycles: 4,762
rand_float3() Cycles: 34
rand_float4() Cycles: 31
rand_float5() Cycles: 17
We can see that the 23-bit range functions are always a little more expensive for the extra bits of range but it’s not a deal breaker.
However the pièce de résistance has to be the “magic MMX thing” that I originally fished out of the Ogre3D source code and had been documented as early as 2002 on the internet (19 years ago!!) ranking in at a respectable ~30% faster than the classic
rand_float1(), not bad.
You can verify the results yourself, the source code is on gist here.
There is also a version of this code with a microsecond timer benchmark for those who would like to compare against the RDTSC benchmark here.
Benchmarking the Box-Muller transformation to produce a normal distribution:
---- with regular FPU sqrt
Compiler Flags: none
rand_normal_float1() Cycles: 80
rand_normal_float2() Cycles: 11,829
rand_normal_float3() Cycles: 121
rand_normal_float4() Cycles: 121
rand_normal_float5() Cycles: 94Compiler Flags: -Ofast
rand_normal_float1() Cycles: 75
rand_normal_float2() Cycles: 11,951
rand_normal_float3() Cycles: 117
rand_normal_float4() Cycles: 110
rand_normal_float5() Cycles: 52---- with intrinsic sqrtps
Compiler Flags: none
rand_normal_float1_sqrtps() Cycles: 82
rand_normal_float2_sqrtps() Cycles: 11,936
rand_normal_float3_sqrtps() Cycles: 128
rand_normal_float4_sqrtps() Cycles: 125
rand_normal_float5_sqrtps() Cycles: 97Compiler Flags: -Ofast
rand_normal_float1_sqrtps() Cycles: 69
rand_normal_float2_sqrtps() Cycles: 11,934
rand_normal_float3_sqrtps() Cycles: 116
rand_normal_float4_sqrtps() Cycles: 112
rand_normal_float5_sqrtps() Cycles: 53
With -Ofast enabled using MMX
rand_normal_float5() is the clear cut winner here otherwise it’s
rand_normal_float1()aka as the
/ RAND_MAX version. Using an intrinsic sqrtps over the regular sqrt didn’t really work out any better, although the performance improvement of sqrtps probably does exist just more noticeably over a larger time frame; rsqrtss was 11% faster than 1.f/sqrt() over 16 seconds of sampling so I am going to assume this will be a very similar case for sqrtps. It is a small margin and its hard to say if even a 11% difference would be noticeable over time as in a real-world use you don’t tend to thrash the function like you would in a benchmark so that 11% is probably really is the best margin you’d ever expect in the most extreme use cases as of 2021.
Below you can see Box-Muller transform code that is used:
You can also test the Box-Muller results yourself with the source code on gist here.
So to wrap this up, if you want speed, are happy with a 15-bit range, and you have MMX then use
rand_float5() otherwise stick to
rand_float1(). That also applies to the Box-Muller normal variants.
As for the entropy, it is what you’d expect, on windows rand() has a very small range of 32,767 but on Linux using GCC I get a range of 2,147,483,647. You could always try sample /dev/urandom for a higher entropy seed on some time delay interval, as a light weight method of injecting a bit more entropy, an example provided:
That’s all folks.