Fast floating-point functions with a good distribution are essential in a number of applications from games to machine learning. In this article, I will be benchmarking the fastest floating-point random implementations for both uniform ranged randoms and Box-Muller normal distributions.

Introducing the random functions I will be testing in uniform 0–1 range:

That’s five different methods of producing a floating-point random that I will be testing, and most interestingly is the last one `rand_float5()`

which is a function I obtained from the Ogre3D source code originally located in the asm_math.h file. This file is no longer included in the recent distributions of the Ogre3D source code but below I have a screenshot of its original form before I re-implemented it using intrinsic functions:

So let’s cut to the chase shall we? Here are the benchmarks of each random function when using the uniform range of -20 to +20:

Compiler Flags: none

rand_float1() Cycles: 24

rand_float2() Cycles: 4,740

rand_float3() Cycles: 37

rand_float4() Cycles: 36

rand_float5() Cycles: 18Compiler Flags: -Ofast

rand_float1() Cycles: 23

rand_float2() Cycles: 4,762

rand_float3() Cycles: 34

rand_float4() Cycles: 31

rand_float5() Cycles: 17

We can see that the 23-bit range functions are always a little more expensive for the extra bits of range but it’s not a deal breaker.

However the pièce de résistance has to be the “magic MMX thing” that I originally fished out of the Ogre3D source code and had been documented as early as 2002 on the internet *(19 years ago!!) *ranking in at a respectable ~30% faster than the classic `rand_float1()`

, not bad.

You can verify the results yourself, the source code is on gist here.

There is also a version of this code with a microsecond timer benchmark for those who would like to compare against the RDTSC benchmark here.

Benchmarking the Box-Muller transformation to produce a normal distribution:

---- with regular FPU sqrt

Compiler Flags: none

rand_normal_float1() Cycles: 80

rand_normal_float2() Cycles: 11,829

rand_normal_float3() Cycles: 121

rand_normal_float4() Cycles: 121

rand_normal_float5() Cycles: 94Compiler Flags: -Ofast

rand_normal_float1() Cycles: 75

rand_normal_float2() Cycles: 11,951

rand_normal_float3() Cycles: 117

rand_normal_float4() Cycles: 110

rand_normal_float5() Cycles: 52---- with intrinsic sqrtps

Compiler Flags: none

rand_normal_float1_sqrtps() Cycles: 82

rand_normal_float2_sqrtps() Cycles: 11,936

rand_normal_float3_sqrtps() Cycles: 128

rand_normal_float4_sqrtps() Cycles: 125

rand_normal_float5_sqrtps() Cycles: 97Compiler Flags: -Ofast

rand_normal_float1_sqrtps() Cycles: 69

rand_normal_float2_sqrtps() Cycles: 11,934

rand_normal_float3_sqrtps() Cycles: 116

rand_normal_float4_sqrtps() Cycles: 112

rand_normal_float5_sqrtps() Cycles: 53

With -Ofast enabled using MMX `rand_normal_float5()`

is the clear cut winner here otherwise it’s `rand_normal_float1()`

aka as the `/ RAND_MAX`

version. Using an intrinsic sqrtps over the regular sqrt didn’t really work out any better, although the performance improvement of sqrtps probably does exist just more noticeably over a larger time frame; rsqrtss was 11% faster than 1.f/sqrt() over 16 seconds of sampling so I am going to assume this will be a very similar case for sqrtps. It is a small margin and its hard to say if even a 11% difference would be noticeable over time as in a real-world use you don’t tend to thrash the function like you would in a benchmark so that 11% is probably really is the best margin you’d ever expect in the most extreme use cases as of 2021.

Below you can see Box-Muller transform code that is used:

You can also test the Box-Muller results yourself with the source code on gist here.

So to wrap this up, if you want speed, are happy with a 15-bit range, and you have MMX then use `rand_float5()`

otherwise stick to `rand_float1()`

. That also applies to the Box-Muller normal variants.

As for the entropy, it is what you’d expect, on windows rand() has a very small range of 32,767 but on Linux using GCC I get a range of 2,147,483,647. You could always try sample /dev/urandom for a higher entropy seed on some time delay interval, as a light weight method of injecting a bit more entropy, an example provided:

That’s all folks.

**October 2022 Update:** SEIR and INIGO randoms are also performant.