Efficient Bit Manipulation in Go 1.9
Exploring the maths/bits package
Go 1.9, released in August 2017, introduced the maths/bits
package. This package provides optimized functions for bit counting and manipulation of unsigned integers. These are useful in low-level environments where the individual bits stored in an integer are of importance. Let's take a look at some of the functions.
First up, Len
returns the minimum number of bits required to represent a value x
.
package main
import (
"fmt"
"math/bits"
)
func main() {
var a uint = 31
fmt.Printf("bits.Len(%d) = %d\n", a, bits.Len(a))
a++
fmt.Printf("bits.Len(%d) = %d\n", a, bits.Len(a))
}
Which outputs:
bits.Len(31) = 5
bits.Len(32) = 6
This is because 31 is 11111
in binary, and 32 is 100000
. The first significant bit is in each case is in position 5 and 6, respectively, when counting positions from the right.
This also allows us to easily illustrate the OnesCount
function, which returns the number of bits set to 1 in a given number:
package main
import (
"fmt"
"math/bits"
)
func main() {
var a uint = 31
fmt.Printf("bits.OnesCount(%d) = %d\n", a, bits.OnesCount(a))
a++
fmt.Printf("bits.OnesCount(%d) = %d\n", a, bits.OnesCount(a))
}
which outputs:
bits.OnesCount(31) = 5
bits.OnesCount(32) = 1
We also have LeadingZeros
, TrailingZeros
, Reverse
and ReverseBytes
. Let's use all of these in a single program to demonstrate how they work. In the next example, we use the uint16
version of these functions to keep the output short and readable. We could just as easily use the 32-bit, 64-bit or unspecified versions.
package main
import (
"fmt"
"math/bits"
"os"
"text/tabwriter"
)
func main() {
// read an unsigned integer from stdin
var i uint16
_, err := fmt.Scanf("%d", &i)
if err != nil {
i = 100
fmt.Println("Defaulting to 100")
}
// create a tabwriter for formatting the output
w := new(tabwriter.Writer)
// Format in tab-separated columns with a tab stop of 8.
w.Init(os.Stdout, 0, 8, 0, '\t', 0)
fmt.Fprintf(w, "Binary:\t%0.16bb\n", i)
fmt.Fprintf(w, "Len:\t%16d\n", bits.Len16(i))
fmt.Fprintf(w, "Ones count:\t%16d\n", bits.OnesCount16(i))
fmt.Fprintf(w, "Leading zeros:\t%16d\n", bits.LeadingZeros16(i))
fmt.Fprintf(w, "Trailing zeros:\t%16d\n", bits.TrailingZeros16(i))
fmt.Fprintf(w, "Reversed:\t%0.16bb\n", bits.Reverse16(i))
fmt.Fprintf(w, "Reversed Bytes:\t%0.16bb\n", bits.ReverseBytes16(i))
w.Flush()
}
An example run with 123 as the input gives us the following:
$ go run bits.go
123
Binary: 0000000001111011b
Len: 7
Ones count: 6
Leading zeros: 9
Trailing zeros: 0
Reversed: 1101111000000000b
Reversed Bytes: 0111101100000000b
It's not difficult to see what these functions do: LeadingZeros
returns the number of zeros to the left of the most significant set bit. TrailingZeros
returns the number of zeros to the right of the least significant set bit. Reverse
returns the number with all bits in reverse order, and ReverseBytes
reverses the number byte-by-byte, that is, in 8-bit blocks.
How does it perform?
This is all good and well, and it can be useful at times. But some of these functions are relatively simple to implement manually. How much more efficient are the library implementations, really?
Let's compare the performance of OnesCount64
with that of a naive algorithm for counting the number of set bits. Below is a simple benchmark that compares a naive implementation of OnesCount64
, which uses bit shifting to count the number of set bits, to the built-in bits.OnesCount64
:
package main
import (
"math/bits"
"testing"
)
func NaiveOnesCount64(x uint64) int {
var c, i uint64
for i = 0; i < 64; i++ {
c += (x >> i) & 1
}
return int(c)
}
var numbers = []uint64{
0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
100, 200, 300, 400, 500, 600, 700, 800, 900,
11111, 22222, 33333, 44444, 55555, 66666, 77777, 88888, 99999,
190239012390, 123901312, 6549654, 45904059,
}
// Output is a global sync variable so that test outputs
// do not get elided by the compiler.
var Output int
func TestNaiveOnesCount64(t *testing.T) {
m := len(numbers)
for n := 0; n < m; n++ {
x := numbers[n%m]
got := NaiveOnesCount64(x)
want := bits.OnesCount64(x)
if got != want {
t.Errorf("NaiveOnesCount64(%d) = %d, want %d", x, got, want)
}
}
}
func BenchmarkNaiveOnesCount64(b *testing.B) {
// run NaiveOnesCount b.N times
var s int
for n := 0; n < b.N; n++ {
// use &31 to avoid modulo operation
s += NaiveOnesCount64(numbers[n&31])
}
Output = s
}
func BenchmarkBitsOnesCount64(b *testing.B) {
// run NaiveOnesCount b.N times
var s int
for n := 0; n < b.N; n++ {
// use &31 to avoid modulo operation
s += bits.OnesCount64(numbers[n&31])
}
Output = s
}
Running the comparison on a 64-bit Linux machine, we get:
$ go test -bench=Count
goos: linux
goarch: amd64
BenchmarkNaiveOnesCount64-4 20000000 71.6 ns/op
BenchmarkBitsOnesCount64-4 2000000000 1.15 ns/op
PASS
ok _/home/code/bits 3.946s
The built-in OnesCount
is orders of magnitude faster! This is because the built-in function uses an algorithm called Parallel summing of adjacent bits. A full description of this is given in Chapter 5 of Hacker's Delight, Counting Bits. (Correction: on x86-64 systems, it actually uses POPCNT instructions, see note at bottom of this post).
Below is another benchmark comparison, this time for TrailingZeros64
:
package main
import (
"math/bits"
"testing"
)
// NaiveTrailingZeros64 returns the number of trailing zero bits in x;
// the result is 64 for x == 0.
func NaiveTrailingZeros64(x uint64) int {
var c, i uint64
for i = 0; i < 64; i++ {
if (x>>i)&1 == 1 {
break
}
c++
}
return int(c)
}
var nums = []uint64{
0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
100, 200, 300, 400, 500, 600, 700, 800, 900,
11111, 22222, 33333, 44444, 55555, 66666, 77777, 88888, 99999,
190239012390, 123901312, 6549654, 45904059,
}
// Output is a global sync variable so that test outputs
// do not get elided by the compiler.
var Output int
func TestNaiveTrailingZeros64(t *testing.T) {
m := len(nums)
for n := 0; n < m; n++ {
x := nums[n%m]
got := NaiveTrailingZeros64(x)
want := bits.TrailingZeros64(x)
if got != want {
t.Errorf("NaiveTrailingZeros64(%d) = %d, want %d", x, got, want)
}
}
}
func BenchmarkNaiveTrailingZeros64(b *testing.B) {
// run NaiveOnesCount b.N times
s := 0
for n := 0; n < b.N; n++ {
s += NaiveTrailingZeros64(nums[n&31])
}
Output = s
}
func BenchmarkBitsTrailingZeros64(b *testing.B) {
// run NaiveOnesCount b.N times
s := 0
for n := 0; n < b.N; n++ {
s += bits.TrailingZeros64(nums[n&31])
}
Output = s
}
Running this we get:
$ go test -bench=Trailing
goos: linux
goarch: amd64
BenchmarkNaiveTrailingZeros64-4 300000000 5.86 ns/op
BenchmarkBitsTrailingZeros64-4 2000000000 1.00 ns/op
PASS
ok _/home/code/bits 4.460s
Again, we see that TrailingZeros64
is significantly more efficient than the naive solution, though not by as much in this case. This time the reason is that it implements an algorithm first described by Charles E. Leiserson, Harald Prokop and Keith H. Randallin in 1998, using what is called De Bruijn numbers. It's a neat trick, and I recommend reading the original paper. (Correction: Again, it was pointed out that this is not true on most 64-bit systems, see note at the bottom of post)
Example Use Case
Now that we have seen that the built-in functions are much faster, you may still be asking "why should I care?" Bit manipulation isn't all that common of a thing to need, and we still have to shift left and right with the <<
and >>
operators when we do need it, so this is a fair question.
One example where this has already made an impact is in Will Fitzgerald's bitset package. A blog post by Daniel Lemire shows how changing to use the new bits package made the Count
function run twice as fast, and the previous implementation relied on hand-tuned assembly code!
$ go test -run=XXX -bench Count >old.txt
$ git checkout go19
$ go get golang.org/x/tools/cmd/benchcmp
$ go test -run=XXX -bench Count >new.txt
$ benchcmp old.txt new.txt
benchmark old ns/op new ns/op delta
BenchmarkCount-4 2676 1352 -49.48%
BenchmarkLemireCount-4 2569228 1401137 -45.46%
Bitsets are commonly used in all sorts of applications that are memory or space-aware, from databases (e.g. bloom filters) to priority queues and examining stream data. The use of the new bits
package in bitsets provides an example of the positive impact this will have on the performance of many Go applications.
Conclusion
The new bit manipulation package in Go 1.9 wasn't celebrated all that much, but it is pretty exciting if you do low-level programming in Go. It can provide quick efficiency gains if you are already using some of these common functions in a hot path in your code. Kudos to Robert Griesemer, Roman Budnikov and Lynn Boger, who worked on getting this into the standard library.
Update: some great feedback from giovannibajo and klauspost on r/golang - apparently a lot of the efficiency of the OnesCount and TrailingZeros on 64-bit systems comes from calling POPCNT and BSF instructions when the CPU supports it. The algorithms mentioned in the article are a fallback if the system does not support these methods.