Efficient Bit Manipulation in Go 1.9

Go 1.9, released in August 2017, introduced the maths/bits package. This package provides optimized functions for bit counting and manipulation of unsigned integers. These are useful in low-level environments where the individual bits stored in an integer are of importance. Let's take a look at some of the functions.

First up, Len returns the minimum number of bits required to represent a value x.

package main

import (
	"fmt"
	"math/bits"
)

func main() {
	var a uint = 31
	fmt.Printf("bits.Len(%d) = %d\n", a, bits.Len(a))

	a++
	fmt.Printf("bits.Len(%d) = %d\n", a, bits.Len(a))
}

Which outputs:

bits.Len(31) = 5
bits.Len(32) = 6

Playground link

This is because 31 is 11111 in binary, and 32 is 100000. The first significant bit is in each case is in position 5 and 6, respectively, when counting positions from the right.

This also allows us to easily illustrate the OnesCount function, which returns the number of bits set to 1 in a given number:

package main

import (
	"fmt"
	"math/bits"
)

func main() {
	var a uint = 31
	fmt.Printf("bits.OnesCount(%d) = %d\n", a, bits.OnesCount(a))

	a++
	fmt.Printf("bits.OnesCount(%d) = %d\n", a, bits.OnesCount(a))
}

which outputs:

bits.OnesCount(31) = 5
bits.OnesCount(32) = 1

Playground link

We also have LeadingZeros, TrailingZeros, Reverse and ReverseBytes. Let's use all of these in a single program to demonstrate how they work. In the next example, we use the uint16 version of these functions to keep the output short and readable. We could just as easily use the 32-bit, 64-bit or unspecified versions.

package main

import (
	"fmt"
	"math/bits"
	"os"
	"text/tabwriter"
)

func main() {
	// read an unsigned integer from stdin
	var i uint16
	_, err := fmt.Scanf("%d", &i)
	if err != nil {
		i = 100
		fmt.Println("Defaulting to 100")
	}

	// create a tabwriter for formatting the output
	w := new(tabwriter.Writer)

	// Format in tab-separated columns with a tab stop of 8.
	w.Init(os.Stdout, 0, 8, 0, '\t', 0)

	fmt.Fprintf(w, "Binary:\t%0.16bb\n", i)
	fmt.Fprintf(w, "Len:\t%16d\n", bits.Len16(i))
	fmt.Fprintf(w, "Ones count:\t%16d\n", bits.OnesCount16(i))
	fmt.Fprintf(w, "Leading zeros:\t%16d\n", bits.LeadingZeros16(i))
	fmt.Fprintf(w, "Trailing zeros:\t%16d\n", bits.TrailingZeros16(i))
	fmt.Fprintf(w, "Reversed:\t%0.16bb\n", bits.Reverse16(i))
	fmt.Fprintf(w, "Reversed Bytes:\t%0.16bb\n", bits.ReverseBytes16(i))

	w.Flush()
}

An example run with 123 as the input gives us the following:

$ go run bits.go 
123
Binary:         0000000001111011b
Len:                           7
Ones count:                    6
Leading zeros:                 9
Trailing zeros:                0
Reversed:       1101111000000000b
Reversed Bytes: 0111101100000000b

Playground Link

It's not difficult to see what these functions do: LeadingZeros returns the number of zeros to the left of the most significant set bit. TrailingZeros returns the number of zeros to the right of the least significant set bit. Reverse returns the number with all bits in reverse order, and ReverseBytes reverses the number byte-by-byte, that is, in 8-bit blocks.

How does it perform?

This is all good and well, and it can be useful at times. But some of these functions are relatively simple to implement manually. How much more efficient are the library implementations, really?

Let's compare the performance of OnesCount64 with that of a naive algorithm for counting the number of set bits. Below is a simple benchmark that compares a naive implementation of OnesCount64, which uses bit shifting to count the number of set bits, to the built-in bits.OnesCount64:

package main

import (
	"math/bits"
	"testing"
)

func NaiveOnesCount64(x uint64) int {
	var c, i uint64
	for i = 0; i < 64; i++ {
		c += (x >> i) & 1
	}
	return int(c)
}

var numbers = []uint64{
	0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
	100, 200, 300, 400, 500, 600, 700, 800, 900,
	11111, 22222, 33333, 44444, 55555, 66666, 77777, 88888, 99999,
	190239012390, 123901312, 6549654, 45904059,
}

// Output is a global sync variable so that test outputs
// do not get elided by the compiler.
var Output int

func TestNaiveOnesCount64(t *testing.T) {
	m := len(numbers)
	for n := 0; n < m; n++ {
		x := numbers[n%m]
		got := NaiveOnesCount64(x)
		want := bits.OnesCount64(x)
		if got != want {
			t.Errorf("NaiveOnesCount64(%d) = %d, want %d", x, got, want)
		}
	}
}

func BenchmarkNaiveOnesCount64(b *testing.B) {
	// run NaiveOnesCount b.N times
	var s int
	for n := 0; n < b.N; n++ {
		// use &31 to avoid modulo operation
		s += NaiveOnesCount64(numbers[n&31])
	}
	Output = s
}

func BenchmarkBitsOnesCount64(b *testing.B) {
	// run NaiveOnesCount b.N times
	var s int
	for n := 0; n < b.N; n++ {
		// use &31 to avoid modulo operation
		s += bits.OnesCount64(numbers[n&31])
	}
	Output = s
}

Running the comparison on a 64-bit Linux machine, we get:

$ go test -bench=Count
goos: linux
goarch: amd64
BenchmarkNaiveOnesCount64-4   	20000000	        71.6 ns/op
BenchmarkBitsOnesCount64-4    	2000000000	         1.15 ns/op
PASS
ok  	_/home/code/bits	3.946s

The built-in OnesCount is orders of magnitude faster! This is because the built-in function uses an algorithm called Parallel summing of adjacent bits. A full description of this is given in Chapter 5 of Hacker's Delight, Counting Bits. (Correction: on x86-64 systems, it actually uses POPCNT instructions, see note at bottom of this post).

Below is another benchmark comparison, this time for TrailingZeros64:

package main

import (
	"math/bits"
	"testing"
)

// NaiveTrailingZeros64 returns the number of trailing zero bits in x;
// the result is 64 for x == 0.
func NaiveTrailingZeros64(x uint64) int {
	var c, i uint64
	for i = 0; i < 64; i++ {
		if (x>>i)&1 == 1 {
			break
		}
		c++
	}
	return int(c)
}

var nums = []uint64{
	0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
	100, 200, 300, 400, 500, 600, 700, 800, 900,
	11111, 22222, 33333, 44444, 55555, 66666, 77777, 88888, 99999,
	190239012390, 123901312, 6549654, 45904059,
}

// Output is a global sync variable so that test outputs
// do not get elided by the compiler.
var Output int

func TestNaiveTrailingZeros64(t *testing.T) {
	m := len(nums)
	for n := 0; n < m; n++ {
		x := nums[n%m]
		got := NaiveTrailingZeros64(x)
		want := bits.TrailingZeros64(x)
		if got != want {
			t.Errorf("NaiveTrailingZeros64(%d) = %d, want %d", x, got, want)
		}
	}
}

func BenchmarkNaiveTrailingZeros64(b *testing.B) {
	// run NaiveOnesCount b.N times
	s := 0
	for n := 0; n < b.N; n++ {
		s += NaiveTrailingZeros64(nums[n&31])
	}
	Output = s
}

func BenchmarkBitsTrailingZeros64(b *testing.B) {
	// run NaiveOnesCount b.N times
	s := 0
	for n := 0; n < b.N; n++ {
		s += bits.TrailingZeros64(nums[n&31])
	}
	Output = s
}

Running this we get:

$ go test -bench=Trailing
goos: linux
goarch: amd64
BenchmarkNaiveTrailingZeros64-4   	300000000	         5.86 ns/op
BenchmarkBitsTrailingZeros64-4    	2000000000	         1.00 ns/op
PASS
ok  	_/home/code/bits	4.460s

Again, we see that TrailingZeros64 is significantly more efficient than the naive solution, though not by as much in this case. This time the reason is that it implements an algorithm first described by Charles E. Leiserson, Harald Prokop and Keith H. Randallin in 1998, using what is called De Bruijn numbers. It's a neat trick, and I recommend reading the original paper. (Correction: Again, it was pointed out that this is not true on most 64-bit systems, see note at the bottom of post)

Example Use Case

Now that we have seen that the built-in functions are much faster, you may still be asking "why should I care?" Bit manipulation isn't all that common of a thing to need, and we still have to shift left and right with the << and >> operators when we do need it, so this is a fair question.

One example where this has already made an impact is in Will Fitzgerald's bitset package. A blog post by Daniel Lemire shows how changing to use the new bits package made the Count function run twice as fast, and the previous implementation relied on hand-tuned assembly code!

$ go test -run=XXX -bench Count >old.txt
$ git checkout go19
$ go get golang.org/x/tools/cmd/benchcmp
$ go test -run=XXX -bench Count >new.txt
$ benchcmp old.txt new.txt
benchmark                  old ns/op     new ns/op     delta
BenchmarkCount-4           2676          1352          -49.48%
BenchmarkLemireCount-4     2569228       1401137       -45.46%

Bitsets are commonly used in all sorts of applications that are memory or space-aware, from databases (e.g. bloom filters) to priority queues and examining stream data. The use of the new bits package in bitsets provides an example of the positive impact this will have on the performance of many Go applications.

Conclusion

The new bit manipulation package in Go 1.9 wasn't celebrated all that much, but it is pretty exciting if you do low-level programming in Go. It can provide quick efficiency gains if you are already using some of these common functions in a hot path in your code. Kudos to Robert Griesemer, Roman Budnikov and Lynn Boger, who worked on getting this into the standard library.

Update: some great feedback from giovannibajo and klauspost on r/golang - apparently a lot of the efficiency of the OnesCount and TrailingZeros on 64-bit systems comes from calling POPCNT and BSF instructions when the CPU supports it. The algorithms mentioned in the article are a fallback if the system does not support these methods.

Efficient Bit Manipulation in Go 1.9

Exploring the maths/bits package

How does it perform?

Example Use Case

Conclusion