This is the 9th article of the Makuake Advent Calendar 2022!

Improving Code Performance with Go

When I thought about improving the performance of my custom HTTP Router, goblin, I tackled performance improvements in Go, and I will write about the approaches and practices I implemented during that time.

Prerequisite Knowledge

I believe that deeper tuning requires more knowledge, but I will list only the minimum necessary information.

Garbage Collection
- A feature that automatically frees up memory areas that are no longer needed from the memory space allocated by the program during execution.
Memory Areas
- Text Area
  - The area where the program converted to machine language is allowed.
- Stack Area
  - The memory area allocated during program execution.
  - It targets data whose size is determined at runtime.
  - Automatically freed (when the function execution ends and it is no longer needed).
  - e.g., arguments, return values, temporary variables, etc.
- Heap Area
  - The memory area allocated during program execution.
  - It targets data whose size is determined dynamically.
  - Subject to garbage collection.
- Static Area
  - The memory area allocated during program execution.
  - Allocated until the program ends.
  - e.g., global variables and static variables.

Approach to Performance Improvement

There is a premise that there is a necessity to improve performance (whether it is worth sacrificing readability, whether it can be definitively stated that the application is a bottleneck, etc.), but I will proceed with the discussion based on the assumption that there is a necessity.

As methods to improve code performance, several ideas come to mind, such as:

Algorithm optimization
Data structure optimization
Utilizing caches
Applying parallel processing
Compilation optimization

However, before implementing any improvements, measurements and analyses should be conducted. (I will not touch on the necessity of performance improvement as it varies by individual needs.)

I will introduce packages and tools for measurement and analysis in Go.

Benchmark

In Go, benchmarks for obtaining code performance are included in the standard package testing as Benchmarks.

For example, executing the following code with the command go test -bench=. -benchmem will obtain benchmarks.

package main

import (
	"math/rand"
	"testing"
)

func BenchmarkRandIn(b *testing.B) {
	for i := 0; i < b.N; i++ { // b.N automatically specifies the number of times the benchmark should run to be reliable
		rand.Int() // The function to be measured
	}
}

The output will look like this:

goos: darwin
goarch: amd64
pkg: bmf-san/go-perfomance-tips
cpu: VirtualApple @ 2.50GHz
BenchmarkRandIn-8       87550500                13.53 ns/op            0 B/op          0 allocs/op
PASS
ok      bmf-san/go-perfomance-tips      1.381s

From the benchmark results, we can read the following:

87550500
- The number of times the function was executed.
- The more times it is executed, the better the performance is considered.
13.53 ns/op
- The time taken for one execution of the function.
- The less time taken, the better the performance is considered.
0 B/op
- The size of memory allocated for each execution of the function.
- The less allocated, the better the performance is considered.
0 allocs/op
- The number of memory allocations made during one execution of the function.
- The fewer allocations, the better the performance is considered.

Go makes it easy to obtain benchmarks in this way.

For more information on other Go benchmark features, refer to the documentation. Benchmarks

As a tool to compare benchmark results, the package benchstat can be used, which displays the percentage of improvement in benchmark results.

In my managed repository, bmf-san/goblin, I have integrated it into CI to compare results before and after commits.

// This is an example where nothing has improved...
go test -bench . -benchmem -count 1 > new.out
benchstat old.out new.out
name           old time/op    new time/op    delta
Static1-36        248ns ± 0%     246ns ± 0%   ~     (p=1.000 n=1+1)
Static5-36        502ns ± 0%     495ns ± 0%   ~     (p=1.000 n=1+1)
Static10-36       874ns ± 0%     881ns ± 0%   ~     (p=1.000 n=1+1)
WildCard1-36     1.60µs ± 0%    1.50µs ± 0%   ~     (p=1.000 n=1+1)
WildCard5-36     4.49µs ± 0%    4.92µs ± 0%   ~     (p=1.000 n=1+1)
WildCard10-36    7.68µs ± 0%    7.58µs ± 0%   ~     (p=1.000 n=1+1)
Regexp1-36       1.38µs ± 0%    1.48µs ± 0%   ~     (p=1.000 n=1+1)
Regexp5-36       4.30µs ± 0%    4.11µs ± 0%   ~     (p=1.000 n=1+1)
Regexp10-36      7.66µs ± 0%    7.18µs ± 0%   ~     (p=1.000 n=1+1)

If you absolutely cannot allow performance degradation, it might be a good idea to implement a mechanism that fails the CI.

If you want to see the actual memory allocation situation after looking at such benchmark results, you can confirm it by building with specified build options. By increasing the number of -m specified in -gcflags, you can obtain more detailed results.

package main

import "fmt"

// Run build with go build -o example -gcflags '-m' gcflagsexample.go
func main() {
	a := "hello"
	b := "world"
	fmt.Println(a + b)
}

When you execute go build -o example -gcflags '-m' gcflagsexample.go, you will get the following output.

# command-line-arguments
./gcflagsexample.go:9:13: inlining call to fmt.Println
./gcflagsexample.go:9:13: ... argument does not escape
./gcflagsexample.go:9:16: a + b escapes to heap
./gcflagsexample.go:9:16: a + b escapes to heap

This is a simple example, so it is obvious at a glance, but by identifying allocations to the heap in this way and reducing heap allocations, you can improve memory allocation, making it a useful method for analysis.

Profiling

Go has a tool called pprof for analyzing where the bottlenecks are at the function level.

package main

import (
	"sort"
	"testing"
)

func sortAlphabetically() {
	s := []string{"abc", "aba", "cba", "acb"}
	sort.Strings(s)
}

func BenchmarkSortAlphabetically(b *testing.B) {
	for i := 0; i < b.N; i++ {
		sortAlphabetically()
	}
}

To see the CPU profile, execute the following:

go test -test.bench=BenchmarkSortAlphabetically -cpuprofile cpu.out && go tool pprof -http=":8888" cpu.out

cpu_profile

To see the memory profile, execute the following:

go test -test.bench=BenchmarkSortAlphabetically profilingexample_test.go -memprofile mem.out && go tool pprof -http=":8889" mem.out

memory_profile

By utilizing the UI of pprof, it becomes easier to identify where the bottlenecks are in the processing.

Practice

I will provide an example of improvements made to my custom HTTP Router, goblin.

The PR in question is here. Reduce the memory allocation by refactoring explodePath method #68

goblin is an HTTP Router that works well with the net/http interface based on a trie.

It has the minimum features that are thought necessary for routing. cf. goblin#features

Benchmark

First, I will run benchmark tests to measure performance.

go test -bench=. -cpu=1 -benchmem

The benchmark tests have prepared about three test cases each for static routing (e.g., /foo/bar), dynamic routing (e.g., /foo/:bar), and routing using regular expressions (e.g., /foo/:bar[^\d+$]).

The flow of routing processing is as follows:

Insert data into the tree structure (≒ define routing)
Search for data from the tree structure (return data based on the requested path)

However, this test case is set to measure only the latter.

The output results are as follows.

goos: darwin
goarch: amd64
pkg: github.com/bmf-san/goblin
cpu: VirtualApple @ 2.50GHz
BenchmarkStatic1         5072353               240.1 ns/op           128 B/op          4 allocs/op
BenchmarkStatic5         2491546               490.0 ns/op           384 B/op          6 allocs/op
BenchmarkStatic10        1653658               729.6 ns/op           720 B/op          7 allocs/op
BenchmarkWildCard1       1602606               747.3 ns/op           456 B/op          9 allocs/op
BenchmarkWildCard5        435784              2716 ns/op            1016 B/op         23 allocs/op
BenchmarkWildCard10       246729              5033 ns/op            1680 B/op         35 allocs/op
BenchmarkRegexp1         1647258               733.2 ns/op           456 B/op          9 allocs/op
BenchmarkRegexp5          456652              2641 ns/op            1016 B/op         23 allocs/op
BenchmarkRegexp10         251998              4780 ns/op            1680 B/op         35 allocs/op
PASS
ok      github.com/bmf-san/goblin       14.304s

Trends can be observed in the number of executions, time per execution, memory size per execution, and the number of memory allocations.

I personally find it concerning that memory allocations occur even with static routing. (Looking at benchmarks of other HTTP Routers, some show 0 allocs.)

Profiling

Next, I will use pprof to obtain profiles.

This time, I will focus only on memory to obtain the profile.

go test -bench . -memprofile mem.out && go tool pprof -http=":8889" mem.out

Graph output results. pprof_graph

It is clear that the process consuming the most memory is explodePath.

Looking at the Top (a list sorted by execution time), explodePath is at the top.

pprof_top

Flat shows the processing time of the function, while Cum includes the processing time along with waiting time.

Furthermore, I will check which part of the processing within the function is heavy using Source.

pprof_source

Since Search is the core processing responsible for the router's matching, I suspected that it would be the biggest bottleneck, and it turned out that a specific processing within it, explodePath, was the bottleneck.

Tuning

explodePath is a process that splits the received string by / and returns it as a []string type.

// explodePath removes an empty value in slice.
func explodePath(path string) []string {
	s := strings.Split(path, pathDelimiter)
	var r []string
	for _, str := range s {
		if str != "" {
			r = append(r, str)
		}
	}
	return r
}

For clarity, I also included test code.

func TestExplodePath(t *testing.T) {
	cases := []struct {
		actual   []string
		expected []string
	}{
		{
			actual:   explodePath(""),
			expected: nil,
		},
		{
			actual:   explodePath("/"),
			expected: nil,
		},
		{
			actual:   explodePath("//"),
			expected: nil,
		},
		{
			actual:   explodePath("///"),
			expected: nil,
		},
		{
			actual:   explodePath("/foo"),
			expected: []string{"foo"},
		},
		{
			actual:   explodePath("/foo/bar"),
			expected: []string{"foo", "bar"},
		},
		{
			actual:   explodePath("/foo/bar/baz"),
			expected: []string{"foo", "bar", "baz"},
		},
		{
			actual:   explodePath("/foo/bar/baz/"),
			expected: []string{"foo", "bar", "baz"},
		},
	}

	for _, c := range cases {
		if !reflect.DeepEqual(c.actual, c.expected) {
			t.Errorf("actual:%v expected:%v", c.actual, c.expected)
		}
	}
}

The variable r defined as []string does not have a specified capacity, so it is presumed to be inefficient in terms of memory.

Below is a benchmark test prepared to verify the effect of adding append to a slice and its results.

package main

import "testing"

func BenchmarkSliceLen0Cap0(b *testing.B) {
	var s []int

	b.StartTimer()
	for i := 0; i < b.N; i++ {
		s = append(s, i)
	}
	b.StopTimer()
}

func BenchmarkSliceLenN(b *testing.B) {
	var s = make([]int, b.N)

	b.StartTimer()
	for i := 0; i < b.N; i++ {
		s = append(s, i)
	}
	b.StopTimer()
}

func BenchmarkSliceLen0CapN(b *testing.B) {
	var s = make([]int, 0, b.N)

	b.StartTimer()
	for i := 0; i < b.N; i++ {
		s = append(s, i)
	}
	b.StopTimer()
}

goos: darwin
goarch: amd64
pkg: example.com
cpu: VirtualApple @ 2.50GHz
BenchmarkSliceLen0Cap0  100000000               11.88 ns/op           45 B/op          0 allocs/op
BenchmarkSliceLenN      78467056                23.60 ns/op           65 B/op          0 allocs/op
BenchmarkSliceLen0CapN  616491007                5.057 ns/op           8 B/op          0 allocs/op
PASS
ok      example.com     6.898s

From this result, it can be inferred that specifying the capacity will lead to more efficient code.

Therefore, I modified explodePath as follows.

func explodePath(path string) []string {
	s := strings.Split(path, "/")
	// var r []string
	r := make([]string, 0, strings.Count(path, "/")) // Specify capacity
	for _, str := range s {
		if str != "" {
			r = append(r, str)
		}
	}
	return r
}

Going a bit further, I improved the implementation.

func explodePath(path string) []string {
	splitFn := func(c rune) bool {
		return c == '/'
	}
	return strings.FieldsFunc(path, splitFn)
}

I will compare benchmarks of three patterns: the original explodePath, the implementation with specified slice capacity, and the implementation using strings.FieldsFunc.

package main

import (
	"strings"
	"testing"
)

func explodePath(path string) []string {
	s := strings.Split(path, "/")
	var r []string
	for _, str := range s {
		if str != "" {
			r = append(r, str)
		}
	}
	return r
}

func explodePathCap(path string) []string {
	s := strings.Split(path, "/")
	r := make([]string, 0, strings.Count(path, "/"))
	for _, str := range s {
		if str != "" {
			r = append(r, str)
		}
	}
	return r
}

func explodePathFieldsFunc(path string) []string {
	splitFn := func(c rune) bool {
		return c == '/'
	}
	return strings.FieldsFunc(path, splitFn)
}

func BenchmarkExplodePath(b *testing.B) {
	paths := []string{"", "/", "///", "/foo", "/foo/bar", "/foo/bar/baz"}
	b.StartTimer()
	for i := 0; i < b.N; i++ {
		for _, v := range paths {
			explodePath(v)
		}
	}
	b.StopTimer()
}

func BenchmarkExplodePathCap(b *testing.B) {
	paths := []string{"", "/", "///", "/foo", "/foo/bar", "/foo/bar/baz"}
	b.StartTimer()
	for i := 0; i < b.N; i++ {
		for _, v := range paths {
			explodePathCap(v)
		}
	}
	b.StopTimer()
}

func BenchmarkExplodePathFieldsFunc(b *testing.B) {
	paths := []string{"", "/", "///", "/foo", "/foo/bar", "/foo/bar/baz"}
	b.StartTimer()
	for i := 0; i < b.N; i++ {
		for _, v := range paths {
			explodePathFieldsFunc(v)
		}
	}
	b.StopTimer()
}

goos: darwin
goarch: amd64
pkg: example.com
cpu: VirtualApple @ 2.50GHz
BenchmarkExplodePath             1690340               722.2 ns/op           432 B/op         12 allocs/op
BenchmarkExplodePathCap          1622161               729.5 ns/op           416 B/op         11 allocs/op
BenchmarkExplodePathFieldsFunc   4948364               239.5 ns/op            96 B/op          3 allocs/op
PASS
ok      example.com     5.685s

The implementation using strings.FieldsFunc seems to have the best performance, so I will adopt it.

Effect Measurement

Let’s check the results after improving the implementation of explodePath.

Benchmark

# Before Improvement
goos: darwin
goarch: amd64
pkg: github.com/bmf-san/goblin
cpu: VirtualApple @ 2.50GHz
BenchmarkStatic1         5072353               240.1 ns/op           128 B/op          4 allocs/op
BenchmarkStatic5         2491546               490.0 ns/op           384 B/op          6 allocs/op
BenchmarkStatic10        1653658               729.6 ns/op           720 B/op          7 allocs/op
BenchmarkWildCard1       1602606               747.3 ns/op           456 B/op          9 allocs/op
BenchmarkWildCard5        435784              2716 ns/op            1016 B/op         23 allocs/op
BenchmarkWildCard10       246729              5033 ns/op            1680 B/op         35 allocs/op
BenchmarkRegexp1         1647258               733.2 ns/op           456 B/op          9 allocs/op
BenchmarkRegexp5          456652              2641 ns/op            1016 B/op         23 allocs/op
BenchmarkRegexp10         251998              4780 ns/op            1680 B/op         35 allocs/op
PASS
ok      github.com/bmf-san/goblin       14.304s

# After Improvement
go test -bench=. -cpu=1 -benchmem -count=1
goos: darwin
goarch: amd64
pkg: github.com/bmf-san/goblin
cpu: VirtualApple @ 2.50GHz
BenchmarkStatic1        10310658               117.7 ns/op            32 B/op          1 allocs/op
BenchmarkStatic5         4774347               258.1 ns/op            96 B/op          1 allocs/op
BenchmarkStatic10        2816960               435.8 ns/op           176 B/op          1 allocs/op
BenchmarkWildCard1       1867770               653.4 ns/op           384 B/op          6 allocs/op
BenchmarkWildCard5        496778              2484 ns/op             928 B/op         13 allocs/op
BenchmarkWildCard10       258508              4538 ns/op            1560 B/op         19 allocs/op
BenchmarkRegexp1         1978704               608.4 ns/op           384 B/op          6 allocs/op
BenchmarkRegexp5          519240              2394 ns/op             928 B/op         13 allocs/op
BenchmarkRegexp10         280741              4309 ns/op            1560 B/op         19 allocs/op
PASS
ok      github.com/bmf-san/goblin       13.666s

Comparing before and after improvements, it seems that there is an overall trend of improvement.

Profiling

pprof Graph.

pprof_graph_after

pprof Top.

pprof_top_after

It can be seen that the bottleneck has moved to strings.FieldsFunc called within explodePath.

Further Improvements

Other improvements have been made to goblin, and the released tag is here. 6.0.0

Since there have not been significant improvements in data structures or algorithms, it is somewhat of a minor improvement, and unfortunately, there are no remarkable results.

I feel that it is indeed difficult with the current data structures and algorithms being adopted. (Looking at other routers, they seem to adopt more advanced trees, so that makes sense...)

While it deviates slightly from the main topic, I created a benchmarker to see if I could gain insights for improvements by comparing with other routers.

bmf-san/go-router-benchmark

It is interesting to compare, and it becomes clear that I am losing badly. I cried.

I want to study the implementations of other routers and improve my understanding of advanced tree structures, which I previously struggled with.

Summary

Benchmarking and profiling are easy to do in Go.
Do not guess; measure.
Minor improvements are unlikely to yield significant results (which is true).

Improving Code Performance with Go

Improving Code Performance with Go

Prerequisite Knowledge

Approach to Performance Improvement

Benchmark

Profiling

Practice

Benchmark

Profiling

Tuning

Effect Measurement

Benchmark

Profiling

Further Improvements

Summary

References

Related Articles

Mastering Web Performance Tuning: Practical Speed Optimization Lessons from ISUCON

Web Frontend High-Performance Tuning

Understanding System Performance Through Illustrations