Golang reflection, a faster route?
Reflection can be very handy and makes it possible to achieve flexibility. One of its typical use-cases is serialization.
Let’s write the simplest example I can think of. This serializer will write:
- byte, [u]int8 and bools as a single byte;
- bigger integers as Varints (leb-128);
- arrays as the concatenation of its content;
- strings and slices as len + content;
- structs as its fields, without any structure;
- pointers as a bool + contents;
- maps as len + key-value stream;
- interfaces won’t be supported for now.
Let’s capture the basics of the writing in an interface:
type BinaryWriter interface {
WriteBool(bool)
WriteInt8(int8)
WriteUint8(uint8)
WriteInt(int64)
WriteUint(uint64)
WriteFloat32(float32)
WriteFloat64(float64)
WriteString(string)
}
The serializer is then just using reflection to route each time to the right method:1:
package fast_reflection
import (
"fmt"
"reflect"
)
func Serialize(writer BinaryWriter, val any) error {
if val == nil {
return fmt.Errorf("can't serialize nil")
}
value := reflect.ValueOf(val)
if value.Kind() != reflect.Pointer {
return fmt.Errorf("val must be a pointer")
}
return serializeValue(writer, value.Elem())
}
func serializeValue(writer BinaryWriter, value reflect.Value) error {
switch value.Kind() {
case reflect.Bool:
writer.WriteBool(value.Bool())
case reflect.Int8:
writer.WriteInt8(int8(value.Int()))
case reflect.Uint8:
writer.WriteUint8(uint8(value.Uint()))
case reflect.Int, reflect.Int16, reflect.Int32, reflect.Int64:
writer.WriteInt(value.Int())
case reflect.Uint, reflect.Uint16, reflect.Uint32, reflect.Uint64:
writer.WriteUint(value.Uint())
case reflect.Float32:
writer.WriteFloat32(float32(value.Float()))
case reflect.Float64:
writer.WriteFloat64(value.Float())
case reflect.Complex64:
cpx := value.Complex()
writer.WriteFloat32(float32(real(cpx)))
writer.WriteFloat32(float32(imag(cpx)))
case reflect.Complex128:
cpx := value.Complex()
writer.WriteFloat64(real(cpx))
writer.WriteFloat64(imag(cpx))
case reflect.Array:
len := value.Len()
for i := 0; i < len; i++ {
serializeValue(writer, value.Index(i))
}
case reflect.Slice:
len := value.Len()
writer.WriteInt(int64(len))
for i := 0; i < len; i++ {
serializeValue(writer, value.Index(i))
}
case reflect.String:
writer.WriteString(value.String())
case reflect.Map:
if value.IsNil() {
writer.WriteInt(0)
} else {
writer.WriteInt(int64(value.Len()))
for iter := value.MapRange(); iter.Next(); {
serializeValue(writer, iter.Key())
serializeValue(writer, iter.Value())
}
}
case reflect.Pointer:
if value.IsNil() {
writer.WriteBool(false)
} else {
writer.WriteBool(true)
serializeValue(writer, value.Elem())
}
case reflect.Struct:
// Let's ignore the complexity deriving from exported vs non-exported fields
for i := 0; i < value.NumField(); i++ {
serializeValue(writer, value.Field(i))
}
case reflect.Uintptr, reflect.Chan, reflect.Func, reflect.Interface, reflect.UnsafePointer:
return fmt.Errorf("Unsupported kind %s", value.Kind())
}
return nil
}
A bit longer than I wanted, but simple to read (and sure enough not suitable for production!).
There’s some weirdness though. For example, you must pass a pointer, but the public interface of the method doesn’t seem to be aware of that, so:
var value MyAwesomeType
fast_reflection.Serialize(reader, value) // run-time error!
If you are lucky enough to use go1.18+, then you can use the new shiny Generics to work around this akwardness:
func Serialize[T any](writer BinaryWriter, val *T) error {
if val == nil {
return fmt.Errorf("can't serialize nil")
}
value := reflect.ValueOf(val)
// Don't need to check if it's a pointer anymore
return serializeValue(writer, value.Elem())
}
How much does it costs?
Evaluating a technology or a solution needs to take into account the cost. In this case we are not talking about cash, but execution speed.
So, let’s write a benchmark for that, as a baseline.
I will use a simple, yet representative, struct:
type simpleStruct struct {
I int
U uint
S string
L []int
P *int
M map[int]int
}
And a very naive hand-written serialization function:
func serializeSimpleStruct(writer io.Writer, s simpleStruct) error {
writer.WriteInt(int64(s.I))
writer.WriteUint(uint64(s.U))
writer.WriteString(s.S)
writer.WriteInt(int64(len(s.L)))
for _, item := range s.L {
writer.WriteInt(int64(item))
}
if s.P == nil {
writer.WriteBool(false)
} else {
writer.WriteBool(true)
writer.WriteInt(int64(*s.P))
}
if s.M == nil {
writer.WriteInt(0)
} else {
writer.WriteInt(int64(len(s.M)))
for k, v := range s.M {
writer.WriteInt(int64(k))
writer.WriteInt(int64(v))
}
}
return nil
}
To banchmark our reflection code, it is necessary to remove as much as possible things not related to the access method. For this, we can write a very simple NilBinaryWriter
which will basically do nothing. It’s also useful to mark all its methods as go:noinline
, otherwise the hand-written solution would probably be able to optimize too much, making the comparrison unfair.
type NilBinaryWriter struct{}
var _ BinaryWriter = &NilBinaryWriter{}
//go:noinline
func (*NilBinaryWriter) WriteBool(bool) {}
//go:noinline
func (*NilBinaryWriter) WriteFloat32(float32) {}
//go:noinline
func (*NilBinaryWriter) WriteFloat64(float64) {}
//go:noinline
func (*NilBinaryWriter) WriteInt(int64) {}
//go:noinline
func (*NilBinaryWriter) WriteInt8(int8) {}
//go:noinline
func (*NilBinaryWriter) WriteString(string) {}
//go:noinline
func (*NilBinaryWriter) WriteUint(uint64) {}
//go:noinline
func (*NilBinaryWriter) WriteUint8(uint8) {}
The benchmark is then straightforward:
func BenchmarkSerialize(b *testing.B) {
s := simpleStruct{
I: 1,
U: 2,
S: "3",
L: []int{4, 5},
P: new(int),
M: map[int]int{6: 7, 8: 9},
}
b.Run("hand-written", func(b *testing.B) {
for i := 0; i < b.N; i++ {
writer := &bytes.Buffer{}
serializeSimpleStruct(writer, s)
}
})
b.Run("reflect", func(b *testing.B) {
for i := 0; i < b.N; i++ {
writer := &bytes.Buffer{}
fast_reflection.Serialize(writer, s)
}
})
}
Which gives:
> go test . -bench=. -benchmem
goos: linux
goarch: amd64
pkg: fast_reflection
cpu: Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz
BenchmarkSerialize/hand-written-4 8458156 130.5 ns/op 0 B/op 0 allocs/op
BenchmarkSerialize/reflect-4 2211645 515.6 ns/op 32 B/op 4 allocs/op
So, from this we can see:
- reflection is around 4x slower;
- reflection allocates while the hand written one doesn’t.
In search for faster route
Optimization is about compromise. In the benchmark the trade-off is clear: to gain performance we are sacrificing maintainability 2 (by hand-writing the serialization code).
Of course, one could use tools like go generate
to remove the “hand written” part. While this is viable, that’s not what this article is about.
Let’s introduce the unsafe
package. As the name implies, this package can be used to defy the normal rules and if not used carefully can result in a malformed program. There are, however, usage patterns that are accepted and considered safe (a safe usage of an unsafe
feature…). Quoting the documentation about unsafe.Pointer
:
The following patterns involving Pointer are valid. Code not using these patterns is likely to be invalid today or to become invalid in the future. Even the valid patterns below come with important caveats.
(1) Conversion of a *T1 to Pointer to *T2.
Provided that T2 is no larger than T1 and that the two share an equivalent memory layout, this conversion allows reinterpreting data of one type as data of another type.
...
(3) Conversion of a Pointer to a uintptr and back, with arithmetic.
If p points into an allocated object, it can be advanced through the object by conversion to uintptr, addition of an offset, and conversion back to Pointer.
p = unsafe.Pointer(uintptr(p) + offset)
The most common use of this pattern is to access fields in a struct or elements of an array:
// equivalent to f := unsafe.Pointer(&s.f)
f := unsafe.Pointer(uintptr(unsafe.Pointer(&s)) + unsafe.Offsetof(s.f))
// equivalent to e := unsafe.Pointer(&x[i])
e := unsafe.Pointer(uintptr(unsafe.Pointer(&x[0])) + i*unsafe.Sizeof(x[0]))
It is valid both to add and to subtract offsets from a pointer in this way. It is also valid to use &^ to round pointers, usually for alignment. In all cases, the result must continue to point into the original allocated object.
Interesting. Go is saying that you can use pointer arithmetics to:
- access array elements
- access struct fields
as long as those addresses don’t outlive the object itself (clearly).
This means that we can compose a function to perform serialization/deserialization. Something like:
func MakeSerializer(reflect.Type) func(BinaryWriter, any) error
Sinc this function must deal with the “non-unsafe” world and we are using “unsafe” version inside, it is better to demand the real composition of the writer to a non-exported function and wrap it like:
func makeSerializer(typ reflect.Type) (func(BinaryWriter, unsafe.Pointer), error)
func MakeSerializer(typ reflect.Type) (func(BinaryWriter, any) error, error) {
serializer, err := makeUsafeWriter(typ)
if err != nil {
return nil, err
}
return func(writer io.Writer, val any) error {
if val == nil {
return fmt.Errorf("can't serialize nil")
}
value := reflect.ValueOf(val)
if value.Kind() != reflect.Pointer {
return fmt.Errorf("val must be a pointer")
}
serializer(writer, value.UnsafePointer())
return nil
}, nil
}
Just like in the other case, it is possible to use generics to improve the usage pattern:
func MakeSerializer[T any]() (func(BinaryWriter, *T) error, error) {
var _t T
serializer, err := makeSerializer(reflect.TypeOf(_t))
if err != nil {
return nil, err
}
return func(writer BinaryWriter, val *T) error {
if val == nil {
return fmt.Errorf("can't serialize nil")
}
serializer(writer, unsafe.Pointer(val))
return nil
}, nil
}
It is convenient first to define some helpers. Most of them are as trivial as casting the pointer to the right type and calling the right writeXXXUnsafe
function.
func writeBoolUnsafe(writer BinaryWriter, pVal unsafe.Pointer) {
value := *(*bool)(pVal)
writer.WriteBool(value)
}
This way, makeUsafeWriter
will just return the right function for the type at hand, eventually composing them for aggregates. Something like:
func makeUsafeWriter(typ reflect.Type) (func(BinaryWriter, unsafe.Pointer), error) {
switch typ.Kind() {
case reflect.Bool:
return writeBoolUnsafe, nil
case reflect.Int8:
return writeInt8Unsafe, nil
case reflect.Int16:
return writeInt16Unsafe, nil
...
case reflect.Uintptr, reflect.Chan, reflect.Func, reflect.Interface, reflect.UnsafePointer:
return nil, fmt.Errorf("Unsupported kind %s", typ.Kind())
}
panic("can't be here")
}
Arrays, slices, struct, pointers and maps are trickier.
For arrays we will need to capture the length with a 2nd order function:
func makeArrayUnsafeWriter(typ reflect.Type) (func(BinaryWriter, unsafe.Pointer), error) {
len := typ.Len()
elemTyp := typ.Elem()
elemSize := elemTyp.Size()
elemSerializer, err := makeUsafeWriter(elemTyp)
if err != nil {
return nil, err
}
return func(writer io.Writer, pVal unsafe.Pointer) {
for i := 0; i < len; i++ {
pItem := unsafe.Pointer(uintptr(pVal) + uintptr(i)*elemSize) // This is accepted!
elemSerializer(writer, pItem)
}
}, nil
}
For slices there’s a bit more of work as the length is not know at “composition time”.
func makeSliceUnsafeWriter(typ reflect.Type) (func(BinaryWriter, unsafe.Pointer), error) {
// Len is fixed!
elemTyp := typ.Elem()
elemSize := elemTyp.Size()
elemSerializer, err := makeSerializer(elemTyp)
if err != nil {
return nil, err
}
return func(writer BinaryWriter, pVal unsafe.Pointer) {
// in go1.20 we will have `SliceData` for this...
slice := (*reflect.SliceHeader)(pVal)
writer.WriteInt(int64(slice.Len))
for i := 0; i < slice.Len; i++ {
pItem := unsafe.Pointer(slice.Data + uintptr(i)*elemSize)
elemSerializer(writer, pItem)
}
}, nil
}
That conversion is considered by Go’s linter as unsafe (e.g. not part of that “good use” list). It seems to me that slice.Data
points to the slice’s backing array, and arrays can be used this way, it is logical to believe that this is ok (and that that warning is a false positive).
Struct can be analyzed beforehand, saving a list of functions to call each time:
func makeStructUnsafeWriter(typ reflect.Type) (func(BinaryWriter, unsafe.Pointer), error) {
fields := make([]func(BinaryWriter, unsafe.Pointer), typ.NumField())
for i := 0; i < typ.NumField(); i++ {
field := typ.Field(i)
fieldOffset := field.Offset
fieldWriter, err := makeSerializer(field.Type)
if err != nil {
return nil, err
}
fields[i] = func(writer BinaryWriter, pVal unsafe.Pointer) {
fieldWriter(writer, unsafe.Pointer(uintptr(pVal)+fieldOffset))
}
}
return func(writer BinaryWriter, pVal unsafe.Pointer) {
for _, field := range fields {
field(writer, pVal)
}
}, nil
}
Now, pointers can be dereferenced by converting the pointer to *unsafe.Pointer
as the rule (1) from unsafe.Pointer
says that I can convert between types that share the same memory layout.
func makePointerUnsafeWriter(typ reflect.Type) (func(BinaryWriter, unsafe.Pointer), error) {
elemWriter, err := makeSerializer(typ.Elem())
if err != nil {
return nil, err
}
return func(writer BinaryWriter, pVal unsafe.Pointer) {
value := *(*unsafe.Pointer)(pVal)
if value == nil {
writer.WriteBool(false)
} else {
writer.WriteBool(true)
elemWriter(writer, value)
}
}, nil
}
Last, maps. Unfortunately, for maps, there is simply no way to iterate using unsafe.Pointer
s. If you used Go for long enough, it is probably clear to you that maps are a somewhat 2nd-class citizen in Go. I might write an article in the future on why.
So, how can we write something working? Well… We could go back to reflection! Thanks to reflect.NewAt
we can get a value from an existing pointer (given that we know the type).
Let’s try:
func makeMapUnsafeWriter(typ reflect.Type) (func(io.Writer, unsafe.Pointer), error) {
keyWriter, err := makeUsafeWriter(typ.Key())
if err != nil {
return nil, err
}
elemWriter, err := makeUsafeWriter(typ.Elem())
if err != nil {
return nil, err
}
return func(writer io.Writer, pVal unsafe.Pointer) {
if pVal == nil {
writeInt(writer, 0)
} else {
value := reflect.NewAt(typ, pVal).Elem()
writeInt(writer, int64(value.Len()))
for iter := value.MapRange(); iter.Next(); {
keyWriter(writer, iter.Key().UnsafePointer())
elemWriter(writer, iter.Value().UnsafePointer())
}
}
}, nil
}
Seems neat! Except that it crashes 💥
The problem lies in the iter.Key()
and iter.Value()
which are both non addressable. This is the contrived way of Go to solve the lack of const
(or equivalent) functionality (e.g. you cannot take the address of a const value otherwise you could be changing it… Except for strings 😕 ).
If we want to stay within the “safe” usage of the unsafe
package, the only way around that is to copy those values outside. This will cost few cicles and makes the inner loop ugly:
key := reflect.New(typ.Key())
elem := reflect.New(typ.Elem())
for iter := value.MapRange(); iter.Next(); {
key.Elem().Set(iter.Key())
elem.Elem().Set(iter.Value())
keyWriter(writer, key.UnsafePointer())
elemWriter(writer, elem.UnsafePointer())
}
So, let’s benchmark what we have.
goos: linux
goarch: amd64
pkg: fast_reflection
cpu: Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz
BenchmarkSerialize/hand-written-4 8856024 115.1 ns/op 0 B/op 0 allocs/op
BenchmarkSerialize/reflect-4 2778475 496.1 ns/op 32 B/op 4 allocs/op
BenchmarkSerialize/unsafe-4 1998032 599.4 ns/op 48 B/op 6 allocs/op
Wait what? It became slower! We can now run to our manager saying that in a day of work we managed to:
- make the code much less readable;
- make it slower;
- make it consume more memory.
Or we can understand where the problem is.
First of all, that strange map
code is making copies at each step of the loop. Can we make that better? For now, we can use the “normal” reflect code so that it’s at least in par with the rest of the code.
func makeMapUnsafeWriter(typ reflect.Type) (func(BinaryWriter, unsafe.Pointer), error) {
return func(writer BinaryWriter, pVal unsafe.Pointer) {
if pVal == nil {
writer.WriteInt(0)
} else {
m := reflect.NewAt(typ, pVal).Elem()
writer.WriteInt(int64(m.Len()))
for iter := m.MapRange(); iter.Next(); {
serializeValue(writer, iter.Key())
serializeValue(writer, iter.Value())
}
}
}, nil
}
And another round of tests gives:
goos: linux
goarch: amd64
pkg: fast_reflection
cpu: Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz
BenchmarkSerialize/hand-written-4 12570352 99.26 ns/op 0 B/op 0 allocs/op
BenchmarkSerialize/reflect-4 2666583 448.8 ns/op 32 B/op 4 allocs/op
BenchmarkSerialize/unsafe-4 2668941 422.2 ns/op 32 B/op 4 allocs/op
Oh yeah! It’s this 5% that will get your salary up!
No, honestly, the result is somewhat better, but clearly not the way it’s intended. We are supposedly doing less work (as we are not inspecting the structure of our data, but just acting on it), but the result might very well some testing fluctation.
What’s happening? Is combining lambdas this costly? Maybe!
First of all, as far as I know, go (as of v1.20) can’t inline closures properly, not even not-escaping ones.
But there is another hidden cost of functions. Go have a curious way to represent goroutines, in the sense that it will allocate a very small stack and it will grow on demand. This basically means that each function call will need first to check if the stack is enough for its stack-frame, and, if not, it will resize it. That’s some overhead.
It is possible to remove that (at your own risk) with the directive go:nosplit
3.
goos: linux
goarch: amd64
pkg: fast_reflection
cpu: Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz
BenchmarkSerialize/hand-written-4 13102244 99.27 ns/op 0 B/op 0 allocs/op
BenchmarkSerialize/reflect-4 2841808 458.0 ns/op 32 B/op 4 allocs/op
BenchmarkSerialize/unsafe-4 2621190 409.8 ns/op 32 B/op 4 allocs/op
Ok, then the problem really is how bad go is at inlining/calling functions. Also, fluctuations in the result of a microbenchmark like this clearly point to the fact that the benchmarking library in Go needs some serious improvements.
How could we enable more inlining? Can we have a switch (again), but lighter than all that reflection code?
What about writing a very simple interpreter of a rudimental “instruction set”?
A simple serialization language
Ok, so let’s say that we want to make a language to save our decisions about what to serialize. Just like a normal instruction set, we will have some opcode:
type opcode int
Each opcode
will possibly have arguments inside of the “bytecode”. Each opcode
is a serialization operation.
const (
invalid opcode = iota
writeBool
writeInt8
writeInt16
writeInt32
writeInt64
writeInt
writeUint8
writeUint16
writeUint32
writeUint64
writeUint
writeFloat32
writeFloat64
writeComplex64
writeComplex128
writeString
writeSlice
writeMap
writePointer
)
Each of them will sure enough need an offset for the field (in case we are not serializing a struct, it will be 0).
The execution loop will be something like:
func runSerializerCode(writer BinaryWriter, pVal unsafe.Pointer, code []int, mapTypes []reflect.Type) {
pc := 0 // program counter
for pc < len(code) {
op := opcode(code[pc])
pc++
offset := uintptr(code[pc])
pc++
fieldPtr := unsafe.Pointer(uintptr(pVal) + offset)
switch op {
case writeBool:
writeBoolUnsafe(writer, fieldPtr)
case writeInt8:
writeInt8Unsafe(writer, fieldPtr)
...
}
}
Turns out that arrays can be inlined easily by unrolling the loop.
In the case of pointers and slices we need to be able to jump instruction (forward for pointers and backwards for slices).
We can think in terms of “subroutines”:
writePointer <offset> <subroutine code length>
<subroutine code>
...
writeSlice <offset> <item size> <subroutine code length>
<subroutine code>
In this way we can call runSerializerCode
recursively on the subroutine code. Something like:
case writePointer:
subroutine := code[pc]
pc++
value := *(*unsafe.Pointer)(fieldPtr)
if value == nil {
writer.WriteBool(false)
} else {
writer.WriteBool(true)
runSerializerCode(writer, value, code[pc:(pc+subroutine)], mapTypes)
}
pc += int(subroutine)
Just like before, maps are a bit akward, so we will just call the original routine for now with:
func writeMapUnsafe(writer BinaryWriter, pVal unsafe.Pointer, typ reflect.Type) {
if pVal == nil {
writer.WriteInt(0)
} else {
m := reflect.NewAt(typ, pVal).Elem()
writer.WriteInt(int64(m.Len()))
for iter := m.MapRange(); iter.Next(); {
serializeValue(writer, iter.Key())
serializeValue(writer, iter.Value())
}
}
}
This will need to have a type. We will save the type of the map in the mapTypes
argument. Finally, the wrapper (again using generics):
func MakeSerializer[T any]() (func(BinaryWriter, *T) error, error) {
code := make([]int, 0, 2)
typsMap := make(map[reflect.Type]int)
var _t T
makeSerializerCode(reflect.TypeOf(_t), 0, &code, typsMap)
typs := make([]reflect.Type, len(typsMap))
for k, v := range typsMap {
typs[v] = k
}
return func(writer BinaryWriter, val *T) error {
if val == nil {
return fmt.Errorf("can't serialize nil")
}
runSerializerCode(writer, unsafe.Pointer(val), code, typs)
return nil
}, nil
}
Testing all of that gives almost the same results of the last time:
goos: linux
goarch: amd64
pkg: fast_reflection
cpu: Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz
BenchmarkSerialize/hand-written-4 13660764 100.3 ns/op 0 B/op 0 allocs/op
BenchmarkSerialize/reflect-4 2661003 470.7 ns/op 32 B/op 4 allocs/op
BenchmarkSerialize/unsafe-4 2637398 406.2 ns/op 32 B/op 4 allocs/op
The big difference there is that I removed all the unsafe go:nosplit
, so the stack is safe this time.
I feel like this test is leading to a big performance hit becouse we are using serializeValue
for maps, so let’s compare the case where maps aren’t there. I will use this struct this time:
type innerStruct struct {
a [4]int
b byte
}
type nestedStruct struct {
I int
U uint
S string
L []int
P *int
N innerStruct
}
Aaand the result is:
goos: linux
goarch: amd64
pkg: fast_reflection
cpu: Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz
BenchmarkNestedStruct/hand-written-4 37589709 33.42 ns/op 0 B/op 0 allocs/op
BenchmarkNestedStruct/reflect-4 6968936 178.0 ns/op 0 B/op 0 allocs/op
BenchmarkNestedStruct/unsafe-4 15621717 72.92 ns/op 0 B/op 0 allocs/op
OH YEAH! The reflect
code is consistently 4-5x slower while our new shiny code is 2-2.5x only!
You can find the complete listing here
Wrapping up
So, we discussed how difficult it is to beat “boring” reflection code in Go. We saw how bad it is not to inline closures and that the cost of having a resizable stack is not zero.
Also it is clear, given the length of this article, that I need to come a long way to learn how to write concise, yet useful articles.