Endianness madness
A lot of C
code in the wild does something on the lines of:
void write32(char *buffer, uint32_t n) {
memcpy(buffer, (char*)&n, 4);
}
This code, however, is clearly not portable as the order of the bytes of n
is architecture-dependent. This is especially clear in lazy-oriented programming, when an entire structure is just written to a file:
struct state {
...
// Many fields here
};
int write_state(FILE* f, struct state* s) {
return fwrite(s, sizeof(*s), 1, f);
}
Which is even a worse idea since struct state
might be:
- different in size, depending on the architecture;
- different in size and layout on the same architecture, but under a different data model 1.
So, the Good ProgrammerTM will understand this and will:
- define the size of each field beforehand (or use a stricter type like
uint32_t
); - swap the bytes, if needed, before writing to disk/network.
However, the challenge with byte swapping arises from the necessity to determine the system’s endianness. If the byte order of your architecture aligns with the output order, swapping becomes unnecessary.
There are plenty of bad ways to do that, but what I found most intriguing was:
// Pre-C99, invalid
#define IS_BIG_ENDIAN ((*(uint16_t*)("\1\2")) == 0x0102)
// C99-compliant
#define IS_BIG_ENDIAN (!*(unsigned char *)&(uint16_t){1})
The first is not valid neither C nor in C++, the second is valid only in C99
and above. However, they still work on all major compilers. Speaking of which, relying on compiler-specific stuff is also an option. gcc
has had __BYTE_ORDER__
macro for quite a while now. In C++, you could use std::endian
. Another viable option is to rely on OS-specific stuff (like POSIX’s <endian.h>
header).
Nevertheless, writing conditional logic to consider this factor is necessary. I remember I once read an article talking about this exact problem and providing a simple solution: writing each individual byte as you expect it to be. It seemed like very solid advice to be honest. In detail, he was proposing something on the lines of:
void write32le(char* buffer, int value) {
buffer[0] = (char)(value);
buffer[1] = (char)(value >> 8);
buffer[2] = (char)(value >> 16);
buffer[3] = (char)(value >> 24);
}
void write32be(char* buffer, int value) {
buffer[0] = (char)(value >> 24);
buffer[1] = (char)(value >> 16);
buffer[2] = (char)(value >> 8);
buffer[3] = (char)(value);
}
Now, this is what I like: simple code, always work, no compiler-specific, no undefined-behavior.
C++ to the rescue
Admittedly, this article might seem too boring and simple. I might as well wrote the same in Go, where those properties are idolatrised.
But not on my whatch (or on my personal blog). Let’s embrace some C++ madness and write the generic version of that.
namespace little_endian {
template <size_t S, std::integral T>
requires (0 < S && S <= sizeof(T))
constexpr void write(std::span<char> output, T value) {
if constexpr (std::is_signed_v<T>) {
using unsigned_T = std::make_unsigned_t<T>;
return write<S, unsigned_T>(output, static_cast<unsigned_T>(value));
}
for (size_t i = 0; i < S; i++) {
output[i] = static_cast<char>(value >> i*8);
}
}
}
This interface is more complex compared to the original C version, but adds some nice features:
- it lets you specify the byte count to write. This way it is possible to write a number with bit-count which is not part of the C++ type system (e.g.
write<24>(buffer, 10)
) - it supports both signed and unsigned numbers;
- it supports all sizes, so if by chanche you have access to
int128_t
, you can use it.
Still, it bothers me that for simple cases I can’t just write write(buffer, 10)
. To account for that, it is possible to add an overload to simplify things:
namespace little_endian {
...
template <std::integral T>
constexpr void write(std::span<char> output, T value) {
write<sizeof(T), T>(output, value);
}
}
Sweet! Now I can have my cake and eat it, too.
Performance
How about the performance of the solution above? Some might assume that byte-order swapping would execute faster due to being a compiler intrinsic, especially when compared to the C++ version, which contains a loop.
So, let’s inspect the assembly generated by:
GCC 13
Using -std=c++20 -O3
2:
void little_endian::write<unsigned short>(std::span<char, 18446744073709551615ul>, unsigned short):
mov WORD PTR [rdi], dx
ret
void little_endian::write<unsigned int>(std::span<char, 18446744073709551615ul>, unsigned int):
mov DWORD PTR [rdi], edx
ret
void little_endian::write<unsigned long>(std::span<char, 18446744073709551615ul>, unsigned long):
mov QWORD PTR [rdi], rdx
ret
Just perfect.
Clang 17
Using -std=c++20 -O2
.
void little_endian::write<unsigned short>(std::span<char, 18446744073709551615ul>, unsigned short): # @void little_endian::write<unsigned short>(std::span<char, 18446744073709551615ul>, unsigned short)
mov word ptr [rdi], dx
ret
void little_endian::write<unsigned int>(std::span<char, 18446744073709551615ul>, unsigned int): # @void little_endian::write<unsigned int>(std::span<char, 18446744073709551615ul>, unsigned int)
mov dword ptr [rdi], edx
ret
void little_endian::write<unsigned long>(std::span<char, 18446744073709551615ul>, unsigned long): # @void little_endian::write<unsigned long>(std::span<char, 18446744073709551615ul>, unsigned long)
mov qword ptr [rdi], rdx
ret
On par with GCC. Awesome.
MSVC
Using /std:c++20 /O2
(and tried with /std:c++20 /Os
).
output$ = 8
value$ = 16
void little_endian::write<unsigned __int64>(std::span<char,-1>,unsigned __int64) PROC ; little_endian::write<unsigned __int64>, COMDAT
mov rcx, QWORD PTR [rcx]
mov rax, rdx
shr rax, 8
mov BYTE PTR [rcx], dl
mov BYTE PTR [rcx+1], al
mov rax, rdx
shr rax, 16
mov BYTE PTR [rcx+2], al
mov rax, rdx
shr rax, 24
mov BYTE PTR [rcx+3], al
mov rax, rdx
shr rax, 32 ; 00000020H
mov BYTE PTR [rcx+4], al
mov rax, rdx
shr rax, 40 ; 00000028H
mov BYTE PTR [rcx+5], al
mov rax, rdx
shr rax, 48 ; 00000030H
shr rdx, 56 ; 00000038H
mov BYTE PTR [rcx+6], al
mov BYTE PTR [rcx+7], dl
ret 0
void little_endian::write<unsigned __int64>(std::span<char,-1>,unsigned __int64) ENDP ; little_endian::write<unsigned __int64>
output$ = 8
value$ = 16
void little_endian::write<unsigned int>(std::span<char,-1>,unsigned int) PROC ; little_endian::write<unsigned int>, COMDAT
mov rcx, QWORD PTR [rcx]
mov eax, edx
shr eax, 8
mov BYTE PTR [rcx], dl
mov BYTE PTR [rcx+1], al
mov eax, edx
shr eax, 16
shr edx, 24
mov BYTE PTR [rcx+2], al
mov BYTE PTR [rcx+3], dl
ret 0
void little_endian::write<unsigned int>(std::span<char,-1>,unsigned int) ENDP ; little_endian::write<unsigned int>
$T1 = 0
output$ = 32
value$ = 40
void little_endian::write<unsigned short>(std::span<char,-1>,unsigned short) PROC ; little_endian::write<unsigned short>, COMDAT
$LN16:
sub rsp, 24
movups xmm0, XMMWORD PTR [rcx]
xor eax, eax
movaps XMMWORD PTR $T1[rsp], xmm0
mov r9, QWORD PTR $T1[rsp]
$LL6@write:
movzx ecx, al
movzx r8d, dx
shl cl, 3
shr r8w, cl
mov BYTE PTR [r9+rax], r8b
inc rax
cmp rax, 2
jb SHORT $LL6@write
add rsp, 24
ret 0
void little_endian::write<unsigned short>(std::span<char,-1>,unsigned short) ENDP ; little_endian::write<unsigned short>
output$ = 8
value$ = 16
void little_endian::write<2,unsigned short>(std::span<char,-1>,unsigned short) PROC ; little_endian::write<2,unsigned short>, COMDAT
mov r9, QWORD PTR [rcx]
xor eax, eax
npad 11
$LL4@write:
movzx ecx, al
movzx r8d, dx
shl cl, 3
shr r8w, cl
mov BYTE PTR [r9+rax], r8b
inc rax
cmp rax, 2
jb SHORT $LL4@write
ret 0
void little_endian::write<2,unsigned short>(std::span<char,-1>,unsigned short) ENDP ; little_endian::write<2,unsigned short>
output$ = 8
value$ = 16
void little_endian::write<4,unsigned int>(std::span<char,-1>,unsigned int) PROC ; little_endian::write<4,unsigned int>, COMDAT
mov r8, QWORD PTR [rcx]
mov eax, edx
shr eax, 8
mov BYTE PTR [r8+1], al
mov eax, edx
mov BYTE PTR [r8], dl
shr eax, 16
shr edx, 24
mov BYTE PTR [r8+2], al
mov BYTE PTR [r8+3], dl
ret 0
void little_endian::write<4,unsigned int>(std::span<char,-1>,unsigned int) ENDP ; little_endian::write<4,unsigned int>
output$ = 8
value$ = 16
void little_endian::write<8,unsigned __int64>(std::span<char,-1>,unsigned __int64) PROC ; little_endian::write<8,unsigned __int64>, COMDAT
mov r8, QWORD PTR [rcx]
mov rax, rdx
shr rax, 8
mov BYTE PTR [r8+1], al
mov rax, rdx
shr rax, 16
mov BYTE PTR [r8+2], al
mov rax, rdx
shr rax, 24
mov BYTE PTR [r8+3], al
mov rax, rdx
shr rax, 32 ; 00000020H
mov BYTE PTR [r8+4], al
mov rax, rdx
shr rax, 40 ; 00000028H
mov BYTE PTR [r8+5], al
mov rax, rdx
mov BYTE PTR [r8], dl
shr rax, 48 ; 00000030H
shr rdx, 56 ; 00000038H
mov BYTE PTR [r8+6], al
mov BYTE PTR [r8+7], dl
ret 0
void little_endian::write<8,unsigned __int64>(std::span<char,-1>,unsigned __int64) ENDP ; little_endian::write<8,unsigned __int64>
Beautiful as well.
No, wait! What? MSVC apparently is not understanding this simple piece of code.
To be honest, this is not the first time I see MSVC generating sub-par assembly. As sometimes it gets confused when using templates. I attempted the same with the C version (write32le
), but unfortunately had no luck there either.
Nevertheless, I would pretty much prefer this code, even if slower on Windows, as it is portable and simple. Sometimes performance is not the first aim. At times, if performance is the main priority, considering a change in the compiler could be a viable option.
Reading
Reading follows the same ideas.
namespace byte_order::little_endian {
template <std::integral T, size_t S = sizeof(T)>
requires (0 < S && S <= sizeof(T))
constexpr T read(std::span<const char> output) {
if constexpr (std::is_signed_v<T>) {
return static_cast<T>(read<S, std::make_unsigned_t<T>>(output));
} else {
T result = 0;
for (size_t i = 0; i < S; i++) {
result |= static_cast<T>(static_cast<unsigned char>(output[i])) << (i*8);
}
return result;
}
}
}
Notice how it’s required to first cast to unsigned char
and then to T
to properly handle char
signedness3.
This time, results are a bit different though: Clang and MSVC don’t change their behavior, leading to similar code.
Clang
Using -std=c++20 -O2
.
unsigned short byte_order::little_endian::read<unsigned short, 2ul>(std::span<char const, 18446744073709551615ul>): # @unsigned short byte_order::little_endian::read<unsigned short, 2ul>(std::span<char const, 18446744073709551615ul>)
movzx eax, word ptr [rdi]
ret
unsigned int byte_order::little_endian::read<unsigned int, 4ul>(std::span<char const, 18446744073709551615ul>): # @unsigned int byte_order::little_endian::read<unsigned int, 4ul>(std::span<char const, 18446744073709551615ul>)
mov eax, dword ptr [rdi]
ret
unsigned long byte_order::little_endian::read<unsigned long, 8ul>(std::span<char const, 18446744073709551615ul>): # @unsigned long byte_order::little_endian::read<unsigned long, 8ul>(std::span<char const, 18446744073709551615ul>)
mov rax, qword ptr [rdi]
ret
MSVC
output$ = 8
unsigned __int64 byte_order::little_endian::read<unsigned __int64,8>(std::span<char const ,-1>) PROC ; byte_order::little_endian::read<unsigned __int64,8>, COMDAT
mov rdx, QWORD PTR [rcx]
movzx eax, BYTE PTR [rdx+6]
movzx ecx, BYTE PTR [rdx+2]
movzx r8d, BYTE PTR [rdx+7]
shl r8, 8
or r8, rax
movzx eax, BYTE PTR [rdx+5]
shl r8, 8
or r8, rax
movzx eax, BYTE PTR [rdx+4]
shl r8, 8
or r8, rax
movzx eax, BYTE PTR [rdx+3]
shl r8, 8
or rax, r8
shl rax, 8
or rax, rcx
movzx ecx, BYTE PTR [rdx+1]
shl rax, 8
or rax, rcx
movzx ecx, BYTE PTR [rdx]
shl rax, 8
or rax, rcx
ret 0
unsigned __int64 byte_order::little_endian::read<unsigned __int64,8>(std::span<char const ,-1>) ENDP ; byte_order::little_endian::read<unsigned __int64,8>
output$ = 8
unsigned short byte_order::little_endian::read<unsigned short,2>(std::span<char const ,-1>) PROC ; byte_order::little_endian::read<unsigned short,2>, COMDAT
mov rdx, QWORD PTR [rcx]
movzx eax, BYTE PTR [rdx+1]
movzx ecx, BYTE PTR [rdx]
shl ax, 8
or ax, cx
ret 0
unsigned short byte_order::little_endian::read<unsigned short,2>(std::span<char const ,-1>) ENDP ; byte_order::little_endian::read<unsigned short,2>
output$ = 8
unsigned int byte_order::little_endian::read<unsigned int,4>(std::span<char const ,-1>) PROC ; byte_order::little_endian::read<unsigned int,4>, COMDAT
mov rdx, QWORD PTR [rcx]
movzx ecx, BYTE PTR [rdx+2]
movzx eax, BYTE PTR [rdx+3]
shl eax, 8
or eax, ecx
movzx ecx, BYTE PTR [rdx+1]
shl eax, 8
or eax, ecx
movzx ecx, BYTE PTR [rdx]
shl eax, 8
or eax, ecx
ret 0
unsigned int byte_order::little_endian::read<unsigned int,4>(std::span<char const ,-1>) ENDP ; byte_order::little_endian::read<unsigned int,4>
This time, however, GCC doesn’t seem to handle the generation properly, generating results similar to MSVC. However, hand-writing the routine, gets things better.
uint32_t read32le(unsigned char* data) {
return (data[0]<<0) | (data[1]<<8) | (data[2]<<16) | (data[3]<<24);
}
Generates (in GCC, not in MSVC):
read32le(unsigned char*):
mov eax, DWORD PTR [rdi]
ret
So, most likely it’s the loop confusing GCC. What if I remove it and merge everything in a single embeddable expression? It’s possible to get there by recusively loading the high part and the low part, each of them half the size of the result.
namespace byte_order::little_endian {
template <std::integral T, size_t S = sizeof(T)>
requires (0 < S && S <= sizeof(T))
constexpr T read(std::span<const char> buffer) {
if constexpr (std::is_signed_v<T>) {
return static_cast<T>(read<S, std::make_unsigned_t<T>>(buffer));
} else if constexpr (S == 1) {
return static_cast<uint8_t>(buffer[0]);
} else {
const auto low_part = read<T, S/2>(buffer);
const auto high_part = read<T, (S+1)/2>(buffer.subspan(S/2));
return low_part | (high_part << ((S/2)*8));
}
}
}
And this solves for GCC as well, leaving (again) MSVC the only one generating sub-par assembly.
unsigned short byte_order::little_endian::read<unsigned short, 2ul>(std::span<char const, 18446744073709551615ul>):
movzx eax, WORD PTR [rdi]
ret
unsigned int byte_order::little_endian::read<unsigned int, 4ul>(std::span<char const, 18446744073709551615ul>):
mov eax, DWORD PTR [rdi]
ret
unsigned long byte_order::little_endian::read<unsigned long, 8ul>(std::span<char const, 18446744073709551615ul>):
mov rax, QWORD PTR [rdi]
ret
You can find the complete code here.
-
an example for this is x86-64: LLP64 data model uses 4-byte
long
s (used on Windows) while in LP64 (*nix) the same type is twice that size (i.e. 8 bytes). ↩ -
Notice that the
64-bit
version is not this beautiful whith-O2
. ↩ -
directly converting
char
toT
would have sign-extended the register leading to a wrong result. ↩