Writing Efficient Endian-Independent Code in C++

Once upon a time, there was an article published on gamedev.net [Roy2013], which described a way (as he says, mostly taken from Quake2 engine) to deal with Little-Endian/Big-Endian issues in games. While this approach is mostly sound (“mostly” because of unaligned-read issues which will be discussed below), it is not the most efficient one. Better (simpler, faster, and more general) approaches do exist, and they will be discussed below.

What is Endianness

Endianness itself has been described in many different works, including [[Roy2013] and [WikipediaEndianness]. Basically, it is a way that CPU stores multi-byte data in memory; little-endian systems store least significant byte first, and big-endian ones store most-significant byte first. So, if you have

uint16_t x = 1234;

then x will look as {0xD2, 0x04} on a little-endian system, and as {0x04, 0xD2} on a big-endian system. As a result, code such as

send(socket,&x,2);

will send different data over the wire depending on system endianness (little-endian systems will send {0xD2,0x04}, and big-endian ones will send {0x04,0xD2}).

It is important to note that endianness effects cannot be observed unless we have some kind of cast between pointers on data of different sizes (in the example above, there was an implicit cast from &x, which is uint16_t*, to void*, which is actually treated as byte pointer by the send() function). In other words, as long as we keep away from casts and stay within arithmetical and bitwise operations without pointer casts, the result is always the same regardless of endianness. 2+3 is always 5, and (((uint16_t)0xAA)<<3)^0xCCCC is always 0xC99C, regardless of the system where our code is running. Let's name such calculations endianness-agnostic.

Scope

First of all, where do we need to deal with little-endian/big-endian issues? In fact, there are only two scenarios of which I know, where it is important. First one is reading files (which might have been written on a different machine), and another one is network communication. From our perspective, both of these cases are essentially the same: we're transferring data from one machine to another one.

Serialization/Marshalling

One thing which should be noted for both these data-transfer-between-machines scenarios, is that you should never transfer data as a C structure; instead, you should serialize/marshal it. Putting C structure in a file (which may be read on another machine) or over the network, is a Really Bad Idea for several reasons.

First of all, when writing/reading C structure to external storage, you're becoming a hostage of implicit alignment rules of the compiler you're using. In general, when you have a structure such as

struct X {
uint8_t a;
uint32_t b;
};

then sizeof(X) won't be 5 bytes as some might expect; in many cases sizeof(X) will be 8 bytes (1 byte of a, then 3 unused bytes of padding just to make b aligned on a 4-byte boundary, and then 4 bytes of b), but this is not guaranteed at all. To make things worse, the amount of alignment is not specified by standards, so when you're switching from one compiler to another one, it may change (not to mention switching between CPUs); to make things even worse, it can be affected by compiler switches and on a struct-by-struct basis by things such as #pragma pack.

If you are using types such as int or long (rather than guaranteed-size-types such as uint8_t and uint32_t), things worsen even further (yes, this is possible) – due to different sizes of these types on different platforms. Oh, and don't forget that variable-length strings and C++ containers are clearly off-limits.

There are other (rather minor) reasons for avoiding writing C structures directly: you'll write more data then necessary, the data written will include garbage (which will affect the ability to compress it), and so on. However, the most important issue is (lack of) inter-platform and inter-compiler compatibility mentioned above.

These issues are so important, that in the networking world sending C structures over the network is universally considered a Big No-No.

So, what you should do when you need to send a C structure over the network (or to save it to the file)? You should serialize it first (in network world term “marshal” is generally preferred, though it is essentially the same thing).

Implementing Serialization

The idea behind serialization is simple: for the struct X above you write one byte of a, and 4 bytes of b, avoiding alignment issues. In fact, you can go further and use, for example, VLQ [WikipediaVLQ] variable-length encoding, or put null-terminated strings into your serialized data.

One way of serializing data (the one I prefer), is to have serialize/deserialize functions such as

void serialize_uint16(DataBlock&, uint16_t);//DataBlock should grow as the data is serialized
uint16_t deserialize_uint16(Parser&);//there is constructor Parser(DataBlock&)

When we have these functions implemented, then serializing our struct X will look as

DataBlock data;
serialize_uint8(data,x.a);
serialize_uint32(data,x.b);

(and deserializing will look similar).

So far so good, now let's see how we can implement our serialize_uint16() function. If we will implement it according to [Roy2013], it would look like:

void serialize_uint16(DataBlock& data,uint16_t u16) {
  void* ptr = data.grow(2);//add 2 bytes to the end of data, and return pointer to these 2 bytes
  u16 =  LittleShort(u16);//calling ShortSwap on big-endian systems, and ShortNoSwap on little-endian systems
  *(uint16_t*)ptr = u16; //(*)
}

This would work fine on x86 and x86-64, but on the other platforms the line marked as (*) may run into problems. The problem is that our ptr might be either even or odd; and if it is odd – some CPUs will refuse to read 2-byte data from it (also they will usually refuse to read 4-byte data unless its address is a multiple of 4, and so on). This never happens on x86/x86-64, but happens on SPARC, and may or may not happen on ARM (unless we specify __packed qualifier for uint16_t*, but it is not universally available).

Another Popular Alternative

Another popular alternative (thanks to Servant of the Lord for reminding me of it) is based on LITTLE_ENDIAN and BIG_ENDIAN macros. In some sense, it can be seen as the same serialize_uint16() as above, but using different implementation for BigShort() etc.:

//code courtesy of Servant of the Lord
#ifdef LITTLE_ENDIAN
    #define BigShort(x)     ShortSwap(x)
    #define LittleShort(x)  (x) //Do nothing, just 'return' the same variable.
    #define BigLong(x)      LongSwap(x)
    #define LittleLong(x)   (x)
    #define BigFloat(x)     FloatSwap(x)
    #define LittleFloat(x)  (x)
#elif defined(BIG_ENDIAN)
    #define BigShort(x)     (x)
    #define LittleShort(x)  ShortSwap(x)
    #define BigLong(x)      (x)
    #define LittleLong(x)   LongSwap(x)
    #define BigFloat(x)     FloatSwap(x)
    #define LittleFloat(x)  (x)
#else
    #error No idea about endianness
#endif

While it is faster and less bulky than a previous one (see "Performance Analysis" section below), it has the same problem with unaligned read/writes on non-x86 platforms :-(. In other words, for serialization purposes it won't work, for example, on SPARC (and it working for ARM is not guaranteed).

What Is to be Done?

What is to be done?
-- name of the novel by Nikolay Chernyshevsky, 1863 --

The answer is quite simple, actually. Instead of writing two bytes as one chunk, we can always write it byte-by-byte:

void serialize_uint16(DataBlock& data,uint16_t u16) {
  uint8_t* ptr = data.grow(2);
  *ptr++ = (uint8_t)u16;
  *ptr = (uint8_t)(u16 >> 8);
}//deserializing is very similar

This technique is well-known (see, for example, [Pike2012], but the idea is known for years before); a Really Good Thing about it is that we don't need to care about endianness at all(!). This stands because the code above doesn't perform any casts, and calculates all the bytes in a completely endianness-agnostic manner (see “What is Endianness” section above); and all writes are exactly 1 byte in size, so there is no chance for endianness to manifest itself.

While [Pike2012] argues that all the other marshalling methods represent a fallacy, I'm not so sure about it, and will describe an improvement over byte-by-byte marshalling in a moment.

Further Optimization

When I really care about performance (which I usually do, as server-side handling of billions of network messages per day is quite a significant load), I often add a special handling for those platforms of the most interest, for example (taken from [NoBugs2015]):

void serialize_uint16(DataBlock& data, uint16_t u16) { //assuming little-endian order on the wire 
  uint8_t* ptr = data.grow(2);
#if defined(__i386) || defined(__x86_64__) || defined(_M_IX86) || defined(_M_X64) 
  *(uint16_t*)ptr = u16; // safe and fast as x86/x64 are ok with unaligned writes
#else 
  *ptr++ = (uint8_t)u16;
  *ptr = (uint8_t)(u16 >> 8);
#endif 
}

With this approach, we have the best of both worlds: (a) universal version (the one under #else) which works everywhere, and (b) optimized version for specific platforms which we know for sure will work efficiently there.

Performance analysis

Now, let's analyze relative performance of all three approaches: (a) the one from [Roy2013], (b) LITTLE_ENDIAN/BIG_ENDIAN based, © endianness-agnostic one (see [Pike2012]), and (d) the one from [NoBugs2015]. For the purposes of our analysis let's assume that all the data is already in L1 cache (it is the most common case for continuous reading, if it isn't, penalties will be the same for all the methods). Also, let's assume that L1 reads cost 2-3 clocks (L1 latencies with modern x86 CPUs are around 4-5 clocks, but latency isn't exactly translated to overall execution time, so 2-3 clocks works as a reasonable estimate); writes are usually around 1 clock. Also, we won't count costs of data.grow() for writing and of parser offset management for reading; it can be done in a manner that ensures amortized costs are quite low (of the order of single-digit clocks), and it will be the same regardless of the endianness handling method.

LITTLE_ENDIAN/BIG_ENDIAN-based marshalling is certainly not bad performance-wise: it can be inlined easily, and has a cost of 1 clock for writing and 2-3 clocks for reading.

[Roy2013] causes a function-call-by-pointer on each conversion; most importantly, such calls cannot possibly be inlined. From my experience, on x86 function calls with such parameters usually cost around 15-20 CPU clocks (while CALL/RET instructions are cheap, all the required PUSH/POPs and creating/destroying stack frame, taken together, are not), compared to inlined version. In addition, it will need roughly 1 clock to write the data (and 2-3 clocks to read), making the total for writing around 16-21 clocks and total for reading around 17-23 clocks.

Endianness-agnostic approach can be inlined easily; however, it causes 2 writes/reads instead of 1 (and compilers don't combine them, at least I've never seen a compiler combine 2 writes/reads), which is normally translated into 2 clocks for writing and 4-6 clocks for reading; also, it requires shifts and casts, which also cost around 2 additional clocks, making totals for writing around 4 clocks and total for reading around 6-8 clocks.

[NoBugs2015] is optimized for x86, and can be inlined too; same as LITTLE_/BIG_ENDIAN one, it has cost of 1 clock for writing and 2-3 clocks for reading.

Of course, the analysis above is very approximate and there are other things not taken into account (such as larger size of inline functions, which in general may affect caches), but I feel that these other considerations in most cases won't affect the overall picture.

Implementation Notes

One thing to note when implementing marshaling, is that in most cases it is simpler to do it using unsigned integers rather than signed ones; while using signed types isn't formally a bad thing, in practice it tends to cause trouble. Not that it isn't possible to implement marshalling with signed ints - it is just simpler to implement it with unsigned ints, with one less thing to worry about. For a list of troubles which can be caused by using signed stuff for marshalling - see a comment by SICrane below; however, you don't really need to care about them - just use unsigned and you'll be fine :-).

Another thing to keep in mind is to use those guaranteed-size types such as uint32_t and uint16_t (rather than int and short). You never know where your code will be compiled in 5 years from now, and just recently I've seen a guy who needed to fix his code because when compiling for AVR8, sizeof(int) is 2 (but sizeof(uint32_t) is always 4, regardless of the platform).

Summary

Properties of the three ways to handle endianness can be summarized in the following table:

	Applicability	Clocks on x86, write uint16_t	Clocks on x86, read uint16_t	Clocks on x86, write uint32_t	Clocks on x86, write uint32_t Clocks on x86, read uint32_t
[Roy2013]	Only C-structure based	16-21	17-23	16-21	17-23
LITTLE_/BIG_ENDIAN	Only C-structure based	1	2-3	1	2-3
Endianness-agnostic	Both C-structure based and serialization	4	6-8	8	12-16
[NoBugs2015]	Both C-structure based and serialization	1	2-3	1	2-3

Of course, on non-x86 platforms the picture won't be as good for [Nobugs2015] as it is written, but it will still perform exactly as the endianness-agnostic one, and there will be an option to optimize it for a specific platform (in the manner similar to x86 optimization) if necessary and possible.

References

[Roy2013] Promit Roy, Writing Endian Independent Code in C++, 2013, http://www.gamedev.net/page/resources/_/technical/general-programming/writing-endian-independent-code-in-c-r3301
[WikipediaEndianness] https://en.wikipedia.org/wiki/Endianness
[WikipediaVLQ] https://en.wikipedia.org/wiki/Variable-length_quantity
[Pike2012] Rob Pike, The Byte Order Fallacy, 2012, http://commandcenter.blogspot.com/2012/04/byte-order-fallacy.html
[NoBugs2015] 'No Bugs' Hare, 64 Network DO's and DON'Ts for Game Engine Developers, Part IIa, Protocols and APIs, 2015 http://ithare.com/64-network-dos-and-donts-for-game-engine-developers-part-iia-protocols-and-apis/

Article Update Log

Jun 5, 2015: Minor change to explanation "why signed is bad" in Implementation Detail section
Jun 4, 2015: Added LITTLE_/BIG_ENDIAN approach to those being analyzed
Jun 4, 2015: Added 'Implementation Notes' section
Jun 2, 2015: Initial release