Beyond JSON — Introduction to FlatBuffers
This article is a “voice over” to the slides which you can find here:
Let’s jump directly to slide 7
Why do we need data serialisation?
This is a very basic question, however from my experience it is important to think about the WHY, before we start analysing the HOW.
I came up with three main purposes for data serialisation:
- Persisting State
- Machine to Machine Communication
- Representation of Configuration
How can we persist data on mobile?
I am starting with the most inconvenient way of data serialisation — Custom binary representation.
Nobody (accept for some hard core game developers) does it this way. And there are thousand reasons why. However, those game developer don’t do it just because they are hard core. They do it because they have to. A friend of mine worked on a mobile realtime synchronous PVP game, which should work with latency time around 100 ms on cellular network. In such restricted scenario you need your machine to machine communication to be as efficient as possible. After trying out different things they went with a TCP, UDP mix and custom binary representation. You are probably not in there shoes, but I still wanted to mention this technique.
Most OOP languages provide built in binary serialisation:
- Java: http://docs.oracle.com/javase/7/docs/api/java/io/Serializable.html
- C#: https://msdn.microsoft.com/de-de/library/mt656716.aspx
- ObjC: https://developer.apple.com/reference/foundation/nscoding
It is more convenient than rolling your own binary format, but it also has some problems. I will mention it later. When we will talk about benefits of FlatBuffers.
Text based representations are very popular, specifically when it comes to representation of configurations. They are human readable and writable way of storing information. Here is a list of most popular formats:
CSV is mostly used for large data dumps, export data bases or spreadsheets.
JSON is the current replacement for XML. It is widely used for configuration, Machine to Machine communication and persisting data (it is the way to represent data in many document based NoSQL databases)
To be honest I am not really familiar with YAML, but as far as I know it is widely used for configurations.
All this text based representations have one weakness when it comes to performance, they have to be parsed. But more on it later.
Embedded SQL or NoSQL databases are the poison of choice when it comes to persisting state. The most used one is SQLite, but if you google for NoSQL DB for mobile, you will find a few alternatives. However as I mentioned before, they are only practical for one out of three use cases, so I will not spend more time on them.
Last but not least is the binary cross platform serialisation library. The list states:
- Cap’n Proto
- Apache Thrift
There are many more, but I will any ways concentrate on FlatBuffers, because I am most familiar with it and it also fits the best with all three use cases. It is due to the fact, that FlatBuffers IDL (interface definition language) describes the data representation only, where others concentrate on machine to machine communication and there for let you describe the end point.
Important criteria for persisting data
Now that we know how we can persist data, let’s think about how we can compare those different strategies. Here I listed five points which I find important.
Size on disk
This is the outcome of the process. We would like it, to be as small as possible. This might be, not the most important criteria, but it is the most obvious one in my opinion.
Speed of read & write / partial read / memory consumption
This covers the process itself. How fast can we convert data from in memory into something that we can put on disk and the other way around. Is it possible to read only a relevant portion. For example if we have a 200 MB CSV file with 3 million entries. Can we just read 100 entries starting from entry 123000? How much transient memory will the serialisation and deserialisation process generate? In some scenarios where memory is sparse, this could mean if your App will crash or not.
Human readable and writable
When we persist data, there are cases where we want to be able to read what we just persisted, specifically during the development and debug phase. It is also nice to write some data by hand in order to test a use case. If your serialisation technique does not support it, you will induce a lot of pain on developers and this might lead to slower development cycles.
Support of OO language type
If you are working with a typed programming language, you want your data to be typed. Typed data eliminates a certain class of bugs and keeps your development experience smooth.
Data versioning / evolution / migration
Persisting data means, you will read this data later. This however implies that you might read the data with a newer version of your App. Because of this, we need a strategy for backward compatibility and migration strategy.
FlatBuffers vs. JSON
Now that we know the Why the How and What’s important, let’s make some comparisons.
Here we see the listed criteria with my assessment how they apply to JSON as data persistence format.
Size on disk 👍
For size on disk I give only one thumb up. JSON is not that bad when it comes to size on disk, specifically if we minify the text (making it less human readable), however it still has lot’s of repetition and can be verbose.
Speed of read & write / partial read / memory consumption 👎
Here I give one thumb down. The efficiency of serialisation and deserialisation is not that great, because the text has to be parsed. This also means lots of transient memory and there is no way of partial read. You have to process the whole JSON file, before you can ask for a specific portion of a tree.
Human readable and writable 👍
This is where JSON shines in my opinion. It is very much readable and writeable. However as mentioned in “Size on disk” you would have to make it less readable in order to make the outcome much smaller.
Support of OO language type 👍👎
This is a mixed bag. You will find a library for every typed OO language, which will give you support for types, but it comes with the price. If you use such library I guess I would have to put second thumb down on the second point of our list (serialisation/deserialisation efficiency).
Data versioning / evolution / migration 👍👎
This is again a mixed bag. Because JSON is not typed, it is fairly easy to make sure that your App is backwards compatible. And you can write some migration code. However it is not a feature of the format, it is more of a coincidence. It is however easy to shoot your self in the foot if you are not careful.
Now let’s talk about FlatBuffers and how this format withstands our criteria.
Size on disk 👍👎
This is a mixed bag. If you have lots of structurally repeating data FlatBuffers is great. It let’s you even avoid repetition (more on it later). But if you want to persist small amount of data like one small object, it can become an overkill.
Speed of read & write / partial read / memory consumption 👍👍
Two thumbs up! This is what this format was designed for. You can do read and write with zero transient memory and it is blazing fast. It is also posible to do partial reads.
Human readable and writable 👍👎
Again a mixed bag. The format itself is binary and there for it is not human readable. However there are tools which let you translate the binary file to a JSON file and vice versa.
Support of OO language type 👍👍
Again solid two thumbs up. The code generator, generates a typed API from the given fbs (flat buffers schema) file.
Data versioning / evolution / migration 👍👍
FlatBuffers is designed to be backwards and forwards compatible, this means not only a new version of the App could read files which where persisted with an old version of the App, but also that old App version could still handle data persisted by new code. This is built into the format and there is no need to be paranoid about App crashes because of unexpected data representation.
FlatBuffers vs. JSON
Here we see a table which compares JSON and FlatBuffers side by side. We can see that the only obvious benefit behind using JSON is the human readable and writeable part. Size on disc is questionable, however in most real world examples FlatBuffers can read JSONs lunch on this.
But over all we must admit that JSON is not that good of a choice for persisting data. But what about machine to machine communication? Allmost all the cool kids use JSON for machine to machine communication. Maybe we missed something there?
OK let’s bring up our five point again specifically in regards to sending data
Important criteria for sending data
Size on wire
We still should strive for the smallest outcome. Small size means more throughput.
Speed of read & write / memory consumption
While sending data, we don’t care much about partial reads, becuase mostly if we receive something we will have to read it completely any ways, but read & write speed is very important. Specifically on the BackEnd. If your backend is bombarded with thousands of request with a payload, you want this payload to be readable fast and without transient memory overhead.
I guess human writable is not that important, because those are machines who write the messages, but human readable could be practical, specifically if we want to do eavesdropping on the conversation.
Support of OO language type
Is as important as for persisting data.
Data versioning / evolution
The migration part is not there any more, because you really don’t want to migrate an old message, but the versioning and evolution is even more important, as it could happen that you have two different client versions talking to each other. Or a new backend has to serve an old version of an App.
Enough theory let’s do some practice
So I guess I brought some compelling points why FlatBuffers migth be a better choice for data serialisation. Now I want to tell you what it takes to use FlatBuffers in practice.
When we talk about FlatBuffers, we have to start with the interface definition language. This is the way to describe data you want to serialise.
Here we see a tiny example fbs file. It describes that only thing that we will serialise is a person. And person has two properties name and age. Name is a string and age is a 4byte integer.
Person is defined as a table. Table is like a class in your typical OO language.
./ﬂatc -n person.fbs
Given this fbs file we can call the flat buffers compiler (flatc) with a parameter -n and it will generate C# classes needed to serialise and deserialise Person. (-n stands for C#, other languages are supported as well)
FlatBuffers Schema Editor
I personally like to have an editor with content assist and live error highlighting. This is why I created a FlatBuffers Schema editor based on Xtext.
Which can generate code for you as well.
FlatBuffers IDL 1/4
But lets go back to our small example.
Can you spot a problem with this example?
Age property is a very bad idea for persisting state. The data will become stale pretty quickly. It would be a much better idea if we would store a birthday instead.
FlatBuffers IDL 2/4
Here we can see that we introduce a new property birthday of type Date and declared age as deprecated. This is the evolution strategy of FlatBuffers in action. By deprecating the age, we make sure that new clients will not be able to store it. However new clients will still be able to read the data stored by old client. And if old client will try to read data stored by new client, it will still be ok, the age will be null, but it still would be able to access the name.
Let’s have a look at the Date more closely. We see that Date is declared as struct. It means that date is a value type, where Person is a table and there for a reference type. Value types / structs can only contain scalars or other structs, and has no evolution strategy. I will talk about it in more details later, when we will have a pick at how the data is stored under the hood.
FlatBuffers IDL 3/4
Now let’s add some more fields to the person table and explore some more capabilities of FlatBuffers
In this step we added gender. Gender is defined as an enum type. So we can have only Female, Male and Other. Internally gender value is a byte, where Female is equal to 0, Male to 1 and Other is 2.
FlatBuffers IDL 4/4
Last but not least, we are adding the friends property
Friends is a vector of Persons. Do you see how important this small addition is?
Now we turned our small useless example into something very sophisticated. We can store complex graphs of relationships. The mind blowing part is however that by adding the friends field we made Person a recursive structure. This means that we can serialise complex graph structure and not only a tree. As it is the case with JSON.
Lets have a look at a small example.
Here we see a small graph where we have three persons. max1, max2 and max3. max1 has two friends, max2 has one friend. This kind of structure would be impossible to represent in JSON without building up an adjacency matrix. Or we would have to double the max3 by putting it in to friends array of max1 and friends array of max2.
It is possible in FlatBuffers, because tables are reference types and the friends array stores only the reference (relative offset) to the Person. Not the Person it self. You can also see the difference between struct and table. You see that birthday is a Date which is a struct. This is why even though every person has the some birthday it is not shared. Value types are not sharable.
Now the representation that you see on this slide is not the real representation of the data stored under the hood.
It will take me a couple of slides to gradually introduce you to the true representation.
Here we are getting a bit closer to the real representation. In reality FlatBuffers does not store the names of the properties. The properties will be retrieved by index, the names are reflected in the generated API. As you can see this makes the storing data much more compact.
On this slide we see that the tables store only fixed size values. Strings and vectors are not fixed size, this is why they are stored separately and are referenced inside of the table. This enables us to reuse vectors and strings if multiple tables contain the same instances.
This is the last piece of the puzzle. Tables contain only non null values. Look at the previous slide. In all three tables we stored “-” which suppose to symbolise a non existing value or null. Now on this slide the property is gone, but we have a reference to a virtual table which tells us where to find the property. max1 and max2 share a virtual table because they have the same internal structure. max3 has it’s own virtual table, because it’s friends property is null as well.
I am sorry it all become so technical, but I include those in order to explain why reading and writing FlatBuffers is so much faster than JSON.
The buffer as you can see it above doesn’t has to be parsed. It contains all the information we need for partial reads. And it can repesent a complex graph.
What does it all mean in practice?
Now we can go through the points we defined in the theoretical parts and with our in-depth knowledge of FlatBuffers, see how they apply.
Size on disc/wire
As I mentioned before, if you want to send a simple message which contains a simple object, FlatBuffers can be a bit bulky. All the virtual tables and indirections need space. However if you have big amount of data with repeating structure and data, FlatBuffers is a great choice.
Speed of read & write / partial read / memory consumption
As you saw the structure is perfect for fast and partial reading. There is no need for parsing, decode or producing transient memory. We can read values from buffer directly. This is also why FlatBuffers Benchmark Page can shine with its numbers.
I also did some experimenting of my own to try out partial reads. Here are the results:
The file used in this demo is 200MB big. In this demo I load the file completely into memory. However I have another demo, where I read directly from file. Reading from fiel is 10x slower, but it is still very fast. Slowest query is about 40ms.
As already mentioned there is a possibility to translate a binary to JSON and vice versa. This is however only possible if the data is a tree. I am considering to build some tooling, which will display the graph as a diagram, close to the one you saw in this presentation.
Support of OO language type
The generated code from flatc (original code generator by google) is fully typed. To make the experience even better I am generating classes which represent the Tables one to one. This makes the serialisation and deserialisation less efficient, but makes working with FlatBuffers even easier. Here is our person example in Swift PlayGrounds
Data versioning / evolution
Thanks to the virtual tables, data we are storing with FlatBuffers is backwards and forwards compatible. This is a great benefit compared to built in binary serialisation we can find in OO languages. Needles to say that FlatBuffers is programming language agnostic.