Data Serialization Comparison: JSON, YAML, BSON, MessagePack

Data Serialization Comparison

JSON is the de facto standard for data exchange on the web, but it has its drawbacks, and there are other formats that may be more suitable for certain scenarios. I’ll compare the pros and cons of the alternatives, including ease of use and performance.

Note: I won’t cover implementation details here, but if you’re a Ruby programmer, check out this article, where Dhaivat writes about implementing some serialization formats in Ruby.

Key Takeaways

JSON (JavaScript Object Notation) is the most widespread format for data serialization, offering human-readable code, a simple specification, and widespread support. However, it has limitations, particularly when encoding binary data.
BSON (Binary JSON) is a binary-encoded serialization of JSON-like documents. It offers convenient storage of binary information, is designed for fast in-memory manipulation, and is the primary data representation for MongoDB. However, it can be more expensive than JSON when serializing.
MessagePack is a binary format for serialization that is designed for efficient transmission over the wire. It often outperforms BSON in terms of speed and size, and offers better JSON compatibility.
YAML (YAML Ain’t Markup Language) is a plaintext format for serialization that offers human-readable code and compact code. It is particularly suited for viewing and editing data structures. However, its specification is much larger than that of JSON’s, making it more complex.

What Is Data Serialization

According to Wikipedia, serialization is:

the process of translating data structures or object state into a format that can be stored (for example, in a file or memory buffer, or transmitted across a network connection link) and reconstructed later in the same or another computer environment.

Let’s say you want to collect certain data about a group of people — name, last name, nickname, date of birth, instruments they play. You could easily set a spreadsheet, define some columns, and make every row an entry. You could go just a little further, define that the date of birth column must be a number, and that the instruments columns could be a list of options. It’d look like this:

name	last name	dob	nickname	instruments
William	Bailey	1962	Axl Rose	vocals, piano
Saul	Hudson	1965	Slash	guitar

More or less, what you did there was to define a data structure; and you’ll do just fine if you only need this on a spreadsheet format. The problem is that, if you ever want to exchange this information with a database or a website, the mechanics by which these data structures are implemented on these other platforms — even if the underlying semantics are overall the same — will be dramatically different. You can’t just plug-n-play a spreadsheet into a web application, unless the application has been specifically designed for it. And you can’t transfer that info from the website to the database unless you have some sort of export tool or gateway for it.

Let’s assume that our website already has these data structures implemented in its internal logic, and that it just cannot deal with a spreadsheet format. In order to solve these problems, you can translate these data structures into a format that can be easily shared across different applications, architectures, or what have you: you serialize them. And by doing so, you ensure not only that you can transfer this data across platforms, but that they can be reconstructed in the reverse process called deserialization. Furthermore, if exchanged back from the website to the spreadsheet, you’ll get a semantically identical clone of the original object — that is, a row that looks exactly the same as the one you originally sent.

In short: serializing data is finding some sort of universal format that can be easily shared across different applications.

The Formats

JSON

JSON (JavaScript Object Notation) is a lightweight data-interchange format. It’s easy for humans to read and write; it’s easy for machines to parse and generate.

JSON is the most widespread format for data serialization, and it has the following features:

(Mostly) human readable code: even if the code has been obscured or minified, you can always indent it with tools such as JSONLint and make it readable again.
Very simple and straightforward specification: a summary of the whole spec fits on a single page (as displayed on the JSON site).
Widespread support: not only does every programming language or IDE come with JSON support, but also many web services APIs offer JSON as a means of data interchange.
As a subset of JavaScript, it supports the following JavaScript data types:
- string
- number
- object
- array
- true and false
- null

This is how our previous spreadsheet would look, after being serialized in JSON:

[
  {
    "name": "William",
    "last name": "Bailey",
    "dob": 1962,
    "nickname": "Axl Rose",
    "instruments": [
      "vocals",
      "piano"
    ]
  },
  {
    "name": "Saul",
    "last name": "Hudson",
    "dob": 1965,
    "nickname": "Slash",
    "instruments": [
      "guitar"
    ]
  }
]

BSON

BSON, short for Binary JSON, is a binary-encoded serialization of JSON-like documents. … It also contains extensions that allow representation of data types that are not part of the JSON spec.

JSON is a plain text format, and while binary data can be encoded in text, this has certain limitations and can make JSON files very big. BSON comes in to deal with these problems.

It has the following features:

convenient storage of binary information: better suitable for exchanging images and attachments
designed for fast in-memory manipulation
simple specification: like JSON, BSON has a very short and simple spec
primary data representation for MongoDB: BSON is designed to be traversed easily
extra data types:
- double (64-bit IEEE 754 floating point number)
- date (integer number of milliseconds since the Unix epoch)
- byte array (binary data)
- BSON object and BSON array
- JavaScript code
- MD5 binary data
- regular expressions

MessagePack

It’s like JSON. But fast and small.

MessagePack (also msgpack) is another binary format for serialization. Not as well known as BSON, but it’s worth having a look at.

Among its features:

designed for efficient transmission over the wire
better JSON-compatibility than BSON: as explained by Sadayuki Furuhashi in this Stack Overflow post
smaller than BSON: is has a smaller overhead than BSON, and can serialize smaller objects most of the time
type checking: it supports static-typing
streaming API: support for streaming deserializers, which is useful for network communication.

YAML

YAML: YAML Ain’t Markup Language.
What It Is: YAML is a human friendly data serialization standard for all programming languages.

Back to plaintext formats, YAML is an alternative to JSON:

(truly) human readable code: YAML is so readable that even its front-page content is displayed in YAML to make this point
compact code: whitespace indentation is used to denote structure, no need for quotes nor brackets
syntax for relational data: to allow internal references with anchors ( &) and aliases (*)
especially suited for viewing/editing of data structures: such as configuration files, dumping during debugging, and document headers
a rich set of language independent types:
- collections:
  - unordered set of key (!!map)
  - ordered sequence of key (!!omap)
  - ordered sequence of key (!!pairs)
  - unordered set of non-equal values (!!set)
  - sequence of arbitrary values (!!seq)
- scalar types:
  - null values (~, null)
  - decimals (1234), hexadecimal (0x4D2) and octal (02333) integers
  - fixed (1_230.15) and exponential (12.3015e+02) floats
  - infinity (.inf, -.Inf) and not-a-number (.NAN)
  - true (Y, true, Yes, ON) and false (n, FALSE, No, off)
  - binary (!!binary) with base64 encoding
  - timestamps (!!timestamp).

This is how our little spreadsheet looks when serialized in YAML:

- name: William
  last name: Bailey
  dob: 1962
  nickname: Axl Rose
  instruments:
    - vocals
    - piano

- name: Saul
  last name: Hudson
  dob: 1965
  nickname: Slash
  instruments:
    - guitar

Other Formats

There are a number of other formats for serialization, such as Protocol Buffers (protobuf, also binary), that I’ve (in a rather discretionary manner) left out. If you just want to know every possible format, go and have a look at Wikipedia’s comparison of data serialization formats.

… HDF5?

We’ll get a bit off-topic here, but just slightly. The Hierarchical Data Format version 5 (HDF5) isn’t really for serialization, but rather for storage, and it’s taking data science and other industries by storm. It’s a very fast and versatile format that can be used not only to store a number of data structures, but even as a replacement for relational databases.

To conclude this intermission, let’s just mention that if you’re into binary formats such as BSON and MessagePack for storing/exchanging big volumes of information, you may very well want to have a look at HDF5.

Benchmarks and Comparisons

A pattern that emerges is that BSON can be more expensive than JSON when serializing, but faster when deserializing; and MessagePack is faster than both on any operation. Also, because of its overhead and in spite of being a binary format, BSON files can occasionally be bigger than JSON ones when storing non-binary data. Some links to have a look at:

Serialization Performance comparison (C#/.NET) by Maxim Novak on M@X on DEV.
Protocol Buffers, Avro, Thrift & MessagePack by Ilya Grigorik on ivita.com.
Binary Serialization Tour Guide by Karlin Fox in Atomic Object.
Efficiently Store Pandas DataFrames by Matthew Rocklin.
MessagePack vs JSON vs BSON by Wesley Tanaka.

It’s also worth noting that the performance could change depending on the serializer and the parser you choose, even for the same format.

Remarks and Commentary

As silly as it may sound, BSON has the advantage of the name: people automatically link the format developed by MongoDB (BSON) to the standard (JSON), which are not associated one to another. So when searching for a binary alternative for JSON, you may also consider other options.

In fact, MessagePack seems to beat BSON in every possible aspect: it’s faster, smaller, and it’s even more compatible to JSON that BSON is. (In fact, if you’re already working with JSON, MessagePack is almost a drop-in optimization.) Maybe as a “reporter” I should be more balanced, but as a developer, this is a no brainier.

Still, BSON is MongoDB’s format to store and represent data, so if you’re working with this NoSQL DB, that’s a reason to stick with it.

Of course, serialization is not all about storing binary data. Admittedly, JSON has a different goal in mind — that of being “human readable”. But it doesn’t take much effort to notice that YAML does a significantly better job at it.

However, the YAML spec is awfully big, specially when compared to that of JSON’s. But arguably, it must be, as it comes with more data types and features.

On the other hand, in can’t be ignored that the simplicity of JSON played a key role in its adoption over other serialization formats. It relies on an already existent widespread language, JavaScript, and if you know or are exposed to JS (which if you are in the web development industry, you are), you already know JSON.

Then why not adopt YAML, like now? In many cases it isn’t that easy. JSON still has a place for web APIs, as you can easily embed JSON code in HTTP requests (both for GET, as in URLs, and POST, as in sending a form): the format will let you know if the transmission was suddenly cut, as the code will automatically render invalid, which may not be the case with YAML and other competing plaintext formats. Also, you’ll still need to interact at one point or another with JSON-based APIs and legacy code, and it’s always a pain maintaining two pieces of code (JSON and YAML methods) for the same purpose (data serialization).

But then again, these are partly the same arguments that push us backwards and prevent us from adopting newer and more efficient technologies (e.g: like Python 3 over Python 2). And I thought for a minute that we, programmers and entrepreneurs, were innovators, aren’t we?

Frequently Asked Questions on Data Serialization and JSON Alternatives

What are the main differences between JSON and YAML?

JSON and YAML are both data serialization formats, but they have some key differences. JSON is a subset of JavaScript and is often used in web applications due to its compatibility with JavaScript. It uses a simple syntax and is easy to read and write. However, it lacks some features like comments and multi-line strings. On the other hand, YAML is a superset of JSON and has a more human-friendly syntax. It supports comments and multi-line strings, making it easier to use for configuration files. However, it is more complex and less widely supported than JSON.

How does BSON compare to JSON and YAML?

BSON, or Binary JSON, is a binary representation of JSON-like documents. It is designed to be efficient in space, but also in compute-intensive scenarios like network transfers. BSON can store more data types than JSON, including binary and date data types. However, it is not as human-readable as JSON or YAML, and it is primarily used in MongoDB for storing and retrieving data.

What is MessagePack and how does it compare to other data serialization formats?

MessagePack is a binary serialization format that is similar to JSON but more efficient. It is compact, fast, and supports a wide range of data types. It is often used in applications where performance is critical, such as real-time streaming applications. However, like BSON, it is not as human-readable as JSON or YAML.

Are there any other alternatives to JSON?

Yes, there are several other alternatives to JSON, including XML, Protobuf, and Avro. XML is a markup language that is human-readable and supports complex data structures, but it is more verbose than JSON. Protobuf, or Protocol Buffers, is a binary serialization format developed by Google that is compact and fast, but not human-readable. Avro is a binary serialization format developed by Apache that supports schema evolution, making it suitable for long-term data storage.

Which data serialization format should I use?

The choice of data serialization format depends on your specific needs. If you need a format that is human-readable and easy to use, JSON or YAML might be the best choice. If you need a format that is compact and fast, MessagePack or BSON might be more suitable. If you need a format that supports schema evolution, Avro might be the best choice. It’s important to understand the strengths and weaknesses of each format before making a decision.

Can I use multiple data serialization formats in the same application?

Yes, it is possible to use multiple data serialization formats in the same application. For example, you might use JSON for data interchange between the client and server, and BSON for storing data in MongoDB. However, using multiple formats can add complexity to your application, so it’s important to carefully consider the trade-offs.

How can I convert data between different serialization formats?

There are several libraries and tools available that can convert data between different serialization formats. For example, you can use the json module in Python to convert data between JSON and Python objects, or the yaml module to convert data between YAML and Python objects. There are also online tools like json2yaml that can convert data between JSON and YAML.

What are the performance implications of using different data serialization formats?

The performance implications of using different data serialization formats can vary depending on the specific use case. Binary formats like BSON and MessagePack are generally faster and more compact than text-based formats like JSON and YAML. However, they are not as human-readable, which can make debugging more difficult. It’s also important to consider the performance of the libraries and tools you are using to serialize and deserialize data.

Are there any security considerations when using data serialization formats?

Yes, there are several security considerations when using data serialization formats. For example, some formats like JSON and YAML can execute arbitrary code if they are not properly sanitized, which can lead to security vulnerabilities. It’s important to use trusted libraries and tools to serialize and deserialize data, and to sanitize any user-supplied data.

How can I learn more about data serialization formats?

There are many resources available online to learn more about data serialization formats. You can start by reading the official documentation for each format, which often includes tutorials and examples. There are also many tutorials and articles available on websites like Stack Overflow and Medium. Finally, you can experiment with different formats in your own projects to gain hands-on experience.