Choosing the Right Serialization Format

Dhaivat Pandya
Tweet

Digital touch

When saving or communicating some kind of information, we often use serialization. Serialization takes a Ruby object and converts it into a string of bytes and vice versa. For example, if you have an object representing information about a user and need to send it over the network, it has to be serialized into a set of bytes that can be pushed over a socket. Then, at the other end, the receiver has to unserialize the object, converting it back into something that Ruby (or another language) can understand.

It turns out that there are lots of ways to serialize Ruby objects. I’ll cover YAML, JSON, and MessagePack in this article, exploring their pianos and fortes to see them in action with Ruby. At the end, we’ll put together a modular serialization approach using some metaprogramming tricks.

Let’s jump in!

YAML

YAML is a recursive acronym that stands for “YAML Ain’t Markup Language”. It is a serialization format, but it is also (easily) human readable, meaning that it can be used as a configuration language. In fact, Rails uses YAML to do all sorts of configuration, e.g. database connectivity.

Let’s check out an example:

name: "David"
height: 124
age: 28
children:
  "John":
    age: 1
    height: 10
  "Adam":
    age: 2
    height: 20
  "Robert":
    age: 3
    height: 30
traits:
  - smart
  - nice
  - caring

The format of YAML is incredibly easy to understand. The quickest way to make it click is to transform it into a Ruby hash or Javascript object. We’ll go with the former (saving the above YAML in test.yaml):

require 'yaml'

YAML.load File.read('test.yaml')

Running the above in Pry will give you a nicely formatted result that looks like:

{"name"=>"David",
 "height"=>124,
 "age"=>28,
 "children"=>{"John"=>{"age"=>1, "height"=>10},
             "Adam"=>{"age"=>2, "height"=>20},
             "Robert"=>{"age"=>3, "height"=>30}},
 "traits"=>["smart", "nice", "caring"]}

As you can see, the colons represent “key-value” pairings, and tabs create a new hash. The little hyphens tell YAML that we want a list rather than a hash. This easy translation between YAML and Ruby dictionaries is one of the primary benefits of YAML.

require 'yaml'

class Person
  attr_accessor :name, :age, :gender

  def initialize(name, age, gender)
    @name = name
    @age = age
    @gender = gender
  end

  def to_yaml
    YAML.dump ({
      :name => @name,
      :age => @age,
      :gender => @gender
    })
  end

  def self.from_yaml(string)
    data = YAML.load string
    p data
    self.new(data[:name], data[:age], data[:gender])
  end
end

p = Person.new "David", 28, "male"
p p.to_yaml

p = Person.from_yaml(p.to_yaml)
puts "Name #{p.name}"
puts "Age #{p.age}"
puts "Gender #{p.gender}"

Let’s break down the code. We have the to_yaml method:

def to_yaml
  YAML.dump ({
    :name => @name,
    :age => @age,
    :gender => @gender
  })
end

We are making a Ruby hash and turning it into a YAML string using modules provided by the standard library. To go the other direction and convert a YAML string into a Ruby Object:

def self.from_yaml(string)
  data = YAML.load string
  p data
  self.new(data[:name], data[:age], data[:gender])
end

Here, take the string, convert it into a Ruby hash, then use the contents of our hash with the constructor to construct a new instance of Person.

Now, let’s see how YAML compares with the heavyweight from the land of Javascript.

JSON

In some ways, JSON is very similar to YAML. It is meant to be a human-readable format that often serves as a configuration format. Both are widely adopted in the Ruby community. However, JSON differs in that it draws its roots from Javascript. In fact, JSON actually stands for Javascript Object Notation. The syntax for JSON is nearly the same as the syntax for defining Javascript objects (which are somewhat analogous to Ruby hashes). Let’s see an example:

{
  "name": "David",
  "height": 124,
  "age": 28,
  "children": {"John": {"age": 1, "height": 10},
             "Adam": {"age": 2, "height": 20},
             "Robert": {"age": 3, "height": 30}},
  "traits": ["smart", "nice", "caring"]
}

That looks really similar to the good old Ruby hash. The only difference seems to be that the key-pair relation is expressed by “:” in JSON instead of the => we find in Ruby.

Let’s see exactly what the example looks like in Ruby:

require 'json'
JSON.load File.read("test.json")

{"name"=>"David",
 "height"=>124,
 "age"=>28,
 "children"=>{"John"=>{"age"=>1, "height"=>10},
             "Adam"=>{"age"=>2, "height"=>20},
             "Robert"=>{"age"=>3, "height"=>30}},
 "traits"=>["smart", "nice", "caring"]}

We can add set of methods to the Person class developed earlier, making it JSON-serializable:

require 'json'

class Person
  ...
  def to_json
    JSON.dump ({
      :name => @name,
      :age => @age,
      :gender => @gender
    })
  end

  def self.from_json(string)
    data = JSON.load string
    self.new(data['name'], data['age'], data['gender'])
  end
  ...
end

The underlying code is exactly the same, except for the fact that the methods use JSON instead of YAML!

What sets JSON apart from the rest is its similarity to Ruby and Javascript syntax. It takes some mental energy to switch between YAML and Ruby when writing code. There is no such problem with JSON, since the syntax is nearly identical to that of Ruby. In addition, many modern browsers have a Javascript implementation of JSON by default, making it the lingua franca of AJAX communication.

On the other hand, YAML requires an extra library and simply does not have that much following in the Javascript community. If your primary objective for a serialization method is to communicate with Javascript, look at JSON first.

MessagePack

So far, we haven’t paid much attention to how much space a serialized object consumes. It turns out that small serialized size is a very important characteristic, especially for systems that require low latency and high throughput. This is where MessagePack steps in.

Unlike JSON and YAML, MessagePack is not meant to be human readable! It is a binary format, which means that it represents its information as arbitrary bytes, not necessarily bytes that represent the alphabet. The benefit of doing so is that its serializations often take up significantly less space than their YAML and JSON counterparts. Although this does rule out MessagePack as a configuration file format, it makes it very attractive to those building fast, distributed systems.

Let’s see how to use it with Ruby. Unlike YAML and JSON, MessagePack does not come bundled with Ruby (yet!). So, let’s get ourselves a copy:

gem install msgpack

We can mess around with it a bit:

require 'msgpack'
msg = {:height => 47, :width => 32, :depth => 16}.to_msgpack

#prints out mumbo-jumbo
p msg

obj = MessagePack.unpack(msg)
p obj

First, create a standard Ruby hash and call to_msgpack on it. This returns the MessagePack serialized version of the hash. Then, unserialize the serialized hash with MessagePack.unpack (we should get the original hash back). Of course, we can use our good old converter methods (notice the similar API):

class Person
  ...
  def to_msgpack
    MessagePack.dump ({
      :name => @name,
      :age => @age,
      :gender => @gender
    })
  end

  def self.from_msgpack(string)
    data = MessagePack.load string
    self.new(data['name'], data['age'], data['gender'])
  end
  ...
end

Okay, so MessagePack should be used when we feel the need for speed, JSON for when we need to communicate with Javascript, and YAML is for configuration files. But, you’re usually not going to be sure of which one to pick when you start a large project, so how do we keep our options open?

Modularizing with Mixins

Ruby is a dynamic language with some pretty awesome metaprogramming features. Let’s use them to make sure that we don’t pigeonhole ourselves into an approach we might later regret. First of all, notice that the Person serialization/unserialization methods created earlier seem awfully similar.

Let’s turn that into a mixin:

require 'json'

#mixin
module BasicSerializable

  #should point to a class; change to a different
  #class (e.g. MessagePack, JSON, YAML) to get a different
  #serialization
  @@serializer = JSON

  def serialize
    obj = {}
    instance_variables.map do |var|
      obj[var] = instance_variable_get(var)
    end

    @@serializer.dump obj
  end

  def unserialize(string)
    obj = @@serializer.parse(string)
    obj.keys.each do |key|
      instance_variable_set(key, obj[key])
    end
  end
end

First of all, notice that the @@serializer is set to the serializing class. This means that we can immediately change our serialization method, as long as our serializable classes include this module.

Taking a closer look at the code, it’s basically taking a look at the instance variables to serialize and unserialize an object/string. In the serialize method:

def serialize
  obj = {}
  instance_variables.map do |var|
    obj[var] = instance_variable_get(var)
  end

  @@serializer.dump obj
end

It loops over the instance_variables and constructs a Ruby hash of the variable names and their values. Then, simply use the @@serializer to dump out the object. If the serializing mechanism does not have a dump method, we can simply subclass it to give it that method!

We use a similar approach with the unserialize method:

def unserialize(string)
  obj = @@serializer.parse(string)
  obj.keys.each do |key|
    instance_variable_set(key, obj[key])
  end
end

Here, use the serializer to get a Ruby hash out of the string and set the object’s instance variables to the values of the hash.

This makes our Person class really easy to implement:

class Person
include BasicSerializable

  attr_accessor :name, :age, :gender

  def initialize(name, age, gender)
    @name = name
    @age = age
    @gender = gender
  end
end

Notice, we’re just adding the include BasicSerializable line! Let’s test it out:

p = Person.new "David", 28, "male"
p p.serialize

p.unserialize (p.serialize)
puts "Name #{p.name}"
puts "Age #{p.age}"
puts "Gender #{p.gender}"

Now, if you comb through the code carefully (or just understand the underlying concepts), you might notice that the BasicSerializable methods work very well for objects that only have serializable instance variables (i.e. integers, strings, floats, etc. or arrays and hashes of them). However, it will fail for an object that has other BasicSerializable objects as instances.

The easy wasy to fix this problem is to override the serialize and unserialize methods in such classes, like so:

class People
  include BasicSerializable

  attr_accessor :persons

  def initialize
    @persons = []
  end

  def serialize
    obj = @persons.map do |person|
      person.serialize
    end

    @@serializer.dump obj
  end

  def unserialize(string)
    obj = @@serializer.parse string
    @persons = []
    obj.each do |person_string|
      person = Person.new "", 0, ""
      person.unserialize(person_string)
      @persons << person
    end
  end

  def <<(person)
    @persons << person
  end
end

Finishing up

Serialization is a pretty important topic that often goes overlooked. Choosing the right serialization method can make your much life much easier when optimization time comes around. Along with our coverage of serialization methods, the modular approach (it may need to be modified for particular applications) can help you change your decision at a later date.

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • Karol JG

    YAML can (de)serialize BigDecimal, JSON cannot.

  • jokeyrhyme

    Had a look at TOML yet? https://github.com/toml-lang/toml

    • dhaivatpandya

      It definitely looks very promising. YAML has some nonintuitive portions of its syntax that TOML seems to fix.