Choosing the Right Serialization Format
When saving or communicating some kind of information, we often use serialization. Serialization takes a Ruby object and converts it into a string of bytes and vice versa. For example, if you have an object representing information about a user and need to send it over the network, it has to be serialized into a set of bytes that can be pushed over a socket. Then, at the other end, the receiver has to unserialize the object, converting it back into something that Ruby (or another language) can understand.
It turns out that there are lots of ways to serialize Ruby objects. I’ll cover YAML, JSON, and MessagePack in this article, exploring their pianos and fortes to see them in action with Ruby. At the end, we’ll put together a modular serialization approach using some metaprogramming tricks.
Let’s jump in!
YAML
YAML is a recursive acronym that stands for “YAML Ain’t Markup Language”. It is a serialization format, but it is also (easily) human readable, meaning that it can be used as a configuration language. In fact, Rails uses YAML to do all sorts of configuration, e.g. database connectivity.
Let’s check out an example:
name: "David"
height: 124
age: 28
children:
"John":
age: 1
height: 10
"Adam":
age: 2
height: 20
"Robert":
age: 3
height: 30
traits:
- smart
- nice
- caring
The format of YAML is incredibly easy to understand. The quickest way to make it click is to transform it into a Ruby hash or Javascript object. We’ll go with the former (saving the above YAML in test.yaml):
require 'yaml'
YAML.load File.read('test.yaml')
Running the above in Pry will give you a nicely formatted result that looks like:
{"name"=>"David",
"height"=>124,
"age"=>28,
"children"=>{"John"=>{"age"=>1, "height"=>10},
"Adam"=>{"age"=>2, "height"=>20},
"Robert"=>{"age"=>3, "height"=>30}},
"traits"=>["smart", "nice", "caring"]}
As you can see, the colons represent “key-value” pairings, and tabs create a new hash. The little hyphens tell YAML that we want a list rather than a hash. This easy translation between YAML and Ruby dictionaries is one of the primary benefits of YAML.
require 'yaml'
class Person
attr_accessor :name, :age, :gender
def initialize(name, age, gender)
@name = name
@age = age
@gender = gender
end
def to_yaml
YAML.dump ({
:name => @name,
:age => @age,
:gender => @gender
})
end
def self.from_yaml(string)
data = YAML.load string
p data
self.new(data[:name], data[:age], data[:gender])
end
end
p = Person.new "David", 28, "male"
p p.to_yaml
p = Person.from_yaml(p.to_yaml)
puts "Name #{p.name}"
puts "Age #{p.age}"
puts "Gender #{p.gender}"
Let’s break down the code. We have the to_yaml
method:
def to_yaml
YAML.dump ({
:name => @name,
:age => @age,
:gender => @gender
})
end
We are making a Ruby hash and turning it into a YAML string using modules provided by the standard library. To go the other direction and convert a YAML string into a Ruby Object:
def self.from_yaml(string)
data = YAML.load string
p data
self.new(data[:name], data[:age], data[:gender])
end
Here, take the string, convert it into a Ruby hash, then use the contents of our hash with the constructor to construct a new instance of Person
.
Now, let’s see how YAML compares with the heavyweight from the land of Javascript.
JSON
In some ways, JSON is very similar to YAML. It is meant to be a human-readable format that often serves as a configuration format. Both are widely adopted in the Ruby community. However, JSON differs in that it draws its roots from Javascript. In fact, JSON actually stands for Javascript Object Notation. The syntax for JSON is nearly the same as the syntax for defining Javascript objects (which are somewhat analogous to Ruby hashes). Let’s see an example:
{
"name": "David",
"height": 124,
"age": 28,
"children": {"John": {"age": 1, "height": 10},
"Adam": {"age": 2, "height": 20},
"Robert": {"age": 3, "height": 30}},
"traits": ["smart", "nice", "caring"]
}
That looks really similar to the good old Ruby hash. The only difference seems to be that the key-pair relation is expressed by “:” in JSON instead of the =>
we find in Ruby.
Let’s see exactly what the example looks like in Ruby:
require 'json'
JSON.load File.read("test.json")
{"name"=>"David",
"height"=>124,
"age"=>28,
"children"=>{"John"=>{"age"=>1, "height"=>10},
"Adam"=>{"age"=>2, "height"=>20},
"Robert"=>{"age"=>3, "height"=>30}},
"traits"=>["smart", "nice", "caring"]}
We can add set of methods to the Person
class developed earlier, making it JSON-serializable:
require 'json'
class Person
...
def to_json
JSON.dump ({
:name => @name,
:age => @age,
:gender => @gender
})
end
def self.from_json(string)
data = JSON.load string
self.new(data['name'], data['age'], data['gender'])
end
...
end
The underlying code is exactly the same, except for the fact that the methods use JSON
instead of YAML
!
What sets JSON apart from the rest is its similarity to Ruby and Javascript syntax. It takes some mental energy to switch between YAML and Ruby when writing code. There is no such problem with JSON, since the syntax is nearly identical to that of Ruby. In addition, many modern browsers have a Javascript implementation of JSON by default, making it the lingua franca of AJAX communication.
On the other hand, YAML requires an extra library and simply does not have that much following in the Javascript community. If your primary objective for a serialization method is to communicate with Javascript, look at JSON first.
MessagePack
So far, we haven’t paid much attention to how much space a serialized object consumes. It turns out that small serialized size is a very important characteristic, especially for systems that require low latency and high throughput. This is where MessagePack steps in.
Unlike JSON and YAML, MessagePack is not meant to be human readable! It is a binary format, which means that it represents its information as arbitrary bytes, not necessarily bytes that represent the alphabet. The benefit of doing so is that its serializations often take up significantly less space than their YAML and JSON counterparts. Although this does rule out MessagePack as a configuration file format, it makes it very attractive to those building fast, distributed systems.
Let’s see how to use it with Ruby. Unlike YAML and JSON, MessagePack does not come bundled with Ruby (yet!). So, let’s get ourselves a copy:
gem install msgpack
We can mess around with it a bit:
require 'msgpack'
msg = {:height => 47, :width => 32, :depth => 16}.to_msgpack
#prints out mumbo-jumbo
p msg
obj = MessagePack.unpack(msg)
p obj
First, create a standard Ruby hash and call to_msgpack
on it. This returns the MessagePack serialized version of the hash. Then, unserialize the serialized hash with MessagePack.unpack
(we should get the original hash back). Of course, we can use our good old converter methods (notice the similar API):
class Person
...
def to_msgpack
MessagePack.dump ({
:name => @name,
:age => @age,
:gender => @gender
})
end
def self.from_msgpack(string)
data = MessagePack.load string
self.new(data['name'], data['age'], data['gender'])
end
...
end
Okay, so MessagePack should be used when we feel the need for speed, JSON for when we need to communicate with Javascript, and YAML is for configuration files. But, you’re usually not going to be sure of which one to pick when you start a large project, so how do we keep our options open?
Modularizing with Mixins
Ruby is a dynamic language with some pretty awesome metaprogramming features. Let’s use them to make sure that we don’t pigeonhole ourselves into an approach we might later regret. First of all, notice that the Person
serialization/unserialization methods created earlier seem awfully similar.
Let’s turn that into a mixin:
require 'json'
#mixin
module BasicSerializable
#should point to a class; change to a different
#class (e.g. MessagePack, JSON, YAML) to get a different
#serialization
@@serializer = JSON
def serialize
obj = {}
instance_variables.map do |var|
obj[var] = instance_variable_get(var)
end
@@serializer.dump obj
end
def unserialize(string)
obj = @@serializer.parse(string)
obj.keys.each do |key|
instance_variable_set(key, obj[key])
end
end
end
First of all, notice that the @@serializer
is set to the serializing class. This means that we can immediately change our serialization method, as long as our serializable classes include this module.
Taking a closer look at the code, it’s basically taking a look at the instance variables to serialize and unserialize an object/string. In the serialize
method:
def serialize
obj = {}
instance_variables.map do |var|
obj[var] = instance_variable_get(var)
end
@@serializer.dump obj
end
It loops over the instance_variables
and constructs a Ruby hash of the variable names and their values. Then, simply use the @@serializer
to dump out the object. If the serializing mechanism does not have a dump
method, we can simply subclass it to give it that method!
We use a similar approach with the unserialize method:
def unserialize(string)
obj = @@serializer.parse(string)
obj.keys.each do |key|
instance_variable_set(key, obj[key])
end
end
Here, use the serializer to get a Ruby hash out of the string and set the object’s instance variables to the values of the hash.
This makes our Person
class really easy to implement:
class Person
include BasicSerializable
attr_accessor :name, :age, :gender
def initialize(name, age, gender)
@name = name
@age = age
@gender = gender
end
end
Notice, we’re just adding the include BasicSerializable
line! Let’s test it out:
p = Person.new "David", 28, "male"
p p.serialize
p.unserialize (p.serialize)
puts "Name #{p.name}"
puts "Age #{p.age}"
puts "Gender #{p.gender}"
Now, if you comb through the code carefully (or just understand the underlying concepts), you might notice that the BasicSerializable
methods work very well for objects that only have serializable instance variables (i.e. integers, strings, floats, etc. or arrays and hashes of them). However, it will fail for an object that has other BasicSerializable
objects as instances.
The easy wasy to fix this problem is to override the serialize
and unserialize
methods in such classes, like so:
class People
include BasicSerializable
attr_accessor :persons
def initialize
@persons = []
end
def serialize
obj = @persons.map do |person|
person.serialize
end
@@serializer.dump obj
end
def unserialize(string)
obj = @@serializer.parse string
@persons = []
obj.each do |person_string|
person = Person.new "", 0, ""
person.unserialize(person_string)
@persons << person
end
end
def <<(person)
@persons << person
end
end
Finishing up
Serialization is a pretty important topic that often goes overlooked. Choosing the right serialization method can make your much life much easier when optimization time comes around. Along with our coverage of serialization methods, the modular approach (it may need to be modified for particular applications) can help you change your decision at a later date.