Projects That Are Making Blazing Fast Ruby a Reality

Speed design over white background, vector illustration.

Ruby is still not big in the scientific computing or artificial intelligence communities. However, that could soon change as there are two major projects on the horizon that could give Ruby a pick-up in 2016.

OMR

The OMR project is an open source (soon) initiative to develop reusable and consumable components from the IBM J9 virtual machine. These components will be used for any desired language runtime, including Ruby. The hope is that this will lower the barrier of entry for implementing languages.

Here are some things the project plans to bring:

JIT compiler
Enterprise-grade garbage collector
Method profiler

Right now speedups are modest – in the 1x – 3x range. However, IBM is focusing on compatibility, rather than performance, to ease adoption. They claim to already be able to run Rails.

IBM has posted a technology preview of their OMR Ruby implementation to GitHub.

Ruby+Truffle+Graal

While OMR looks very promising, the real star could be an experimental JRuby backend being worked on by Oracle Labs. It implements Ruby with the Truffle AST framework and the Graal JIT VM. The claimed performance gains are already phenomenal – up to around 30x MRI in non-synthetic benchmarks. The results of this project could make us rethink what is possible with dynamic languages, and it may open Ruby to the worlds of scientific computing and artificial intelligence.

Aside from performance, another goal of Truffle+Graal is high interoperability between implemented languages. These currently include R, JavaScript, and, of course, Ruby. This will make it easier to use the best library for the job rather than settling for whatever is available in the primary language.

As for maturity, according to Chris Seaton, the project currently “passes 93% of the RubySpace language specs and 90% of the core library specs”.

Truffle

Truffle is a Java framework for writing self-optimizing AST (Abstract Syntax Tree) interpreters. ASTs are data structures that contain a tree representation of the source code, generated by the parser. Typically with interpreters, the AST is used to generate bytecode that runs on a virtual machine. However, Truffle instead uses the AST nodes to directly control Graal’s emission of machine code.

When implementing a language using this new approach, we are mainly concerned with the Truffle “layer”. This is why you may see references to “Truffle languages”.

Graal

Graal is a new JIT compiler implementation in the OpenJDK JVM. Graal is written in Java, so it exposes a Java API to the running program. This allows the language to directly control the compiler and thus go from AST -> machine code with no bytecode generation step.

Method Calls

One of the reasons Ruby+Truffle is so fast is because it removes method calls for common operations. In Ruby, just about everything is a method call, including 1+1 which evaluates to 1.+(1), and foo=1 which evaluates to foo.=(1). These calls also involve allocating objects.

Ruby+Truffle does away with these method calls and allocations altogether by inlining the operations instead, enabling the backend to be much faster. The end result is that Ruby+Truffle produces artisanal machine code that looks as if it were hand-crafted by a demoscene wizard. This process is known as partial escape analysis.

Aggressive Caching

You’ve probably heard of Ruby’s “method cache” and how it is often invalidated. Ruby method lookup generally involves 2 cache levels:

Global method cache – associates a function pointer with a class and method name combination.
Inline cache – caches the function pointer at the call site to avoid hitting the global cache.

Ruby+Truffle adds a new level of caching:

Argument cache – generalized inline cache that includes arguments.

Optimized Data Structures

When you create an array like [1,2,3] in MRI, the object is represented in C as an array of VALUE pointers. This means each array element doesn’t have to be the same type, but it has to be boxed and unboxed which takes time.

The Ruby+Truffle backend is smart enough to recognize when it can store collections in an unboxed manner. It does this through a system of specializations. It even has specializations for empty collections which don’t allocate storage at all.

For the first time, it might become possible to perform intense math in Ruby without depending on C extensions. However, users of C extensions are expected to see a significant speed boost as well.

GPU Backends

A couple of GPU backends have been taking shape in the Graal repository.

PTX (Parallel Thread Execution) – Enables general-purpose computing on CUDA (nvidia) hardware
HSAIL (Heterogeneous System Architecture Intermediate Layer) – AMD’s solution for integrating CPUs and GPUs

If these mature, it could create an environment where almost any language can be used for advanced machine learning tasks – and at a speed that is very close to native.

Trying it Out

JRuby currently has the experimental Truffle/Graal backend available on GitHub. Here are some instructions on how to load it with popular Ruby version managers:

rbenv

$ rbenv install jruby-master+graal-dev
$ rbenv shell jruby-master+graal-dev
$ ruby -X+T -e 'puts Truffle.graal?'
true

RVM

$ rvm mount -r http://lafo.ssw.uni-linz.ac.at/graalvm/jruby-dist-master+graal-macosx-x86_64-bin.tar.gz -n jruby-dev-graal
$ rvm use jruby-dev-graal
$ ruby -X+T -e 'puts Truffle.graal?'
true

When using the Graal backend, you should get something like:

$ ruby --version
jruby 9.0.5.0-SNAPSHOT (2.2.3) 2015-12-22 36276d3 OpenJDK 64-Bit Server VM25.40-b25-internal-graal-0.7 on 1.8.0-internal-b128 +jit [darwin-x86_64]

Keep in mind that these are development releases. I ran into some installation problems the first time I tried the RVM route, but it worked a few days later.

If you decide to run some benchmarks, keep in mind that you will still need to deal with the JVM’s boot time and warm-up time. So far, it seems it’s tricky to get the kind of results the project’s team claims. With Truffle enabled (-X+T), I typically got MRI performance or less – even after many runs – for a variety of simple cases.

If you can get the 30x+ gains for any code, please let us know in the comments. Note: to help deal with the warmup issue, the developers use benchmark-ips

Conclusion

These projects do not only affect Ruby. Any implemented language will see the speed benefit provided by these approaches. They still have a ways to go, but they have been progressing nicely, so there’s a possibility we will see them in general use sometime this year.