Ruby - - By David Bush

Lexers, Parsers, and ASTs, OH MY!: How Ruby Executes

We take for granted that ruby my_app.rb will execute our code. But how does it work? What’s happening under the hood? Let’s walk through the internals and find out.

Lexer

The first port of call is the lexer. The lexer reads your code and breaks it up into ‘tokens’ (for this reason, you may hear Tokenizer and Lexer used interchangeably). What’s a token you ask? Great question! Let’s take a look at an example:

5.times do |x|
  puts x
end

When you run ruby my_app.rb, Ruby opens the file and reads in each character sequentially. The first is an integer (5), so the lexer knows that the type it is dealing is an integer object. The 0 is also an integer (interestingly, . also counts as an integer because it could be denoting a floating point) but once it hits the t, the lexer recognizes the numerical part of this process is over, backtracks, and assigns type tINTEGER to the token.

After the ., the lexer encounters times. This is classified as an identifier which is a technical way of denoting a class, variable, or method name. Next, it encounters do which is a reserved word as it is a keyword in the language. How does the lexer differentiate between a reserved word and an identifier? Well, Ruby itself must be aware of what is a keyword and what isn’t, so internally it keeps a list of reserved words. Here is an example – you can see this imported atop the parsing file we’ll dig into in a moment.

In reality, I am slightly simplifying things. The lexing and parsing process has been streamlined in modern Ruby to be more efficient. What actually happens is that the lexing function is called as part of the parsing process. This is done in a grammar rules file called parse.y – 11,000 lines of complex logic and low level wrangling of your source file. See this 600+ line case statement? That’s your lexer.

Enough theory. Lets get to something practical! Since Ruby 1.9, Ruby ships with an API to interface with its Lexer. It’s called Ripper, and it’s an interesting exercise. So, let’s try it. Fire up the excellent Pry or irb and follow along:

[1] pry(main)> require 'Ripper'
=> true
[2] pry(main)> to_lex = <<STR
[2] pry(main)* 5.times do |n|
[2] pry(main)* puts n end
[2] pry(main)* STR
=> "5.times do |n|\nputs n end\n"
[3] pry(main)> pp Ripper.lex(to_lex)
[[[1, 0], :on_int, "5"],
 [[1, 2], :on_period, "."],
 [[1, 3], :on_ident, "times"],
 [[1, 8], :on_sp, " "],
 [[1, 9], :on_kw, "do"],
 [[1, 11], :on_sp, " "],
 [[1, 12], :on_op, "|"],
 [[1, 13], :on_ident, "n"],
 [[1, 14], :on_op, "|"],
 [[1, 15], :on_ignored_nl, "\n"],
 [[2, 0], :on_ident, "puts"],
 [[2, 4], :on_sp, " "],
 [[2, 5], :on_ident, "n"],
 [[2, 6], :on_sp, " "],
 [[2, 7], :on_kw, "end"],
 [[2, 10], :on_nl, "\n"]]
=> [[[1, 0], :on_int, "5"],
 [[1, 2], :on_period, "."],
 [[1, 3], :on_ident, "times"],
 [[1, 8], :on_sp, " "],
 [[1, 9], :on_kw, "do"],
 [[1, 11], :on_sp, " "],
 [[1, 12], :on_op, "|"],
 [[1, 13], :on_ident, "n"],
 [[1, 14], :on_op, "|"],
 [[1, 15], :on_ignored_nl, "\n"],
 [[2, 0], :on_ident, "puts"],
 [[2, 4], :on_sp, " "],
 [[2, 5], :on_ident, "n"],
 [[2, 6], :on_sp, " "],
 [[2, 7], :on_kw, "end"],
 [[2, 10], :on_nl, "\n"]]

Straightforward enough – we require Ripper, pass in some source as a heredoc and call Ripper’s to_lex method. The output is our tokenizer in action; the numbers on the left are in the format [line, position] and the prefix :on_ denotes the type of token. Let’s look at a similar example:

[5] pry(main)> pp Ripper.lex(to_lex)
[[[1, 0], :on_int, "5"],
 [[1, 2], :on_period, "."],
 [[1, 3], :on_ident, "times"],
 [[1, 8], :on_sp, " "],
 [[1, 9], :on_kw, "do"],
 [[1, 11], :on_ignored_nl, "\n"],
 [[2, 0], :on_ident, "puts"],
 [[2, 4], :on_sp, " "],
 [[2, 5], :on_ident, "n"],
 [[2, 6], :on_sp, " "],
 [[2, 7], :on_kw, "end"],
 [[2, 10], :on_nl, "\n"]]
=> [[[1, 0], :on_int, "5"],
 [[1, 2], :on_period, "."],
 [[1, 3], :on_ident, "times"],
 [[1, 8], :on_sp, " "],
 [[1, 9], :on_kw, "do"],
 [[1, 11], :on_ignored_nl, "\n"],
 [[2, 0], :on_ident, "puts"],
 [[2, 4], :on_sp, " "],
 [[2, 5], :on_ident, "n"],
 [[2, 6], :on_sp, " "],
 [[2, 7], :on_kw, "end"],
 [[2, 10], :on_nl, "\n"]]

In this example, I’ve removed the block (|x|) from the heredoc so it reads:

"5.times do\nputs n end\n"

This would be a syntax error within the bounds of the Ruby runtime. The lexer doesn’t care though, because its job is simply to tokenize the input as syntax checks are enforced at a different level. Enter: The Parser

The Parser

We have our tokens, but we still need to execute them. How do we do that? Well, we need something to understand them, and this is the job of a Parser Generator. A Parser Generator takes a series of defined rules (not unlike grammar in language) and checks the input for validity against this. Ruby uses a Parser Generator called Bison and the rules against which the language is defined can be found in the previously mentioned parse.y. This is literally where Ruby as a language is defined.

Parsing Algorithms

A Parsing Algorithm is the process used to parse the tokens. There are many different types, but the one we will be concerning ourselves with is the Look Ahead Lefr-to-right Rightmost Deviation (or LALR for short). My word count constraint won’t permit me to go into the depths of compiler engineering, but here are some resources if you’d like to better acquaint yourself with some of the terminology.

In a nutshell, this is what the parser is doing:

  • Moves from left to right processing the tokens
  • Takes the current token and pushes it onto a stack
  • Checks the grammar rules for a match; adds matching rules to a table of states and removes rules that are no longer possible
  • Pushes the next token onto the stack and applies the same process
  • Performs a process similar to reduce the stack if it’s applicable
  • Looks ahead to check that the next token is compatible with the remaining rules in the state table. If it is, then it continues, and if not then this portion of the parsing is complete and the process is restarted. This is the LR portion of LALR.
  • It repeats this process until the reduced content of the stack matches on a single grammatical rule (or errors out because your syntax is wrong!)

The parser records both a list of applied states and a list of potential states to which it can move. In other words, the parser is essentially just an admittedly complex Finite State Machine which Bison is maintaining for you. If, for some reason, you would like to see Ruby’s state transitions, you can run your program with the -y flag, which prints out debug information:

Starting parse
Entering state 0
Reducing stack by rule 1 (line 939):
lex_state: EXPR_NONE -> EXPR_BEG at line 940
-> $$ = nterm $@1 ()
Stack now 0
Entering state 2
lex_state: EXPR_BEG -> EXPR_END at line 7386
Reading a token: Next token is token tINTEGER ()
Shifting token tINTEGER ()
Entering state 41
...

Hopefully some of this output makes sense now: tINTEGER is the token type the lexer assigned to the token, states are handled by the parser and Reducing and Shifting are the parser traversing the token stream.

What is the result of the parsing? An Abstract Syntax Tree

Abstract Syntax Trees

Lisp) users will be on familiar ground here. The Abstract Syntax Tree (AST) is comprised of a tree of ‘Symbolic Expressions’ (sexpressions or sexprs for short) and it exists as the final stage before our Ruby program is converted to bytecode. The AST lends itself well to this, because the possible paths through the program are all mapped out which makes generating bytecode a derivative excerise. It is at this point in the compilation progress that our program is checked for correctness semantic analysis), and if you were writing a transpiler to change your Ruby code into a different target language (or a templating engine for your Rack app perhaps), this is where you would hook in.

Let’s take a look at our code represented as an AST:

[1] pry(main)> require 'Ripper'
=> true
[2] pry(main)> to_ast = <<STR
[2] pry(main)* 5.times do |n|
[2] pry(main)* puts n end
[2] pry(main)* STR
=> "5.times do |n|\nputs n end\n"
[3] pry(main)> pp Ripper.sexp(to_ast)

This will yield:

[:program,
 [[:method_add_block,
   [:call, [:@int, "5", [1, 0]], :".", [:@ident, "times", [1, 3]]],
   [:do_block, nil, [[:command, [:@ident, "puts", [2, 0]], [:args_add_block, [[:vcall, [:@ident, "n", [2, 5]]]], false]]]]]]]

You can also retrieve this information is by running ruby --dump parsetree your_program.rb.

Similar to before, the numbers are the line/character numbers. This is how your stack trace can tell you exactly where in your source that you made a mistake. ident is of course short for identifier and is the reference to a method or variable’s identifier which we covered in the Lexer section.

From this AST, we can now run our code! Ruby will traverse the AST, converting each node into instructions for its virtual machine (Yet Another Ruby VM, or YARV). There are some slight optimizations at this point, like custom instructions for commonly used operations such as addition. How YARV works is an entire post in itselfe. Suffice to say, Ruby’s pleasing syntax has been changed into code our computers understand and runs as written. The build process really isn’t as complicated as it may initially appear. It’s simply a pipeline that takes some input, applies a set of rules and returns an object – in this case, executable code that we can take for granted as working.

Sponsors