Parsing is…complicated. And yet, we wouldn’t have many facets of the modern world without it. Our ability to get behavior out of a computer is limited by the expressiveness of our programming languages. And the programming languages available to us are limited by our parsing techniques.
The construction of a programming language consists of at least three components:
- lexer
- parser
- interpreter / compiler
The lexer, or scanner as it is often known, is generally a simple routine that understands the input as a stream of characters. The lexer converts the character stream into a token stream that it sends to the parser.
The parser consists of rules that determine whether a stream of tokens is arranged in an acceptable order. In addition, the parser is responsible for assembling a graph of nodes, in many cases an Abstract Syntax Tree (commonly AST), that represent the input in a meaningful way.
The syntax graph generated by the parser is generally converted into bytecode that can run on a virtual machine or compiled into machine code. For a small DSL, it can be handled by a simple runtime implemented in the host language.
Writing these components by hand can be a challenge, particularly for the newcomer. There have been many tools to assist in their creation, initially with lexer and parser generators, and recently with code generators like LLVM.
The Parslet gem provides a DSL for creating programming languages in Ruby. In this article we will be looking at some of the history of parsing, where Parslet fits in, and how you can use it to create your own programming languages.
The source code for this tutorial is available at github.
Time and Space
In 1965, Donald Knuth invented the LR parser. LR is short for Left-to-right, Rightmost derivation. The LR parser can recognize any deterministic Context-Free-Grammar in linear time. Due to maintaining state-transition tables, LR parsers were fast but had inordinate memory requirements (for the 1960s), and were thus dismissed as too impractical.
In 1969, Frank DeRemer proposed simplified versions of the LR parser: Look-Ahead LR (LALR) and Simple LR (SLR). These parsers reduced the memory requirements at the cost of language recognition. The most popular variation became LALR(1), LALR with 1 look-ahead terminal.
In 1970, Stephen C. Johnson created YACC (Yet Another Compiler Compiler) while working on the B language at AT&T. He and Al Aho initially tried implementing an LALR table by hand, but it took 2 days just to get 30 grammar rules and produced quite a few errors, so he decided to invest time into the LALR(1) parser generator that we now know as YACC.
Initially, YACC was rather slow, but after years of work Johnson was able to speed it up by several orders of magnitude and make it the leading tool for parser generation. By the time computers had improved significantly and alternative parsing algorithms were available, YACC and LALR(1) had become entrenched in computer science culture.
Writing a usable grammar for YACC isn’t easy. The gotchas of LALR(1) (namely shift/reduce and reduce/reduce conflicts) present a high barrier of entry for those interested in parsing a non-trivial language. As a result, the YACCs of the world are increasingly competing with alternative approaches, including writing parsers from scratch.
The New Era
Today, YACC lives on in GNU Bison (and other variants) and remains quite popular due to its speed and history. However, the consistent computer science theme of performance taking a back seat to maintainability has been playing itself out.
Initially, the GCC project used Bison to generate parsers for C, C++, and Objective C, but it gradually shifted to hand-written recursive descent parsers in versions 3.4-4.1. Their basis was that LALR(1) parsers required too many hacks to conform to any of the C languages.
The Clang maintainers have decided to stick with a similar direction, with the goal of eventually having a unified recursive descent parser for the C language family and its many dialects.
The ANTLR project generates recursive descent parsers using LL(*) grammars. In v4, ANTLR’s creator Terence Parr decided to focus on grammar flexibility rather than performance, aptly naming the release “honey badger”.
Recursive descent isn’t the only modern answer to parsing demanding languages. With more RAM on the table (no pun intended), and increased clock cycles, there has been renewed interest in some of the more expensive algorithms that have been proposed since the mid 20th century. These include the CYK Algorithm and the Earley parser.
Bison now supports Generalized LR (GLR) in addition to LALR(1). GLR works similarly to LR except that it forks parse stacks upon encountering a conflict. Each parse stack reads tokens until it finishes the input or reaches an invalid state. This enables GLR to avoid shift/reduce and reduce/reduce conflicts at the cost of increased memory consumption.
Parsing Expression Grammar
In order to understand why Parsing Expression Grammar is interesting, it helps to know the alternative. In the 1950s, Noam Chomsky proposed his now-famous Chomsky Hierarchy which consisted of formal grammars, the kinds of languages they can generate, and what kind of machine (automaton) would be needed to recognize the language.
Most parser-generators expect parsers to be described by a Context-Free Grammar, the second most complicated in Chomsky’s hierarchy. The problem with Chomsky’s grammars is that they are all generative. They were designed to express ambiguity in language because Chomsky was most interested in applying them to natural languages. Historically, the ambiguity expressed by CFGs for parser-generators has not meshed well with the parsing algorithms behind them (like LALR(1)).
In 2004, Bryan Ford published a paper on Parsing Expression Grammars. He argued that the role of Context-Free-Grammars as language generators was misaligned with the role of parsers as language recognizers. In the paper, he proposed the alternative, recognition-based, formalism that we now call PEG.
Parsing Expression Grammar (PEG) brings two important innovations:
- lexical analysis and parsing are the same activity (rather than requiring separate tools like lex and yacc)
- ambiguity is replaced with greed
The choice operator in CFGs is unordered (commutative). This has the pitfall of assuming multiple potential parse trees (when there can be only one). However, the choice operator in PEGs is greedy. If the first option is valid, the second is ignored, just like in most programming languages. It should be noted that this is a not a magical fix, and it is quite possible to get yourself into trouble with it.
Unlike CFGs, PEGs are not particularly suited to parsing natural language. However, they are much more suited to parsing programming languages. Whereas a CFG is like a specification for generating strings in a language, a PEG is like a collection of rules for creating a recursive descent parser.
Parslet
Parslet provides a DSL for defining parsers with a Parsing Expression Grammar.
First, install the parslet gem.
gem install parslet
This is what a simple parser in Parslet looks like.
#demo_parser.rb
require 'parslet'
class DemoParser < Parslet::Parser
rule(:identifier) { match('[a-z]').repeat(1) }
rule(:number) { match('[0-9]').repeat(1) }
rule(:space) { match('\s').repeat(0) }
rule(:line) { identifier >> space >> number }
root(:line)
end
demo_parser = DemoParser.new
result = demo_parser.parse("foo 1")
puts result.inspect
Rules
A Parslet parser is established through a set of rules and definition blocks.
rule(name) { definition_block }
The definition block contains the expected syntax for that rule, including: – the order in which elements are expected to arrive. – the number of times any element is expected to appear – any choices between elements
Here is how you would expect one or more digits using #repeat.
rule(:number) { match('[0-9]').repeat(1) }
Rules can be referenced in definition blocks by their names.
rule(:number) { match('[0-9]').repeat(1) }
rule(:zipcode) { number }
Every parser needs a root rule. The root is the first rule that the parser will expect.
rule(:line) { match('[a-zA-Z0-9]').repeat >> '\n' }
rule(:lines) { line.repeat }
root(:lines)
Atoms
Parslet refers to the elements of language syntax as atoms or parslets (interestingly, ‘atom’ is usually used in the documentation).
Atoms come in the following varieties:
- Regular expressions – ex:
match('[a-zA-Z0-9]')
- Strings – ex:
str('foo')
- any character – corresponds to the regexp /./ ex:
any.repeat(0)
The behavior of the parser is driven by relationships between its atoms and rules.
atom1 >> atom2
– atom2 is expected after atom1atom.repeat(x[,y])
– atom is expected x or more times (or up to y times)atom.maybe
– equivalent to atom.repeat(0,1)atom1 | atom2
– either atom is expected, but if atom1 is seen then atom2 is ignored
Although it seems implied, the Parslet documentation doesn’t appear to make clear that atoms are purely terminals (and thus rules, being non-terminals, don’t count as atoms), but it appears that anything you can apply to an atom, you can apply to a rule.
Expect ‘a’ or rule b:
rule(:aorb) { str('a') | b }
Expect a word followed by a colon:
rule(:wordcolon) { match('[a-zA-Z]') >> str(':') }
Tricky Whitespace
It’s very tempting to use #repeat(0)
to make your parser flexible, but care must be taken. When dealing with whitespace, It’s generally a good idea to have rules like these:
rule(:space) { match('\s').repeat(1) }
rule(:space?) { space.maybe }
If you just use one all-encompassing whitespace rule, you are at risk of getting entangled in ambiguity. While playing around with whitespace, I managed to create a parser that hangs.
# tricky_whitespace.rb
# Be careful with #repeat(0)
require 'parslet'
class BadWhitespaceParser < Parslet::Parser
rule(:space) { match('\s').repeat(0) }
rule(:identifier) { match('[a-zA-Z0-9]').repeat(1) }
rule(:line) { identifier | space }
rule(:lines) { (line >> space).repeat(0) }
root(:lines)
end
doc = <<-eof
east
west
eof
parser = BadWhitespaceParser.new
# This will hang. Use ctrl-c to interrupt.
puts parser.parse(doc).inspect
Error Reporting
Parslet provides excellent error reporting if you use it. If you just let Parslet fail, it will tell you what sequence it failed to match. However, if you rescue Parslet::ParseFailed
, you can get a failure tree.
# parse_failed.rb
require 'parslet'
class FailParser < Parslet::Parser
rule(:alpha) { match('[a-zA-Z]').repeat(1) }
rule(:number) { digit.repeat(1) >> (str('.') >> digit.repeat(1)).maybe }
rule(:digit) { match('[0-9]').repeat(1) }
rule(:space) { str(' ') }
rule(:line) { alpha >> space >> number >> space >> alpha }
root(:line)
end
input = "ab 1.2 d1"
parser = FailParser.new
# not so great error reporting
puts "--STANDARD ERROR REPORTING--"
begin
parser.parse(input)
rescue Exception => error
puts error
end
puts ("\n")
# better error reporting
puts "--ASCII TREE ERROR REPORTING--"
begin
parser.parse(input)
rescue Parslet::ParseFailed => error
puts error.cause.ascii_tree
end
This prints out:
--STANDARD ERROR REPORTING--
Failed to match sequence (ALPHA SPACE NUMBER SPACE ALPHA) at line 1 char 8.
--ASCII TREE ERROR REPORTING--
Failed to match sequence (ALPHA SPACE NUMBER SPACE ALPHA) at line 1 char 8.
`- Extra input after last repetition at line 1 char 9.
`- Failed to match [a-zA-Z] at line 1 char 9.
Parslet provides a convenience method so that you don’t need to write out the exception handling every time.
require 'parslet/convenience'
parser.parse_with_debug(input)
Generating a Tree
By default, Parslet just checks the input. It doesn’t generate a syntax tree. The #as
method creates a node on the tree.
rule(:assignment) { left.as(:left) >> str('=') >> right.as(:right) }
Here is a more complete example.
# assignment_parser.rb
require 'parslet'
class AssignmentParser < Parslet::Parser
rule(:identifier) { match('[a-zA-Z0-9_]').repeat(1) }
rule(:value) { match('[0-9]').repeat(1) }
rule(:assignment) {
identifier.as(:left) >>
str('=') >>
value.as(:right) >>
str("\n").maybe
}
rule(:assignments) { assignment.as(:assignment).repeat }
root(:assignments)
end
doc = <<-eof
a=23
b=56
eof
parser = AssignmentParser.new
puts parser.parse(doc).inspect
If you run assignment_parser.rb, instead of getting the input back, you get an array of assignment hashes.
[{:assignment=>{:left=>”a”@0, :right=>”23″@2}}, {:assignment=>{:left=>”b”@5, :right=>”56″@7}}]
Transformation
Parslet provides Parslet::Transform for making sense of syntax graphs produced by Parslet::Parser
. You could use it to make an interpreter or bytecode generator.
Like Parslet::Parser, a Parslet::Transform is made of rules and blocks.
rule(nodes => capture_method) { action_block }
The capture method determines what the rule will match and is one of:
- simple – any object but hashes or arrays
- sequence – hashes or arrays
- subtree – everything
The result of the action block will replace the node pattern specified in the rule. If only the leaf node is specified, only the leaf node will be replaced.
# transform_replace.rb
require 'parslet'
ast = {:document => {:words => "hello world"}}
# replaces the words node
class LeafTransform < Parslet::Transform
rule(:words => simple(:x)) { x.upcase }
end
# replaces document and words nodes
class TreeTransform < Parslet::Transform
rule(:document => {:words => simple(:x)}) { x.upcase }
end
leaf_transform = LeafTransform.new
puts leaf_transform.apply(ast).inspect
tree_transform = TreeTransform.new
puts tree_transform.apply(ast).inspect
Here, LeafTransform just replaces {:words => “hello world”}, while TreeTransform replaces the entire tree.
{:document=>"HELLO WORLD"}
"HELLO WORLD"
Let’s transform the tree produced by the assignment parser (Note: the Parslet “@” decorations are removed).
# assignment_transform.rb
require 'parslet'
tree = [{:assignment=>{:left=>"a", :right=>"23"}}, {:assignment=>{:left=>"b", :right=>"56"}}]
class Assignment
def initialize(left, right)
@left = left
@right = right
end
def generate_code
# spit out bytecode/machinecode associated with assignment operation
end
end
class AssignmentTransform < Parslet::Transform
rule(:left => simple(:left), :right => simple(:right)) do
Assignment.new(left, Integer(right))
end
end
transform = AssignmentTransform.new
puts transform.apply(tree).inspect
The #apply method returns the transformed tree with Assignment objects instead of hashes.
[{:assignment=>#<Assignment:0x000000019df520 @left="a",@right=23>},
{:assignment=>#<Assignment:0x000000019c9018 @left="b", @right=56>}]
For more on Parlset::Transform, check out the documentation.
Conclusion
If you are new to creating programming languages, Parslet is an excellent way to get started. It provides more grammatical flexibility than the typical LALR(1) generator, and it can use your Ruby code. However, Parslet is not the only PEG game in town for Ruby. Here are some others:
Here are some additional resources if you would like to learn more about creating languages with Parslet:
- Parsing TOML with Parslet
- Using Parslet to Write Your Own Language
- Writing DSLs with Parslet – Wicked Good Ruby 2013
Frequently Asked Questions about Parsing with Parslet Gem
What is the Parslet gem and why is it useful?
The Parslet gem is a Ruby library that allows you to parse complex data structures with ease. It’s a powerful tool for developers who need to extract information from text, XML, JSON, or other data formats. The Parslet gem is particularly useful because it provides a simple and intuitive interface for defining grammars and parsing rules. This makes it easier to handle complex parsing tasks, and reduces the amount of code you need to write.
How do I install the Parslet gem?
To install the Parslet gem, you need to have Ruby installed on your system. Once you have Ruby, you can install the Parslet gem by running the command gem install parslet
in your terminal. This will download and install the latest version of the Parslet gem.
How do I define a grammar using the Parslet gem?
Defining a grammar with the Parslet gem involves creating a subclass of Parslet::Parser
and defining rules for each element of your grammar. Each rule is defined as a method that returns a Parslet::Atoms
instance. These instances define how to match and transform the input data.
How do I parse data using the Parslet gem?
Once you have defined your grammar, you can parse data by creating an instance of your parser class and calling the parse
method with your input data. The parse
method will return a tree of Parslet::Slice
instances that represent the parsed data.
What are the common errors I might encounter when using the Parslet gem and how can I handle them?
When using the Parslet gem, you might encounter errors such as Parslet::ParseFailed
if the input data does not match your grammar. You can handle these errors by using the rescue
keyword in your code to catch the exception and handle it appropriately.
How can I transform the parsed data using the Parslet gem?
The Parslet gem provides a Parslet::Transform
class that you can use to transform the parsed data. You define transformation rules as methods that match patterns in the parsed data and return the transformed data.
Can I use the Parslet gem to parse XML or JSON data?
Yes, you can use the Parslet gem to parse XML or JSON data. However, you will need to define a grammar that matches the structure of the XML or JSON data.
How can I debug my grammar or transformation rules in the Parslet gem?
The Parslet gem provides several methods for debugging your grammar or transformation rules. For example, you can use the parse_with_debug
method to get detailed information about the parsing process.
Can I use the Parslet gem with Rails or other Ruby frameworks?
Yes, you can use the Parslet gem with Rails or other Ruby frameworks. You can include the Parslet gem in your Gemfile and use it in your application code just like any other gem.
Are there any resources or tutorials I can use to learn more about the Parslet gem?
Yes, there are several resources and tutorials available online that can help you learn more about the Parslet gem. For example, the official Parslet documentation provides a comprehensive guide to using the gem, and there are several blog posts and tutorials that provide practical examples and tips.
Robert is a voracious reader, Ruby aficionado, and other big words. He is currently looking for interesting projects to work on and can be found at his website.