Packrat Parsing: a Practical LinearTime Algorithm with Backtracking
Bryan Ford
Master's Thesis
Massachusetts Institute of Technology
Abstract
Packrat parsing is a novel and practical method
for implementing lineartime parsers
for grammars defined in TopDown Parsing Language (TDPL).
While TDPL was originally created as a formal model
for topdown parsers with backtracking capability,
this thesis extends TDPL
into a powerful generalpurpose notation for describing language syntax,
providing a compelling alternative
to traditional contextfree grammars (CFGs).
Common syntactic idioms
that cannot be represented concisely in a CFG
are easily expressed in TDPL,
such as longestmatch disambiguation and "syntactic predicates,"
making it possible
to describe the complete lexical and grammatical syntax
of a practical programming language
in a single TDPL grammar.
Packrat parsing is an adaptation of a 30yearold tabular parsing algorithm
that was never put into practice until now.
A packrat parser can recognize
any string defined by a TDPL grammar in linear time,
providing the power and flexibility
of a backtracking recursive descent parser
without the attendant risk of exponential parse time.
A packrat parser can recognize any LL(k) or LR(k) language,
as well as many languages requiring unlimited lookahead
that cannot be parsed by shift/reduce parsers.
Packrat parsing also provides better composition properties than LL/LR parsing,
making it more suitable for dynamic or extensible languages.
The primary disadvantage of packrat parsing is its storage cost,
which is a constant multiple of the total input size
rather than being proportional to the nesting depth
of the syntactic constructs appearing in the input.
Monadic combinators and lazy evaluation
enable elegant and direct implementations of packrat parsers
in recent functional programming languages such as Haskell.
Three different packrat parsers for the Java language
are presented here,
demonstrating the construction of packrat parsers in Haskell
using primitive pattern matching, using monadic combinators,
and by automatic generation from a declarative parser specification.
The prototype packrat parser generator developed for the third case
itself uses a packrat parser to read its parser specifications,
and supports full TDPL notation
extended with "semantic predicates,"
allowing parsing decisions to depend on the semantic values
of other syntactic entities.
Experimental results show that all of these packrat parsers
run reliably in linear time,
efficiently support "scannerless" parsing
with integrated lexical analysis,
and provide the userfriendly errorhandling facilities
necessary in practical applications.
Full Thesis:
PDF
PS
Pappy: a Parser Generator for Haskell
The full source code for Pappy,
the prototype packrat parser generator described in the thesis,
is available for browsing in this directory,
or for downloading as a gzipped tar file.
A brief breakdown of the source files follows:
 Pos.hs:
Library module for keeping track of line/column position
in a text file.
 Parse.hs:
Library of support functions and monadic combinators
for use in constructing packrat parsers.
Inspired by Daan Leijen's
Parsec library,
which was designed for traditional predictive parsers
and mostlypredictive parsers with specialcase backtracking.
 ReadGrammar.hs:
Monadic packrat parser for Pappy parser specifications.
 ReduceGrammar.hs:
Grammar reduction module,
which rewrites leftrecursive rules
and repetition operators ('*' and '+')
into primitive rightrecursive form.
 SimplifyGrammar.hs:
Grammar simplification module,
which optimizes the grammar
and eliminates as many nonterminals as possible.
 MemoAnalysis.hs:
Memoization analysis modle,
which determines the set of nonterminals
to be memoized by the packrat parser.
 WriteParser.hs:
Code generation module.
 Main.hs:
Toplevel control module,
which links all the compiler stages together.
Example Arithmetic Expression Parsers
Following are complete versions of the example parsers
for the trivial arithmetic expression language used in the thesis:

ArithRecurse.hs:
Recursive descent parser described in Section 3.1.1,
for the trivial arithmetic expression language of Figure 1.

ArithPackrat.hs:
Equivalent packrat parser for the same trivial language,
Section 3.1.4.

ArithLeft.hs:
Left recursion example for Section 3.2.1,
which extends the above packrat parser with
properly leftassociative subraction, division, and modulo operators.

ArithLex.hs:
Integrated lexical analysis example for Section 3.2.2,
which extends the previous packrat parser
with support for multipledigit decimal literals
and optional whitespace padding
between literals, operators, and punctuation.

ArithMonad.hs:
Example packrat parser, equivalent to ArithLex.hs,
but using monadic combinators to express the parsing functions more succinctly
and provide support for userfriendly error detection and reporting.
Discussed in Section 3.2.3 and 3.2.4 of the thesis.
The following two library modules from Pappy are required:

Arith.pappy:
Pappy parser specification for a parser
equivalent to ArithLex.hs and ArithMonad.hs.
The resulting automaticallygenerated parser
is available as Arith.hs.
 Pos.hs:
Keeps track of line and column position while scanning input text.
 Parse.hs:
Monadic combinator library for packrat parsers.
Example Java Language Parsers
The three complete and working parsers for the Java language,
which are described in the paper and used
for analysis and comparison purposes,
are available here:
 JavaMonad.hs:
A packrat parser for the Java language
that exclusively uses monadic combinators
to define the parsing functions making up the parser.
Both "safe", constanttime combinators
and "unsafe" combinators with hidden recursion
are used in this parser,
meaning that it is not quite a lineartime parser
although it appears to come pretty close in practice.
 JavaPat.hs:
A version of the above parser
modified to use direct Haskell patternmatching
for some of the performancecritical lexical analysis functions:
whitespace, identifiers, keywords, operators,
and integer, character, and string literals.
The rest of the parser is monadic just as before,
and likewise uses "unsafe" combinators.
 Java.pappy:
Pappy parser specification for the Java language.
The resulting automaticallygenerated parser
is available as Java.hs.
Since Pappy rewrites repetition operators,
this parser uses only constanttime primitives
and therefore should be a strictly lineartime parser 
at least to the extent that memory access is constanttime
(which is not quite the case
in the presence of garbage collection and cache effects and such).
The test suite of Java source files
used to obtain the experimental results in the thesis
are available in this gzipped tar file.
All of these Java source files
were taken from Cryptix version 3.2.0.
Enjoy!
Bryan Ford

Updated Monday, 24Nov2014 16:44:31 EST
