Creating a Recursive-Descent Parser

A Grammar, G, is a structure <N,T,P,S> where N is a set of non-terminals, T is a set of terminals, P is a set of productions, and S is a special non-terminal called the start symbol of the grammar. Grammars are used to formally specify the syntax of a language. For example, consider the language of calculator expressions where we can add, subtract, multiply, and divide. We can also store a value in a single memory location and recall the value in the memory location. For example, (5S+4)* R EOF is a sentence in this grammar. When evaluated, this expression would yield the value 45. A grammar that specifies the syntax of this language is:

G=<N,T,P,Prog> where N={Prog,Expr,Term,Storable,Factor}, T={number,+,-,*,/,S,R,(,),EOF} the start symbol is Prog, and the set of productions is given by these rules

Prog -> Expr EOF

Expr -> Expr + Term | Expr - Term | Term

Term -> Term * Storable | Term / Storable | Storable

Storable -> Factor S | Factor

Factor -> number | R | ( Expr )

Using this grammar we can check to see that the example expression given above is valid by constructing a derivation of the sentence. A derivation starts with the start symbol and replace one non-terminal at each step to generate the sentence. For instance, a derivation that generates the example above is:

Prog => Expr EOF =>Term EOF =>Term * Storable EOF => Storable * Storable EOF =>Factor * Storable EOF => ( Expr ) * Storable =>EOF ( Expr + Term ) * Storable EOF => ( Term + Term ) * Storable EOF => ( Storable + Term ) * Storable EOF => ( Factor S + Term ) * Storable EOF => ( 5 S + Term ) * Storable EOF => ( 5 S + Storable ) * Storable EOF => ( 5 S + Factor ) * Storable EOF => ( 5 S + Factor ) * Storable EOF => ( 5 S + 4 ) * Storable EOF => ( 5 S + 4 ) * Factor EOF => ( 5 S + 4 ) * R EOF

This derivation proves that (5S+4)*R EOF is a syntactically valid expression in the language of the grammar. The grammar presented above is what is called an LALR(1) grammar. This kind of grammar can be used by a program called a parser generator that given the grammar, will automatically generate a program called a parser. A parser is a program that given a sentence in its language, will construct a derivation of that sentence to check it for syntactic correctness. The derivation can also be expressed in tree form, called a parse tree. There may be many different derivations for a sentence in a language, but only one parse tree (if the grammar is unambiguous). Parse trees are outside the scope of this document. Take CS67 if you want to learn about parse trees.

Although parsers can be generated by parser generators, it is still sometimes convenient to write a parser by hand. However, LALR(1) grammars are not easy to use to manually construct parsers. Instead, we want an LL(1) grammar if we are going to manually construct a parser. An LL(1) grammar can be used to construct a top-down or recursive descent parser where an LALR(1) grammar is typically used to construct a bottom-up parser. A top-down parser constructs (or at least traverses) the parse tree starting at the root of the tree and proceeding downward. A bottom-up parser constructs or traverses the parse tree in a bottom-up fashion. Again, for more details, take CS67.

In a recursive descent parser, each non-terminal in the grammar becomes a function in a program. The right hand side of the productions become the bodies of the functions. An LALR(1) grammar is not appropriate for constructing a recursive descent parser. To create a recursive-descent parser (the topic of this page) we must convert the LALR(1) grammar above to an LL(1) grammar. Typically, there are two steps involved.

Eliminate left recursion
Perform left factorization where appropriate

Eliminate Left Recursion

Eliminating left recursion means eliminating rules like Expr Expr + Term. Rules like this are left recursive because the Expr function would first call the Expr function in a recursive descent parser. Without a base case first, we are stuck in infinite recursion (a bad thing). To eliminate left recursion we look to see what Expr can be rewritten as. In this case, Expr can be only be replaced by a Term so we replace Expr with Term in the productions. The usual way to eliminate left recursion is to introduce a new non-terminal to handle all but the first part of the production. So we get

Expr -> Term RestExpr

RestExpr -> + Term RestExpr | - Term RestExpr | <null>

We must also eliminate left recursion in the Term Term * Factor | Term / Factor productions in the same way. We end up with an LL(1) grammar that looks like this:

Prog -> Expr EOF

Expr -> Term RestExpr

RestExpr -> + Term RestExpr | - Term RestExpr | <null>

Term -> Storable RestTerm

RestTerm -> * Storable RestTerm | / Storable RestTerm | <null>

Storable -> Factor S | Factor

Factor -> number | R | ( Expr )

Perform Left Factorization

Left factorization isn't needed on this grammar so this step is skipped. Left factorization is needed when the first part of two or more productions is the same and the rest of the similar productions are different. Left factorization is important in languages like Prolog because without it the parser is inefficient. However, it isn't needed and does not improve efficiency when writing a parser in an imperative language like Java, for instance.

Translating the LL(1) Grammar to Java

Typically, a parser returns an abstract syntax tree (or expression tree) representing the expression or sentence being parsed. An abstract syntax is defined by a grammar that is likely ambiguous. The ambiguity is not a problem for abstract syntax since this grammar will not be used for parsing.

Once you have an LL(1) grammar you use it to build a parser as follows. The following construction causes the parser to return an abstract syntax tree or expression tree for the sentence being parsed.

Construct a function for each non-terminal. Each of these functions should return a node in the abstract syntax tree.
Depending on your grammar, some non-terminal functions may require an input parameter of an abstract syntax tree (ast) to be able to complete a partial ast that is recognized by the non-terminal function.
Each non-terminal function should call a function to get the next token as needed. If the next token is not needed, the non-terminal function should call another function to put back the token. Your parser (if it is based on an LL(1) grammar, should never have to get or put back more than one token at a time.
The body of each non-terminal function should be be a series of if statements that choose which production right-handside to expand depending on the value of the next token.

The construction above is very simple, but can be confusing without an example. Consider the LL(1) grammar given above.

Assume that you have a binary tree class in Java called BTNode that can hold an Object and that operators in the expression are Characters and numbers in the expression are Doubles. Also, assume there are two functions that get the next Token in the input (getToken) and push the last token back on the input (pushBackToken). Then, we can write the first part of the parser as follows:

 private BTNode Prog() {
   BTNode tmp = Expr();

   if (!getNextToken().toString().equals("$")) 
      throw new IllegalStateException("Error in expression");

   return tmp;
 }

 private BTNode Expr() {        
   return RestExpr(Term());
 }
    
 private BTNode RestExpr(BTNode E1) {
   Object on = getNextToken();

   if (on.toString().equals("+") || on.toString().equals("-"))
     return RestExpr(new BTNode(on,E1,Term()));
        
   pushBackToken();
        
   return E1;
 }

The code given here throws an exception if an error is discovered during parsing. You should of course take appropriate action during error conditions. The BTNode constructor for the tree above takes a value for the new node and references to the left and right subtrees. Calling the function Prog() returns a pointer to an abstract syntax tree representing the expression.

Kent Lee

Last modified: Mon Apr 28 21:18:05 CDT