2. Syntax

2.1. Terminology

a=b+c;

2.2. Backus Naur Form (BNF)

<syntactic category> ::= a string of terminals and nonterminals

2.2.1. BNF Examples

<primitive-type> ::= boolean
<primitive-type> ::= char
<primitive-type> ::= boolean | char | byte | short | int | long | float | ...
<argument-list> ::= <expression> | <argument-list> , <expression>
<selection-statement> ::=
  if ( <expression> ) <statement> |
  if ( <expression> ) <statement> else <statement> |
  switch ( <expression> ) <block>
<method-declaration> ::=
  <modifiers> <type-specifier> <method declarator> <throws-clause> <method-body> |
  <modifiers> <type-specifier> <method-declarator> <method-body> |
  <type-specifier> <method-declarator> <throws-clause> <method-body> |
  <type-specifier> <method-declarator> <method-body>

2.2.2. Extended BNF (EBNF)

  1. item? or [item] means the item is optional.
  2. item* or {item} means zero or more occurrences of an item are allowable.
  3. item+ means one or more occurrences of an item are allowable.
  4. Parentheses may be used for grouping

2.3. Context-Free Grammars

G = (N,T,P,S)

where

  • N is a set of symbols called nonterminals or syntactic categories.
  • T is a set of symbols called terminals or tokens.
  • P is a set of productions of the form n \rightarrow \alpha where n \in N and \alpha \in (N \cup T)^*
  • S \in N is a special nonterminal called the start symbol of the grammar.

2.3.1. The Infix Expression Grammar

A context-free grammar for infix expressions can be specified as G=(N,T,P,E) where


N = \{E,T,F\}
T = \{identifier,number,+,-,*,/,(,)\}
P is defined by the set of productions

E \rightarrow E~+~T \mid E~-~T \mid T
T \rightarrow T~*~F \mid T~/~F \mid F
F \rightarrow (~E~) \mid identifier \mid number

2.4. Derivations

2.4.1. A Derivation

& \underline{E} \Rightarrow  \underline{E} + T \Rightarrow \underline{T} + T \Rightarrow \underline{F} + T \Rightarrow (\underline{ E }) + T \Rightarrow ( \underline{T} ) + T \Rightarrow ( \underline{T} * F ) + T  \\
& \Rightarrow ( \underline{F} * F ) + T  \Rightarrow ( 5 * \underline{F} ) + T \Rightarrow ( 5 * x ) + \underline{T} \Rightarrow ( 5 * x ) + \underline{F} \Rightarrow ( 5 * x ) + y

Practice 2.1

Construct a derivation for the infix expression 4+(a-b)*x.

You can check your answer(s) at the end of the chapter.

2.4.2. Types of Derivations

& E  \Rightarrow E + T  \Rightarrow E + F  \Rightarrow E + y  \Rightarrow T + y  \Rightarrow F + y  \Rightarrow ( E ) + y  \Rightarrow ( T ) + y
\Rightarrow ( T * F) + y  \\
&\Rightarrow ( T * x ) + y  \Rightarrow ( F * x ) + y  \Rightarrow ( 5 * x ) + y

Practice 2.2

Construct a right-most derivation for the expression x*y+z.

You can check your answer(s) at the end of the chapter.

2.4.3. Prefix Expressions

2.4.4. The Prefix Expression Grammar

A context-free grammar for prefix expressions can be specified as G=(N,T,P,E) where


N = \{E\}
T = \{identifier,number,+,-,*,/\}
P is defined by the set of productions

E \rightarrow +~E~E \mid -~E~E \mid *~E~E \mid /~E~E \mid identifier \mid number

Practice 2.3

Construct a left-most derivation for the prefix expression +4*-a b x.

You can check your answer(s) at the end of the chapter.

2.5. Parse Trees

_images/parsetree.png

Fig 2.1 A Parse Tree

Practice 2.4

What does the parse tree look like for the right-most derivation of (5*x)+y?

You can check your answer(s) at the end of the chapter.

Practice 2.5

Construct a parse tree for the infix expression 4+(a-b)*x.

HINT: What has higher precedence, “+” or “*”? The given grammar automatically makes “*” have higher precedence. Try it the other way and see why!

You can check your answer(s) at the end of the chapter.

Practice 2.6

Construct a parse tree for the prefix expression +4*-a b x.

You can check your answer(s) at the end of the chapter.

2.6. Abstract Syntax Trees

_images/abstree.png

Fig 2.2 An AST

Practice 2.7

Construct an abstract syntax tree for the expression 4+(a-b)*x.

You can check your answer(s) at the end of the chapter.

2.7. Lexical Analysis

2.7.1. The Language of Regular Expressions

The language of regular expression is defined by a context-free grammar. The context-free grammar for regular expressions is RE=(N,T,P,E) where


N = \{ E, T, K, F \}
T = \{character, *, +, ., (, ) \}
P is defined by the set of productions

E \rightarrow E + T \mid T
T \rightarrow T . K \mid K
K \rightarrow F * \mid F
F \rightarrow character \mid (~E~)
letter.letter* + digit.digit* + ‘+’ + ‘-‘ + ‘*’ + ‘/’ + ‘(‘ + ‘)’

Practice 2.8

Define a regular expression so that negative and non-negative integers can both be specified as tokens of the infix expression language.

You can check your answer(s) at the end of the chapter.

2.7.2. Finite State Machines

Formally a finite state automata is defined as follows.

M=(\Sigma, S, F, s_0, \delta)
where \Sigma (pronounced sigma) is the input alphabet (the characters understood by the machine), S is a set of states, F is a subset of S usually written as F \subseteq S, s_0 is a special state called the start state, and \delta (pronounced delta) is a function that takes as input an alphabet symbol and a state and returns a new state. This is usually written as \delta : \Sigma \times S \rightarrow S.
_images/lexerfsm.png

Fig 2.3 A Finite State Machine

2.7.3. Lexer Generators

2.8. Parsing

_images/parser.png

Fig 2.4 Parser Data Flow

2.9. Top-Down Parsers

2.9.1. An LL(1) Grammar

The grammar for prefix expressions is LL(1). Examine the prefix expression grammar G=(N,T,P,E) where


N = \{E\}
T = \{identifier,number,+,-,*,/\}
P is defined by the set of productions

E \rightarrow +~E~E \mid -~E~E \mid *~E~E \mid /~E~E \mid identifier \mid number

2.9.2. A Non-LL(1) Grammar

Some grammars are not LL(1). The grammar for infix expressions is not LL(1). Examine the infix expression grammar G=(N,T,P,E) where


N = \{E,T,F\}
T = \{identifier,number,+,-,*,/,(,)\}
P is defined by the set of productions

E \rightarrow E~+~T \mid E~-~T \mid T
T \rightarrow T~*~F \mid T~/~F \mid F
F \rightarrow (~E~) \mid identifier \mid number

E  \Rightarrow T  \Rightarrow T * F \Rightarrow F * F  \Rightarrow 5 * F  \Rightarrow 5 * 4

2.9.3. An LL(1) Infix Expression Grammar

The following grammar is an LL(1) grammar for infix expressions. G=(N,T,P,E) where


N = \{E,RestE, T, RestT, F\}
T = \{identifier,number,+,-,*,/,(,)\}
P is defined by the set of productions

E \rightarrow T~RestE
RestE \rightarrow +~T~RestE \mid -~T~RestE \mid \epsilon
T \rightarrow F~RestT
RestT \rightarrow *~F~RestT \mid /~F~RestT \mid \epsilon
F \rightarrow (~E~) \mid identifier \mid number

Practice 2.9

Construct a left-most derivation for the infix expression 4+(a-b)*x using the grammar in An LL(1) Infix Expression Grammar, proving that this infix expression is in L(G) for the given grammar.

You can check your answer(s) at the end of the chapter.

2.10. Bottom-Up Parsers

_images/generator.png

Fig 2.5 Parser Generator Data Flow

2.10.1. PDA

_images/pda.png

Fig 2.6 A Pushdown Automaton Stack

2.10.2. Parsing an Infix Expression

Consider the grammar for infix expressions as G=(N,T,P,E) where


N = \{E,T,F\}
T = \{identifier,number,+,-,*,/,(,)\}
P is defined by the set of productions

(1)~E \rightarrow E~+~T
(2)~E \rightarrow T
(3)~T \rightarrow T~*~F
(4)~T \rightarrow F
(5)~F \rightarrow number
(6)~F \rightarrow ( E )

E  \Rightarrow E + T  \Rightarrow E + F  \Rightarrow E + 3  \Rightarrow T + 3  \Rightarrow T * F + 3 \Rightarrow T * 4 + 3 \Rightarrow F * 4 + 3 \Rightarrow 5 * 4 + 3

Practice 2.10

For each step in PDA, is there a shift or reduce operation being performed? If it is a reduce operation, then what production is being reduced? If it is a shift operation, what token is being shifted onto the stack?

You can check your answer(s) at the end of the chapter.

Practice 2.11

Consider the expression (6+5)*4. What are the contents of the pushdown automaton’s stack as the expression is parsed using a bottom-up parser? Show the stack after each shift and each reduce operation.

You can check your answer(s) at the end of the chapter.

2.11. Ambiguity in Grammars

if a<b then
  if b<c then
    writeln("a<c")
else
  writeln("?")
<statement> ::= if <expression> then <statement> else <statement>
              | if <expression> then <statement>
              | writeln ( <expression> )

2.12. Other Forms of Grammars

2.13. Limitations of Syntactic Definitions

2.14. Chapter Summary

2.15. Review Questions

  1. What does the word syntax refer to? How does it differ from semantics?
  2. What is a token?
  3. What is a nonterminal?
  4. What does BNF stand for? What is its purpose?
  5. What kind of derivation does a top-down parser construct?
  6. What is another name for a top-down parser?
  7. What does the abstract syntax tree for 3*(4+5) look like for infix expressions?
  8. What is the prefix equivalent of the infix expression 3*(4+5)? What does the prefix expression’s abstract syntax tree look like?
  9. What is the difference between lex and yacc?
  10. Why aren’t all context-free grammars good for top-down parsing?
  11. What kind of machine is needed to implement a bottom-up parser?
  12. What is a context-sensitive issue in a language? Give an example in Java.
  13. What do the terms shift and reduce apply to?

2.16. Exercises

  1. Rewrite the BNF in BNF Examples using EBNF.
  2. Given the grammar in The Infix Expression Grammar, derive the sentence 3*(4+5) using a right-most derivation.
  3. Draw a parse tree for the sentence 3*(4+5).
  4. Describe how you might evaluate the abstract syntax tree of an expression to get a result? Write out your algorithm in English that describes how this might be done.
  5. Write a regular expression to describe identifier tokens which must start with a letter and then can be followed by any number of letters, digits, or underscores.
  6. Draw a finite state machine that would accept identifier tokens as specified in the previous exercise.
  7. For the expression 3*(4+5) show the sequence of shift and reduce operations using the grammar in Parsing an Infix Expression. Be sure to say what is shifted and which rule is being used to reduce at each step. See the solution to practice problem 2.11 for the proper way to write the solution to this problem.
  8. Construct a left-most derivation of 3*(4+5) using the grammar in An LL(1) Infix Expression Grammar.

2.17. Solutions to Practice Problems

These are solutions to the practice problems. You should only consult these answers after you have tried each of them for yourself first. Practice problems are meant to help reinforce the material you have just read so make use of them.

2.17.1. Solution to Practice Problem 2.1

This is a left-most derivation of the expression. There are other derivations that would be correct as well.

& E \Rightarrow E + T \Rightarrow T + T \Rightarrow F + T \Rightarrow 4 + T \Rightarrow 4 + T * F \Rightarrow 4 + F * F \Rightarrow 4 + ( E ) * F \\
& \Rightarrow 4 + ( E - T ) * F \Rightarrow 4 + ( T - T ) * F \Rightarrow 4 + ( F - T ) * F \Rightarrow 4 + ( a - T ) * F \Rightarrow \\
& 4 + ( a - F ) * F \Rightarrow 4 + ( a - b ) * F \Rightarrow 4 + ( a - b ) * x \\

2.17.2. Solution to Practice Problem 2.2

This is a right-most derivation of the expression x*y+z. There is only one correct right-most derivation.

E \Rightarrow E + T \Rightarrow E + F \Rightarrow E + z \Rightarrow T + z \Rightarrow T * F + z \Rightarrow T * y + z \Rightarrow F * y + z \Rightarrow x * y + z

2.17.3. Solution to Practice Problem 2.3

This is a left-most derivation of the expression +4*-a b x.

& E \Rightarrow + E E \Rightarrow + 4 E \Rightarrow + 4 * E E \Rightarrow + 4 * - E E E  \Rightarrow + 4 * - a E E \Rightarrow + 4 * - a b E \Rightarrow + 4 * - a b x

2.17.4. Solution to Practice Problem 2.4

Exactly like the parse tree for any other derivation of (5*x)+y. There is only one parse tree for the expression given this grammar.

2.17.5. Solution to Practice Problem 2.5

_images/parsetree2ex.png

Fig 2.7 The parse tree for practice problem 2.5

2.17.6. Solution to Practice Problem 2.6

_images/parsetreeprefix.png

Fig 2.8 The parse tree for practice problem 2.6

2.17.7. Solution to Practice Problem 2.7

_images/abstreeex.png

Fig 2.9 The abstract syntax tree for practice problem 2.7

2.17.8. Solution to Practice Problem 2.8

In order to define both negative and positive numbers, we can use the choice operator.

letter.letter* + digit.digit* + ‘-‘.digit.digit* ‘+’ + ‘-‘ + ‘*’ + ‘/’ + ‘(‘ + ‘)’

2.17.9. Solution to Practice Problem 2.9

& E \Rightarrow T~RestE \Rightarrow F~RestT~RestE \Rightarrow 4~RestT~RestE \Rightarrow 4~RestE \Rightarrow \\
& 4 + T~RestE \Rightarrow 4 + F~RestT~RestE \Rightarrow 4 + ( E )~RestT~RestE \Rightarrow 4 + ( T~RestE ) RestT~RestE \\
& \Rightarrow 4 + ( F~RestT~RestE )~RestT~RestE \Rightarrow 4 + ( a~RestT~RestE ) RestT~RestE~\Rightarrow \\
& 4 + ( a~RestE )~RestT~RestE  \Rightarrow 4 + ( a - T~RestE )~RestT~RestE \Rightarrow \\
& 4 + ( a - F~RestE )~RestT~RestE \Rightarrow 4 + ( a - b~RestE ) \Rightarrow 4 + ( a - b )~RestT~RestE \\
& \Rightarrow 4 + ( a - b ) * F~RestT~RestE \Rightarrow 4 + ( a - b ) * x~RestT~RestE \Rightarrow 4 + ( a - b ) * x~RestE \\
& \Rightarrow 4 + ( a - b ) * x

2.17.10. Solution to Practice Problem 2.10

In the parsing of 5*4+3 the following shift and reduce operations: step A initial condition, step B shift, step C reduce by rule 5, step D reduce by rule 4, step E shift, step F shift, step G reduce by rule 5, step H reduce by rule 3, step I reduce by rule 2, step J shift, step K shift, step L reduce by rule 5, step M reduce by rule 4, step N reduce by rule 1, step O finished parsing with dot on right side and E on top of stack so pop and complete with success.

2.17.11. Solution to Practice Problem 2.11

To complete this problem it is best to do a right-most derivation of (6+5)*4 first. Once that derivation is complete, you go through the derivation backwards. The difference in each step of the derivation tells you whether you shift or reduce. Here is the result.

& E \Rightarrow T \Rightarrow T * F \Rightarrow T * 4 \Rightarrow F * 4 \Rightarrow ( E ) * 4 \Rightarrow ( E + T ) * 4 \Rightarrow ( E + F ) * 4 \Rightarrow ( E + 5 ) * 4 \\
&\Rightarrow ( T + 5 ) * 4 \Rightarrow ( F + 5 ) * 4 \Rightarrow ( 6 + 5 ) * 4

We get the following operations from this. Stack contents have the top on the right up to the dot. Everything after the dot has not been read yet. We shift when we must move through the tokens to get to the next place we are reducing. Each step in the reverse derivation provides the reduce operations. Since there are seven tokens there should be seven shift operations.

  1. Initially: . ( 6 + 5 ) * 4
  2. Shift: ( . 6 + 5 ) * 4
  3. Shift: ( 6 . + 5 ) * 4
  4. Reduce by rule 5: ( F . + 5 ) * 4
  5. Reduce by rule 4: ( T . + 5 ) * 4
  6. Reduce by rule 2: ( E . + 5 ) * 4
  7. Shift: ( E + . 5 ) * 4
  8. Shift: ( E + 5 . ) * 4
  9. Reduce by rule 5: ( E + F . ) * 4
  10. Reduce by rule 4: ( E + T . ) * 4
  11. Shift: ( E + T ) . * 4
  12. Reduce by rule 1: ( E ) . * 4
  13. Reduce by rule 6: F . * 4
  14. Reduce by rule 4: T . * 4
  15. Shift: T * . 4
  16. Shift: T * 4 .
  17. Reduce by rule 5: T * F .
  18. Reduce by rule 3: T .
  19. Reduce by rule 2: E .