1 min read

Parsing Expressions

by Vihan Bhargava
  • Compilers
  • Grammars

So you’re writing a parser? Expressions are probably one of the more difficult items you’ll come upon to parse but hopefully this article will explain how to handle them.

So you’re writing a parser? Expressions are probably one of the more difficult items you’ll come upon to parse but hopefully this article will explain how to handle them.


So here’s the grammar from my The grammar to end all grammars! post:

EEα;EβEα(;E;)γσpre;EEβσin;Eσpost;ϵ\begin{array}{rl} E \rightarrow & E_\alpha;E_\beta \\ E_\alpha \rightarrow & (;E;) \\ & \gamma \\ & \sigma_{pre};E \\ E_\beta \rightarrow& \sigma_{in};E\\ & \sigma_{post};\\ & \epsilon \end{array}

σ\sigma represents an expression “item”. What is an expression item? I’ll explain


So an expression item is basically what goes inside the expression, this can be any of your literals, for example I might define γ\gamma as:

N{E}ΣγσSσ\begin{aligned} N &\in \lbrace{E\rbrace} \\ \Sigma &\in \gamma \cup \sigma \\ S &\in \sigma \end{aligned}

This would work for a pretty standard expression allowing for basic literals but what if you want each item to be allowed to have certain suffixes/prefixes? What do I mean by a “suffix” or “prefix”? I mean:

foo + bar? + baz

Where ? could be applied upon any expression item:

foo + (bar + baz)?

Now this may seem completely unnecessary for your parser but realize these are what properties are:

foo.bar.baz

The variable is “foo” but it’s suffix is .baz.bar. Having . as an operator would make no sense as;

  1. . is not an operator
  2. . does not allow entire expressions between them, 1 + 1 . 1 + 1 makes no sense.

Here’s a typical C-like expression item written out:

γ=Terminal literalsσ={apre=Prefix operatorsain=Infix operatorsapost=Postfix operators\begin{aligned} \gamma &= \text{Terminal literals} \\ \sigma &= \begin{cases} \begin{array}{ll} a_{pre} &= \text{Prefix operators} \\ a_{in} &= \text{Infix operators} \\ a_{post} &= \text{Postfix operators} \end{array} \end{cases} \\ \end{aligned}

Where ρβ\rho_{\beta} is the suffix and ρα\rho_{\alpha} is the root. In this case you’d want to remove the (;E;)(;E;) from the EαE_\alpha rule


So hopefully that made sense and cleared things up. You’ll want to allow whitespace between all the above tokens listed. (You usually want to allow whitespace between all tokens in a non-terminal rule).