curve

Writing a small parser / interpreter (Part 1: Scanner)

Writing a small parser / interpreter (Part 1: Scanner)

Perhaps some time in your career you’ve been asked to write a small “conversion tool”, one which “shall perform a simple translation” of some text input and produce a text output. So, you grab the compiler or scripting tool of your choice and hack away at some regex parser thingy, right?!

There is an alternative that at first glance seems daunting, but for most cases is pretty simple and definitely worthwhile.

In the following three part sequel I will introduce the technique of writing a simple parser for a subset of the LOGO programming language in C#.

LOGO(small) syntax

LOGO is mostly known to be a toy language aimed at introducing kids and newcomers to programming. LEGO programs directs a small turtle (or arrow or whatever symbol kids find amusing) around the screen.

Here’s a small sample drawing a quadrant 100 units wide.

REPEAT 4 [FORWARD 100 LEFT 90]

Expressing the legal sentences in our small subset of LOGO programs can be done using a CFG (Context-Free-Grammar) and (Backus-Naur-Form), or EBNF (Extended-BNF) where appropriate.

A CFG consists of rules transforming nonterminal symbols into a sequence of nonterminal or terminal symbols. “No context” is allowed on the left hand side (the nonterminal “definition”). Deriving legal sentences consists of replacing nonterminals with their right hand side “productions” until only terminal symbols are left.

Consider the following grammar for our LOGO subset:

<logo-program>  ::= <logo-sentence> { <logo-sentence> } EOF 
<logo-sentence> ::= FORWARD <integer> | 
                    BACK <integer> | 
                    LEFT <integer> | 
                    RIGHT <integer> | 
                    REPEAT <integer> [ <logo-sentence> { <logo-sentence> } ]

<integer> are just your normal integer representation, no 0 prefixes allowed (besides 0 of course). { .. } is EBNF notation for “optional”. “|” means “or”. EOF means no more input characters (Aka “End-Of-File”).

The above subset excels when it comes to the absence of complexities (ambiguities), “left recursion” or “similar right hand side prefixes” and is therefore suitable for “recursive decent parsing”.

Boring? Don’t worry, if you have made it this far, from now on no more theory.

Transforming the simple grammar above into a parser / interpreter involves three major steps:

  1. Write a scanner capable of recognizing the keywords and integers on  the input stream
  2. Write a parser that matches the recognized terminal symbols as part of the production rules
  3. Decorate the parser with semantic methods interpreting the LOGO language

Step 1: Scanner

Writing a scanner consists of recognizing and matching chars in a string – it’s that simple.

An initial test (cough, a bit too elaborate to qualify for an actual TDD step, but you will get the idea) might look something along the lines of:

[Test]
public void CanRecognizeTokens()
{
  const string input = "REPEAT 4 [ FORWARD 100 LEFT 90 ] ";
  var expectedTokens = new[] { Token.REPEAT, Token.NUMBER, Token.LBRACKET, 
Token.FORWARD, Token.NUMBER, Token.LEFT, 
Token.NUMBER, Token.RBRACKET, Token.EOF };
  var scanner = new LogoScanner(input);
  foreach (var expectedToken in expectedTokens)
  {
    var token = scanner.Scan();
    Assert.AreEqual(expectedToken, token);
  }
}

And the accompanying code could look like:

public enum Token
{
  [TokenAsText("REPEAT")]
  REPEAT,
  [TokenAsText("FORWARD")]
  FORWARD,
  [TokenAsText("BACK")]
  BACK,
  [TokenAsText("LEFT")]
  LEFT,
  [TokenAsText("RIGHT")]
  RIGHT,
  [TokenAsText("NUMBER")]
  NUMBER,
  [TokenAsText("[")]
  LBRACKET,
  [TokenAsText("]")]
  RBRACKET,
  EOF
}
public class LogoScanner
{
  readonly Token[] reserved = new[] { Token.FORWARD, Token.BACK, Token.RIGHT, Token.LEFT, Token.REPEAT };

  private readonly string rawContents;
  private string scanBuffer;
  private int idx;
  private char ch;

  public LogoScanner(string input)
  {
    rawContents = input;
  }

  public Token Scan()
  {
    while (idx < rawContents.Length)
    {
      ch = rawContents[idx];
      if (ch == '[')
      {
        idx++;
        return Token.LBRACKET;
      }
      else if (ch == ']')
      {
        idx++;
        return Token.RBRACKET;
      }
      else if (char.IsDigit(ch))
      {
        scanBuffer = ch.ToString();
        idx++;
        while (idx < rawContents.Length)
        {
          ch = rawContents[idx];
          if (char.IsDigit(ch))
          {
            scanBuffer += ch;
            idx++;
          }
          else break;
        }
        return Token.NUMBER;
      }
      else if (char.IsLetter(ch))
      {
        scanBuffer = ch.ToString();
        idx++;
        while (idx < rawContents.Length)
        {
          ch = rawContents[idx];
          if (char.IsLetter(ch))
          {
            scanBuffer += ch;
            idx++;
          }
          else break;
        }
        Token lookup;
        if (LookupReserved(scanBuffer, out lookup))
        {
          return lookup;
        }
        LexicalError();
      }
      else if (char.IsWhiteSpace(ch))
      {
        idx++;
      }
      else
      {
        LexicalError();
      }
    }
    return Token.EOF;
  }

  private void LexicalError()
  {
    throw new LogoScannerException(String.Format("Lexical error at '{0}' ('{1}')", ch, scanBuffer));
  }

  private bool LookupReserved(string s, out Token lookup)
  {
    lookup = TokenHelper.LookupAsText(s);
    return reserved.Contains(lookup);
  }
}

Next we’ll look at writing the parser, utilizing the scanners token recognizing capabilities. Please, see the articles below.

We would like to help you

Morten Hoffmann

CEO

T: (+45) 3095 6416