Contents
Parsing utilities

Overview

Balau includes a set of classes which can help in the construction of language scanners and parsers. The aim of these classes is to facilitate the creation of hand written language scanners and recursive descent parsers written in pure C++.

Given that there are many mature scanner/parser generator tools available, one pertinent question to ask may be: why write a scanner and parser in pure C++?

Firstly, there is no right or wrong approach to writing a scanner/parser pair. If a generator tool works well for a particular use case, then such an approach is admirable. However, given that the parsers of prominent mainstream compilers are hand written (Clang - unified parser, GCC - new C parser), perhaps there are valid reasons for doing so.

One anecdotal reason often quoted is performance, that is to say that a hand written parser will supposedly be much faster than a generated one. The evidence does not appear to back this up (the GCC wiki indicates a negligible 1.5% speed increase). There may thus be a small percentage speed up, but it is unlikely that a speed up measured in orders of magnitude will occur.

So removing performance from the argument, some benefits are proposed below.

The overriding reason to use scanner/parser generator tools is that the amount of code you need to write is much less than the code required for a hand written scanner and parser. This is most likely true. However, a language specification is typically created once and then subsequently modified over time with only small incremental improvements. The overhead of creating the larger code base of a hand written scanner and parser is thus likely to be less than the benefits reaped during fine tuning of the initial implementation and the ease of adding incremental improvements later on.

Approach

Having read the introductory words in the overview section of this page, it could be natural to imagine that the parsing utilties in the Balau library are large and complex. This is actually not the case. The parsing utilities total less than 1000 lines of code and a handful of classes.

As the aim is to hand write a scanner and parser implementation pair, the value of the utilities lies in the approach rather than the amount of code provided. The parser tools do not actually provide any classes for parsing at all. The code only provides an abstract scanner base class and a common contract between the scanner and the to be written parser.

The overall approach to the creation of a new pure C++ based language parser using the Balau parsing utilities classes is to:

There are currently two language parser implementations in the Balau library that follow this approach:

Before embarking on the creation of a language parser based on the Balau parser utility classes, it could be useful to take a look at these parser implementations.

Architecture

The overall architecture supported by the Balau parser utilities is one where:

The advantages of such an approach are the separation of concerns between the scanning and parsing stages, the infinite look-ahead and look-back, and the ease of dynamically changing the whitespace handling strategy during parsing.

The disadvantage of such an approach is the necessity of specifying a single token set which covers all possible tokens of the language, i.e. it is not possible to define multiple token sets, the active set then being selected according to context during the parsing stage.

There are a number of ways of mitigating this disadvantage when a parser cannot be trivially implemented with a complete separation between scanner and parser for a particular language grammar. One solution could be to define a single token for the contextual part of the language and then define a second scanner which is triggered separately from within the parser. Another solution could be to define a set of tokens that represent the input text and then, under certain exceptional code paths in the parser, translate subsets of the input tokens on the fly into a modified set of tokens for further parsing.

Scanned tokens

#include <Balau/Lang/Common/ScannedTokens.hpp>

An instance of the ScannedTokens<TokenT> class is generated by the scanner and consumed by the parser. This class is an efficient data structure which contains the input text as a UTF-8 string and two additional arrays:

In order to use the scanned tokens data, instantiation of an adaptor class is required. There are three adaptor classes, destined for different functionalities:

Scanner Api

With the ScannerApiScannedTokens adaptor class, a standard scanner API is provided for use by a traditional downstream parsing stage. This is the adaptor that is used if traditional parsing is required.

The ScannerApiScannedTokens adaptor class provides the following methods.

Method Description
get Get the current scanned token, optionally consuming blanks, line breaks, and/or comments as determined by the current whitespace mode.
consume Consume the current token.
expect(token) Consume the current token if it is equal to the supplied token, otherwise register an error report or throw an exception.
expect(tokens) Consume the current token if it is equal one of the supplied tokens, otherwise register an error report or throw an exception
putBack Put back the current token, optionally restoring blanks, line breaks, and/or comments as determined by the current whitespace mode.
mark Record a token position marker for a subsequent multiple putBack call.
putBack(marker) Put back to the specified marker.
pushWhitespaceMode Push the specified whitespace mode onto the whitespace mode stack.
popWhitespaceMode Pop the top of the whitespace mode stack.
getCurrentCodeSpan Get the code span for the current token's text.
reset Reset the scanner state to the beginning of the token list and with an empty whitespace mode stack.
size Get the number of tokens produced from the scanned text.
moveTextOut Move the text of the scanner out (prevents further scanning calls).

The expect methods that register an error report do so via a supplied error report function that produces an error report. The report is added to the supplied container. These expect methods should be used if an error recovery type parser is required, as opposed to a fail on first error parser.

Due to the compact representation of the scanned token data, obtaining the code span of a token's text must be performed by iterating from the start of the arrays. When using the ScannerApiScannedTokens adaptor class, the current code span is maintained for immediate use. Due to this, the current code span may be accessed without any performance penalty.

The scanner API also provides classification methods for the current token and token specified by a marker:

Random access

If random access to the scanned token data and code spans is required for other uses, the RandomAccessScannedTokens adaptor class may be used to pre-calculate the code spans. When instantiated, this class iterates over the token set and calculates all the code spans. Due to this, the RandomAccessScannedTokens class has a higher memory overhead than the ScannerApiScannedTokens class.

Iteration

When iteration is required over the scanned tokens, this third adaptor class is provided for such applications. The IterativeScannedTokens adaptor class provides standard C++ iterators onto the scanned tokens.

Scanning

#include <Balau/Lang/Common/AbstractScanner.hpp>

In order to create a ScannedTokens data structure from some input source code text, a scanner implementation is required. Balau provides an AbstractScanner base class. Concrete implementations of this class implement the getNextToken abstract method in order to provide the scanning logic.

The AbstractScanner class provides the public scanner api, consisting of the single scan method, and a set of protected utility methods used by implementing classes.

The fundamental utility methods used by implementing classes are the readNextChar and putBackCurrentChar methods. These manage character reading and end of file handling.

Other utility methods including string and whitespace extraction. Also included is a calculateCurrentCodeSpan method which can be used for error reporting.

Parsing

To create a recursive descent parser in pure C++ using the Balau parser utility classes, it is sufficient to pass in an instance of the ScannerApiScannedTokens class into the parser constructor and access the scanner API from within the production methods.

The PropertyParser class is an example of a parser in the Balau library. A single public method parse is provided in each class. This method calls the root production methods in order to initiate the recursive descent parsing.

Classes

The following classes are defined in the parsing utilties.

Class name Description
ScannedTokens<TokenT> The data structure produced by scanners. Contains the input source text, the scanned tokens, and the start offsets.
ScannerApiScannedTokens<TokenT> A scanner API adaptor over the ScannedTokens data structure.
RandomAccessScannedTokens<TokenT> A random access adaptor over the ScannedTokens data structure.
IterativeScannedTokens<TokenT> An iterative adaptor over the ScannedTokens data structure.
AbstractScanner<TokenT> The base class of scanner implementations.
ScannedToken<TokenT> A single scanned token returned from the scanner API, containing the token, the text, and the code span.
CodeSpan Contains two code positions which together indicate a span of code.
CodePosition Represents a line/column position in the source code text.