Error capturing

Parsers generated by parser generators aren’t usually very good at handling errors. However, a Citron-generated parser can be configured to handle errors intelligently, almost as good as parsers written by hand, using Citron’s error capturing capability.

Motivation
Concepts
- Capturing an error
- Synchronization points
Usage
Strategy
- Choosing non-terminals for error capturing
- Making semantic types support error states
Examples

Motivation

When Citron encounters an error during parsing, the default behaviour is to throw an error, thereby aborting the parsing process. There are a few limitations with this approach:

No error recovery: The parser can detect just one error per input. In case there are multiple errors, the subsequent errors can be detected only after the previous errors are resolved.
No error diagnostics: The thrown error doesn’t have much information for coming up with a helpful error message capable of guiding a human in correcting the input.
No partial parse tree: There’s no way to create a partial parse tree that contains correctly parsed data where available, and placeholder data or error data where there were parse errors.

Citron’s error capturing helps in addressing the above limitations in the following ways:

Based on synchronization points in the grammar, Citron can automatically figure out how to get to a stable state after an error, from where normal parsing can resume.
When an error is encountered, Citron can call into our error handling code with information about the context around the error.

Our error handling code then use that information to:
- report a meaningful error message
- build a partial parse tree

Concepts

Capturing an error

In Citron, error capturing can be turned on for specific non-terminals.

If an error occurs while an error-capturing non-terminal is being parsed, Citron will try to incorporate the error onto the semantic value of that non-terminal, thereby “capturing” the error.

To be able to incorporate an error, the semantic type of an error-capturing non-terminal should be modified to accomodate an error state.

Synchronization points

If an error occurs during parsing of a non-terminal, Citron can no longer use the grammar rules to figure out where the non-terminal ends. Instead, Citron uses the concept of synchronization points put forth by Joe Groff in his blog post on Constructing human-grade parsers:

Language grammars tend to have natural synchronization points, intentional or not, that unambiguously delimit (or at least have a high likelihood of delimiting) independent subcomponents. For instance, C always ends statements with semicolons, and it groups expressions in matching pairs of parentheses, square brackets, or curly braces. It also reserves keywords like “struct”, “if”, “while”, and so on that always introduce a new declaration or statement. Absent any more specific recovery, a good fallback strategy when a parse error is encountered is to scan ahead to one of these synchronization point tokens.

In Citron, when we turn on error capturing for a non-terminal, we also specify the synchronization points for that non-terminal. Synchronization points are specified as a set of tokens that can be used to locate the end of the non-terminal.

Once a synchronizing token is found, the parser can unwind to the level of syntax the token indicates, filling in the incomplete tree underneath as best as possible. For instance, if a C parser encounters an error in the middle of an expression, it can start scanning, and if it finds a closing paren “)”, it can match that to the previous “(“ and complete the surrounding paren expression or argument list and continue parsing the expression grammar from there. Otherwise, if the parser finds a semicolon, it can complete the current partial statement and start parsing a new statement. If it finds a closing curly brace, it can match that to the previous opening brace, complete the compound statement, and start parsing a new statement or declaration afterward.

On finding a matching synchronization point, Citron automatically figures out the correct stack level to unwind to, and asks our code for the error-captured semantic value for that non-terminal. If in our code, we create and return that semantic value, Citron then actually unwinds the stack to the appropriate stable state, pushes the obtained semantic value onto the stack, and continues parsing.

Our code can, at run time, also choose to not capture the error at that particular synchronization point, in which case Citron will try again at a subsequent synchronization point. For example, when trying to error capture on a parenthesized expression (like the condition of an “if” statement), we can choose to not capture errors on sub-expressions and capture errors only when the top-level close parenthesis is encountered.

Usage

To use error capturing in Citron, we should:

Enable error capturing for specific non-terminals
Specify the synchronization points for each of those non-terminals
Make the the semantic type of these non-terminals support an error state
Write code for handling errors

Each of these steps is explained in detail below.

1. Enable error capturing

We use %capture_errors directives in a Citron grammar file to enable error capturing for a non-terminal.

For example, the following enables error capturing on a non-terminal called statement.

%capture_errors statement.

2. Specify synchronization points

We use end_before and end_after clauses in %capture_errors directives to specify the synchronization points.

For instance, consider Joe Groff’s example of using synchronization points while parsing a C statement. In case of an error, we should look for the next semicolon, and if we find one, we can assume that the statement ends with that semicolon. If we don’t find a semicolon, but instead find a reserved keyword like “struct”, “if” or “while”, we can assume that the statement ends just before that keyword.

In a Citron grammar file, we can specify this as:

%capture_errors statement
    end_after(SemiColon)
    end_before(KeywordStruct | KeywordIf | KeywordWhile).

where statement is the non-terminal representing a C statement, and SemiColon, KeywordStruct, KeywordIf and KeywordWhile are tokens.

As another example, to turn on error capturing for the param non-terminal representing a parameter in our function header grammar, we could say:

%capture_errors param
    end_before(Comma | CloseBracket).

This means that, in case there was an error, the param can be considered to end just before the next Comma or CloseBracket token.

3. Make semantic types support error states

We need to make sure that the semantic type of each error-capturing non-terminal supports an error state.

A simple way to do that is to use an optional version of the type as the semantic type for that non-terminal. In this case, the nil value would represent the error state.

A more comprehensive way to modify a semantic type for error capturing would be to:

Change contained member variable types to accomodate an error state
Add a member variable to store the captured error

For example, let’s try to modify the FunctionParameter structure we used to represent the param non-terminal, originally defined as:

struct FunctionParameter {
    let localName: String
    let externalName: String?
    let type: String
    let isInout: Bool
}

A simple way to make this semantic type capable of supporting error capturing is to make it optional. If the semantic value is nil, it indicates that an error occurred during the parsing of the param; if it is non-nil, it contains an actual function parameter.

However, a nil value cannot contain any additional information.

It might be useful to, for example, store any partially parsed data, or the error itself, in the parse tree. To do that, we should modify the FunctionParameter structure to accomodate the error state. Here’s one possible way to do that:

struct FunctionParameter {
    let localName: String? // Is nil only if there's an error
    let externalName: String?
    let type: String? // Is nil only if there's an error
    let isInout: Bool
    let syntaxError: Error?

    // Normal initializer, for when there's no error.
    // Takes non-optional localName and type.

    init(localName: String, externalName: String?, type: String, isInout: Bool) {
        self.localName = localName
        self.externalName = externalName
        self.type = type
        self.isInout = isInout
        self.syntaxError = nil
    }

    // Error-capturing initializer, for when there's an error.
    // Takes optional localName and type.

    init(syntaxError: Error, localName: String?, externalName: String?, type: String?, isInout: Bool) {
        self.localName = localName
        self.externalName = externalName
        self.type = type
        self.isInout = isInout
        self.syntaxError = syntaxError
    }
}

4. Write code for handling errors

We should designate a class in our code to handle the errors, make that class adopt the CitronErrorCaptureDelegate protocol and set an instance of the class as the errorCaptureDelegate of the parser.

The definition of the CitronErrorCaptureDelegate protocol is auto-generated by Citron from the grammar file. For each error-capturing non-terminal, the protocol contains a method declaration that looks like this:

func shouldCaptureErrorOnNameOfNonTerminal(
        state: ParserClass.CitronErrorCaptureState,
        error: Error) -> CitronErrorCaptureResponse<SemanticTypeOfNonTerminal>

Our class that adopts the protocol must provide the implementation for these methods.

The state parameter provides information on the context in which the error occurred, which we can use to:

Diagonize the cause for the error and show an appropriate error message
Build a partial parse tree (or sub-tree) for this non-terminal

The return value, CitronErrorCaptureResponse, is a generic enumeration that can have one of these values:

.captureAs(SemanticTypeOfNonTerminal):

We use this to return to Citron the error-captured semantic value for that non-terminal.

If we return this, Citron will unwind the stack appropriately, push the semantic value we return onto the stack, thereby capturing the error, and continue parsing.
.dontCapture:

We use this to ask Citron to abort error capturing at this particular token.

If we return this, Citron will not capture the error at this synchronization point, and will try again at a subsequent synchronization point.

Strategy

Choosing non-terminals for error capturing

Not every non-terminal in the grammar might need error capturing enabled.

A non-terminal picked for error capturing should, ideally:

Have one or more natural synchronizing points consistent with the grammar (or it should be a grammar-ending non-terminal)
Represent a user-facing entity that error messages can talk about

Some other points to consider:

It might be useful to make sure that all parsing errors go through the errorCaptureDelegate. One way to do that is to ensure that the start symbol has error capturing enabled.
If a non-terminal is contained within another non-terminal, with both being error-capturing non-terminals, in case an error occurs while parsing the inner non-terminal, Citron will try to capture the error only on the inner non-terminal.

Making semantic types support error states

In case we’d like to be able to get a partial parse tree from erraneous input, we should update the type definitions of our tree nodes to accomodate that.

In case we don’t want a partial parse tree, but are looking at error capturing to provide better error messages, and multiple errors in one pass, we can just use an optional type and use the nil value to represent an error state.

Examples

A few examples of how Citron is used for parsing can be found in the “examples” folder in the project repository. The examples with a “_ec” suffix support error capturing.

These are the two examples with error capturing enabled:

expr_ec:

This is an arithmetic expression parser that parses an expression in infix notation, parses it and outputs it in lisp-style prefix notation.

For example:

$ ./expr "(1 + 2) * (3 + 4)"
Prefix notation: (* (+ 1 2) (+ 3 4))

It detects multiple errors in one go, gives out contextual error messages and can produce a partial parse tree (and therefore a partial prefix notation output):

$ ./expr "1 + ($ * 4"
Error: Expecting an integer or parenthesized expression
1 + ($ * 4
     ^
Error: Parenthesis not closed
1 + ($ * 4
          ^
Prefix notation: (+ 1 (*  ? 4)?)

$ ./expr "1 + + 2 3 - 4 5 * ("
Error: Expecting an integer or parenthesized expression
1 + + 2 3 - 4 5 * (
    ^
Error: Expecting an operator: +, -, *, or /
1 + + 2 3 - 4 5 * (
              ^
Error: Expecting an integer or parenthesized expression
1 + + 2 3 - 4 5 * (
                   ^
Prefix notation: (- (+ 1 2?) (* 4?  ?))

functype_ec:

This parses Swift function headers and produces the type of the function.