Error capturing
Parsers generated by parser generators aren’t usually very good at handling errors. However, a Citron-generated parser can be configured to handle errors intelligently, almost as good as parsers written by hand, using Citron’s error capturing capability.
Motivation
When Citron encounters an error during parsing, the default behaviour is to throw an error, thereby aborting the parsing process. There are a few limitations with this approach:
-
No error recovery: The parser can detect just one error per input. In case there are multiple errors, the subsequent errors can be detected only after the previous errors are resolved.
-
No error diagnostics: The thrown error doesn’t have much information for coming up with a helpful error message capable of guiding a human in correcting the input.
-
No partial parse tree: There’s no way to create a partial parse tree that contains correctly parsed data where available, and placeholder data or error data where there were parse errors.
Citron’s error capturing helps in addressing the above limitations in the following ways:
-
Based on synchronization points in the grammar, Citron can automatically figure out how to get to a stable state after an error, from where normal parsing can resume.
-
When an error is encountered, Citron can call into our error handling code with information about the context around the error.
Our error handling code then use that information to:
- report a meaningful error message
- build a partial parse tree
Concepts
Capturing an error
In Citron, error capturing can be turned on for specific non-terminals.
If an error occurs while an error-capturing non-terminal is being parsed, Citron will try to incorporate the error onto the semantic value of that non-terminal, thereby “capturing” the error.
To be able to incorporate an error, the semantic type of an error-capturing non-terminal should be modified to accomodate an error state.
Synchronization points
If an error occurs during parsing of a non-terminal, Citron can no longer use the grammar rules to figure out where the non-terminal ends. Instead, Citron uses the concept of synchronization points put forth by Joe Groff in his blog post on Constructing human-grade parsers:
Language grammars tend to have natural synchronization points, intentional or not, that unambiguously delimit (or at least have a high likelihood of delimiting) independent subcomponents. For instance, C always ends statements with semicolons, and it groups expressions in matching pairs of parentheses, square brackets, or curly braces. It also reserves keywords like “struct”, “if”, “while”, and so on that always introduce a new declaration or statement. Absent any more specific recovery, a good fallback strategy when a parse error is encountered is to scan ahead to one of these synchronization point tokens.
In Citron, when we turn on error capturing for a non-terminal, we also specify the synchronization points for that non-terminal. Synchronization points are specified as a set of tokens that can be used to locate the end of the non-terminal.
Once a synchronizing token is found, the parser can unwind to the level of syntax the token indicates, filling in the incomplete tree underneath as best as possible. For instance, if a C parser encounters an error in the middle of an expression, it can start scanning, and if it finds a closing paren “)”, it can match that to the previous “(“ and complete the surrounding paren expression or argument list and continue parsing the expression grammar from there. Otherwise, if the parser finds a semicolon, it can complete the current partial statement and start parsing a new statement. If it finds a closing curly brace, it can match that to the previous opening brace, complete the compound statement, and start parsing a new statement or declaration afterward.
On finding a matching synchronization point, Citron automatically figures out the correct stack level to unwind to, and asks our code for the error-captured semantic value for that non-terminal. If in our code, we create and return that semantic value, Citron then actually unwinds the stack to the appropriate stable state, pushes the obtained semantic value onto the stack, and continues parsing.
Our code can, at run time, also choose to not capture the error at that particular synchronization point, in which case Citron will try again at a subsequent synchronization point. For example, when trying to error capture on a parenthesized expression (like the condition of an “if” statement), we can choose to not capture errors on sub-expressions and capture errors only when the top-level close parenthesis is encountered.
Usage
To use error capturing in Citron, we should:
- Enable error capturing for specific non-terminals
- Specify the synchronization points for each of those non-terminals
- Make the the semantic type of these non-terminals support an error state
- Write code for handling errors
Each of these steps is explained in detail below.
1. Enable error capturing
We use %capture_errors directives in a Citron grammar file to enable error capturing for a non-terminal.
For example, the following enables error capturing on a
non-terminal called statement
.
%capture_errors statement.
2. Specify synchronization points
We use end_before
and end_after
clauses in
%capture_errors directives to specify the synchronization points.
For instance, consider Joe Groff’s example of using synchronization points while parsing a C statement. In case of an error, we should look for the next semicolon, and if we find one, we can assume that the statement ends with that semicolon. If we don’t find a semicolon, but instead find a reserved keyword like “struct”, “if” or “while”, we can assume that the statement ends just before that keyword.
In a Citron grammar file, we can specify this as:
%capture_errors statement
end_after(SemiColon)
end_before(KeywordStruct | KeywordIf | KeywordWhile).
where statement
is the non-terminal representing a C statement, and
SemiColon
, KeywordStruct
, KeywordIf
and KeywordWhile
are tokens.
As another example, to turn on error capturing for the param
non-terminal representing a parameter in our function header grammar,
we could say:
%capture_errors param
end_before(Comma | CloseBracket).
This means that, in case there was an error, the param
can be
considered to end just before the next Comma
or CloseBracket
token.
3. Make semantic types support error states
We need to make sure that the semantic type of each error-capturing non-terminal supports an error state.
A simple way to do that is to use an optional version of the type as the
semantic type for that non-terminal. In this case, the nil
value would
represent the error state.
A more comprehensive way to modify a semantic type for error capturing would be to:
- Change contained member variable types to accomodate an error state
- Add a member variable to store the captured error
For example, let’s try to modify the FunctionParameter
structure we
used to represent the param
non-terminal, originally defined as:
struct FunctionParameter {
let localName: String
let externalName: String?
let type: String
let isInout: Bool
}
A simple way to make this semantic type capable of supporting error
capturing is to make it optional. If the semantic value is nil
, it
indicates that an error occurred during the parsing of the param
;
if it is non-nil, it contains an actual function parameter.
However, a nil
value cannot contain any additional information.
It might be useful to, for example, store any partially parsed data, or
the error itself, in the parse tree. To do that, we should modify the
FunctionParameter
structure to accomodate the error state. Here’s one
possible way to do that:
struct FunctionParameter {
let localName: String? // Is nil only if there's an error
let externalName: String?
let type: String? // Is nil only if there's an error
let isInout: Bool
let syntaxError: Error?
// Normal initializer, for when there's no error.
// Takes non-optional localName and type.
init(localName: String, externalName: String?, type: String, isInout: Bool) {
self.localName = localName
self.externalName = externalName
self.type = type
self.isInout = isInout
self.syntaxError = nil
}
// Error-capturing initializer, for when there's an error.
// Takes optional localName and type.
init(syntaxError: Error, localName: String?, externalName: String?, type: String?, isInout: Bool) {
self.localName = localName
self.externalName = externalName
self.type = type
self.isInout = isInout
self.syntaxError = syntaxError
}
}
4. Write code for handling errors
We should designate a class in our code to handle the errors, make that
class adopt the CitronErrorCaptureDelegate
protocol and set an
instance of the class as the errorCaptureDelegate
of the parser.
The definition of the CitronErrorCaptureDelegate
protocol is
auto-generated by Citron from the grammar file. For each error-capturing
non-terminal, the protocol contains a method declaration that looks like
this:
func shouldCaptureErrorOnNameOfNonTerminal(
state: ParserClass.CitronErrorCaptureState,
error: Error) -> CitronErrorCaptureResponse<SemanticTypeOfNonTerminal>
Our class that adopts the protocol must provide the implementation for these methods.
The state
parameter provides information on the
context in which the error occurred, which we can use to:
- Diagonize the cause for the error and show an appropriate error message
- Build a partial parse tree (or sub-tree) for this non-terminal
The return value, CitronErrorCaptureResponse
, is a generic
enumeration that can have one of these values:
-
.captureAs(SemanticTypeOfNonTerminal)
:We use this to return to Citron the error-captured semantic value for that non-terminal.
If we return this, Citron will unwind the stack appropriately, push the semantic value we return onto the stack, thereby capturing the error, and continue parsing.
-
.dontCapture
:We use this to ask Citron to abort error capturing at this particular token.
If we return this, Citron will not capture the error at this synchronization point, and will try again at a subsequent synchronization point.
Strategy
Choosing non-terminals for error capturing
Not every non-terminal in the grammar might need error capturing enabled.
A non-terminal picked for error capturing should, ideally:
- Have one or more natural synchronizing points consistent with the grammar (or it should be a grammar-ending non-terminal)
- Represent a user-facing entity that error messages can talk about
Some other points to consider:
-
It might be useful to make sure that all parsing errors go through the
errorCaptureDelegate
. One way to do that is to ensure that the start symbol has error capturing enabled. -
If a non-terminal is contained within another non-terminal, with both being error-capturing non-terminals, in case an error occurs while parsing the inner non-terminal, Citron will try to capture the error only on the inner non-terminal.
Making semantic types support error states
In case we’d like to be able to get a partial parse tree from erraneous input, we should update the type definitions of our tree nodes to accomodate that.
In case we don’t want a partial parse tree, but are looking at error
capturing to provide better error messages, and multiple errors in one
pass, we can just use an optional type and use the nil
value to
represent an error state.
Examples
A few examples of how Citron is used for parsing can be found in the “examples” folder in the project repository. The examples with a “_ec” suffix support error capturing.
These are the two examples with error capturing enabled:
-
This is an arithmetic expression parser that parses an expression in infix notation, parses it and outputs it in lisp-style prefix notation.
For example:
$ ./expr "(1 + 2) * (3 + 4)" Prefix notation: (* (+ 1 2) (+ 3 4))
It detects multiple errors in one go, gives out contextual error messages and can produce a partial parse tree (and therefore a partial prefix notation output):
$ ./expr "1 + ($ * 4" Error: Expecting an integer or parenthesized expression 1 + ($ * 4 ^ Error: Parenthesis not closed 1 + ($ * 4 ^ Prefix notation: (+ 1 (* ? 4)?)
$ ./expr "1 + + 2 3 - 4 5 * (" Error: Expecting an integer or parenthesized expression 1 + + 2 3 - 4 5 * ( ^ Error: Expecting an operator: +, -, *, or / 1 + + 2 3 - 4 5 * ( ^ Error: Expecting an integer or parenthesized expression 1 + + 2 3 - 4 5 * ( ^ Prefix notation: (- (+ 1 2?) (* 4? ?))
-
This parses Swift function headers and produces the type of the function.
For example:
$ ./functype "func add(a: Int, b: Int) -> Int" Function type is: (Int, Int) -> Int
It detects multiple errors in one go and gives out contextual error messages:
$ ./functype "func add(a: Int, b, c) ->" 1: error: expected ':' after parameter name. func add(a: Int, b, c) -> ^ 1: error: expected ':' after parameter name. func add(a: Int, b, c) -> ^ 1: error: expected result type. func add(a: Int, b, c) -> ^