Evaluating the CommonMark spec

September 05, 2014

The promise of “CommonMark”, or the erstwhile “Standard Markdown”, or just stmd, is to present a spec that describes the Markdown syntax unambiguously.

Their homepage says:

… one of our major goals is to make Markdown easier to parse, and to eliminate the many old inconsistencies and ambiguities that made writing a Markdown parser so difficult.

But after looking at the stmd spec, it looks unlikely that their current spec can evolve into a specification that can unambiguously specify Markdown.

Why do Markdown parsers differ?

Undoubtedly, John Gruber’s original syntax description for Markdown is ambiguous. Some of the ambiguity is in the description of the syntax constructs itself, like how should sublists be written.

But a lot of the ambiguity in parsing Markdown is in handling the interplay of different syntax constructs. Some of them are silly and are possibly rare in real world documents¹, like whether *emph or [a link*?](url) has an emphasis construct or a link construct. But some of them are practical too, like whether the second line in the following snippet is part of the list or a separate header:

* Paper
* Pencil
---

The above text is something that a writer could really write in a Markdown document, and it does get interpreted differently by different parsers. Why this disagreement? Because, while the original Markdown syntax was clear about what a list looks like, and also what a header looks like, it didn’t specify what a parser should do if something in the input looks like it could be both a list and a header. So different implementations took different decisions on how to handle this.

The above is just one example and there are quite a lot of permutations in which different syntax constructs could interleave in a document. It is important that a spec that aims to unambiguously specify the syntax addresses all of these interplays.

Where does stmd fall short?

The stmd spec assumes a declarative style, similar to the original Markdown syntax description, and gets into more detail for each syntax construct. So, the ambiguities limited to a particular syntax construct are addressed, or can be potentially addressed.

But stmd is silent on what should be done when a piece of text matches multiple syntax constructs, in much the same way as the original Markdown syntax page is.

Block-level ambiguity

To illustrate, if we take the above example of lists vs. setext header, stmd is equally ambiguous on how it should be handled. According to sections 5.2 and 5.3 of the stmd spec (covering list items and lists), the second line is part of a list. But according to section 4.3 (covering setext headers), the second line is a header. Going by different parts of the spec, we can interpret the input differently.

One could argue that whatever comes first should take precedence. If that’s the case, a spec that aims to be unambiguous should be explicit about it. And should include an exception in the definition of setext headers requiring that the first line of text should not be part of any other construct (except link reference definitions, if you want to match the stmd implementation). But that makes it only murkier to determine from the spec what constitutes a setext header (because now the definition of setext headers is dependant on the definition of the other constructs).

And even then, consider the case where the setext underline is indented, like:

* Paper
* Pencil
  ---

Is the header now inside the list, or outside? Going by the stmd spec, it could be interpreted either way. Neither clarifying the setext header description, nor adding a note about precedence will help in clearing this up.

Span-level ambiguity

Similarly, when span level constructs interleave, we again have ambiguity. For example, the text run *emph or __strong*__ can be parsed either considering * as em or __ as strong. We’re not any wiser after reading the spec how this should be handled. Admittedly, this is a pedantic example, but a specification aiming to be unambiguous cannot afford to leave these undefined.

This argument can be extended to talk about other interleaving of span elements, like code spans and emph, links and other links, links and images, links and emphasis², well, you get the idea.

Same old story

The stmd spec takes the approach of declaratively describing a construct and providing examples to illustrate corner cases. Even though the spec has a lot of examples, they barely scratch the surface of describing the corner cases. That’s not surprising given that the number of permutations of interleaving of syntax constructs is just too large to handle with examples.

This is why the current design of the stmd spec is inadequate to cover all the corner cases and evolve into an unambiguous spec. I know that it is possible to write a completely unambiguous algorithm-based spec (à la the HTML5 spec) for Markdown, because I’ve written one myself. There might also be other ways to do it that I haven’t come across. But a declarative spec like this cannot get rid of ambiguity resulting from the interleaving of multiple syntax constructs, and hence adopters of this spec will likely interpret the same spec in different ways, exactly like before.

Not so rare if you’re writing a Markdown editor and need to parse the text as it’s being written. In that case, a lot of these not-likely-in-a-document scenarios can be encountered while the document gets edited. ↩
The current link definition in the stmd spec disallows unescaped asterisks in the link label, which is wrong and doesn’t match the stmd implementation. Once that is fixed, we will have the problem of whether *a[b*](url) is a[ in an em or b* in a link. ↩

This post started with a HN conversation with John MacFarlane (primary author of CommonMark) and forms the basis of this post in the CommonMark forum.

If you liked this post, you might also like: Why isn't there a formal grammar for Markdown?