Home - Topics - Papers - Theses - Blog - CV - Photos - Funny

December 28, 2022

MinML: concise but general markup syntax

Could you use a markup syntax that supports the full expressive power and richness of HTML or XML, but is more terse, easier to type, and less frankly ugly? To emphasize text, for example, would it be nice just to write em[emphasize] instead of <em>emphasize</em>? If so, pleae read on.

The tussle between generality and writer-friendliness

Markup languages derived from SGML, like HTML and XML, are powerful and have many uses but are verbose and often a pain to write or edit manually. While XML was substantially a reaction to the complexity and bloat of SGML, terseness was always considered of minimal importance in XML.

Reactions to the verbosity and awkwardness of SGML-style markup brought us formats like JSON and Markdown. But while JSON is useful for automated data interchange, it is not a markup language. Its strict and minimal syntax demanded further extensions like YAML, TOML, or JSON5 even to get, say, a way to write comments.

Markdown is a markup language, and vastly improves terseness and quick typeability for the most common and simple markup constructs. But its expressiveness is limited to a small subset of HTML. Further, the quirky special-case syntax it uses for each construct makes its syntax difficult to “scale” to richer functionality without getting into a mess of syntax conflicts and ambiguities. It is not easy to standardize, or even to specify rigorously – not to say that this hasn't been tried. To see just how fragile Markdown syntax is, try to understand – or correctly implement – the 17 rules for parsing emphasis and the 131 associated examples in Commonmark.

There are numerous extensions and alternative variants of Markdown-style syntax to choose from, of course: e.g., GitHub flavor, reStructuredText, POD, Org Mode, AsciiDoc, Textile, Markua, txt2tags, etc. Each of these variants supports a different small subset of HTML, each with its own syntactic quirks for the markup author to learn afresh. Further, each flavor's limitations present expressiveness barriers that an author may encounter at any moment: “oh, but now can I do that?” These barriers can lead the frustrated author to seek escape routes – back to HTML, or to another existing Markdown flavor, or to create yet another new flavor themselves with ever-more-devilishly clever and brittle syntax with another new and different set of limitations.

On my web site, I used to embed HTML tags in .md files in order to escape Markdown's limitations. But when an “upgrade” to Hugo silently corrupted the entire website by suddenly disabling all markdown-embedded HTML, I realized the essential fragility of this solution. Even if markdown-embedded HTML can be re-enabled, I do not want all my past writing being silently corrupted on a regular basis by the latest evolution in the markdown parser or its default configuration. Markdown and all its flavors are risky dead-ends in the long term. There is real value in relying on stable, highly-standardized, general-purpose markup formats like HTML or XML. But do I really have to keep typing all those stupid start and end tags?

Introducing MinML

MinML (which I pronounce like “minimal”) is a more concise or “minified” syntax for markup languages like HTML and XML. It is designed to be automatically cross-convertible both to and from the base markup syntax, and to preserve the full expressiveness of the underlying markup language. Unlike Markdown, there is nothing you can write in HTML but not in MinML.

In effect, MinML might be described as merely a new “skin” for a general markup language like HTML or XML. It changes only the way you write element tags, attributes, or character references, without generally affecting (or even knowing or caring about) which element tags, attributes, or references you use. MinML therefore not only supports the expressive richness of HTML now, but its expressiveness will continue growing as HTML evolves in the future.

Let us start with a brief tour of MinML syntax.

Basic markup elements

In place of start/end tag pairs, MinML uses the basic syntax tag[content], as illustrated in the following table:
HTML MinML Output
<em>emphasis</em> em[emphasis] emphasis
<kbd>typewriter</kbd> kbd[typewriter] typewriter
<var>x</var><sup>2</sup> var[x]sup[2] x2
An element with no content, like <hr> in HRML or <hr/> in XML, becomes hr[] in MinML.

Elements with attributes

In MinML, we attach attributes to elements by inserting them in curly braces between the tag and square-bracketed content, like this:
MinML Output
hr{width=100%}[]
img{src=cat.jpg height=40}[]
a{href=http://bford.info/}[my home page] my home page
If an attribute value in an element needs to contain spaces, we quote the value with square brackets, like this:
	img{src=cat.jpg alt=[a cute cat photo]}[]

Character references

MinML uses square brackets in place of SGML's bizarre &; syntax to delimit character references. Thus, you write [reg] in MinML instead of &reg; in HTML to get a registered trademark sign ®.

You can use numeric character references too, of course. For example, [#174] in decimal or [#x00AE] in hexadecimal are alternative representations for the character ®.

Quoted strings

You can still use the directed (left and right) single- and double-quote character references to typeset quoted strings properly. Writing [ldquo]quote[rdquo] in MinML, as opposed to &ldquo;quote&rdquo; in XML, already seems like a slightly-improved way to express a quoted “string”.

Because quoted strings are such an important common case, however, MinML provides an even more concise alternative for matching quotes. You can write "[string] to express a “string” delimited by matching double quotes, or '[string] for a ‘string’ delimited by matching single quotes.

Comments in markup

You can include comments in MinML markup with -[c], like this:
HTML MinML Output
<!-- comment --> -[comment]

Managing whitespace

Because an element tag is outside (just before) an open bracket or curly brace in MinML, we often need whitespace to separate an element from preceding text:
bee em[yoo] tiful bee yoo tiful
Without the whitespace before the em tag, it would look like the incorrect tag beeem. If you don't actually want whitespace around an element, however, you can use less-than < and greater-than > signs to consume or “suck” the surrounding whitespace:
bee <em[yoo]> tiful beeyootiful
These space-sucking symbols are not delimiters as in SGML, however, and need not appear in matched pairs. You can use them to suck space on one side but not the other:
mark <em[up] now markup now
now em[mark]> up now markup
You can also use space-suckers within an element's content, to suck space at the beginning and/or end of the content:
a <b[> b <]> c abc
If you need literal square brackets or curly braces immediately after what could otherwise be an element name, you can separate them with whitespace and a space-sucker:
b[1 <[hellip]> 10] 1…10
b <[1 <[hellip]> 10] b[1…10]
set <{a,b,c} set{a,b,c}
The same is true if you need a literal square-bracket pair surrounding what could be mistaken for a character reference:
[star]
[> star <] [star]

Raw matchertext sequences

MinML builds on the matchertext syntactic discipline. Matchertext makes it possible to embed one text string into another unambiguously – within a language or even across languages – without having to “escape” or otherwise transform the embedded text. The cost of this syntactic discipline is that the ASCII matcher characters – namely the parentheses (), square brackets [], and curly braces {} – must appear only in properly-nesting matched pairs throughout matchertext.

Let's first look at one of the benefits of matchertext in MinML. You can use the sequence +[m] to include any matchertext string m into the markup as raw literal text, which is completely uninterpreted except to find its end. No character sequences are disallowed in the embedded text as long as matchers match.

You can use raw matchertext sequences to include verbatim examples of markup or other code in your text, for example. A +[m] sequence is thus a more concise analog to XML's clunky CDATA sections:
XML MinML Output
<![CDATA[example <b>bold</b> markup]]> +[example <b>bold</b> in XML] example <b>bold</b> in XML
<![CDATA[example b[bold] in MinML]]> +[example b[bold] in MinML] example b[bold] in MinML
Unlike CDATA sections, raw matchertext sequences nest cleanly. Including a literal example of a CDATA section in XML markup, for example, is mind-meltingly painful:
XML: <![CDATA[example <![CDATA[character data]]]]><![CDATA[> section]]>
Output: example <![CDATA[character data]]> section
Expressing a literal example of a raw matchertext sequence +[…] in MinML is straightforward in contrast:
MinML: +[example +[matchertext] literal]
Output: example +[matchertext] literal

Literal unmatched matchers

The matchertext discipline has a cost, of course. If you want to include an unmatched literal parenthesis, bracket, or curly brace in your MinML markup, you must “escape” it with a character reference. You can use standard named or numeric character references, like [lparen] or [#x28] for an unmatched left parentheses for example.

MinML also provides an alternative, more visual syntax for unmatched matchers: [(<)] and [(>)] for an open and close parenthesis, respectively, [[<]] and [[>]] for a square bracket, and [{<}] and [{>}] for a curly brace. You might think of the < or > symbol in this context as a stand-in for the unmatched matcher that “points” left or right at the matcher you actually want. The following table summarizes these various ways to express literal unmatched matchers.
Open Close
Parentheses () [lpar] [#x28] [(<)] [rpar] [#x29] [(>)]
Square brackets [] [lbrack] [#x5B] [[<]] [rbrack] [#x5D] [[>]]
Curly braces {} [lbrace] [#x7B] [{<}] [rbrace] [#x7D] [{>}]
While having to replace unmatched matchers with character references might seem cumbersome, they tend not to be used often anyway in most text – mainly just in text that is talking about such characters.

Independent of the text embedding benefits discussed above, there is another compensation for this small bother. While editing MinML, or any matchertext language, you may find that your highlighting text editor or integrated development environment (IDE) no longer ever guesses wrong about which parenthesis, bracket, or brace character matches which other one in your source file.

Metasyntax and processing instructions

SGML-derived markup can contain metasyntactic declarations of the form <!…>, and processing instructions of the form <?…?>. MiniML provides the syntax ![…] and ?[…], respectively, for expressing these constructs if needed.

Since these constructs are typically used in only a few lines at the beginning of most markup files, if at all, improving their syntax is not a high-priority goal for MinML. Further, the syntax of – and processing rules for – document type definitions are frighteningly complex, even in the “simplified” XML standard.

MiniML therefore leaves the legacy syntax of the underlying markup language unmodified within the context of these directives. Only the outermost “wrapper” syntax changes. For example, a MiniML document based on XML with a document type declaration might look like:
	?[xml version="1.0"]
	![DOCTYPE greeting SYSTEM "hello.dtd"]
	greeting[Hello, world!]

Give MinML a try

There is an experimental implementation in Go that supports parsing MinML into an abstract syntax tree (AST) and conversion to classic HTML or XML syntax. This repository also includes a simple command-line tool to convert MinML to HTML or XML.

With this experimental fork of the Hugo website builder, you can use MinML source files with extension .minml or .m in your website. This blog post was written in MinML and published using Hugo this way. Feel free to check out the MinML source for this post.

If you implement MinML in other languages or applications, please let me know and I will collect and consolidate links.

Conclusion

MinML is a new “skin” or outer syntax for SGML-derived markup languages such as HTML and XML. MinML preserves all of the base language's power and expressiveness, unlike the numerous flavors of Markdown. MinML's syntax just makes markup a bit more concise and – at least in this author's opinion – less annoying to write, read, or edit. Elements never need end tags, only a final close bracket. Enjoy!


Topics: Syntax Programming Languages Bryan Ford