December 28, 2022
MinML: concise but general markup syntax
Could you use a markup syntax
that supports the full expressive power and richness of HTML or XML,
but is more terse, easier to type, and less frankly ugly?
To
emphasize text, for example,
would it be nice just to write
em[emphasize]
instead of
<em>emphasize</em>
?
If so, pleae read on.
The tussle between generality and writer-friendliness
Markup languages derived from SGML, like HTML and XML,
are powerful and have many uses
but are verbose and often a pain to write or edit manually.
While XML was substantially a reaction to the complexity and bloat of SGML,
terseness was always considered of
minimal importance
in XML.
Reactions to the verbosity and awkwardness of SGML-style markup
brought us formats like
JSON
and
Markdown.
But while JSON is useful for automated data interchange,
it is not a markup language.
Its strict and minimal syntax demanded further extensions like
YAML,
TOML, or
JSON5
even to get, say, a way to write comments.
Markdown
is a markup language,
and vastly improves terseness and quick typeability
for the most common and simple markup constructs.
But its expressiveness is limited to a small subset of HTML.
Further,
the quirky special-case syntax it uses for each construct
makes its syntax difficult to “scale” to richer functionality
without getting into a mess of syntax conflicts and ambiguities.
It is not easy to standardize, or even to specify rigorously –
not to say that this
hasn't been tried.
To see just how fragile Markdown syntax is,
try to understand – or correctly implement –
the 17 rules for parsing emphasis
and the 131 associated examples in Commonmark.
There are numerous extensions and alternative variants of Markdown-style syntax
to choose from, of course: e.g.,
GitHub flavor,
reStructuredText,
POD,
Org Mode,
AsciiDoc,
Textile,
Markua,
txt2tags,
etc.
Each of these variants supports a
different small subset of HTML,
each with its own syntactic quirks for the markup author to learn afresh.
Further, each flavor's limitations present expressiveness barriers
that an author may encounter at any moment:
“oh, but now can I do
that?”
These barriers can lead the frustrated author
to seek escape routes –
back to HTML,
or to another existing Markdown flavor,
or to create
yet another new flavor themselves
with ever-more-devilishly clever and brittle syntax
with another new and different set of limitations.
On
my web site,
I used to embed HTML tags in
.md
files
in order to escape Markdown's limitations.
But when an “upgrade”
to
Hugo
silently corrupted the entire website
by suddenly disabling all markdown-embedded HTML,
I realized the essential fragility of this solution.
Even if markdown-embedded HTML
can be re-enabled,
I do not want all my past writing being silently corrupted on a regular basis
by the latest evolution in the markdown parser or its default configuration.
Markdown and
all its flavors are risky dead-ends in the long term.
There
is real value in relying on stable, highly-standardized,
general-purpose markup formats like HTML or XML.
But do I
really have to keep typing all those stupid start and end tags?
Introducing MinML
MinML
(which I pronounce like “minimal”)
is a more concise or “minified” syntax
for markup languages like HTML and XML.
It is designed to be automatically cross-convertible
both to and from the base markup syntax,
and to preserve the full expressiveness of the underlying markup language.
Unlike Markdown, there is nothing you can write in HTML but not in MinML.
In effect,
MinML might be described as
merely a new “skin” for a general markup language like HTML or XML.
It changes
only the way you write element tags,
attributes, or character references,
without generally affecting (or even knowing or caring about)
which element tags, attributes, or references you use.
MinML therefore not only supports the expressive richness of HTML now,
but its expressiveness will continue growing
as HTML evolves in the future.
Let us start with a brief tour of MinML syntax.
Basic markup elements
In place of start/end tag pairs,
MinML uses the basic syntax
tag[content]
,
as illustrated in the following table:
HTML |
MinML |
Output |
<em>emphasis</em> |
em[emphasis] |
emphasis |
<kbd>typewriter</kbd> |
kbd[typewriter] |
typewriter |
<var>x</var><sup>2</sup> |
var[x]sup[2] |
x2 |
An element with no content,
like
<hr>
in HRML or
<hr/>
in XML,
becomes
hr[]
in MinML.
Elements with attributes
In MinML, we attach attributes to elements
by inserting them in curly braces between the tag
and square-bracketed content, like this:
MinML |
Output |
hr{width=100%}[] |
|
img{src=cat.jpg height=40}[] |
|
a{href=http://bford.info/}[my home page] |
my home page |
If an attribute value in an element needs to contain spaces,
we quote the value with square brackets, like this:
img{src=cat.jpg alt=[a cute cat photo]}[]
Character references
MinML uses square brackets
in place of SGML's bizarre
&
…
;
syntax
to delimit character references.
Thus,
you write
[reg]
in MinML
instead of
®
in HTML
to get a registered trademark sign ®.
You can use numeric character references too,
of course.
For example,
[#174]
in decimal or
[#x00AE]
in hexadecimal
are alternative representations for the character ®.
Quoted strings
You can still use the directed (left and right)
single- and double-quote character references
to typeset quoted strings properly.
Writing
[ldquo]quote[rdquo]
in MinML,
as opposed to
“quote”
in XML,
already seems like a slightly-improved way to express
a quoted “string”.
Because quoted strings are such an important common case, however,
MinML provides an even more concise alternative for matching quotes.
You can write
"[string]
to express
a “string” delimited by matching double quotes,
or
'[string]
for a ‘string’ delimited by matching single quotes.
Comments in markup
You can include comments in MinML markup
with
-[c]
, like this:
HTML |
MinML |
Output |
<!-- comment --> |
-[comment] |
|
Managing whitespace
Because an element tag is outside (just before)
an open bracket or curly brace in MinML,
we often need whitespace to separate an element from preceding text:
bee em[yoo] tiful |
bee yoo tiful |
Without the whitespace before the
em
tag,
it would look like the incorrect tag
beeem
.
If you don't actually want whitespace around an element, however,
you can use less-than
<
and greater-than
>
signs
to consume or “suck” the surrounding whitespace:
bee <em[yoo]> tiful |
beeyootiful |
These space-sucking symbols are
not delimiters as in SGML, however,
and need not appear in matched pairs.
You can use them to suck space on one side but not the other:
mark <em[up] now |
markup now |
now em[mark]> up |
now markup |
You can also use space-suckers
within an element's content,
to suck space at the beginning and/or end of the content:
If you need literal square brackets or curly braces
immediately after what could otherwise be an element name,
you can separate them with whitespace and a space-sucker:
b[1 <[hellip]> 10] |
1…10 |
b <[1 <[hellip]> 10] |
b[1…10] |
set <{a,b,c} |
set{a,b,c} |
The same is true if you need a literal square-bracket pair
surrounding what could be mistaken for a character reference:
[star] |
☆ |
[> star <] |
[star] |
Raw matchertext sequences
MinML builds on the
matchertext
syntactic discipline.
Matchertext makes it possible
to embed one text string into another unambiguously –
within a language or even across languages –
without having to “escape”
or otherwise transform the embedded text.
The cost of this syntactic discipline
is that the ASCII
matcher characters –
namely the parentheses
()
,
square brackets
[]
,
and curly braces
{}
–
must appear
only in properly-nesting matched pairs throughout matchertext.
Let's first look at one of the benefits of matchertext in MinML.
You can use the sequence
+[m]
to include any matchertext string
m into the markup
as raw literal text,
which is completely uninterpreted except to find its end.
No character sequences are disallowed in the embedded text
as long as matchers match.
You can use raw matchertext sequences
to include verbatim examples of markup or other code
in your text, for example.
A
+[m]
sequence
is thus a more concise analog to XML's clunky CDATA sections:
XML |
MinML |
Output |
<![CDATA[example <b>bold</b> markup]]> |
+[example <b>bold</b> in XML] |
example <b>bold</b> in XML |
<![CDATA[example b[bold] in MinML]]> |
+[example b[bold] in MinML] |
example b[bold] in MinML |
Unlike CDATA sections,
raw matchertext sequences nest cleanly.
Including a literal example of a CDATA section in XML markup,
for example,
is
mind-meltingly painful:
XML: |
<![CDATA[example <![CDATA[character data]]]]><![CDATA[> section]]> |
Output: |
example <![CDATA[character data]]> section |
Expressing a literal example
of a raw matchertext sequence
+[…]
in MinML
is straightforward in contrast:
MinML: |
+[example +[matchertext] literal] |
Output: |
example +[matchertext] literal |
Literal unmatched matchers
The matchertext discipline has a cost, of course.
If you want to include an
unmatched literal
parenthesis, bracket, or curly brace in your MinML markup,
you must “escape” it with a character reference.
You can use standard named or numeric character references,
like
[lparen]
or
[#x28]
for an unmatched left parentheses for example.
MinML also provides an alternative, more visual syntax for unmatched matchers:
[(<)]
and
[(>)]
for an open and close parenthesis,
respectively,
[[<]]
and
[[>]]
for a square bracket, and
[{<}]
and
[{>}]
for a curly brace.
You might think of the
<
or
>
symbol in this context
as a stand-in for the unmatched matcher that “points” left or right
at the matcher you actually want.
The following table summarizes these various ways to express
literal unmatched matchers.
|
Open |
Close |
Parentheses () |
[lpar] |
[#x28] |
[(<)] |
[rpar] |
[#x29] |
[(>)] |
Square brackets [] |
[lbrack] |
[#x5B] |
[[<]] |
[rbrack] |
[#x5D] |
[[>]] |
Curly braces {} |
[lbrace] |
[#x7B] |
[{<}] |
[rbrace] |
[#x7D] |
[{>}] |
While having to replace unmatched matchers with character references
might seem cumbersome,
they tend not to be used often anyway in most text –
mainly just in text that is
talking about such characters.
Independent of the text embedding benefits discussed above,
there is another compensation for this small bother.
While editing MinML, or any matchertext language,
you may find that your highlighting text editor
or integrated development environment (IDE)
no longer
ever guesses wrong
about which parenthesis, bracket, or brace character
matches which other one in your source file.
Metasyntax and processing instructions
SGML-derived markup can contain metasyntactic
declarations
of the form
<!…>
,
and
processing instructions
of the form
<?…?>
.
MiniML provides the syntax
![…]
and
?[…]
, respectively,
for expressing these constructs if needed.
Since these constructs are typically used
in only a few lines at the beginning of most markup files,
if at all,
improving their syntax is not a high-priority goal for MinML.
Further,
the syntax of – and processing rules for –
document type definitions are frighteningly complex,
even in the “simplified” XML standard.
MiniML therefore leaves the legacy syntax of the underlying markup language
unmodified within the context of these directives.
Only the outermost “wrapper” syntax changes.
For example, a MiniML document based on XML
with a document type declaration might look like:
?[xml version="1.0"]
![DOCTYPE greeting SYSTEM "hello.dtd"]
greeting[Hello, world!]
Give MinML a try
There is an
experimental implementation
in
Go
that supports parsing MinML into an abstract syntax tree (AST)
and conversion to classic HTML or XML syntax.
This repository also includes a simple
command-line tool
to convert MinML to HTML or XML.
With
this experimental fork
of the
Hugo website builder,
you can use MinML source files
with extension
.minml
or
.m
in your website.
This blog post was written in MinML and published using Hugo this way.
Feel free to check out
the MinML source for this post.
If you implement MinML in other languages or applications,
please let me know and I will collect and consolidate links.
Conclusion
MinML is a new “skin” or outer syntax
for SGML-derived markup languages such as HTML and XML.
MinML preserves all of the base language's power and expressiveness,
unlike the numerous flavors of Markdown.
MinML's syntax just makes markup a bit more concise
and – at least in this author's opinion –
less annoying to write, read, or edit.
Elements never need end tags,
only a final close bracket.
Enjoy!