January 2, 2023
Matchertext: an escape route from language-embedding hell?
We often need to embed strings written in one programming language
into code written in another.
For example,
we routinely embed
regular expressions
and
SQL queries
within shell scripts
or string literals in C-like languages.
HTML pages
routinely contain embedded
JavaScript
and
CSS
code fragments.
We often need to embed one
URI
into another,
such as to formulate a query to a Web service
that
validates,
archives,
translates,
or otherwise refers to other websites.
Whenever we embed strings, however,
we traditionally encounter numerous flavors
of a general problem I call
language-embedding hell.
The host language must have a way to find the end of the embedded string,
which implies constraining the embedded string's syntax.
Quoted string literals in C-like languages, for example,
require any quote or backslash characters within the embedded string to be
backslash-escaped.
Escaping increases the length of the embedded string,
is annoying and error-prone when we must do it manually,
and yields critical security flaws
like
SQL injection
and
cross-site scripting
when automated incorrectly.
The ideal of verbatim cross-language embedding
Imagine for the moment an ideal world in which all programming languages were
embeddable
verbatim within each other.
In this fictional world,
any valid string in
any programming language
could be embedded within the same or any other language,
without any escaping, obfuscation, or length expansion.
To embed a string, you ideally would just “copy-and-paste” it
into any suitable context,
such as a quoted string in the host language.
Wouldn't that be nice?
Our tradition of expressing programming languages as
plain text strings —
arbitrary “flat” sequences of
ASCII or
UCS
characters —
unfortunately renders this ideal of verbatim embedding
impossible in its pure form.
To find the end of an embedded plain text string unambiguously,
a host language must either escape or outright forbid
certain character sequences (e.g., the designated closing quote or end tag),
or else must prefix each embedded string with its precise length.
The latter practice works fine in machine-to-machine protocols,
but would be unbearably tedious and error-prone in a language
intended for humans to write.
Matchertext: a pragmatic path out of embedding hell
What if we could design and gradually deploy
a not-too-painful “upgrade” to our traditional plain text foundation
for programming languages,
which would eventually allow us to achieve a reasonable approximation
to the above ideal of verbatim, “copy-and-paste” embedding?
This goal is the essence of
matchertext,
an idea detailed in
a new preprint.
The pragmatic essence of the matchertext idea is simple.
First, we define six particular ASCII characters as
matchers:
namely the open and close parentheses
()
,
the square brackets
[]
,
and the curly braces
{}
.
We call these characters matchers
because their traditional, already-ubiquitous purpose
is to be used in matching pairs to surround and delimit other text.
Now we define
matchertext as any plain text string
conforming to one additional rule or “syntactic discipline”:
namely that
matchers must match,
throughout any matchertext string, without exception.
Nesting is allowed, but must use corresponding matchers.
For example, the string ‘
([{foo}])
’ is valid matchertext,
but strings like ‘
(foo
’, ‘
bar}
’,
or ‘
(]
’ are not matchertext.
Most of today's programming languages are already “matchertext-compatible”,
in the sense that many (perhaps most) of the strings we already tend to write,
and might want to embed,
will already happen to conform to this syntactic discipline.
Strings in today's languages that aren't matchertext —
i.e., strings containing unmatched matchers —
can usually be rewritten with a bit of effort
to conform to this discipline.
Why should we bother?
Because by following a few simple language-neutral rules,
any “matchertext-aware” language can host embedded strings
written in
any other matchertext-compatible syntax,
without escaping, expansion, or other obfuscation of embedded strings.
Embedded matchertext strings can nest to any depth,
and there are no disallowed character sequences or other constraints,
other than the basic rule that matchers must match.
As long as you know (or your tooling can check) that a string is matchertext,
you can just “copy-and-paste” it into any matchertext embedding context,
with no escaping or other fuss, as in the ideal world above.
While the matchertext idea was motivated by the embedding problem,
I have already noticed at least one side benefit —
or “icing on the cake” —
to adopting this discipline.
When writing matchertext,
I find it convenient that highlighting text editors like
Vim no longer
ever guess wrong
about which open matcher is associated with which close matcher,
even when the editor has no specific knowledge of the language in question.
Matchertext is merely about
strictly enforce an existing discipline
that we already
mostly follow anyway,
and rigorously enforcing a useful programming discipline
often has unanticipated benefits of this kind.
Incremental language extensions for matchertext
Of course, matchertext will never be useful
unless at least some language designers and developers
are willing to “take the plunge” and try implementing and using it.
The
matchertext preprint
explores what this could mean in the context of various popular languages,
including C-like languages,
SGML-derived languages such as HTML and XML,
and embedding-oriented “little languages”
like regular expressions and URIs.
In general, today's languages can be evolved gracefully and incrementally
to become more “matchertext-aware”,
via two main classes of language extensions:
hosting extensions and
embedding extensions.
Both classes of extensions can readily be designed and deployed
so as to preserve backward compatibility with existing code.
Hosting extensions make matchertext useful
by allowing the verbatim embedding of any matchertext string
in a suitable host-language construct such as a string literal.
In C-like languages, for example,
I suggest new backslash-escape sequences like \m[matchertext]
,
where matchertext is an arbitrary matchertext string.
SGML-derived markup languages might be enhanced
with backward-compatible syntax extensions such as
<name attributes [matchertext]>
for a tag containing arbitrary matchertext content.
Embedding extensions reduce the (hopefully already-modest) pain
of writing matchertext-compliant code,
particularly by providing new alternative ways to escape unmatched matchers.
A statement like printf("[")
is invalid matchertext,
for example,
and simple backslash-escaping as in printf("\[")
does not work
because the matchertext discipline
is oblivious to language-specific escaping rules.
Numeric escaping like printf("\x5B")
solves the problem
and already works fine,
but is a bit cumbersome.
Thus, I suggest potential new backward-compatble escapes
that are more visual while being matchertext-compliant:
e.g., \o[]
and \c[]
for unmatched open and close brackets
in C-like languages,
or [<]
and [>]
for unmatched brackets
in regular expressions or SGML-style markup languages.
Practical experimentation with matchertext
As of this writing,
most of the work of implementing and experimenting with the matchertext idea
remains to be done.
To succeed it will need plenty of time (years at least, if not decades),
and will eventually require effort from interested people across
many programming languages.
The current preprint is intended to be only a modest initial starting point,
to be updated occasionally if and when interest materializes
and we gain further experience worth reflecting in the paper.
As a concrete initial experiment,
I have made the matchertext idea central to the design of
MinML,
a more concise alternative to HTML or XML syntax.
MinML attempts to bring HTML-style syntax
closer to the convenience of Markdown and its many variants,
while preserving the full power and generality of HTML or XML.
MinML embodies the matchertext discipline
and uses it to handle nested “example code” constructs
more cleanly and sanely than in existing markup languages.
Since MinML is only a single matchertext-aware language,
however,
it can by no means unlock the full potential of the matchertext idea alone.
I hope that others — maybe you —
will try proposing, implementing, and using the matchertext idea
in new or existing programming languages of your choice.
If you do,
then well-considered pull requests to the
matchertext paper source
are welcome,
and significant contributions to future versions of the paper
will be acknowledged appropriately.
Thanks for reading!