January 2, 2023

Matchertext: an escape route from language-embedding hell?

We often need to embed strings written in one programming language into code written in another. For example, we routinely embed regular expressions and SQL queries within shell scripts or string literals in C-like languages. HTML pages routinely contain embedded JavaScript and CSS code fragments. We often need to embed one URI into another, such as to formulate a query to a Web service that validates, archives, translates, or otherwise refers to other websites.

Whenever we embed strings, however, we traditionally encounter numerous flavors of a general problem I call language-embedding hell. The host language must have a way to find the end of the embedded string, which implies constraining the embedded string's syntax. Quoted string literals in C-like languages, for example, require any quote or backslash characters within the embedded string to be backslash-escaped. Escaping increases the length of the embedded string, is annoying and error-prone when we must do it manually, and yields critical security flaws like SQL injection and cross-site scripting when automated incorrectly.

The ideal of verbatim cross-language embedding

Imagine for the moment an ideal world in which all programming languages were embeddable verbatim within each other. In this fictional world, any valid string in any programming language could be embedded within the same or any other language, without any escaping, obfuscation, or length expansion. To embed a string, you ideally would just “copy-and-paste” it into any suitable context, such as a quoted string in the host language. Wouldn't that be nice?

Our tradition of expressing programming languages as plain text strings — arbitrary “flat” sequences of ASCII or UCS characters — unfortunately renders this ideal of verbatim embedding impossible in its pure form. To find the end of an embedded plain text string unambiguously, a host language must either escape or outright forbid certain character sequences (e.g., the designated closing quote or end tag), or else must prefix each embedded string with its precise length. The latter practice works fine in machine-to-machine protocols, but would be unbearably tedious and error-prone in a language intended for humans to write.

Matchertext: a pragmatic path out of embedding hell

What if we could design and gradually deploy a not-too-painful “upgrade” to our traditional plain text foundation for programming languages, which would eventually allow us to achieve a reasonable approximation to the above ideal of verbatim, “copy-and-paste” embedding? This goal is the essence of matchertext, an idea detailed in a new preprint.

The pragmatic essence of the matchertext idea is simple. First, we define six particular ASCII characters as matchers: namely the open and close parentheses (), the square brackets [], and the curly braces {}. We call these characters matchers because their traditional, already-ubiquitous purpose is to be used in matching pairs to surround and delimit other text.

Now we define matchertext as any plain text string conforming to one additional rule or “syntactic discipline”: namely that matchers must match, throughout any matchertext string, without exception. Nesting is allowed, but must use corresponding matchers. For example, the string ‘([{foo}])’ is valid matchertext, but strings like ‘(foo’, ‘bar}’, or ‘(]’ are not matchertext.

Most of today's programming languages are already “matchertext-compatible”, in the sense that many (perhaps most) of the strings we already tend to write, and might want to embed, will already happen to conform to this syntactic discipline. Strings in today's languages that aren't matchertext — i.e., strings containing unmatched matchers — can usually be rewritten with a bit of effort to conform to this discipline.

Why should we bother? Because by following a few simple language-neutral rules, any “matchertext-aware” language can host embedded strings written in any other matchertext-compatible syntax, without escaping, expansion, or other obfuscation of embedded strings. Embedded matchertext strings can nest to any depth, and there are no disallowed character sequences or other constraints, other than the basic rule that matchers must match. As long as you know (or your tooling can check) that a string is matchertext, you can just “copy-and-paste” it into any matchertext embedding context, with no escaping or other fuss, as in the ideal world above.

While the matchertext idea was motivated by the embedding problem, I have already noticed at least one side benefit — or “icing on the cake” — to adopting this discipline. When writing matchertext, I find it convenient that highlighting text editors like Vim no longer ever guess wrong about which open matcher is associated with which close matcher, even when the editor has no specific knowledge of the language in question. Matchertext is merely about strictly enforce an existing discipline that we already mostly follow anyway, and rigorously enforcing a useful programming discipline often has unanticipated benefits of this kind.

Incremental language extensions for matchertext

Of course, matchertext will never be useful unless at least some language designers and developers are willing to “take the plunge” and try implementing and using it. The matchertext preprint explores what this could mean in the context of various popular languages, including C-like languages, SGML-derived languages such as HTML and XML, and embedding-oriented “little languages” like regular expressions and URIs.

In general, today's languages can be evolved gracefully and incrementally to become more “matchertext-aware”, via two main classes of language extensions: hosting extensions and embedding extensions. Both classes of extensions can readily be designed and deployed so as to preserve backward compatibility with existing code.

Hosting extensions make matchertext useful by allowing the verbatim embedding of any matchertext string in a suitable host-language construct such as a string literal. In C-like languages, for example, I suggest new backslash-escape sequences like \m[matchertext], where matchertext is an arbitrary matchertext string. SGML-derived markup languages might be enhanced with backward-compatible syntax extensions such as <name attributes [matchertext]> for a tag containing arbitrary matchertext content.
Embedding extensions reduce the (hopefully already-modest) pain of writing matchertext-compliant code, particularly by providing new alternative ways to escape unmatched matchers. A statement like printf("[") is invalid matchertext, for example, and simple backslash-escaping as in printf("\[") does not work because the matchertext discipline is oblivious to language-specific escaping rules. Numeric escaping like printf("\x5B") solves the problem and already works fine, but is a bit cumbersome. Thus, I suggest potential new backward-compatble escapes that are more visual while being matchertext-compliant: e.g., \o[] and \c[] for unmatched open and close brackets in C-like languages, or [<] and [>] for unmatched brackets in regular expressions or SGML-style markup languages.

Practical experimentation with matchertext

As of this writing, most of the work of implementing and experimenting with the matchertext idea remains to be done. To succeed it will need plenty of time (years at least, if not decades), and will eventually require effort from interested people across many programming languages. The current preprint is intended to be only a modest initial starting point, to be updated occasionally if and when interest materializes and we gain further experience worth reflecting in the paper.

As a concrete initial experiment, I have made the matchertext idea central to the design of MinML, a more concise alternative to HTML or XML syntax. MinML attempts to bring HTML-style syntax closer to the convenience of Markdown and its many variants, while preserving the full power and generality of HTML or XML. MinML embodies the matchertext discipline and uses it to handle nested “example code” constructs more cleanly and sanely than in existing markup languages. Since MinML is only a single matchertext-aware language, however, it can by no means unlock the full potential of the matchertext idea alone.

I hope that others — maybe you — will try proposing, implementing, and using the matchertext idea in new or existing programming languages of your choice. If you do, then well-considered pull requests to the matchertext paper source are welcome, and significant contributions to future versions of the paper will be acknowledged appropriately. Thanks for reading!