Home - Topics - Papers - Talks - Theses - Blog - CV - Photos - Funny

Escaping Escaping Hell with Matchertext Resource Identifiers

Uniform resource identifiers or URIs were a genuinely great idea and have rightfully become the ubiquitous way to name things on the Internet. As the basis of web addresses or URLs, they are human readable (to varying degrees), manually transcribable, cut-and-pasteable, and have proven incrementally extensible to a vast multitude of schemes. Their later extension to internationalized resource identifiers or IRIs allow people whose native language is not English to type, and view, non-ASCII Unicode characters in web addresses.

URIs and IRIs have two major usability flaws, however. First, you can't reliably tell where they end in surrounding text. Suppose I type or paste into an E-mail:

	My site is https://bford.info/index.html.

Rich text editors often notice the URI and helpfully turn it into a hyperlink – but is the final period part of the link or the surrounding sentence? It could be either, since periods are valid and unreserved URI characters, and a wrong heuristic guess yields a broken link. Suppose I try to “armor” the URI using parentheses, like this:

	See my website (https://bford.info/).

This doesn't really help because parentheses are valid in URIs too, and heavily used by Wikipedia for example – as in this cruel joke of a URI:

	https://en.wikipedia.org/wiki/URL_(disambiguation)

The URI specification recommends, in an obscure appendix, delimiting URIs with angle brackets, as in <https://bford.info/>. But how many users are even aware of this suggestion, let alone consistently follow it?

A second major usability flaw is that URIs are not cleanly composable without sucking users and developers alike into what I call Escaping Hell. When we need to compose a URI that contains other URIs – or bits of text with any other nontrivial syntactic structure, for that matter – we must escape reserved or forbidden characters, by percent-encoding the embedded string into an ugly, obfuscated, unfriendly mess like https%3A%2F%2Fbford.info%2F.

This post suggests a possible fix to both flaws and a path out of Escaping Hell: an incrementally-deployable alternative syntax I call a matchertext resource identifier or MRI. Matchertext is, in brief, a language-neutral syntactic discipline that allows any compliant string in one syntax to be embedded verbatim into any other compliant syntax, via simple "copy-and-paste" – with no escaping or other transformation.

To create a matchertext resource identifier, we in essence simply rewrite any URI of the form scheme:body into MRI syntax of the form scheme[body]. Any URI like https://bford.info/ has an equivalent MRI syntax like https[//bford.info/], and vice versa. (“MRI” may be pronounced like “murrie” to avoid confusion with medical procedures.)

The key benefit of this alternative syntax is that any valid MRI can be embedded into any other MRI, verbatim, via copy-and-paste or manual transcription for example, with no mangling. A query to translate my home page, for example, might look like this:

	https[//translator.org/web?page=https[//bford.info/];lang=fr]

Just copy the target MRI into the query field and we're done.

There's no free lunch, of course. There is one key syntactic “cost”: MRIs must conform to the matchertext discipline, which in brief means that the ASCII matcher characters – namely parentheses (), square brackets [], and curly braces {} – must strictly match up in pairs. We will examine the implications of this syntactic discipline below.

Just as IRIs already liberalized resource identifiers to admit non-ASCII characters, MRIs similarly liberalize resource identifiers to allow all graphical Unicode characters within the body – provided only that matchers match. Any MRI may be converted to an equivalent URI or IRI and vice versa. New applications, protocols, formats, and versions can start accepting MRIs from users alongside URIs and IRIs while preserving backward compatibility through automatic conversions. Let's de-obfuscate the web – or at least web addresses.

Welcome to Escaping Hell

Since the end-finding flaw mentioned above is simple and well-known, let's focus a bit more on the problem of Escaping Hell.

We often wish to embed URIs within URIs for many reasons. Many web-based services operate on other web pages, for example, such as archival, translation, performance testing, syntax checking, and link checking. Such services need to accept URIs containing arbitrary other URIs, either in their pathname components (like https://archiver.org/date/target-URI), or in query strings (like https://checker.org/check?page=target-URI). But because the outer URI treats many characters as special, those characters must be escaped or percent-encoded in the nested URI. Entering a simple target URI like https://bford.info/ into the W3C's official markup validation service, for example, mangles the target URI into this obscure percent-encoded query string:

	https://validator.w3.org/nu/?doc=https%3A%2F%2Fbford.info%2F

So much for URIs being “human-readable”, at least in these contexts. Of course interactive web users don't need to do this mangling themselves because the service's front-page web form tooling does it for them, and developers who generate queries to the service programmatically are expected to percent-encode the target URI as part of the process. This is why we have managed to slog through Escaping Hell so far. But from a usability perspective, it's sad that target URIs must become nearly unrecognizable when embedded into other URIs. It's also sad that less-sophisticated users can't reasonably be expected to write or edit such a query string manually, or to copy-and-paste a target URI into a query field of another URI template, without thoroughly understanding URI syntax and poring over an ASCII table.

Besides embedded URIs, many other bits of text we often want to embed in URI pathname components or queries must commonly be mangled via percent-encoding, because they contain characters that are either reserved or forbidden entirely within URIs. For example, a query to a calculator service to evaluate a simple expression like 6/2+1 gets mangled into a query like https://calc.org/eval?expr=6%2F2%2B1. A query for a C++ or Java-style parameterized type like Box<Integer> gets mangled like https://dev-docs.org/search?type=Box%3CInteger%3E. A query involving a Unix-style pathname like /a/b/c gets mangled like https://unarchiver.org/extract?path=%2Fa%2Fb%2Fc.

Experience suggests that one of the most critical components of URIs – the authority field containing the host name – really “wants” the flexibility and modularity of an embedded URI. The authority component internally needs multiple “schemes” to name hosts in different ways, such as DNS names like bford.info versus IP addresses like 1.2.3.4. It has non-trivial internal syntactic structure, such as optional user@ prefix and :port suffix. It has repeatedly proven to need extensibility, to handle IPv6 addresses like [aaaa:bbbb:cccc:dddd::] for example. In short, the authority component of the base URI syntax has grown complex and syntaxy with many details and variations, exacerbating security issues. Supporting different schemes, internal structure, and extensibility are precisely what the whole URI naming paradigm is supposed to handle – but it can't do so for authority naming because of broken syntax.

Extending IPv6 addresses to support zone indexes like fe80::1234%1 or fe80::1234%eth0 created a further syntactic conflict between this use of the percent (‘%’) character and its use in URIs for percent-encoded escapes. This conflict then required a whole additional RFC just to say that the percent sign in a zoned IPv6 address needs to be escaped as %25 in a URI host field, producing outstandingly-clear URIs like http://[fe80::1234%251]/. (Is the zone index 1 or 251?)

The multiaddr project, while well-intentioned, threatens to dig us even deeper into Escaping Hell as we try to figure out how to shoehorn a pathname-like address like /ip4/127.0.0.1/udp/9090/quic into the hostname field of a URI. How would a parser figure out what's what in a string like http://ip4/127.0.0.1/udp/9090/quic/index.html? Alternatively, how is a user to understand an escaped embedding like http://%2Fip4%2f127.0.0.1%2fudp%2f9090%2fquic/index.html? Current multiaddr examples suggest restructuring URIs to move the scheme or protocol name into the multiaddr itself, as in /ip4/127.0.0.1/tcp/80/http/baz.jpg – but is such a fundamental reworking of how URIs are structured and parsed really going to fly on the established web?

Matchertext Resource Identifiers (MRIs)

As a composition-friendly and backward-compatible approach to making URIs more usable, I propose matchertext resource identifiers or MRIs. To a first approximation, a MRI merely replaces the colon : following the scheme name in a URI or IRI with an open bracket [, and adds a close bracket ] at the end. The URI https://bford.info/ is equivalent to the MRI https[//bford.info/], for example.

Everything within a MRI's outermost pair of square brackets we will call its body, i.e., //bford.info/ in this example. A MRI may be viewed abstractly as a key/value pair whose scheme name is the key and whose body is the value.

This use of matched brackets gives MRIs two fundamental advantages over URIs.

First, MRIs are cleanly self-delimiting, meaning it is always clear and unambiguous where any valid MRI ends when embedded in surrounding text such as a document. Consider these two contrasting URI and corresponding MRI examples:

	My site is https://bford.info/index.html.
	My site is https[//bford.info/index.html].

As mentioned earlier, the first example (URI) leaves it ambiguous and only heuristically guessable whether the final period is part of the URI or the surrounding English sentence. The second example (MRI) leaves no such ambiguity.

The real power of MRIs, however, comes from their composability. Provided that matchers like square brackets match, it is syntactically unproblematic for a valid MRI to appear anywhere in the body of another MRI, without requiring escaping or percent-encoding. To form an MRI to a web service that operates on other web pages, such as translation and checking services for example, a user merely needs to copy-and-paste the target MRI into a suitable template to form a clean and readable query MRI such as:

	https[//validator.w3.org/nu/?doc=https[//bford.info/]]

This property extends to multiple nesting levels, with no percent-encoding or progressive text expansion for successive levels. If the web page checker I wish to use is available only in English and I would like to translate its results to French, for example, then I currently end up with a delightfully-obfuscated URI with two levels of percent-encoding, like this:

	https://translate.google.com/translate?hl=en&sl=auto&tl=fr&u=https%3A%2F%2Fvalidator.w3.org%2Fnu%2F%3Fdoc%3Dhttps%253A%252F%252Fbford.info%252F

With MRIs, in contrast, such a multi-level query becomes both shorter and more comprehensible:

	https[//translate.google.com/translate?hl=en&sl=auto&tl=fr&u=https[//validator.w3.org/nu/?doc=https[//bford.info/]]]

Composability is not limited to query strings, but may also be used in other MRI components. A web-based service for browsing Zip files and other archives hosted on other websites, for example, might wish to treat the target MRI as a pathname component in the service's MRI. To use the service unarchiver.org to open a Zip file hosted at https[//bford.info/stuff.zip] and browse into the subdirectory /dest/path within the archive, for example, we might use a MRI like this:

	https[//unarchiver.org/unzip/https[//bford.info/stuff.zip]/dest/path/]

Finally, the ugly and problematic need for the base URI format to know about IP addresses and port number syntax in their many variants can be deprecated and replaced with use of suitable embedded MRIs in authority and host fields. For example, http://1.2.3.4/ might be deprecated in favor of http[//ip4[1.2.3.4]/], and http[//[aaaa:bbbb:cccc:dddd::]/] might become http://ip6[aaaa:bbbb:cccc:dddd::]/. Each case generically uses a nested MRI instead of special-case base syntax.

This approach could in principle be extended even to DNS host names via the already-existing dns URI scheme, so that https://bford.info/ becomes https[//dns[bford.info]/]. The fact that DNS names are by far the common case on the web, however, probably justifies retaining their special-case base syntax.

Percent Encoding in MRIs

MRI syntax reduces but does not eliminate the need for percent-encoding, which presents some manageable subtleties. Embedding spaces or raw binary data containing control characters into components, for example, still requires percent-encoding in MRIs exactly as in URIs. In addition, any unmatched matcher characters in a MRI's body must similarly be escaped. A query to search for a literal open bracket [ – ASCII code 0x5B – on a target web page, for example, must still be percent-encoded like this:

	https[//search.org/page?string=%5B;page=https[//bford.info/]]

Within any inner pair of square brackets nested within a MRI's body, all reserved URI characters including the percent character % lose their special meanings, becoming ordinary literal characters from the perspective of the outer MRI – again provided only that matchers match. For example, while the MRI https[//bford.info/~user/] is equivalent to https[//bford.info/%7Euser/], the following two MRIs are not equivalent because they represent queries for different literal query strings (which happen to be MRIs in this case but need not necessarily be):

	https[//checker.org/web?page=https[//bford.info/~user/]]
	https[//checker.org/web?page=https[//bford.info/%7Euser/]]

The inner bracket pair essentially quotes special characters such as percent signs, protecting them as literals and ensuring that they are “owned” exclusively by the inner MRI. Viewed another way, MRI syntax protects the syntactic territorial integrity of each level of embedding. Percent-escapes outside the embedded MRI belong exclusively to the host MRI, while escapes within it belong exclusively to the embedded MRI.

For the same reason, the percent character denoting an IPv6 zone index need not and must not be percent-encoded in a MRI's host field, in contrast with standard practice for legacy URIs. For example, the IPv6 address fe80::1234%eth0 appears in a URI percent-encoded as http://[fe80::1234%25eth0], but in a MRI becomes http[//ip6[fe80::1234%eth0]]. The inner percent sign is none of the (outer) MRI's business because it is protected by the nested square brackets.

Because nested brackets make percent signs literal characters from the outer MRI's perspective, if a query string or other MRI component needs to use percent-encoding to represent control characters that occur between an inner literal bracket pair, then those bracket characters must be percent-encoded too. For example, to encode a search query for the three-character string [’,NUL,‘], all three characters must be percent-encoded as %5B%00%5D. Percent-encoding only the inner NUL byte as [%00] would yield a search for the 5-character string [%00], because the brackets protect the percent sign as a literal.

In URIs, only the unreserved characters are considered equivalent in their percent-encoded and unencoded forms and thus may be coded or encoded at any time. In MRIs, bracketed substrings nested within the MRI's body may also be percent-encoded and decoded at any time, provided that the substring contains no forbidden characters, and provided that all reserved characters in the substring including brackets are encoded or decoded together. For example, the substring '[a[b]c]' in a MRI's body is equivalent to '%5Ba%5Bb%5Dc%5D', but is not equivalent to '[a%5Bb%5Dc]' or '%5Ba[b]c%5D'. Further, this equivalence applies only to complete substrings nested immediately within the MRI's body, not to substrings more deeply nested.

To illustrate, because both unreserved characters and complete bracketed substrings may be percent-encoded and decoded in MRIs without changing the meaning of a MRI, the following URI and three MRIs are all equivalent:

	https://checker.org/web?page=https%5B%2F%2Fbford%2Einfo%2F%5D
	https[//checker.org/web?page=https%5B%2F%2Fbford%2Einfo%2F%5D]
	https[//checker.org/web?page=https%5B%2F%2Fbford.info%2F%5D]
	https[//checker.org/web?page=https[//bford.info/]]

The following MRIs, however, are not equivalent to those above but represent different literal query strings:

	https[//checker.org/web?page=https[//bford%2Einfo/]]
	https[//checker.org/web?page=https[%2F%2Fbford%2Einfo%2F]]

URIs and IRIs already constrain scheme names to an extremely restricted character set that does not include percent-encoded characters, and MRIs retain this rule. This means that percent-encoding is “a thing” only within the bracketed body of a MRI. Percent-encodings can never appear at all in the schame name before the opening square bracket.

The MRI Character Set

The original URI format allows only a subset of the ASCII character set to appear directly in URIs without percent-encoding, forbidding spaces and punctuation characters such as:

	< > " { } | \ ^ `

The introduction of internationalized resource identifiers or IRIs, which most browsers now support, extended the limited URI character set to allow most Unicode characters representable in UTF-8 encoding – but IRIs still conservatively require the above forbidden ASCII characters to be percent-encoded.

The successful introduction of IRIs demonstrated empirically that the character set available to resource identifiers can be expanded smoothly in a backward-compatible fashion. In the interest of further extricating ourselves from Escaping Hell, MRIs build on the IRI experience by further liberalizing the allowed character set to include all valid UTF-8 characters apart from spaces and control codes. This means that MRI queries and other components could contain such strings as arithmetic expressions like 2^8, logical expressions like (x<y)&(y<z), C++ or Java-style parameterized type like Box<Integer>, or even code blocks in C-like languages such as {printf("hello\n");}, without percent-encoding.

The URI specification recommends delimiting URIs with angle brackets in surrounding text, which is why angle brackets were defined as “unsafe” in the original specification. Because this usage of angle brackets was never a formally required part of URL syntax, however, this recommendation was never reliably followed, and relatively few ordinary users even seem to be aware of it. MRIs never need to be delimited with surrounding angle brackets because MRI syntax solves the delimiting problem more reliably within the MRI syntax itself, via a mandatory syntactic rule (matchers must match) rather than an optional and easily-neglected recommendation in an appendix. It's always syntactically clear where a MRI embedded in running text ends, otherwise it's not a (valid) MRI. And because MRIs never need be surrounded by angle brackets in text, it is no longer problematic for them to contain unescaped angle brackets.

Embedding MRIs in popular protocols and data formats

Further character set liberalization presents tradeoffs of course. Specific applications, protocols, and formats wishing to support MRIs will either have to support the expanded MRI character set, or provide their own escaping mechanisms to handle problematic characters. A fallback solution always available is simply to down-convert MRIs to traditional URIs for embedding in legacy formats as described later.

Consider the common case of embedding MRIs into XML or HTML documents, for example. XML and HTML escaping rules require literal uses of the angle brackets < and >, the ampersand &, and sometimes single and double quotes, to be escaped using XML entity codes (<, >, &) whenever they appear in text or attributes.

Since URIs forbid angle brackets while MRIs do not, embedding MRIs containing angle brackets within an XML or HTML file requires either percent-encoding the angle brackets in the MRI or escaping them at XML level using entity codes. This is not a fundamentally new burden, however, because URIs already allowed ampersand & characters, which already must be entity-coded in order to insert a URI into XML or HTML safely.

Why not spaces and control codes?

We could in principle liberalize the MRI character set “the rest of the way” by allowing all Unicode characters including spaces and control codes. Just as in URIs, however, allowing spaces, line breaks, or tabs to be significant characters in MRIs would compromise the basic philosophy that resource identifiers should be readily transcribeable. Resource identifiers are often long and need to be wrapped when printed; if spaces were significant then it would be impossible to tell on reading or transcription whether there were zero, one, or more spaces originally at the position where the MRI was wrapped. Allowing other control codes would break languages like C that have trouble with NUL bytes in strings, would make MRIs impossible to embed reliably in text formats like XML that offer no escape codes for control characters, and would generally seem to undermine the basic philosophy of resource identifiers as moderately-compact, single-line, and nominally human-readable strings.

Why square brackets?

Why do MRIs use square brackets to delimit the body, instead of some other matching pair of punctation characters? Square brackets are the only matching punctuation already in the gen-delims category used as generic delimiters in the base URI syntax (for IPv6 addresses), which means that most URI parsing software is already prepared to see brackets in URIs, if only rarely. Further, because brackets are reserved for delimiting URI base syntax components and are formally not allowed to appear within components, we limit the risk that particular URI schemes might assign conflicting uses to these characters.

Parentheses () seem likely to be confused, by users and URL recognizers alike, with parenthetical expressions in surrounding natural-language text. Further, since parentheses are in the sub-delims category, pre-existing uses in particular URI schemes may conflict with any new base syntax use. Angle brackets <>) and curly braces {} were forbidden in URIs and IRIs, so transitioning them directly from forbidden to required seems like a bit of a leap. Finally, using non-ASCII matching delimeters just seems like even more of a leap.

Converting MRIs to URIs or IRIs

To enable incremental deployment while preserving backward compatibility with the existing URI and IRI ecosystem, applications that support MRIs must be able to “down-convert” them to URIs or IRIs in contexts not known to support MRI syntax. Since IRIs already liberalized URIs to support Unicode characters, it is easiest simply to define a conversion from MRIs to IRIs, then use the existing mapping of IRIs to URIs as needed.

A MRI may be mapped to an IRI as follows.

As with IRIs, if the MRI is not already represented as a Unicode-based character sequence, first convert it to one using normalization form C.
If the MRI's host component is an IP address using a nested ip4[…] or ip6[…] MRI, convert it to legacy URI syntax for IP addresses.
Percent-encode all square-bracketed substrings in the MRI's body apart from any IPv6 host address, by UTF-8 encoding then percent-encoding all characters in the substring that are reserved or forbidden in an IRI.
For each remaining character in the MRI that is forbidden in an IRI, convert it to a UTF-8 byte sequence, then percent-encode each byte.
Replace the MRI body's opening square bracket with a colon : and drop its closing square bracket.

Here are a few examples before and after conversion from MRI to IRI/URI:

	https[//bford.info/]
	https://bford.info/

	https[//ip4[1.2.3.4]/find?set={a,b,c};page=https[//bford.info/]]
	https://1.2.3.4/find?set=%7Ba%2Cb%2Cc%7D;page=https%5B%2F%2Fbford.info%2F%5D

	https[//ip6[fe80::1234%eth0]/unzip/https[//bford.info/stuff.zip]/dest/path/]
	https://[fe80::1234%25eth0]/unzip/https%5B%2F%2Fbford.info%2Fstuff.zip%5D/dest/path/

Converting URIs and IRIs to MRIs

The reverse conversion is possible too, of course. If MRIs retain support for legacy IP address syntax, then a minimal IRI to MRI conversion is simply to replace the colon after the scheme name with an open bracket and add a closing bracket at the end. Since brackets are forbidden from IRIs apart from IPv6 addresses, and MRIs handle the already-allowed characters and percent-encodings in the same way as IRIs, a percent-encoded IRI body also serves as a valid MRI body.

To improve usability and readability, however, we would like an IRI-to-MRI encoding that simplifies the resulting MRIs and extricates us from Escaping Hell as well as feasible. To this end, we can convert IRIs as follows:

Decode any unreserved characters that are percent-encoded in the URI.
If the IRI's host component is a legacy IP address, rewrite it into a nested ip4[…] or ip6[…] MRI.
For each maximum-length substring that starts and ends with a matching pair of percent-encoded square brackets, and contains no percent-encoded characters forbidden in MRIs, decode all percent-encoded characters in the substring.
Replace the colon following the scheme name with an opening bracket, and append a closing bracket.

These rules will always successfully percent-decode all nested MRIs contained in the outer MRI's body, because valid nested MRIs percent-encoded in the original URI will have matching brackets and will not contain forbidden characters. If the original URI contains unmatched brackets, or matched brackets containing forbidden characters, these sequences are left percent-encoded in the resulting MRI.

Here is an illustrative example of a conversion from IRI/URI to MRI:

	https://search.org/find?chr=%5B;str=%5Ba%5B%5Bb%5D%5D%20%5D;page=https%5B%2F%2Fbford.info%2F%5D
	https[//search.org/find?chr=%5B;str=%5Ba[[b]]%20%5D;page=https[//bford.info/]]

Notice that while the nested MRI in the page parameter is fully decoded, the unbalanced open bracket passed to the chr parameter is not decoded. Further, only the inner two pairs of matched brackets in the str parameter are decoded, because the outer bracket pair contains a space character (%20), which is forbidden in MRIs.

Mutual Embedding of URIs, IRIs, and MRIs

It should be clear already that MRIs may be embedded in URIs and IRIs, and vice versa. This will be essential to enabling incremental deployment of MRIs, and presents many potential short-term opportunities.

Operators of web services that take URIs in query strings, for example, can unilaterally upgrade their sites to accept MRIs in query strings, and can upgrade their web forms and client-side JavaScript to produce such MRI-encoded queries, without asking anyone's permission or waiting for web browsers or protocols like HTTP to catch up. These MRIs will for now be embedded within legacy URIs on their way from browser to web server, of course, and hence will still be embedded in one level of percent-encoding, just as embedded URIs would be. With a small amount of tooling, however, server-side web logging and statistics scripts can be upgraded to convert the logged outer URIs into MRIs as described above, which will render both nesting levels de-obfuscated in script outputs.

Web browsers can start supporting MRI syntax at any time as an alternative syntax in their address bars, even initially only as an experimental option. This way, early adopters and developers of web sites like those discussed above can obtain the convenience of being able to cut-and-paste one MRI directly into another without having to mess with so much percent-encoding. Ideally the user would be able to select whether to paste a MRI or down-convert it to a URI or IRI first. Firefox already approximates such a choice by down-converting to a URI when copying the whole web address, but preserving international characters from IRIs when copying only part. The browser will for now always convert the MRI to a URI for transmission over HTTP, but that's fine.

In short, since MRIs are both readily convertible to and from URIs and IRIs, and easily mutually embeddable with them, it should not be difficult to get them to “play along” in many useful ways even in the short term.

Conclusion

This post is intended merely to outline and explore a possible solution to some of the usability problems of URIs and IRIs. We have not, of course, defined a fully-precise syntax for MRIs or answered all the detailed questions needed to deploy them. However, I hope this preliminary exploration is adequate to offer an impression of the potential benefits of delimited resource identifiers, as well as the realistic challenges of implementing them and making them interoperable and backward-compatible with URIs and IRIs. For now, this sketch hopefully provides a basis from which we may experiment with MRIs and see if they can fulfill their potential in practice.

Bryan Ford