Home - Topics - Papers - Theses - Blog - CV - Photos - Funny |
URIs and IRIs have two major usability flaws, however. First, you can't reliably tell where they end in surrounding text. Suppose I type or paste into an E-mail:
My site is https://bford.info/index.html.
Rich text editors often notice the URI and helpfully turn it into a hyperlink – but is the final period part of the link or the surrounding sentence? It could be either, since periods are valid and unreserved URI characters, and a wrong heuristic guess yields a broken link. Suppose I try to “armor” the URI using parentheses, like this:
See my website (https://bford.info/).
This doesn't really help because parentheses are valid in URIs too, and heavily used by Wikipedia for example – as in this cruel joke of a URI:
https://en.wikipedia.org/wiki/URL_(disambiguation)
The URI specification recommends,
in an obscure appendix,
delimiting URIs with angle brackets,
as in <https://bford.info/>
.
But how many users are even aware of this suggestion,
let alone consistently follow it?
A second major usability flaw is that URIs are not cleanly composable
without sucking users and developers alike into what I call Escaping Hell.
When we need to compose a URI that contains other URIs –
or bits of text with any other nontrivial syntactic structure,
for that matter –
we must escape reserved or forbidden characters,
by percent-encoding
the embedded string into an ugly, obfuscated, unfriendly mess
like https%3A%2F%2Fbford.info%2F
.
This post suggests a possible fix to both flaws and a path out of Escaping Hell: an incrementally-deployable alternative syntax I call a matchertext resource identifier or MRI. Matchertext is, in brief, a language-neutral syntactic discipline that allows any compliant string in one syntax to be embedded verbatim into any other compliant syntax, via simple "copy-and-paste" – with no escaping or other transformation.
To create a matchertext resource identifier,
we in essence simply rewrite any URI of the form
scheme:body
into MRI syntax of the form
scheme[body]
.
Any URI like https://bford.info/
has
an equivalent MRI syntax like https[//bford.info/]
,
and vice versa.
(“MRI” may be pronounced
like “murrie” to avoid confusion with medical procedures.)
The key benefit of this alternative syntax is that any valid MRI can be embedded into any other MRI, verbatim, via copy-and-paste or manual transcription for example, with no mangling. A query to translate my home page, for example, might look like this:
https[//translator.org/web?page=https[//bford.info/];lang=fr]Just copy the target MRI into the query field and we're done.
There's no free lunch, of course.
There is one key syntactic “cost”:
MRIs must conform to the
matchertext discipline,
which in brief means that the ASCII matcher characters –
namely parentheses ()
,
square brackets []
,
and curly braces {}
–
must strictly match up in pairs.
We will examine the implications of this syntactic discipline below.
Just as IRIs already liberalized resource identifiers to admit non-ASCII characters, MRIs similarly liberalize resource identifiers to allow all graphical Unicode characters within the body – provided only that matchers match. Any MRI may be converted to an equivalent URI or IRI and vice versa. New applications, protocols, formats, and versions can start accepting MRIs from users alongside URIs and IRIs while preserving backward compatibility through automatic conversions. Let's de-obfuscate the web – or at least web addresses.
Since the end-finding flaw mentioned above is simple and well-known, let's focus a bit more on the problem of Escaping Hell.
We often wish to embed URIs within URIs for many reasons.
Many web-based services operate on other web pages,
for example,
such as
archival,
translation,
performance testing,
syntax checking, and
link checking.
Such services need to accept URIs containing arbitrary other URIs,
either in their pathname components
(like https://archiver.org/date/target-URI
),
or in query strings
(like https://checker.org/check?page=target-URI
).
But because the outer URI treats many characters as special,
those characters must be
escaped or
percent-encoded
in the nested URI.
Entering a simple target URI like https://bford.info/
into the W3C's official
markup validation service,
for example,
mangles the target URI into this
obscure percent-encoded query string:
https://validator.w3.org/nu/?doc=https%3A%2F%2Fbford.info%2F
So much for URIs being “human-readable”, at least in these contexts. Of course interactive web users don't need to do this mangling themselves because the service's front-page web form tooling does it for them, and developers who generate queries to the service programmatically are expected to percent-encode the target URI as part of the process. This is why we have managed to slog through Escaping Hell so far. But from a usability perspective, it's sad that target URIs must become nearly unrecognizable when embedded into other URIs. It's also sad that less-sophisticated users can't reasonably be expected to write or edit such a query string manually, or to copy-and-paste a target URI into a query field of another URI template, without thoroughly understanding URI syntax and poring over an ASCII table.
Besides embedded URIs,
many other bits of text we often want to embed
in URI pathname components or queries
must commonly be mangled via percent-encoding,
because they contain characters that are either reserved
or forbidden entirely within URIs.
For example, a query to a calculator service to evaluate
a simple expression like 6/2+1
gets mangled into a query
like https://calc.org/eval?expr=6%2F2%2B1
.
A query for a C++ or Java-style parameterized type
like Box<Integer>
gets mangled
like https://dev-docs.org/search?type=Box%3CInteger%3E
.
A query involving a Unix-style pathname like /a/b/c
gets mangled
like https://unarchiver.org/extract?path=%2Fa%2Fb%2Fc
.
Experience suggests that one of the most critical components of URIs –
the authority field containing the host name –
really “wants” the flexibility and modularity of an embedded URI.
The authority component
internally needs multiple “schemes”
to name hosts in different ways,
such as DNS names like bford.info
versus IP addresses like 1.2.3.4
.
It has non-trivial internal syntactic structure,
such as optional user@
prefix
and :port
suffix.
It has repeatedly proven to need extensibility,
to handle IPv6 addresses
like [aaaa:bbbb:cccc:dddd::]
for example.
In short, the authority component of the base URI syntax has grown
complex and syntaxy
with many details and variations,
exacerbating
security issues.
Supporting different schemes, internal structure, and extensibility
are precisely what the whole URI naming paradigm is supposed to handle –
but it can't do so for authority naming because of broken syntax.
Extending IPv6 addresses to support
zone indexes
like fe80::1234%1
or fe80::1234%eth0
created a further syntactic conflict between this use of
the percent (‘%’
) character
and its use in URIs for percent-encoded escapes.
This conflict then required
a whole additional RFC
just to say that the percent sign in a zoned IPv6 address
needs to be escaped as %25
in a URI host field,
producing outstandingly-clear URIs like
http://[fe80::1234%251]/
. (Is the zone index 1 or 251?)
The multiaddr project,
while well-intentioned,
threatens to dig us even deeper into Escaping Hell
as we try to figure out how to shoehorn a pathname-like address
like /ip4/127.0.0.1/udp/9090/quic
into the hostname field of a URI.
How would a parser figure out what's what in a string like
http://ip4/127.0.0.1/udp/9090/quic/index.html
?
Alternatively, how is a user to understand an escaped embedding like
http://%2Fip4%2f127.0.0.1%2fudp%2f9090%2fquic/index.html
?
Current multiaddr examples suggest restructuring URIs
to move the scheme or protocol name
into the multiaddr itself, as in
/ip4/127.0.0.1/tcp/80/http/baz.jpg
–
but is such a fundamental reworking of how URIs are structured and parsed
really going to fly on the established web?
As a composition-friendly and backward-compatible approach
to making URIs more usable,
I propose matchertext resource identifiers or MRIs.
To a first approximation,
a MRI merely replaces the colon :
following the scheme name
in a URI or IRI with an open bracket [
,
and adds a close bracket ]
at the end.
The URI https://bford.info/
is equivalent
to the MRI https[//bford.info/]
, for example.
Everything within a MRI's outermost pair of square brackets
we will call its body,
i.e., //bford.info/
in this example.
A MRI may be viewed abstractly as a
key/value pair
whose scheme name is the key
and whose body is the value.
This use of matched brackets gives MRIs two fundamental advantages over URIs.
First, MRIs are cleanly self-delimiting, meaning it is always clear and unambiguous where any valid MRI ends when embedded in surrounding text such as a document. Consider these two contrasting URI and corresponding MRI examples:
My site is https://bford.info/index.html. My site is https[//bford.info/index.html].As mentioned earlier, the first example (URI) leaves it ambiguous and only heuristically guessable whether the final period is part of the URI or the surrounding English sentence. The second example (MRI) leaves no such ambiguity.
The real power of MRIs, however, comes from their composability. Provided that matchers like square brackets match, it is syntactically unproblematic for a valid MRI to appear anywhere in the body of another MRI, without requiring escaping or percent-encoding. To form an MRI to a web service that operates on other web pages, such as translation and checking services for example, a user merely needs to copy-and-paste the target MRI into a suitable template to form a clean and readable query MRI such as:
https[//validator.w3.org/nu/?doc=https[//bford.info/]]
This property extends to multiple nesting levels, with no percent-encoding or progressive text expansion for successive levels. If the web page checker I wish to use is available only in English and I would like to translate its results to French, for example, then I currently end up with a delightfully-obfuscated URI with two levels of percent-encoding, like this:
https://translate.google.com/translate?hl=en&sl=auto&tl=fr&u=https%3A%2F%2Fvalidator.w3.org%2Fnu%2F%3Fdoc%3Dhttps%253A%252F%252Fbford.info%252FWith MRIs, in contrast, such a multi-level query becomes both shorter and more comprehensible:
https[//translate.google.com/translate?hl=en&sl=auto&tl=fr&u=https[//validator.w3.org/nu/?doc=https[//bford.info/]]]
Composability is not limited to query strings,
but may also be used in other MRI components.
A web-based service for browsing
Zip files
and other archives hosted on other websites, for example,
might wish to treat the target MRI as a pathname component
in the service's MRI.
To use the service unarchiver.org
to open a Zip file hosted at
https[//bford.info/stuff.zip]
and browse into the subdirectory /dest/path
within the archive, for example,
we might use a MRI like this:
https[//unarchiver.org/unzip/https[//bford.info/stuff.zip]/dest/path/]
Finally, the ugly and problematic need for the base URI format
to know about IP addresses and port number syntax in their many variants
can be deprecated and replaced with use of suitable embedded MRIs
in authority and host fields.
For example,
http://1.2.3.4/
might be deprecated
in favor of http[//ip4[1.2.3.4]/]
,
and http[//[aaaa:bbbb:cccc:dddd::]/]
might become http://ip6[aaaa:bbbb:cccc:dddd::]/
.
Each case generically uses a nested MRI instead of special-case base syntax.
This approach could in principle be extended even to DNS host names
via the already-existing
dns URI scheme,
so that https://bford.info/
becomes
https[//dns[bford.info]/]
.
The fact that DNS names are by far the common case on the web, however,
probably justifies retaining their special-case base syntax.
MRI syntax reduces but does not eliminate the need for percent-encoding,
which presents some manageable subtleties.
Embedding spaces or raw binary data containing
control characters into components, for example,
still requires percent-encoding in MRIs exactly as in URIs.
In addition, any unmatched matcher characters in a MRI's body
must similarly be escaped.
A query to search for a literal open bracket [
–
ASCII code 0x5B –
on a target web page, for example,
must still be percent-encoded like this:
https[//search.org/page?string=%5B;page=https[//bford.info/]]
Within any inner pair of square brackets nested within a MRI's body,
all reserved URI characters including the percent character %
lose their special meanings, becoming ordinary literal characters
from the perspective of the outer MRI –
again provided only that matchers match.
For example, while the MRI
https[//bford.info/~user/]
is equivalent to
https[//bford.info/%7Euser/]
,
the following two MRIs are not equivalent
because they represent queries for different literal query strings
(which happen to be MRIs in this case but need not necessarily be):
https[//checker.org/web?page=https[//bford.info/~user/]] https[//checker.org/web?page=https[//bford.info/%7Euser/]]
The inner bracket pair essentially quotes special characters such as percent signs, protecting them as literals and ensuring that they are “owned” exclusively by the inner MRI. Viewed another way, MRI syntax protects the syntactic territorial integrity of each level of embedding. Percent-escapes outside the embedded MRI belong exclusively to the host MRI, while escapes within it belong exclusively to the embedded MRI.
For the same reason, the percent character denoting an IPv6 zone index
need not and must not be percent-encoded in a MRI's host field,
in contrast with
standard practice for legacy URIs.
For example, the IPv6 address
fe80::1234%eth0
appears in a URI percent-encoded as
http://[fe80::1234%25eth0]
,
but in a MRI becomes
http[//ip6[fe80::1234%eth0]]
.
The inner percent sign is none of the (outer) MRI's business
because it is protected by the nested square brackets.
Because nested brackets make percent signs literal characters
from the outer MRI's perspective,
if a query string or other MRI component needs to use percent-encoding
to represent control characters that occur between
an inner literal bracket pair,
then those bracket characters must be percent-encoded too.
For example, to encode a search query for the three-character string
[
’,NUL,‘]
,
all three characters must be percent-encoded as
%5B%00%5D
.
Percent-encoding only the inner NUL byte as
[%00]
would yield a search for the 5-character string
[%00]
,
because the brackets protect the percent sign as a literal.
In URIs, only the
unreserved characters
are considered equivalent in their percent-encoded and unencoded forms
and thus may be
coded or encoded at any time.
In MRIs, bracketed substrings nested within the MRI's body
may also be percent-encoded and decoded at any time,
provided that the substring contains no forbidden characters,
and provided that all reserved characters
in the substring including brackets
are encoded or decoded together.
For example, the substring '[a[b]c]
' in a MRI's body
is equivalent to '%5Ba%5Bb%5Dc%5D
',
but is not equivalent to '[a%5Bb%5Dc]
'
or '%5Ba[b]c%5D
'.
Further, this equivalence applies only to complete substrings nested
immediately within the MRI's body, not to substrings more deeply nested.
To illustrate, because both unreserved characters and complete bracketed substrings may be percent-encoded and decoded in MRIs without changing the meaning of a MRI, the following URI and three MRIs are all equivalent:
https://checker.org/web?page=https%5B%2F%2Fbford%2Einfo%2F%5D https[//checker.org/web?page=https%5B%2F%2Fbford%2Einfo%2F%5D] https[//checker.org/web?page=https%5B%2F%2Fbford.info%2F%5D] https[//checker.org/web?page=https[//bford.info/]]
The following MRIs, however, are not equivalent to those above but represent different literal query strings:
https[//checker.org/web?page=https[//bford%2Einfo/]] https[//checker.org/web?page=https[%2F%2Fbford%2Einfo%2F]]
URIs and IRIs already constrain scheme names to an extremely restricted character set that does not include percent-encoded characters, and MRIs retain this rule. This means that percent-encoding is “a thing” only within the bracketed body of a MRI. Percent-encodings can never appear at all in the schame name before the opening square bracket.
The original URI format allows only a subset of the ASCII character set to appear directly in URIs without percent-encoding, forbidding spaces and punctuation characters such as:
< > " { } | \ ^ `The introduction of internationalized resource identifiers or IRIs, which most browsers now support, extended the limited URI character set to allow most Unicode characters representable in UTF-8 encoding – but IRIs still conservatively require the above forbidden ASCII characters to be percent-encoded.
The successful introduction of IRIs demonstrated empirically that
the character set available to resource identifiers can be expanded
smoothly in a backward-compatible fashion.
In the interest of
further extricating ourselves from Escaping Hell,
MRIs build on the IRI experience by further liberalizing
the allowed character set to include
all valid UTF-8 characters apart from spaces and
control codes.
This means that MRI queries and other components
could contain such strings as
arithmetic expressions like 2^8
,
logical expressions like (x<y)&(y<z)
,
C++ or Java-style parameterized type like Box<Integer>
,
or even code blocks in C-like languages
such as {printf("hello\n");}
,
without percent-encoding.
The URI specification recommends delimiting URIs with angle brackets in surrounding text, which is why angle brackets were defined as “unsafe” in the original specification. Because this usage of angle brackets was never a formally required part of URL syntax, however, this recommendation was never reliably followed, and relatively few ordinary users even seem to be aware of it. MRIs never need to be delimited with surrounding angle brackets because MRI syntax solves the delimiting problem more reliably within the MRI syntax itself, via a mandatory syntactic rule (matchers must match) rather than an optional and easily-neglected recommendation in an appendix. It's always syntactically clear where a MRI embedded in running text ends, otherwise it's not a (valid) MRI. And because MRIs never need be surrounded by angle brackets in text, it is no longer problematic for them to contain unescaped angle brackets.
Further character set liberalization presents tradeoffs of course. Specific applications, protocols, and formats wishing to support MRIs will either have to support the expanded MRI character set, or provide their own escaping mechanisms to handle problematic characters. A fallback solution always available is simply to down-convert MRIs to traditional URIs for embedding in legacy formats as described later.
Consider the common case of embedding MRIs into
XML or
HTML documents,
for example.
XML and HTML escaping rules
require literal uses of the
angle brackets <
and >
,
the ampersand &
,
and sometimes single and double quotes,
to be escaped using XML entity codes
(<
, >
,
&
)
whenever they appear in text or attributes.
Since URIs forbid angle brackets while MRIs do not,
embedding MRIs containing angle brackets within an XML or HTML file
requires either percent-encoding the angle brackets in the MRI
or escaping them at XML level using entity codes.
This is not a fundamentally new burden, however,
because URIs already allowed
ampersand &
characters,
which already must be entity-coded
in order to insert a URI into XML or HTML safely.
We could in principle liberalize the MRI character set “the rest of the way” by allowing all Unicode characters including spaces and control codes. Just as in URIs, however, allowing spaces, line breaks, or tabs to be significant characters in MRIs would compromise the basic philosophy that resource identifiers should be readily transcribeable. Resource identifiers are often long and need to be wrapped when printed; if spaces were significant then it would be impossible to tell on reading or transcription whether there were zero, one, or more spaces originally at the position where the MRI was wrapped. Allowing other control codes would break languages like C that have trouble with NUL bytes in strings, would make MRIs impossible to embed reliably in text formats like XML that offer no escape codes for control characters, and would generally seem to undermine the basic philosophy of resource identifiers as moderately-compact, single-line, and nominally human-readable strings.
Why do MRIs use square brackets to delimit the body,
instead of some other matching pair of punctation characters?
Square brackets are the only matching punctuation already in the
gen-delims
category
used as generic delimiters in the base URI syntax (for IPv6 addresses),
which means that most URI parsing software is already prepared
to see brackets in URIs, if only rarely.
Further, because brackets are reserved for delimiting URI base syntax components
and are formally not allowed to appear within components,
we limit the risk that particular URI schemes
might assign conflicting uses to these characters.
Parentheses ()
seem likely to be confused, by users and URL recognizers alike,
with parenthetical expressions in surrounding natural-language text.
Further, since parentheses are in the
sub-delims
category,
pre-existing uses in particular URI schemes may conflict
with any new base syntax use.
Angle brackets <>
)
and curly braces {}
were forbidden in URIs and IRIs,
so transitioning them directly from forbidden to required
seems like a bit of a leap.
Finally, using non-ASCII matching delimeters
just seems like even more of a leap.
To enable incremental deployment while preserving backward compatibility with the existing URI and IRI ecosystem, applications that support MRIs must be able to “down-convert” them to URIs or IRIs in contexts not known to support MRI syntax. Since IRIs already liberalized URIs to support Unicode characters, it is easiest simply to define a conversion from MRIs to IRIs, then use the existing mapping of IRIs to URIs as needed.
A MRI may be mapped to an IRI as follows.
ip4[…]
or ip6[…]
MRI,
convert it to legacy URI syntax for IP addresses.
:
and drop its closing square bracket.
Here are a few examples before and after conversion from MRI to IRI/URI:
https[//bford.info/] https://bford.info/ https[//ip4[1.2.3.4]/find?set={a,b,c};page=https[//bford.info/]] https://1.2.3.4/find?set=%7Ba%2Cb%2Cc%7D;page=https%5B%2F%2Fbford.info%2F%5D https[//ip6[fe80::1234%eth0]/unzip/https[//bford.info/stuff.zip]/dest/path/] https://[fe80::1234%25eth0]/unzip/https%5B%2F%2Fbford.info%2Fstuff.zip%5D/dest/path/
The reverse conversion is possible too, of course. If MRIs retain support for legacy IP address syntax, then a minimal IRI to MRI conversion is simply to replace the colon after the scheme name with an open bracket and add a closing bracket at the end. Since brackets are forbidden from IRIs apart from IPv6 addresses, and MRIs handle the already-allowed characters and percent-encodings in the same way as IRIs, a percent-encoded IRI body also serves as a valid MRI body.
To improve usability and readability, however, we would like an IRI-to-MRI encoding that simplifies the resulting MRIs and extricates us from Escaping Hell as well as feasible. To this end, we can convert IRIs as follows:
ip4[…]
or ip6[…]
MRI.
These rules will always successfully percent-decode all nested MRIs contained in the outer MRI's body, because valid nested MRIs percent-encoded in the original URI will have matching brackets and will not contain forbidden characters. If the original URI contains unmatched brackets, or matched brackets containing forbidden characters, these sequences are left percent-encoded in the resulting MRI.
Here is an illustrative example of a conversion from IRI/URI to MRI:
https://search.org/find?chr=%5B;str=%5Ba%5B%5Bb%5D%5D%20%5D;page=https%5B%2F%2Fbford.info%2F%5D https[//search.org/find?chr=%5B;str=%5Ba[[b]]%20%5D;page=https[//bford.info/]]
Notice that while the nested MRI in the page
parameter
is fully decoded,
the unbalanced open bracket passed to the chr
parameter
is not decoded.
Further, only the inner two pairs of matched brackets
in the str
parameter are decoded,
because the outer bracket pair contains a space character (%20
),
which is forbidden in MRIs.
It should be clear already that MRIs may be embedded in URIs and IRIs, and vice versa. This will be essential to enabling incremental deployment of MRIs, and presents many potential short-term opportunities.
Operators of web services that take URIs in query strings, for example, can unilaterally upgrade their sites to accept MRIs in query strings, and can upgrade their web forms and client-side JavaScript to produce such MRI-encoded queries, without asking anyone's permission or waiting for web browsers or protocols like HTTP to catch up. These MRIs will for now be embedded within legacy URIs on their way from browser to web server, of course, and hence will still be embedded in one level of percent-encoding, just as embedded URIs would be. With a small amount of tooling, however, server-side web logging and statistics scripts can be upgraded to convert the logged outer URIs into MRIs as described above, which will render both nesting levels de-obfuscated in script outputs.
Web browsers can start supporting MRI syntax at any time as an alternative syntax in their address bars, even initially only as an experimental option. This way, early adopters and developers of web sites like those discussed above can obtain the convenience of being able to cut-and-paste one MRI directly into another without having to mess with so much percent-encoding. Ideally the user would be able to select whether to paste a MRI or down-convert it to a URI or IRI first. Firefox already approximates such a choice by down-converting to a URI when copying the whole web address, but preserving international characters from IRIs when copying only part. The browser will for now always convert the MRI to a URI for transmission over HTTP, but that's fine.
In short, since MRIs are both readily convertible to and from URIs and IRIs, and easily mutually embeddable with them, it should not be difficult to get them to “play along” in many useful ways even in the short term.
This post is intended merely to outline and explore
a possible solution to some of the usability problems of URIs and IRIs.
We have not, of course, defined a fully-precise syntax for MRIs
or answered all the detailed questions needed to deploy them.
However, I hope this preliminary exploration is adequate
to offer an impression of the potential benefits
of delimited resource identifiers,
as well as the realistic challenges of implementing them
and making them interoperable and backward-compatible with URIs and IRIs.
For now, this sketch hopefully provides a basis
from which we may experiment with MRIs
and see if they can fulfill their potential in practice.
Bryan Ford |