Home - Topics - Papers - Talks - Theses - Blog - CV - Photos - Funny

Composable Text Markup Language (CTML)

CTML is simply an alternate character-level syntax for HTML. It otherwise closes closely to HTML (currently HTML5 in particular), and is designed to be easily convertible both to and from HTML.

The main goal of CTML is to satisfy the CTS metasyntactic discipline, so that CTML can be readily composed with other CTL-compliant languages without escaping. Secondary goals are to be more concise, readable, and easily typeable than HTML, while keeping the syntax simple and readily cross-convertible with HTML. CTML does not try to be as “intuitively” readable as Markdown, for example, but is much simpler syntactically (e.g., no complex layout-dependent rules), and much easier to generate automatically (e.g., via conversion from HTML).

CTML is also syntactically much simpler than HTML. Like XML/XHTML, CTML is more rigorous in enforcing the rules it does specify (e.g., that matchers must match). But it avoids many of the syntactic quirks and complexities of HTML syntax derived from SGML tradition and decades of gradual evolution.

Syntax

CTML uses only three primitive syntactic structures: text, markups, and character references.

Text is arbitrary free-form text, whose content is mostly unrestricted, provided that it obeys CTS’s global “matchers must match” rule and cannot be confused with the other syntactic structures.

Markups are hierarchically composable structures consisting of a fixed tag, optional attributes, and a required content string, as follows:

tag[content]
tag(attributes)[content]

…

CTML document structure

ctml[ head[ … ] body[ … ] ]

CTML document identifiation

CTML does not specify or allow any analog of the HTML DOCTYPE preamble.

This does not mean CTML documents are not self-identifying, however, The entire semantic content of a CTML document must be contained in a top-level ctml entity, directly mirroring HTML’s top-level html' entity. Unlike HTML, which allows the start tag to be omitted in some situations, thectml’ marker is mandatory, and thus may be used to identify a CTML document.

Entities

An entity with no attributes is simply:

tag[content]

where tag is the entity tag name and content is the text it contains, if any. CTML does not use start and end tags like HTML: all delimiting is done with the start tag and the bracket pair.

An entity’s content can be empty. This is typical for entities like p (paragraph) and br (break), which in CTML become:

p[]
br[]

The brackets are never optional in CTML: a p alone does not denote a paragraph marker for example. In addition, the open bracket must immediately follow the last character of the tag name, with no intervening whitespace.

CTML tag names use the same character set as HTML (XXX check).

Entity Attributes

Entities can have attributes, by placing the full attribute string in parentheses between the tag name and the bracketed content, like this:

(img src="pic.jpg")[]
(a href="target.html")[link text]

Entity attribute syntax itself is exactly as in HTML. There may be any number of attributes, each with an option value. Attribute values can be either unquoted (attr=value), single-quoted (attr='value'), or double-quoted (attr="value"), exactly as in HTML.

Entities without attributes can also used the parenthesized form, as in:

(em)[emphasized text]

This form can be useful if the markup needs to follow other literal text immediately with no intervening whitespace. To emphasize just part of a word, for example:

abso(em)[friggin]lutely.

Comments

A comment markup in CTML is much simpler than in HTML, consisting only of a hyphen (-) immediately followed by bracketed text, the contents of which is ignored:

-[comment text]

Comments may be arbitrarily long and contain any characters, subject only to the all-pervasive “matchers must match” rule.

Literal text

While most normal running text does not need to be in any markup, the special . markup encloses uninterpreted literal text and prevents any of its content from being confused as markup:

.[literal text]

Within the literal text, all characters are allowed and none are considered to have special meaning, apart from the “matchers must match” rule. As an example:

Here is some em[emphasized] text.
You emphasize text using .[em[emphasized]] markup.

Another situation needing the . markup form is when literal text ending in alphanumeric characters needs to be immediately followed by a markup tag with no intervening whitespace. To emphasize only part of a word, for example, we can either protect the preceding literal text with a . markup or simply divide the two with an empty .[] markup:

.[Abso]em[friggin]lutely.
Abso.[]em[friggin]lutely.

Literal matchers

Sometimes we need running literal text to contain unmatched matcher characters: for example, when we are writing about those literal characters. We cannot include unmatched matchers directly, however, without violating the global “matchers must match” rule. For this reason, CTML provides the following markup forms to represent unmatched open or close punctuation characters:

o()	open parenthesis `(`
c()	close parenthesis `)`
o[]	open bracket `[`
c[]	close bracket `]`
o{}	open brace `{`
c{}	close brace `}`

While literal matchers could alternatively be escaped using character references as described below, the above markup forms are more concise and have the clarity benefit of using exactly the desired matcher characters via matching pairs, but merely using the o or c tag to indicate whether it is the open or close member of the punctuation pair that is desired as literal text.

Character references

Like HTML, CTML allows both regular text and attributes to contain character references, or ASCII sequences denoting rich Unicode characters that often cannot be typed directly on most keyboards. Character references may use either standard entity names or numeric code points, but CTML surrounds the character reference in brackets instead of with an ampersand (&) and semicolon (;) pair. Numeric code points can be in decimal or hexadecimal, as in HTML. Some examples, with both named and numeric equivalents:

é	[eacute]	[#233]		[#xE9]
©	[copy]		[#169]		[#xA9]
→	[rightarrow]	[#8594]		[#x2192]

Since UTF-8 has become the standard text file encoding and directly supports all of these code points, smart text editors or other tools might recognize and convert non-ASCII character references to their equivalent Unicode characters under some mode under the user’s control. In this way, bracketed character references may be used for ephemeral text-entry rather than markup. Such conversion of character references to characters should not generally be done on references to ASCII characters, however, which may cause insensitive characters such as matchers to be treated as sensitive.

Escaping sensitive characters

Also like HTML, character references in CTML may be used to escape literal ASCII characters that would otherwise be sensitive. The set of characters that (sometimes) needs escaping is different, however. CTML never needs less-than, greater-than, or ampersand characters (<, >, &) to be escaped as in HTML. CTML does require matchers to be escaped ((, ), [, ], {, }), but only when they would violate the global “matchers must match” rule. A normal English parenthetical expression, curly-brace-delimited C code block, and the like, are all fine and need no escaping provided the matchers match.

If running text or an attribute’s value needs to contain an unmatched matcher, then a character reference may be used to refer to the literal character, as an alternative to the o and c markup forms discussed above. Thus, the forms in each row below are equivalent:

o()		[lpar]		[#40]		[#x28]
c()		[rpar]		[#41]		[#x29]
o[]		[lbrack]	[#91]		[#x58]
c[]		[rbrack]	[#93]		[#x5D]
o{}		[lbrace]	[#123]		[#x7B]
c{}		[rbrace]	[#125]		[#x7D]

XXX in retrospect this is probably suitable for cmark but not ctml.

Bryan Ford