Home - Topics - Papers - Talks - Theses - Blog - CV - Photos - Funny |
CTML is simply an alternate character-level syntax for HTML. It otherwise closes closely to HTML (currently HTML5 in particular), and is designed to be easily convertible both to and from HTML.
The main goal of CTML is to satisfy the CTS metasyntactic discipline, so that CTML can be readily composed with other CTL-compliant languages without escaping. Secondary goals are to be more concise, readable, and easily typeable than HTML, while keeping the syntax simple and readily cross-convertible with HTML. CTML does not try to be as “intuitively” readable as Markdown, for example, but is much simpler syntactically (e.g., no complex layout-dependent rules), and much easier to generate automatically (e.g., via conversion from HTML).
CTML is also syntactically much simpler than HTML. Like XML/XHTML, CTML is more rigorous in enforcing the rules it does specify (e.g., that matchers must match). But it avoids many of the syntactic quirks and complexities of HTML syntax derived from SGML tradition and decades of gradual evolution.
CTML uses only three primitive syntactic structures: text, markups, and character references.
Text is arbitrary free-form text, whose content is mostly unrestricted, provided that it obeys CTS’s global “matchers must match” rule and cannot be confused with the other syntactic structures.
Markups are hierarchically composable structures consisting of a fixed tag, optional attributes, and a required content string, as follows:
tag[content]
tag(attributes)[content]
…
ctml[ head[ … ] body[ … ] ]
CTML does not specify or allow any analog of the HTML DOCTYPE preamble.
This does not mean CTML documents are not self-identifying, however,
The entire semantic content of a CTML document must be contained
in a top-level ctml
entity,
directly mirroring HTML’s top-level html' entity. Unlike HTML, which allows the
start tag to be omitted in some situations, the
ctml’ marker is mandatory,
and thus may be used to identify a CTML document.
An entity with no attributes is simply:
tag[content]
where tag
is the entity tag name and content
is the text it contains, if any.
CTML does not use start and end tags like HTML:
all delimiting is done with the start tag and the bracket pair.
An entity’s content can be empty.
This is typical for entities like p
(paragraph) and br
(break),
which in CTML become:
p[]
br[]
The brackets are never optional in CTML:
a p
alone does not denote a paragraph marker for example.
In addition, the open bracket must immediately follow
the last character of the tag name,
with no intervening whitespace.
CTML tag names use the same character set as HTML (XXX check).
Entities can have attributes, by placing the full attribute string in parentheses between the tag name and the bracketed content, like this:
(img src="pic.jpg")[]
(a href="target.html")[link text]
Entity attribute syntax itself is exactly as in HTML.
There may be any number of attributes,
each with an option value.
Attribute values can be either unquoted (attr=value
),
single-quoted (attr='value'
),
or double-quoted (attr="value"
),
exactly as in HTML.
Entities without attributes can also used the parenthesized form, as in:
(em)[emphasized text]
This form can be useful if the markup needs to follow other literal text immediately with no intervening whitespace. To emphasize just part of a word, for example:
abso(em)[friggin]lutely.
A comment markup in CTML is much simpler than in HTML,
consisting only of a hyphen (-
) immediately followed by bracketed text,
the contents of which is ignored:
-[comment text]
Comments may be arbitrarily long and contain any characters, subject only to the all-pervasive “matchers must match” rule.
While most normal running text does not need to be in any markup,
the special .
markup encloses uninterpreted literal text
and prevents any of its content from being confused as markup:
.[literal text]
Within the literal text, all characters are allowed and none are considered to have special meaning, apart from the “matchers must match” rule. As an example:
Here is some em[emphasized] text.
You emphasize text using .[em[emphasized]] markup.
Another situation needing the .
markup form
is when literal text ending in alphanumeric characters
needs to be immediately followed by a markup tag
with no intervening whitespace.
To emphasize only part of a word, for example,
we can either protect the preceding literal text with a .
markup
or simply divide the two with an empty .[]
markup:
.[Abso]em[friggin]lutely.
Abso.[]em[friggin]lutely.
Sometimes we need running literal text to contain unmatched matcher characters: for example, when we are writing about those literal characters. We cannot include unmatched matchers directly, however, without violating the global “matchers must match” rule. For this reason, CTML provides the following markup forms to represent unmatched open or close punctuation characters:
o() open parenthesis `(`
c() close parenthesis `)`
o[] open bracket `[`
c[] close bracket `]`
o{} open brace `{`
c{} close brace `}`
While literal matchers could alternatively be escaped
using character references as described below,
the above markup forms are more concise and have the clarity benefit
of using exactly the desired matcher characters via matching pairs,
but merely using the o
or c
tag to indicate whether
it is the open or close member of the punctuation pair
that is desired as literal text.
Like HTML, CTML allows both regular text and attributes
to contain character references,
or ASCII sequences denoting rich Unicode characters
that often cannot be typed directly on most keyboards.
Character references may use either standard entity names
or numeric code points,
but CTML surrounds the character reference in brackets
instead of with an ampersand (&
) and semicolon (;
) pair.
Numeric code points can be in decimal or hexadecimal, as in HTML.
Some examples, with both named and numeric equivalents:
é [eacute] [#233] [#xE9]
© [copy] [#169] [#xA9]
→ [rightarrow] [#8594] [#x2192]
Since UTF-8 has become the standard text file encoding and directly supports all of these code points, smart text editors or other tools might recognize and convert non-ASCII character references to their equivalent Unicode characters under some mode under the user’s control. In this way, bracketed character references may be used for ephemeral text-entry rather than markup. Such conversion of character references to characters should not generally be done on references to ASCII characters, however, which may cause insensitive characters such as matchers to be treated as sensitive.
Also like HTML, character references in CTML may be used to
escape literal ASCII characters that would otherwise be sensitive.
The set of characters that (sometimes) needs escaping is different, however.
CTML never needs less-than, greater-than, or ampersand characters
(<
, >
, &
) to be escaped as in HTML.
CTML does require matchers to be escaped
((
, )
, [
, ]
, {
, }
),
but only when they would violate the global “matchers must match” rule.
A normal English parenthetical expression, curly-brace-delimited C code block,
and the like, are all fine and need no escaping provided the matchers match.
If running text or an attribute’s value needs to contain an unmatched matcher,
then a character reference may be used to refer to the literal character,
as an alternative to the o
and c
markup forms discussed above.
Thus, the forms in each row below are equivalent:
o() [lpar] [#40] [#x28]
c() [rpar] [#41] [#x29]
o[] [lbrack] [#91] [#x58]
c[] [rbrack] [#93] [#x5D]
o{} [lbrace] [#123] [#x7B]
c{} [rbrace] [#125] [#x7D]
XXX in retrospect this is probably suitable for cmark but not ctml.
Bryan Ford |