Home - Topics - Papers - Talks - Theses - Blog - CV - Photos - Funny

Delimited Text and Binary Syntax

or Composable Text/Binary Syntax (CTS/CBS)?

Goals:

Text format self-describing and human-readable
Binary format for compactness and efficiency
- Either self-describing or schema-dependent
Easy cross-conversion between text and binary formats
- Easy for schemas to define both text and binary equivalents at once
Streamable: encoder can start sending long composites before knowing length
Embeddable:

for embeddability, see for example JSON Schema’s desire to be able to embed arbitrary non-JSON data with a MIME Media Type: https://json-schema.org/understanding-json-schema/reference/non_json_data.html Any DTS-encoded content may be embedded in DTS-encoded JSON data with no additional quoting or escaping, provided the inner DTS context uses the same or a superset of the outer DTS context’s set of sensitive bracket pairs.

Comparison:

Self-describing formats like HTML, XML, JSON
- With optional schema like XML
Self-describing binary formats like BSON, BJSON, CBOR, UBJSON, etc.
Schema-based coded binary formats like Protobuf

Three levels of interpretability:

Delimiting: just finding the end of a data blob
Hierarchy: being able to tell when a blob contains other blobs
Basic typing: recognizing encoded values of common types like JSON types
Extended typing: recognizing encoded values of more specialized types

Encoding types versus schema types: There are only a few encoding types, which are a subset of the types appearing in schema. The encoding types are only those required in defining and interpreting the structural metadata of the format: basically, just integers and strings. Schema types are richer, including booleans, floating-point, dates, etc.

An item is a structural component that a recipient can find the end of regardless of whether the recipient understands its semantic meaning or content. That is, an item is skippable if necessary.

An item may be atomic or composite. An atomic item like an integer or string contains no sub-structure defined by this encoding or reliably interpretable via this encoding’s rules. An integer or string may in fact contain information encoded in some arbitrary syntax (even a recursive instantiation of this one), but any such recursive interpretability is neither guaranteed nor suggested.

A composite item like a vector or map, in contrast, has hierarchical sub-structure defined by this encoding and traversable by an interpreter of this encoding.

Mathematical intervals

A secondary problem is the use of unmatched combinations of square brackets and parentheses in mathematical half-open interval notation…

Potential alternatives:

An ASCII-based notation that uses only one type of matcher for both open and closed endpoints. For example:
- We could use simple brackets like [a..b] for closed intervals, and brackets with adjacent periods like [.a..b.] to represent open intervals, with [a..b.] representing an interval that’s open at a and closed at b, and [.a..b] representing an interval that’s closed at a and closed at b.
- We could incorporate the mathematical inequality signs into the notation, since they are heavily used in non-matching fashion anyway. For example, we could use [>a..b<] to represent a closed interval containing all real numbers strictly greater than a and strictly less than b, with [>a..b] being the half-open interval at a and [a..b<] being the half-open interval at b.
We could move beyond ASCII and use the Unicode character set to express intervals in more visually-suggestive fashion. For example:
- From a typography perspective, one attractive approach is to adopt/abuse the lenticular brackets that Asian languages traditionally use as quotation marks, and for which Unicode already provides both a filled black form, “【】”, and an unfilled white form, “〖〗”.
  
  The lenticular bracket shape is suitable in that it combines the traditional square brackets [] and parentheses () traditionally used to denote intervals. Further, we can take the filled form to denote inclusion of the endpoint and the unfilled form to denote exclusion of the endpoint.
  
  Thus, a half-open interval from including a but excluding b would be 【a..b〗.
  
  The main downside of this approach is that the lenticular bracket characters are defined as part of the Unicode CJK character blocks. As such, they are commonly rendered as full-width characters, inconsistently with the half-width rendering of Western scripts and traditional mathematical notation.
- An alternative that avoids the abuse of CJK characters is to use the mathematical left and right tortoise shell brackets, which similarly come in a filled black form “⦗⦘” and an unfilled white form “⟬⟭”. These symbols could similarly be viewed as a visual “compromise” between square brackets and parentheses, with the black version representing an inclusive interval endpoint and the white version representing an exclusive endpoint.
  
  Thus, a half-open interval from including a but excluding b would be ⦗a..b⟭.

It would be nice to be able to have calculator expressions and programming language syntax in which one might write interval-membership tests that look something like,

if (r ∈ [a,b]) { ... }

where a and b are real or floating-point numbers and this statement is a shorthand for:

if (r >= a && r <= b) { ... }

But since mathematical intervals can be open or closed, it would be nice, and seem natural, if we could also test in shorthand whether r is in the open interval between a and b. But the obvious approach…

if (r ∈ (a,b)) { ... }

immediately lands us in the syntactic disaster of being unable to distinguish mathematical open-interval notation from already-ubiquitous tuple notation.

The syntactic disaster gets even worse if we even conceive of representing tests against half-open or half-closed intervals. Just imagine how to write a parser that can deal with tests like this, especially embedded deeper in inequality expressions involving other uses of parentheses for grouping and brackets for vectors, array indices, etc.:

if (r ∈ [a,b)) { ... }
if (r ∈ (a,b]) { ... }

In general, traditional mathematical interval notation violates the convention that open and close punctuation are otherwise typically used in designated matching pairs.

CBS

Optionally-Schematized Self-Describing Object Notation

Delimited Binary Structured Object Notation or DBSON?

We wish to provide a hierarchical format similar to JSON that is extensible and self-describing by default, in that a reader can derive the full hierarchical structure without necessarily knowing anything about the actual data it contains, and so that a reader that can interpret some but not all embedded fields can safely ignore and skip the ones it does not understand.

We like the fact that self-describing formats like XML and JSON typically give their fields human-readable keys like “name” and “address”. But on the other hand we also like the compactness of “schematized” formats such as Protobuf where fields are typically identified by small integers that usually have one-byte encodings in the common case. Can we get the best of both worlds?

Principle: a schema is basically a dictionary. It can also be used for validity-checking as XML does, but this is a separate role. The receiver might not know the full or latest dictionary that the sender used; this is fine in cases for applications like RPC where the receiver is simply going to ignore fields it does not understand.

But it would also be nice if, for development and debugging purposes for example, a receiver could somehow “go find” the dictionary a schema-compressed object file was compressed with, so as to expand all the codes into human-readable names and not just the ones the receiver already knows the mappings for. Including a schema URL in a header towards the beginning of the encoded file, as XML does, is a standard-practice solution that we adopt for this purpose.

Hierarchy

The goal of this level is to sepearate type information from value information, and enable a receiver to tell which values do and don’t have sub-structure, completely independent of the receiver’s ability to understand specific types.

Types are either integer schematic codes or human-readable text names. At this level, types are merely opaque integers or text strings.

ttttttt0: data with 7-bit schematic type code ttttttt, content follows
lllllll1: next lllllll bytes (>=2) contains type name, content follows
00000001: data is type code (integer) blob followed by content
00000011: data is type name (string) blob followed by content

The last two encodings are the most general, the first being just shorthand equivalents of them.

Type codes

Type codes are variable-length integers, which are categorized according to their three least-significant bits:

cse:

c: 1 if data contains composite structure encoded in DRS/BRS, 0 if atomic s: 1 if data takes the form of a UTF-8 text string, 0 if binary e: 1 if type is an extension type defined by a scheme, 0 if a base type

A more specific scheme defines which standard types are actually allowed.

Atomic binary type codes currently assigned in base format:

000000: raw atomic binary data with no further encoded type information
001000: IEEE 16/32/64/128-bit binary floating-point number
010000: unsigned binary integer with value encoded in little-endian
011000: negative binary integer with -value-1 encoded in little-endian

Atomic text type codes currently assigned:

000010: generic text string with no further encoded type information
001010: text-encoded number with optional fraction, decimal (other bases?)
010010: text-encoded date/time in ISO 8601, RFC 3339, or similar formats

Composite binary type codes currently assigned:

000100: array/vector/sequence of consecutive BRS-encoded values
001100: map/dictionary/object BRS-encoded as sequence of key-value pairs

Composite text type codes currently assigned:

000110: array/vector/sequence of consecutive DRS-encoded values
001110: map/dictionary/object DRS-encoded as sequce of key-value pairs

To consider adding:

Boolean (true, false), null, undefined
Raw text/binary content with MIME content-type tag
Dates in text/binary formats
Packed binary vectors of 8/16/32/64-byte signed or unsigned integers
vvvvv000: little-endian unsigned integer, vvvvv is 5 least-significant bits
vvvvv100: little-endian negative integer encoding -value-1, vvvvv is 5 lsbs
ttttt000: typed atomic value: schematic type ttttt, further bytes are content
00000010: text string, remaining bytes contain UTF-8 encoding
00000011: binary data contained in remaining bytes
00000110: boolean false value
00000111: boolean true value
00001110: JSON null value
00001111: undefined value (e.g., JavaScript)
000000bb: data is vector of 2^bb-byte unsigned integers (8,16,32,64-bit)
000000bb: data is vector of 2^bb-byte signed integers (8,16,32,64-bit)
00000000: composite vector of variable-length blob-encoded values
00000000: composite map of blob-encoded key/value pairs
00000bbb: IEEE binary floating-point value of size 8*2^bbb bits
000000bb: typed atomic value: 2^bb-byte schematic type followed by content
000000bb: typed composite value: 2^bb-byte schematic type followed by content

Maps can encode keys and values with arbitrary types, although a particular scheme may restrict the range of valid keys and/or values.

A part is a varint code followed by additional data that depends on the low X bits of the code.

x0: this part is a final literal
Bit 0 - More: indicates this item contains more part(s)
Bit 1 - Composite: indicates this part may contain sub-parts
Bit 2 -

For items representing keys:

If the item is an integer, it represents a code referring to a schema. The receiver must know and understand the applicable schema in order to interpret the code properly.
If the item is a string, it represents a textual scheme name.
Keys can potentially be composite too, depending on the scheme…

CT85 encoding for binary data

Inspired by Ascii85 encoding, which is more efficient than uuencode or base64 encoding.

We use this 85-character set, in this order:

! # $ % & * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z | ~

left unused:

( ) { } because they’re sensitive matchers
" ’ \ because they’re widely used for quoting and excaping in C-like languages

Unlike Ascii85, the CT85-encoded data encodes the exact byte length of the binary data and does not require it first to be padded to a multiple of four. We treat a full 4-byte sequence as an unsigned integer from 0 to 2^32-1, just as in Ascii85. However, if the binary stream ends has only 3 bytes in the last frame, we encode it as an unsigned integer from 0 to 2^24-1 and add 2^32. If the binary stream ends with a 2-byte frame, we encode it as an integer from 0 to 2^16-1 and add 2^32+2^24. If the last frame contains only a single byte, we treat it as an integer from 0 to 2^8-1 and add 2^32+2^24+2^16.

Thus, for all but the last frame, the valid unsigned integer codes range from 0 to 2^23-1 - but for the last frame the valid integer codes range from 0 to 2^32+2^24+2^16+2^8-1, or 4,311,810,303. This is still comfortably less than 85^5 or 4,437,053,124, the largest unsigned integer encodable in base85.

Bryan Ford