Home - Topics - Papers - Theses - Blog - CV - Photos - Funny

Delimited Text and Binary Syntax

or Composable Text/Binary Syntax (CTS/CBS)?

Goals:

for embeddability, see for example JSON Schema’s desire to be able to embed arbitrary non-JSON data with a MIME Media Type: https://json-schema.org/understanding-json-schema/reference/non_json_data.html Any DTS-encoded content may be embedded in DTS-encoded JSON data with no additional quoting or escaping, provided the inner DTS context uses the same or a superset of the outer DTS context’s set of sensitive bracket pairs.

Comparison:

Three levels of interpretability:

Encoding types versus schema types: There are only a few encoding types, which are a subset of the types appearing in schema. The encoding types are only those required in defining and interpreting the structural metadata of the format: basically, just integers and strings. Schema types are richer, including booleans, floating-point, dates, etc.

An item is a structural component that a recipient can find the end of regardless of whether the recipient understands its semantic meaning or content. That is, an item is skippable if necessary.

An item may be atomic or composite. An atomic item like an integer or string contains no sub-structure defined by this encoding or reliably interpretable via this encoding’s rules. An integer or string may in fact contain information encoded in some arbitrary syntax (even a recursive instantiation of this one), but any such recursive interpretability is neither guaranteed nor suggested.

A composite item like a vector or map, in contrast, has hierarchical sub-structure defined by this encoding and traversable by an interpreter of this encoding.

Mathematical intervals

A secondary problem is the use of unmatched combinations of square brackets and parentheses in mathematical half-open interval notation…

Potential alternatives:

It would be nice to be able to have calculator expressions and programming language syntax in which one might write interval-membership tests that look something like,

if (r ∈ [a,b]) { ... }

where a and b are real or floating-point numbers and this statement is a shorthand for:

if (r >= a && r <= b) { ... }

But since mathematical intervals can be open or closed, it would be nice, and seem natural, if we could also test in shorthand whether r is in the open interval between a and b. But the obvious approach…

if (r ∈ (a,b)) { ... }

immediately lands us in the syntactic disaster of being unable to distinguish mathematical open-interval notation from already-ubiquitous tuple notation.

The syntactic disaster gets even worse if we even conceive of representing tests against half-open or half-closed intervals. Just imagine how to write a parser that can deal with tests like this, especially embedded deeper in inequality expressions involving other uses of parentheses for grouping and brackets for vectors, array indices, etc.:

if (r ∈ [a,b)) { ... }
if (r ∈ (a,b]) { ... }

In general, traditional mathematical interval notation violates the convention that open and close punctuation are otherwise typically used in designated matching pairs.


CBS

Optionally-Schematized Self-Describing Object Notation

Delimited Binary Structured Object Notation or DBSON?

We wish to provide a hierarchical format similar to JSON that is extensible and self-describing by default, in that a reader can derive the full hierarchical structure without necessarily knowing anything about the actual data it contains, and so that a reader that can interpret some but not all embedded fields can safely ignore and skip the ones it does not understand.

We like the fact that self-describing formats like XML and JSON typically give their fields human-readable keys like “name” and “address”. But on the other hand we also like the compactness of “schematized” formats such as Protobuf where fields are typically identified by small integers that usually have one-byte encodings in the common case. Can we get the best of both worlds?

Principle: a schema is basically a dictionary. It can also be used for validity-checking as XML does, but this is a separate role. The receiver might not know the full or latest dictionary that the sender used; this is fine in cases for applications like RPC where the receiver is simply going to ignore fields it does not understand.

But it would also be nice if, for development and debugging purposes for example, a receiver could somehow “go find” the dictionary a schema-compressed object file was compressed with, so as to expand all the codes into human-readable names and not just the ones the receiver already knows the mappings for. Including a schema URL in a header towards the beginning of the encoded file, as XML does, is a standard-practice solution that we adopt for this purpose.

Hierarchy

The goal of this level is to sepearate type information from value information, and enable a receiver to tell which values do and don’t have sub-structure, completely independent of the receiver’s ability to understand specific types.

Types are either integer schematic codes or human-readable text names. At this level, types are merely opaque integers or text strings.

The last two encodings are the most general, the first being just shorthand equivalents of them.

Type codes

Type codes are variable-length integers, which are categorized according to their three least-significant bits:

c: 1 if data contains composite structure encoded in DRS/BRS, 0 if atomic s: 1 if data takes the form of a UTF-8 text string, 0 if binary e: 1 if type is an extension type defined by a scheme, 0 if a base type

A more specific scheme defines which standard types are actually allowed.

Atomic binary type codes currently assigned in base format:

Atomic text type codes currently assigned:

Composite binary type codes currently assigned:

Composite text type codes currently assigned:

To consider adding:

Maps can encode keys and values with arbitrary types, although a particular scheme may restrict the range of valid keys and/or values.

A part is a varint code followed by additional data that depends on the low X bits of the code.

For items representing keys:

CT85 encoding for binary data

Inspired by Ascii85 encoding, which is more efficient than uuencode or base64 encoding.

We use this 85-character set, in this order:

! # $ % & * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z | ~

left unused:

Unlike Ascii85, the CT85-encoded data encodes the exact byte length of the binary data and does not require it first to be padded to a multiple of four. We treat a full 4-byte sequence as an unsigned integer from 0 to 2^32-1, just as in Ascii85. However, if the binary stream ends has only 3 bytes in the last frame, we encode it as an unsigned integer from 0 to 2^24-1 and add 2^32. If the binary stream ends with a 2-byte frame, we encode it as an integer from 0 to 2^16-1 and add 2^32+2^24. If the last frame contains only a single byte, we treat it as an integer from 0 to 2^8-1 and add 2^32+2^24+2^16.

Thus, for all but the last frame, the valid unsigned integer codes range from 0 to 2^23-1 - but for the last frame the valid integer codes range from 0 to 2^32+2^24+2^16+2^8-1, or 4,311,810,303. This is still comfortably less than 85^5 or 4,437,053,124, the largest unsigned integer encodable in base85.



Bryan Ford