Home - Topics - Papers - Theses - Blog - CV - Photos - Funny

Binary Encoded Schematic Objects (BESO)

(or JBIN? or…?)

BESO provides a generic representation for any JSON, using JSON Schema information to provide a compact binary encoding. Unlike schema-free binary encodings like BSON and CBOR, BESO encoding leverages the schema to optimize the encoded representation, for example by replacing object tags specified in the schema with small integers, by representing numbers in binary integer or floating-point forms, etc.

BESO Generic does not modify the JSON Schema language at all. It takes only the information already specified in a JSON schema and provides a more compact schema-optimized binary representation.

The BESO Generic encoding ensures that all valid JSON structures can still be represented, including those that extend or fail to conform to the schema. For example, if a JSON object contains elements not defined in the schema, then those elements can still be converted to and from BESO encoding but will just not benefit from the schema’s ability to compress known keys. If a JSON object has values entirely non-conformant with the schema, such as a value of the wrong type, out of the specified numeric range, etc., then that value can still be represented if necessary and converted correctly between text-based JSON and binary BESO encodings, although the value will fail a schema compliance check of course.

This concise list of goals distinguishes BESO

Untyped schema

When the schema specifies neither a type nor enum constraint, and hence allows any JSON value type, the binary encoding of each of the JSON value types is as described here. The first byte of the binary encoding determines the JSON value type it encodes, as summarized in the following list and detailed below:

Integer values

A JSON number whose value is a signed integer is first zigzag-encoded into an unsigned integer with the least-significant bit serving as the sign flag. A non-negative value v becomes 2v, while a negative value v becomes -2v+1. The resulting unsigned integer is then serialized into a variable number of bytes in network (big-endian) byte order. Finally, if the first (most-significant) byte of the encoding is not in the range 0x00 through 0x0f, then a leading zero byte is prepended.

Some example encodings of integers are as follows:

	+0   -> [0x00]
	-0   -> [0x01]
	+1   -> [0x02]
	-1   -> [0x03]
	+2   -> [0x04]
	-2   -> [0x05]
	+7   -> [0x0e]
	-7   -> [0x0f]
	+8   -> [0x00,0x10]
	-8   -> [0x00,0x11]
	+128 -> [0x01,0x00]
	-128 -> [0x01,0x01]
	+256 -> [0x02,0x00]
	-256 -> [0x02,0x01]

In this way, integers close to zero remain small after zigzag-encoding, and thus require few bytes to serialize. Integers of up to four bits including sign require one byte, integers up to twelve bits including sign require two bytes, etc.

Notice that unlike the zigzag encoding used in protobufs, this encoding has representations for positive and negative zero. While negative zero is not representable in most native integer formats, the integer encoding is also usable to encode floating-point numbers that have no fractional component, and the IEEE floating-point formats do have both positive and negative zero. Thus, writers are typically expected to produce -0 only when serializing a floating-point number as an integer, with readers treating +0 and -0 as identical when converting to standard internal integer formats.

For compactness writers should generally avoid encoding unnecessary leading zero bytes in the serialized representation, but readers must be prepared to accept them. Writers may sometimes have reason to produce non-essential leading zero bytes: for example, to pad numbers representing cryptographic secrets to a constant time to minimize the risk of side-channel leaks.

Floating-point numbers

Non-integral JSON numbers are encoded into binary floating-point representation using either base-2 (binary) or base-10 (decimal) fractions. The first byte of the value encoding is 0x10 for floating-point with a binary fraction, and 0x11 for floating-point with a decimal fraction.

Encoders converting from decimal string representation (e.g., JSON text) should generally encode the number using a base-10 fraction in order to avoid losing precision in the conversion. Encoders converting from internal binary floating-point representations, however, can use base-2 fractions to encode numbers more space-efficiently without loss of precision.

The binary encoding of the floating-point number consists of the 0x10 (binary fraction) or 0x11 (decimal fraction) prefix, followed by a CBE-encoded exponent, and finally a mantissa comprising the rest of the encoded byte sequence.

The exponent consists of a variable-length signed integer, which is first zigzag-encoded into a binary unsigned integer, then serialized into a big-endian byte string, and finally CBE-encoded to delimit its length.

In the exponent’s zigzag encoding, the encoded value 1 is special and indicates infinity or not-a-number (NaN) values, rather than a negative zero exponent.

Some example exponents are as follows after both zigzag-encoding and CBE-encoding:

	e = 0        -> [0x00]            1-byte CBE-encoding for
	infinity/NaN -> [0x01]            1-byte exponent 0x00-0x7f
	e = +1       -> [0x02]
	e = -1       -> [0x03]
	e = +2       -> [0x04]
	e = -2       -> [0x05]
	e = +63      -> [0x7e]
	e = -63      -> [0x7f]
	e = +64      -> [0x81,0x80]       2-byte CBE-encoding for
	e = -64      -> [0x81,0x81]       1-byte exponent 0x80-0xff
	e = +127     -> [0x81,0xfe]
	e = -127     -> [0x81,0xff]
	e = +128     -> [0x82,0x01,0x00]  3-byte CBE-encoding for
	e = -128     -> [0x82,0x01,0x01]  2-byte exponents
	e = +32767   -> [0x82,0xff,0xfe]
	e = -32767   -> [0x82,0xff,0xff]

Finally, the mantissa uses the same zigzag encoding as for integers, with encoded values 0 and 1 representing positive and negative zero. The mantissa is then serialized in network (big-endian) byte order without CBE-encoding, which is unnecessary as the mantissa always accounts for the remainder of the encoded floating-point number.

When representing zeros and infinities, the mantissa field can be empty, i.e., zero bytes long. This is the preferred representation since it is the most compact.

Some example mantissa encodings are as follows:

	m = +0     -> []
	m = -0     -> [0x01]
	m = +1     -> [0x02]
	m = -1     -> [0x03]
	m = +2     -> [0x04]
	m = -2     -> [0x05]
	m = +127   -> [0xfe]
	m = -127   -> [0xff]
	m = +128   -> [0x01,0x00]
	m = -128   -> [0x01,0x01]
	m = +32767 -> [0xff,0xfe]
	m = -32767 -> [0xff,0xff]

Readers must be prepared to accept leading zero bytes in the exponent and/or mantissa. Writers should normally avoid producing leading zero bytes when compactness is desired, but may produce leading zeros in order to ensure that encoded values are some fixed (but explicitly-encoded) length.

For NaNs, consistent with IEEE 754, the most-significant bit of the (first byte of the) mantissa indicates whether the NaN is a quiet or signaling NaN. Since the variable-length floating-point mantissa can be any number of bytes long, this means that the quiet or signaling NaN flag can be mantissa bit 7, 15, 23, etc.

Arrays

A binary-encoded JSON array consists of the prefix byte 0x12 followed by a sequence of any number of items. Each item is first binary encoded using the untyped schema encoding defined in this section, then CBE-encoded to delimit its length, and finally all the CBE-encoded items are concatenated.

Key/value maps or objects

A binary-encoded JSON object consists of the prefix byte 0x13 followed by a sequence of any number of key/value pairs. The keys and values are each first binary encoded using the untyped schema encoding defined in this section, then CBE-encoded to delimit their lengths, and finally all the key/value pairs are concatenated.

Boolean values

A boolean value is represented by either the single byte 0x14, denoting true, or the single byte 0x15, denoting false. Writers should ensure that the encoding is only one byte, and readers should ignore any bytes beyond the first.

Null values

A null value is represented by the single byte 0x16. Writers should ensure that the encoding is only one byte, and readers should ignore any bytes beyond the first.

String values

The normal binary encoding of a JSON string value is simply the UTF-8 encoding of the string, provided that the string is nonempty and its first character is not an ASCII control character, namely 0x00-0x1f or 0x7f.

In this way, the binary encodings of most strings incur no space overhead. Further, because the CBE encoding of a string consisting of only one ASCII character remains only one byte long after encoding, the common space-saving practice of using single-character property names in JSON data translates to single-byte CBE-encoded property keys in the binary representation of objects as well.

If the JSON string to be encoded is empty or its first character is a control character, then its BESO untyped encoding is a single byte 0x7f (ASCII DEL) followed by the string’s UTF-8 encoding. While writers generally should only prepend the 0x7f prefix only to strings that start with control characters, readers should be prepared to accept any UTF-8 string escaped this way.

Strings containing base64-encoded binary data

It is common practice to embed binary data in JSON strings using base64 encoding. To reduce the space overhead incurred by base64 encoding, JSON string values that strictly conform to the base64 specification can be encoded in a compact binary form, consisting of a single 0x1f byte followed by the raw decoded binary content of the base64-encoded string.

Enumerated schema

When a JSON schema specifies an enum constraint, regardless of the presence or absence of a type constraint, the corresponding binary encoding uses compact unsigned integers to represent indices in the list of allowed values.

When the value to be encoded is one of those listed in the enum constraint, the binary encoding is its zero-based position in the enum list, serialized to a variable-length byte string in network byte order. If the first byte of this serialization is 0xff, then a leading zero byte is prepended.

The preferred encoding of the first value in the enum list is the empty byte string. Since a schema with a const property is synonymous with a single-value enum, this implies that a constant schema may be encoded as the empty string.

If the value to be encoded is not one of those allowed by the enum, i.e., a value non-conformant to the schema, then the binary encoding is a single 0xff byte followed by the untyped encoding of the value.

Consider this schema, for example:

	{
	  "enum": ["red", "amber", "green"]
	}

The following illustrates how four example values are encoded:

	"red"   -> []
	"amber" -> [0x01]
	"green" -> [0x02]
	"blue"  -> [0xff,0x62,0x6c,0x75,0x65]

Schema of integer type

When the scheme specifies a type of integer, the binary representation is specialized for maximum space-efficiency in representing binary integers, while still allowing values non-conformant to the schema to be represented.

An integer value is first zigzag-encoded and serialized into a variable-length big-endian byte string, in exactly the same way as for integers encoded with untyped schema. Then, only if the serialized integer’s first byte is 0xff, a leading zero byte is prepended.

A non-integer value not conforming to the schema is first encoded as for untyped schema, then a single 0xff byte is prepended to that encoding.

An integer with value +0 may be encoded as the empty string, and this is the preferred encoding.

With this encoding, integers between -126 and +127 require at most one byte to represent, integers between -32639 and +32639 require at most two bytes, and so on, as the following encoding examples illustrate:

	+0      -> []
	-0      -> [0x01]
	+1      -> [0x02]
	-1      -> [0x03]
	+126    -> [0xfc]
	-126    -> [0xfd]
	+127    -> [0xfe]
	-127    -> [0x00,0xff]
	+128    -> [0x01,0x00]
	-128    -> [0x01,0x01]
	+32639  -> [0xfe,0xfe]
	-32639  -> [0xfe,0xff]
	+32640  -> [0x00,0xff,0x00]
	-32640  -> [0x00,0xff,0x01]
	+32767  -> [0x00,0xff,0xfe]
	-32767  -> [0x00,0xff,0xff]
	+32768  -> [0x01,0x00,0x00]
	-32768  -> [0x01,0x00,0x01]
	null    -> [0xff,0x1e]
	"a"     -> [0xff,0x61]

Schema of number type

XXX

0x00-0x7f for first byte of integer 0x80-0xbf for first byte of base-2 exponent? 0x80-0xbf for first byte of base-10 exponent? but may be a bit complex…

alternately:

0x00-0xfc for first byte of integer 0xfd followed by floating-point with base-2 exponent 0xfe followed by floating-point with base-10 exponent 0xff for non-conformant value

alternatively:

0x00-0xff for first byte of zigzag-encoded exponent shifted left another 1 bit to select base 2 vs 10?

alternatively:

just keep it simple and use the untyped schema representation?

Schema of array type

To encode a JSON array value with a schema of type array, the array’s individual elements are first binary encoded using the appropriate sub-schema for the array’s individual items, as specified in items and additionalItems properties if present. These binary-encoded items are then individually CBE-encoded then concatenated in order. Finally, if the first byte of the resulting array encoding is either 0xfe or 0xff, then a single 0xfe byte is prepended.

To encode a JSON value other than an array, the value is first encoded using the untyped encoding, then a single 0xff byte is prepended.

Schema of object type

To encode a JSON object with a schema of type object, the properties in the object to be encoded are first classified into three categories: required, known, and unknown. Required properties are those listed in the schema’s required clause, if any. Known properties are those listed in the object schema’s properties clause. Unknown properties are any appearing in neither clause.

The binary encoding of such a JSON object consists of an optional 0xfe prefix byte, followed by an array encoding of the object’s required properties, and finally by a sequence of key/value pairs representing the known and unknown optional properties. Once the object’s properties have been serialized, if the first byte of their encoding is either 0xfe or 0xff, then a single 0xfe byte is prepended.

To encode a JSON object conforming to the schema, each of the object’s required properties is first encoded using the schema specified for that property, if any. Each required property is then individually CBE-encoded in order, just as for arrays. The number of CBE-encoded blobs in this sequence must exactly match the number of properties the schema lists as `required’. If any required properties are not actually present in the object to be encoded (implying that the object is non-conformant with the schema), then the object must be encoded as for an untyped schema with a 0xff prefix byte prepended as specified below.

Following the encodings of the required properties is a variable-length sequence of key/value pairs representing the known and unknown optional properties, in arbitrary order.

For known properties, the key is encoded as an unsigned integer index into the schema’s properties clause. Zero represents the first property the schema lists in its properties object, one represents the second property listed, and so on. This implies that while for schema validation the order in which properties are listed is not important, for JSBE encoding using a schema the order of properties is important: reordering the properties list will change the schema’s binary encoding. The unsigned integer key is then serialized into a big-endian byte string. If the first byte of the resulting string is not in the range 0x00 through 1e, then a leading zero byte is prepended. Finally, the serialized key is CBE-encoded.

For unknown properties, the property name string is first encoded as for untyped schema, resulting in either a regular string or base64 string encoding whose first byte is in the range 0x1f through 0xff. This encoded property name string is then CBE-encoded.

The value of a known property is first binary encoded using the appropriate schema for that property, then this binary encoding is CBE-encoded. The value of an unknown property is binary encoded as for untyped schema, then CBE-encoded.

For non-conformant JSON values that are not objects or that do not have all the properties the schema lists as required, the entire object is first encoded as for untyped schema, then a single 0xff byte is prepended to the encoding.

XXX some examples

Schema of other types

For schema of type boolean, null, or string, the binary encoding is identical to that of untyped schema. That is, the encodings of these types of values are not specialized and do not receive any space-efficiency benefit from having a schema. This is mainly because the untyped encodings of these JSON value types are already extremely space-efficient, and the complexity cost of specialization for these types would probably not be worth the small opportunity for further space saving.

Evolution of schema-based encodings

Backwards-compatible extensibility

Adding new optional properties to an object doesn’t break existing encodings, provided the new fields are added to the end of the properties list.

Schema negotiation

XXX

JBS Custom

JBS Custom allows the specification of customized binary representations via extensions to the JSON Schema language.

Variable-width unsigned integers

When the scheme type is integer, an encoding property of unsigned indicates that values greater than or equal to zero are to be encoded in a variable-length binary representation:

{
	"type": "integer",
	"encoding": "binary"
}

A byteOrder property may specify the byte order for the encoded integer, either bigEndian (most-significant byte first) or littleEndian (least-significant byte first).

{
	"type": "integer",
	"encoding": "binary",
	"byteOrder": "littleEndian"
}

In this encoding, the integer 65 is encoded as the one-byte stream [0x41], while the integer 256 is encoded as the byte-stream [0x00,0x01].

In strict schema validation, the byteOrder property is required. In permissive scheme validation, the byteOrder property is optional and defaults to bigEndian (network byte order).

Variable-width signed integers

When the scheme type is integer, an encoding property of zigzag indicates that signed integer values are to be zigzag-encoded first into unsigned integer values, then the latter encoded as a binary unsigned integer as described above.

{
	"type": "integer",
	"encoding": "zigzag",
}

In zigzag encoding, a positive integer value v is encoded as the unsigned integer 2v, while negative integer values v are encoded as -2v - 1. That is:

0	0
-1	1
1	2
-2	3
2	4
-2	5
etc.

Unsigned integers

When the scheme type is integer and has a minimum value greater than or equal to zero, the scheme represents an unsigned integer or natural number.

Fixed-length unsigned integers

An encoding of unsigned indicates that an integer is to be encoded in unsigned binary representation a given number of bits wide. For example, this type indicates an integer encoded in exactly one byte as an unsigned integer:

{
	"type": "integer",
	"encoding": "unsigned",
	"bits": 8
}

The `bits’ property’s value must be a nonnegative integer indicating the bit-width of this representation.

If bits is greater than 8, then the encoding property must also have a byteOrder property specifying the endianness of the representation: either bigEndian (network byte order) or littleEndian.

XXX or should encoding default to network byte order (bigEndian)? Perhaps it should for purposes of defining new formats, but a strict-specification mode for schema validators should require endianness to be specified?

An example of a 32-bit unsigned integer encoding in network byte-order:

{
	"type": "integer",
	"encoding": "unsigned",
	"byteOrder": "bigEndian",
	"bits": 32
}

The bits field need not be a power of two. For example, this integer is encoded in three bytes:

{
	"type": "integer",
	"encoding": "unsigned",
	"byteOrder": "littleEndian",
	"bits": 24
}

Note that the above examples did not specify a semantic value range via minimum or maximum, although the specified encoding can represent only a certain range of values. This is not necessarily an error: it simply means that not all semantically permitted values are representable in the specified binary representation. This corresponds to the arguably lazy but common practice in designing software, programming languages, and binary formats to postulate that some number of bits “should be enough” for all expected values even though a specific value range has not been clearly defined. Semantically allowing unrepresentable values may arguably be considered bad practice for designing new formats and schemas, however, so schema validators may want to provide a configuration option that yields warnings or errors when a fixed-length representation is specified without a specified minimum and maximum that is representable.

The following variant of the above 24-bit integer example restricts the semantic value to a range of 1 through 16,000,000, a subset of the representable value range of 0 through 16,777,215 (2^24-1).

{
	"type": "integer",
	"minimum": 1,
	"maximum": 16000000
	"encoding": "unsigned",
	"byteOrder": "littleEndian",
	"bits": 24
}

For now, bits must be a multiple of 8. We might later want to relax this to support packed bit fields and bit streams. But then we will probably also need to define a bitOrder property specifying the bit order within bytes (i.e., whether we start filling bytes from the most-significant or the least-significant bit).

Fixed-length signed integers

An encoding property of unsigned indicates that an integer is to be encoded as a fixed-width integer in standard binary two’s-complement encoding, with the most-significant bit serving as the sign bit. As with unsigned integers, a bits property is required and must be a multiple of 8 for now, and a byteOrder property is required if bits is greater than 8.

An example of an 8-bit signed integer:

{
	"type": "integer",
	"encoding": "signed",
	"bits": 8
}

Similarly, a 64-bit little-endian signed integer:

{
	"type": "integer",
	"encoding": "signed",
	"bits": 64,
	"byteOrder": "littleEndian"
}

Offset-encoded integers

Integers may have an offset property to indicate that the semantic value is the offset plus the encoded value. For example, this schema type represents semantic values between 200 through 300 inclusive, encoded as single-byte values from 0 through 100:

{
	"type": "integer",
	"minimum": 200,
	"maximum": 300,
	"offset": 200,
	"encoding": "unsigned/bigEndian",
	"bits": 8
}

When an offset is specified and is less than or equal to the minimum, an unsigned-integer representation may be used even if minimum is less than or equal to zero. This example encodes values from -100 through 100 in one byte, with byte value 0 representing semantic value -100, 100 representing 0, and 200 representing 100:

{
	"type": "integer",
	"minimum": -100,
	"maximum": 100,
	"offset": -100,
	"encoding": "unsigned",
	"bits": 8
}

Floating-point numbers

Fixed-length binary floating-point numbers

When the scheme type is number, an associated `encoding’ property may name one of the IEEE 754 standard fixed-length floating-point number representations to indicate that representation be used as the binary encoding of this number. For example:

{
	"type": "number",
	"encoding": "binary32",
}

The floating-point representations currently defined in IEEE 754-2008 are binary16, binary32, binary64, and binary128 for binary floating-point representations, and decimal32, decimal64, and decimal128 for decimal floating-point representations. More standard floating-point encodings may be defined in the future, of course.

Variable-length binary floating-point numbers

Variable-length strings

Packed arrays with fixed-length items

A schema of type array may specify an encoding of packed to indicate a packed array of fixed-length elements. This example expresses a packed array of 8-bit unsigned integers:

{
	"type": "array",
	"items": {
		"type": "integer",
		"encoding": "unsigned",
		"bits": 8
	}
}

The array element schema specified in the items property must have a fixed-length binary representation for packed array encoding.

If the array also has a fixed length, i.e., minItems and maxItems values both specified and equal, then the resulting packed array has a fixed-length binary representation, which is exactly the element type length times the array length.

CBE-encoded arrays with variable-length items

A schema of type array with an encoding of cbe indicates that the array has a binary representation consisting of a sequence of variable-length items, each individually CBE-encoded.

Comparison with other popular binary formats

Avro, Protobuf, Thrift.

Non-schematic: CBOR, BSON, UBJSON, etc.



Bryan Ford