Initial proposal for binary protocol (#2)

SergeyKanzhelev · web-flow · commit 571cafae5636 · 2019-04-22T15:30:11.000-07:00
* move from https://github.com/w3c/trace-context/pull/215/files * added MUST for ordering * added a link to 256 limit and explained why spec is taking about byte arrays * use trace-flags to describe the field * addressed PRfeedback and more * added note about endianess of bytes * Fix typo (#3) * Update spec/20-binary-format.md
diff --git a/index.html b/index.html
@@ -54,6 +54,7 @@
         <section id='abstract' data-include="spec/01-abstract.md" data-include-format='markdown'></section>
         <section id='sotd' data-include="spec/02-sotd.md" data-include-format='markdown'></section>
 
-        <section data-include="spec/20-BINARY_FORMAT.md" data-include-format='markdown'></section>
+        <section data-include="spec/20-binary-format.md" data-include-format='markdown'></section>
+        <section data-include="spec/31-parsing-algoritm.md" data-include-format='markdown' class="informative"></section>
   </body>
 </html>
diff --git a/spec/20-BINARY_FORMAT.md b/spec/20-BINARY_FORMAT.md
diff --git a/spec/20-binary-format.md b/spec/20-binary-format.md
@@ -0,0 +1,106 @@
+# Binary format
+
+Binary format document describes how to encode each field - `traceparent` and
+`tracestate`. The binary format should be used to encode the values of these
+fields. This specification does not specify how these fields should be stored
+and sent as a part of a binary payload. The basic implementation may serialize
+those as size of the field followed by the value.
+
+Specification operates with bytes - unsigned 8-bit integer values
+representing values from `0` to `255`. Byte representation as a set of
+bits (big or little endian) MUST be defined by underlying platform and
+out of scope of this specification.
+
+## `Traceparent` binary format
+
+The field `traceparent` encodes the version of the protocol and fields
+`trace-id`, `parent-id` and `trace-flags`. Each field starts with the one byte
+field identifier with the field value following immediately after it. Field
+identifiers are used as markers for additional verification of the value
+consistency and may be used in future for the versioning of the `traceparent`
+field.
+
+``` abnf
+traceparent     = version version_format  
+version         = 1BYTE                   ; version is 0 in the current spec
+version_format  = "{ 0x0 }" trace-id "{ 0x1 }" parent-id "{ 0x2 }" trace-flags
+trace-id        = 16BYTES
+parent-id       = 8BYTES
+trace-flags     = 1BYTE  ; only the least significant bit is used
+```
+
+Unknown field identifier (anything beyond `0`, `1` and `2`) should be treated as
+invalid `traceparent`. All zeroes in `trace-id` and `parent-id` invalidates the
+`traceparent` as well.
+
+## Serialization of `traceparent`
+
+Implementation MUST serialize fields into the field ordering sequence.
+In other words, `trace-id` field should be serialized first, `parent-id`
+second and `trace-flags` - third.
+
+Field identifiers should be treated as unsigned byte numbers and should be
+encoded in big-endian bit order.
+
+Fields `trace-id` and `parent-id` are defined as a byte arrays, NOT a
+long numbers. First element of an array MUST be copied first. When array is
+represented as a memory block of 16 bytes - serialization of `trace-id`
+would be identical to `memcpy` method call on that memory block. This
+may be a concern for implementations casting these fields to integers -
+protocol is NOT defining whether those byte arrays are ordered as big
+endian or little endian and have a sign bit.
+
+If padding of the field is required (`traceparent` needs to be serialized into
+the bigger buffer) - any number of bytes can be appended to the end of the
+serialized value.
+
+## `traceparent` example
+
+``` js
+{0,
+  0, 75, 249, 47, 53, 119, 179, 77, 166, 163, 206, 146, 157, 0, 14, 71, 54,
+  1, 52, 240, 103, 170, 11, 169, 2, 183,
+  2, 1}
+```
+
+This corresponds to:
+
+- `trace-id` is
+  `{75, 249, 47, 53, 119, 179, 77, 166, 163, 206, 146, 157, 0, 14, 71, 54}` or
+  `4bf92f3577b34da6a3ce929d000e4736`.
+- `parent-id` is `{52, 240, 103, 170, 11, 169, 2, 183}` or `34f067aa0ba902b7`.
+- `trace-flags` is `1` with the meaning `recorded` is true.
+
+## `tracestate` binary format
+
+List of up to 32 name-value pairs. Each list member starts with the 1 byte field
+identifier `0`. The format of list member is a single byte key length followed
+by the key value and single byte value length followed by the encoded
+value. Note, single byte length field allows keys and values up to 256
+bytes long. This limit is defined by [trace
+context](https://w3c.github.io/trace-context/#header-value)
+specification. Strings are transmitted in ASCII encoding.
+
+``` abnf
+tracestate      = list-member 0*31( list-member )
+list-member     = "0" key-len key value-len value
+key-len         = 1BYTE ; length of the key string
+value-len       = 1BYTE ; length of the value string
+```
+
+Zero length key (`key-len == 0`) indicates the end of the `tracestate`. So when
+`tracestate` should be serialized into the buffer that is longer than it
+requires - `{ 0, 0 }` (field id `0` and key-len `0`) will indicate the end of
+the `tracestate`.
+
+## `tracestate` example
+
+``` js
+{ 0,  3,  102, 111, 111,  16,  51, 52, 102, 48, 54, 55, 97, 97, 48, 98, 97, 57, 48, 50, 98, 55,
+  0,  3,   98,  97, 114,  4,   48, 46, 50, 53,  }
+
+```
+
+This corresponds to 2 tracestate entries:
+
+`foo=34f067aa0ba902b7,bar=0.25`
diff --git a/spec/21-binary-format-rationale.md b/spec/21-binary-format-rationale.md
@@ -0,0 +1,25 @@
+# Rationale for decision on binary format
+
+Binary format is similar to proto encoding without any reference on
+protobuf project. It uses field identifiers in bytes in front of field
+values.
+
+## Field identifiers
+
+Protocol uses field identifiers for fields like `trace-id`, `parent-id`,
+`trace-flags` and tracestate entries. The purpose of the field
+identifiers is two-fold. First, allow to remove existing fields or add
+new ones going forward. Second, provides an additional layer of
+validation of the format.
+
+## How can we add new fields
+
+If we follow the rules that we always append the new ids at the end of the
+buffer we can add up to 127. After that we can either use varint encoding or
+just reserve 255 as a continuation byte. Assumption at the moment is
+that specification will never get to this point.
+
+## Why custom binary protocol
+
+We didn't find non-proprietary wide used binary protocol that can be
+used in this specification.
diff --git a/spec/31-parsing-algoritm.md b/spec/31-parsing-algoritm.md
@@ -0,0 +1,86 @@
+# De-serialization algorithms
+
+This is non-normative section that describe de-serialization algorithm
+that may be used to parse `traceparent` and `tracestate` field values.
+
+## De-serialization of `traceparent`
+
+Let's assume the algorithm takes a buffer - bytes array - and can set
+and shift cursor in the buffer as well as validate whether the end of
+the buffer was reached or will be reached after reading the given number
+of bytes. This algorithm can work on stream of bytes. De-serialization
+of `traceparent` MAY be done in the following sequence:
+
+1. If buffer is empty - RETURN invalid status `BUFFER_EMPTY`. Set a cursor to
+   the first byte.
+2. Read the `version` byte at the cursor position. Shift cursor to `1` byte.
+3. If at the end of the buffer RETURN invalid status `TRACEPARENT_INCOMPLETE`.
+4. **Parse `trace-id`**. Read the field identifier byte at the cursor
+   position. If NOT `0` - go to step `8. Report invalid field`.
+   Otherwise - check that remaining buffer size is more or equal to `16`
+   bytes. If shorter - RETURN invalid status `TRACE_ID_TOO_SHORT`.
+   Otherwise read the next `16` bytes for `trace-id` and shift cursor to
+   the end of those `16` bytes.
+5. **Parse `trace-id`**. Read the field identifier byte at the cursor
+   position. If NOT `1` - go to step `8. Report invalid field`.
+   Otherwise - check that remaining buffer size is more or equal to `8`
+   bytes. If shorter - RETURN invalid status `PARENT_ID_TOO_SHORT`.
+   Otherwise read the next `8` bytes for `parent-id` and shift cursor
+   to the end of those `8` bytes.
+6. **Parse `trace-id`**. Read the field identifier byte at the cursor
+   position. If NOT `2` - go to step `8. Report invalid field`.
+   Otherwise - check the remaining size of the buffer. If at the end of
+   the buffer - RETURN invalid status. Otherwise - read the
+   `trace-flags` byte. Least significant bit will represent `recorded`
+   value.
+7. RETURN status `OK` if `version` is `0` or status `DOWNGRADED_TO_ZERO`
+   otherwise.
+8. **Report invalid field**.  If `version` is `0` RETURN invalid status
+   `INVALID_FIELD_ID`. If `version` has any other value -
+   `INCOMPATIBLE_VERSION`
+
+_Note_, that invalid status names are given for readability and not part of the
+specification.
+
+_Note_, that parsing should not treat any additional bytes in the end of the
+buffer as an invalid status. Those fields can be added for padding purposes.
+Optionally implementation can check that the buffer is longer than `29` bytes as
+a very first step if this check is not expensive.
+
+## De-serialization of `tracestate`
+
+Let's assume the algorithm takes a buffer - bytes array - and can set
+and shift cursor in the buffer as well as validate whether the end of
+the buffer was reached or will be reached after reading the given number
+of bytes. Algorithm also uses `version` value parsed from `traceparent`.
+If `version` was not given - value `0` SHOULD be used. This algorithm
+can work on stream of bytes. De-serialization of `tracestate` MAY be
+done in the following sequence:
+
+1. If at the end of the buffer - RETURN status `OK`. Otherwise set a
+   cursor to the first byte.
+2. **Parse `list-member` field identifier**. Read the field identifier
+   byte at the cursor position and shift cursor to `1` byte. If NOT `0`
+   and `version` is `0` RETURN invalid status `INVALID_FIELD_ID`. If NOT
+   `0` and `version` has any other value - `INCOMPATIBLE_VERSION`.
+3. **Parse key**.
+   1. If at the end of the buffer - RETURN status `OK`. This situation
+      indicates that `tracestate` value was padded with `0`.
+   2. Read the `key-len` byte. Shift cursor to `1` byte. If the value of
+      `key-len` is `0` - RETURN status `OK`. This situation indicates an
+      explicit end of a key.
+   3. Check that buffer has `key-len` more bytes. If not - RETURN
+      `KEY_TOO_SHORT`.
+   4. Read `key-len` bytes as `key`. Shift cursor to `key-len` bytes.
+4. **Parse value**.
+   1. If at the end of the buffer - RETURN status `INCOMPLETE_LIST_MEMBER`.
+   2. Read the `value-len` byte. Shift cursor to `1` byte. If the value of
+      `value-len` is `0` - add `list-member` with the `key` and empty
+      `value` to the `tracestate` list. RETURN status `OK`.
+   3. Check that buffer has `value-len` more bytes. If not - RETURN
+      `VALUE_TOO_SHORT`.
+   4. Read `value-len` bytes as `value`. Shift cursor to `value-len`
+      bytes.
+   5. Add `list-member` with the `key` and `value` to the `tracestate`
+      list.
+5. Go to step `2. Parse list-member field identifier`.