Skip to content

Canonicalization

mvmforge makes a strong promise: given the same Workload IR, every layer of the toolchain produces the same bytes. Two SDKs declaring the same workload emit byte-identical canonical IR. Two compiles over the same IR + source produce byte-identical artifact directories.

That promise rests on a single canonicalization algorithm: RFC 8785 (JSON Canonicalization Scheme, JCS).

What canonicalization buys us

Without it, we couldn’t tell whether two SDK outputs were “the same”. JSON is forgiving: {"a":1,"b":2} and {"b":2, "a":1} are equivalent objects, but they’re different bytes. SHA-256 over those gives two different hashes. Cross-SDK conformance tests would devolve into structural-equivalence comparators that drift over time.

By canonicalizing first, we collapse equivalent JSON to the same bytes. From there, byte-equality is a real test you can run in CI without reasoning about JSON edge cases.

What mvmforge canonicalize does

For our use case, canonicalization is a recursive walk:

  1. Objects. Sort keys lexicographically (by their UTF-8 byte representation). Emit each key, a colon, the canonicalized value, and a comma between entries.
  2. Arrays. Preserve order. Emit each value canonicalized, comma-separated.
  3. Strings. Standard JSON escape rules: \", \\, control characters as \b, \f, \n, \r, \t, or \uXXXX.
  4. Numbers. v0 IR uses only integer numerics. Each integer is emitted as its decimal representation with no leading zeros.
  5. Booleans / null. Literal true, false, null.
  6. No whitespace. Compact output throughout.

Output is deterministic: same input bytes → same output bytes, on every host OS.

Where canonicalization happens

The single canonicalization implementation lives in crates/mvmforge-ir/src/canonicalize.rs. Every other layer that produces canonical output routes through it:

  • mvmforge canonicalize — direct user-facing command for normalizing fixtures.
  • mvmforge emit — SDKs emit canonical IR (each SDK has its own implementation; cross-language byte-identity is verified by the golden corpus).
  • mvmforge validate — error envelopes are canonical JSON.
  • mvmforge compilelaunch.json is canonical; the schema file schema/workload-ir-v0.json is canonical; tree_hash over the source bundle uses a precise byte format defined in ADR-0008 §6.

The Python SDK’s canonicalize

For reference, the Python SDK’s implementation is roughly:

import json
def canonicalize(obj) -> str:
return json.dumps(
obj,
sort_keys=True,
separators=(",", ":"),
ensure_ascii=False,
)

This works because v0 IR has only integer numerics — Python’s json module gets ECMA-262 number serialization right for ints but not for floats. If the IR ever admits floats, every SDK’s canonicalize will need a real RFC 8785 numeric canonicalization step.

The TypeScript SDK’s canonicalize

function canonicalize(value: unknown): string {
if (value === null) return "null";
if (typeof value === "boolean") return value ? "true" : "false";
if (typeof value === "number") {
if (!Number.isFinite(value)) {
throw new Error("cannot canonicalize non-finite number");
}
return JSON.stringify(value);
}
if (typeof value === "string") return JSON.stringify(value);
if (Array.isArray(value)) {
return "[" + value.map(canonicalize).join(",") + "]";
}
if (typeof value === "object") {
const obj = value as Record<string, unknown>;
const keys = Object.keys(obj).sort();
return (
"{" +
keys
.map((k) => JSON.stringify(k) + ":" + canonicalize(obj[k]))
.join(",") +
"}"
);
}
throw new Error(`cannot canonicalize ${typeof value}`);
}

JSON.stringify of strings and integers is RFC 8785-compatible for our subset.

ir_hash and tree_hash

Two derived values use canonicalization as their input:

  • ir_hash — SHA-256 hex of the canonical IR. Carried in launch.json and the mvmforge.metadata.ir_hash flake attribute. Two semantically equivalent IR documents produce the same ir_hash.
  • tree_hash — SHA-256 hex over a precise byte sequence describing the bundled source tree. Format: per-entry <kind> <oct_mode> <relpath>\x00<content_record>\n, sorted by relative path. Defined in ADR-0008 §6.

Both are content-addressed: change the IR or the source, and the hash changes deterministically.

Why integer-only numerics matter

RFC 8785’s hardest part is ECMA-262 number canonicalization (e.g., 1e10 and 10000000000 and 10000000000.0 are all the same value, but only one canonical form). v0 IR sidesteps the entire problem by admitting only integer numeric types: cpu_cores: u16, memory_mb: u32, rootfs_size_mb: u32, guest: u16, host: u16, size_mb: u32. Decimal integer formatting is trivial and identical across every JSON library.

Any future IR change that admits floats requires extending the canonicalize implementations (per ADR-0002) and is treated as an MAJOR IR bump.