A thought experiment: "New JSON Schema"

JSON Schema has been in development since ca. 2009 and has gone through several iterations. Yet, there is still no IETF RFC anyone could really lean on as a standard. Practitioners are largely using "Draft 7" of JSON Schema and the subsequent releases have seen comparatively little adoption.

The quality of the specs in terms of clarity and precision has been a major issue and the JSON Schema project has stood up a website that explains the spec in more detail than the spec itself.

Structured metadata is becoming rapidly more important in the world of APIs and LLMs. Large language models can operate better in structured data if they are fed with rich context information and schema documents can provide such context.

JSON Schema should play a big role in this context, but its complexity and ambiguity and the lack of a finalized standard are major obstacles. Efforts like OpenAPI lean on subsets of JSON Schema and had to invent their own extensions to cover gaps. There are no two implementations of JSON Schema code generators that agree on the output mapping of conditional JSON Schema structures to code, including those for OpenAPI.

Worse yet, all of these tools (need to) give up on complex JSON Schema constructs at some point and that point is different for each implementation. It's close to impossible to write JSON Schemas that can be used reliably for code generation in a cross-platform and cross-language way unless you scope out a substantial portion of the JSON Schema language.

There are two major use-case scenarios for JSON Schema and schema languages in general:

Users want to validate JSON data against a schema to ensure that the data conforms to a specific structure and set of constraints.
Users want to declare data types and structures in a machine-readable format that can be used to generate code, documentation, or other artifacts in a cross-platform, cross-language way.

JSON Schema has enormously powerful facilities for the first use-case right in its core. All that power comes at the expense of the second use-case.

The existing drafts of JSON Schema define a pattern-matching language for schema processors that is applied to JSON data as it is being validated. It is not a data definition language. It is a validation language that embeds elements that only look like data definition capabilities. An object declaration in JSON Schema is a matching expression for a JSON object that contains the properties defined in the schema; it does not define an object type.

Conditional composition constructs like allOf, anyOf, oneOf, not, and if/then/else are defined in the core schema language. As powerful as they are, conditional composition of data types is generally not a thing in databases or programming languages, which means that any use of these constructs makes mapping from and to code and databases hard and in many cases impossible while preserving the schema semantics.

JSON Schema allows for $ref to reference arbitrary JSON nodes (any of which are schemas) from the same or external document, which adds to the complexity. A single JSON Schema might have dozens of external links to content, strewn across the document, making it very difficult to understand as well as hard and potentially unsafe to process.

There are also confusing conflicts and overlaps. For instance, JSON Schema has a concept of a type union that can only be used for primitive types. Users frequently side-step that limitation through a oneOf construct that behaves equivalent to a type union for validation and thus there are factually two type union constructs.

Enumerations are first-class constructs in many programming languages and generally map symbols to values of a single type. In JSON Schema, enumerations are constraints applied to schemas that are not constrained to the declared type of the same schema and values can be of mixed types.

There are further issues with the JSON Schema spec like the confusing existence of embedded subschemas or why "vocabularies" for meta-schemas are special-cased and aren't just another schema. This document is not meant to be an exhaustive list of problems.

JSON Schema, as it stands, is a powerful JSON validation language, but a very poor data definition language.

For APIs, databases, LLMs, and code generation, the industry needs a great data definition language that can also be used for validation. The priorities for the vast majority of practical future applications of JSON Schema are upside down from the current state of the spec.

This set of documents, a proposal for a "new JSON Schema", completely refactors JSON Schema into:

a core data definition language that maps from and to code and databases in a straightforward way and also takes typical type reuse and extensibility patterns into account.
a set of optional extensions that provide the powerful composition and validation capabilities that JSON Schema is known for.

In addition, new JSON Schema has a vastly expanded built-in type system that is not limited to the JSON primitives and includes many extended types that are relevant for modern data processing. The type system also directly addresses common pitfalls of the JSON primitives, such as the limited range and precision of numbers.

Optional extensions directly support multi-language documentation and alternate names and descriptions for properties and types as well as annotations for scientific units and currencies based on international standards.

Start with the "primer" document from which this post is an excerpt.

Share on , , or

A thought experiment: "New JSON Schema"

Clemens Vasters