Schemas, Schema Registries and Kafka - Part One

15 September 2024 // Martin Holt

In earlier blog posts we have looked at some of the popular schema frameworks that are commonly (although not exclusively) associated with Kafka - Avro, Protobuf and JSON Schema.

This blog post will consider some aspects of choosing a schema framework (part one), and look at the Schema Registry and the role it plays in organising, communicating and enforcing your schemas (part two).

The Use Case

This blog post is supported by a working example with source code that can be found in this GitHub repository.

The application collects weather reports from a number of weather stations which are published to Kafka according to Avro, Protobuf and JSON-Schema schemas. The application also consumes the same messages, an unlikely use case in reality but useful to illustrate deserialization.

The example is built using Quarkus and the focus of this blog is heavily skewed towards Java and the Java tooling. Although there is support for schemas and schema registry in other languages (libserdes for example) this is considered out of scope.

What Is A Schema?

When data is structured there is a need to be able to express that structure, for example using COBOL copybook, SOAP or XSD, or OpenAPI specifications. A schema is one way to express a data structure, providing context about the data itself and providing constraints for that data.

In our weather report example the schemas express a structure that contains information regarding the location of the weather station and observations of the weather. The schema provides context where, for example, a precipitationRate observation is described as Volume of rain in the last hour (mm). Constraints can be provided, for example a visibility reading is expressed as an enum to restrict what could otherwise be quite a subjective reading.

JSON Schema

Using JSON Schema to produce a Weather Schema results in a human readable schema document. This is very similar to how most public APIs are expressed (using OpenAPI for example) where one aspect is accessibility for those who will consume the API.

A quick look at the schema shows that the schema contains an identifier:

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://se.martin/weather-schema.json",
  "title": "Weather",
  "description": "A schema for weather readings",
}

The data structure is expressed in a verbose manner and it is easy to add a description and examples to aid the consumer.

    "location": {
      "description": "location of reading",
      "type": "object",
      "properties": {
        "name": {
          "description": "Common name for station",
          "type": "string",
          "examples": ["Stockholm Waether Station 1"]
        },
      "required": [
        "stationId", "latitude", "longitude"
      ]
    }

If your need is to express your data structure to a wide audience then JSON Schema may well be your best fit. Most developers and even non-technicians can follow a JSON schema. For example if you are defining schemas for an event that is common for multiple systems or units in a large organisation it may be preferable to express this using a JSON schema.

JSON Schema is not associated with and does not provide tooling for serialization or deserialization. In this example Quarkus is using the well known and loved Jackson library to map between Java object and JSON. In part 2 we will look in more depth at the role of the schema in serialization and deserialization, but note here that the Jackson library itself does not provide any validation of the JSON against the JSON schema.

In addition the Quarkus framework is using the SmallRye Reactive Messaging framework which in turn is using Jackson and a Kafka Client to serialize data. This complicates matters should you wish to write simple unit tests to ensure schema validation is being performed (more in part 2).

One aspect of serializing to JSON is that the serialized data is also human readable - if a user has access to a topic through tools such as KafkaCat or Confluent Control Center the data will be easy to access (see the screenshot below). If you are storing confidential data you may need to consider further steps to ensure the data is protected.

img alt

AVRO

Using Apache Avro to produce a Weather Reading schema produces another human readable object. The schema itself contains an identifier:

{
  "namespace": "se.martin.weather.avro",
  "name": "WeatherReading",
}

Fields are expressed in a similar way, providing doc fields for documentation and providing constraints as in the example below where the enum constrains the values of a field and type can be used to distinguish between optional and manadatory fields:

{
  "name": "visibility",
  "type": [
    "null",
    {
      "namespace": "se.martin.weather.avro",
      "name": "Visibility",
      "type": "enum",
      "symbols": [
        "good",
        "average",
        "poor",
        "total_utter_darkness"
      ],
      "doc": "Perceived visibility measurement"
    }
  ]
}

Quarkus has been configured to generate Java classes from the schema (once the project is built the generated java classes can be found in the /build/classes/java/quarkus-generated-sources/avsc folder).

One useful feature of these generated classes is that they perform validation against the Avro schema . In this example the WeatherReadingMapper class encapsulates the mapping from domain object to the Avro WeatherReading class, and this class is unit testable (see WeatherReadingMapperTest ). This allows us to find potential errors at build time, which is preferable to runtime, especially in production when running at scale.

Should the mapping fail there will be hints as to why (see below):

Field stationId type:STRING pos:1 does not accept null values
org.apache.avro.AvroRuntimeException: Field stationId type:STRING pos:1 does not accept null values
	at org.apache.avro.data.RecordBuilderBase.validate(RecordBuilderBase.java:91)

In this example Avro is serializing to a binary format (though JSON could be used), which as illustrated below, makes it more difficult, but not impossible, to read the data from the topic. This offers some protection against unwanted access but also will make debugging more difficult.

img alt

Protobuf

Expressing our weather readings using Protobuf results in a more technical format. The schema contains messages which are supported by a domain name (or package). Fields are assigned a field number which should not be changed (more on this in part 2). Some constraints can be applied, such as optional fields and enumerations. Documentation is in the form of comments in the schema.

// A schema for weather reports
message WeatherReport {
  string recordingId = 1; // A unique id for the recording
  Location location = 2; // location of reading
  string observationTimeUtc = 3; // Time of recording (UTC)
  optional Observations observations = 4; // Measurements taken in the reading
}

Quarkus has been used to generate Java classes from the protobuf schema (after build these can be found in the /build/classes/java/quarkus-generated-sources/grpc folder). As for the Avro case it is possible to encapsulate the mapping to and from a Protobuf generated object and to write unit tests (see the WeatherReportMapper and WeatherReportMapperTest classes).

There is some validation against the schema, however this can be in the form of the dreaded NPE:

java.lang.NullPointerException
	at se.martin.weather.proto.Location$Builder.setStationId(Location.java:788)
	at se.martin.weather.proto.WeatherReportMapper.from(WeatherReportMapper.java:26)
	at se.martin.weather.proto.WeatherReportMapper.from(WeatherReportMapper.java:19)
	at se.martin.weather.proto.WeatherReportMapperTest.testMapping(WeatherReportMapperTest.java:22)

Protobuf uses a binary format for serialization, similar to Avro, which again makes it somewhat difficult, but not impossible, to read the data straight from the topic.

img alt

Other Aspects

Usability and tooling are important aspects but other aspects should be considered.

Sizing your topic is important and it can be a good idea to perform some form of calculation on how much data will be retained. For this you will need to know the retention period of the topic, the expected volume of messages in the retention period and the size of the messages.

To illustrate this I have taken a snapshot of each topic size after 10 minutes of execution, in which 4800 messages are produced. The same message data is sent to each topic so differences in topic size can be attributed to the method of serialization. The snapshot shows:

Avro topic - 1.3 Mb, average 265 bytes per message
Protobuf topic - 1.6 Mb, average 342 bytes per message
JSON Schema topic - 3.2 Mb, average 658 bytes per message

Although the size of the individual message is trivial it does illustrate that the choice of schema will affect the size of your topics when running at scale.

Summary

This blog post has taken a surface look at three schema frameworks that are frequently associated with Kafka. In part two we will take a deeper look at how schemas can be enforced and evolved using a schema registry.

Tack för att du läser Callistas blogg.
Hjälp oss att nå ut med information genom att dela nyheter och artiklar i ditt nätverk.

Blogg

Schemas, Schema Registries and Kafka - Part One

15 September 2024 // Martin Holt

The Use Case

What Is A Schema?

JSON Schema

AVRO

Protobuf

Other Aspects

Summary

Kommentarer