AWS MSK Producer data validation with AWS Glue Schema using Python

Debadatta Panda
3 min readJun 21, 2021

--

With the Streaming architecture you can decouple the Data Processing application from the Data processors . With this architecture there is a rapid development happen at the Data processing application . With these rapid developments it is hard to coordinate and evolve data schemas, like adding or removing a data field and which can create data quality issue and errors in the downstream application.

This article explained how to use AWS Glue Schema & Registry with Kafka Producer to prevent the data quality issues with the consumer application .

I will integrate AWS MSK data streaming with AWS Glue Schema using AVRO-encoding messages . AWS provided native libraries to use Glue Schema in Java , If you are using Java you can refer to AWS documentation .In this example, Kafka Producer will send a AVRO encoded message to MSK topic ( KafkaSample ) . The AVRO message will read a Glue Schema in AWS and validate the message before sending this to topic , the serializer will use schema to validate the record being produced by the application is structured with the fields and data types matching . If the schema of the record does not match against the registered schema , the serializer will return an exception and application will fail to deliver the record to destination .

AWS MSK : fully managed service that is used to build and run application using Apache Kafka

AWS Glue Schema : allows to centrally discover , control and evolve data stream schema.

AVRO: Open source data serialization system that helps with data exchange between systems and streams .

Before developing the application you need to load the following libraries kafka-python , avro . For the below example purpose , I am using a very simple schema in Glue :

Example Schema

Here is a sample code in Python which will use boto3 libraryto read the schema from the glue and will use the avro to validate the data against the schema and send the data producer

Conclusion:

Kafka and Avro is very useful when you would like to stream your data to different systems . Schema in streaming improves data governance , ensure good quality of data and data consumers become more resilient to compatible upstream changes.

--

--

Debadatta Panda
Debadatta Panda

No responses yet