Amazon MSK data validation with AWS Glue Schema using AVRO

Debadatta Panda
3 min readNov 8, 2022

--

With the Streaming architecture you can decouple the Data Processing application from the Data processors . With this architecture there is a rapid development happen at the Data processing application . With these rapid developments it is hard to coordinate and evolve data schemas, like adding or removing a data field and which can create data quality issue and errors in the downstream application.

This article explained how to use AWS Glue Schema & Registry with Amazon Managed Streaming for Apache Kafka(MSK) data Producer to prevent the data quality issues with the consumer application .

Data validation flow

As shown in the above diagram I will show how we will integrate AWS MSK data streaming with AWS Glue Schema using AVRO-encoding messages . AWS provided native libraries to use Glue Schema in Java, If you are using Java you can refer to AWS documentation. In this example, Kafka Producer will send a AVRO encoded message to MSK topic ( KafkaSample ) . The AVRO message will be validated with the Glue Schema , the AVRO serializer will use Glue schema to validate the record. If the schema of the record does not match against the registered schema , the avro serializer will return an exception and application will not process the record to publish this to the Producer. I will run through a sample to show how the above flow will work, We will use the below solution components:

Amazon MSK topic : In this sample we will use the Amazon Managed Streaming for Apache Kafka(MSK) cluster to create the topic to which the data will be published for the consumer application.

AWS Glue Schema : We will use AWS Glue Schema to hold the schema for the data validation.

AVRO : Open source data serialization system that helps with data exchange between systems and streams .

Before developing the application you need to load the following libraries kafka-python , avro. For the below example purpose, I am using a very simple schema as shown below :

{
"type": "record",
"namespace": "My_Organization",
"name": "Employee",
"fields": [
{
"name": "Name",
"type": "string"
},
{
"name": "Age",
"type": "int"
}
]
}

Here is a sample code in Python which will use boto3 library to read the schema from the glue and will use the avro to validate the data against the schema and send the data producer

from kafka import KafkaProducer
import boto3
import io
import avro.schema
import avro.io

topic= 'KafkaSample'
#Kafkaproducer Initialization
kproducer = KafkaProducer(bootstrap_server='PROVIDEKAFKABROKERBOOTSTRAP')
#Get the Glue Schema using boto3
client = boto3.client(servrice_name='glue' , region_name='AWSREGION')
schema_res = client.get_schema_version(
SchemaId = {
'SchemaName': "PROVIDESCHEMANAME"
'RegistryName': "PROVIDEYOURREGISTRYNAME"
},
SchemaVersionNumber={
'VersionNumber': PROVIDEVERSIONNUMBER
}
)
schema_defination = schema_res['SchemaDefination']
#Use AVRO Library to Serailize the Data
producerschema = avro.schema.parse(schema_defination)
writer = avro.io.DatumWriter(producerschema)
bytes_writer = io.BytesIO()
encoder = avro.ioBinaryEncoder(bytes_writer)
#Write the data with a valid schema otherwise this will throw error
writer.write({"Name": "String" , "Age": integer},encoder)
raw_bytes = bytes_writer.getvalue()
#Send the data to Kafka Prodcer
kproducer.send(topic ,raw_bytes)
kproducer.flush()

Conclusion:

Kafka and Avro is very useful when you would like to stream your data to different systems . Schema in streaming improves data governance , ensure good quality of data and data consumers become more resilient to compatible upstream changes.

--

--

Debadatta Panda
Debadatta Panda

No responses yet