Amazon Bedrock knowledgebases for structured data
Amazon Bedrock Knowledge Bases is a fully managed capability that helps you implement the entire RAG workflow, from ingestion to retrieval and prompt augmentation. For RAG-based applications, the accuracy of the generated responses from FMs depend on the context provided to the model. Contexts are retrieved from vector stores based on user queries. In this post, we discuss the new custom metadata filtering feature in Amazon Bedrock Knowledge Bases, which you can use to improve search results by pre-filtering your retrievals from vector stores for structed data.
Solution Flow :
Step 1 : Prepare the Dataset for Knowledgebases :
With the new enhancement, you can now separate content and metadata within your CSV files, transforming the way you handle data ingestion. This feature allows you to designate specific columns as content fields while marking others as metadata fields. For the demonstration I will use a sample movie dataset to illustrate how to ingest and retrieve metadata using Amazon Bedrock Knowledge Bases. A sample code you can use for the pre-processing of data :
import pandas as pd
import os, json, tqdm, boto3
#replace the path with your value
df = pd.read_csv('path/movie_data.csv')
for i in tqdm.trange(100):
desc = str(df['label'][i])
meta = {
"metadataAttributes": {
"title": str(df['title'][i]),
"runningtime": str(df['running_time'][i]),
"actors": str(df['actors'][i]),
"directors": str(df['directors'][i]),
"genre": str(df['genre'][i]),
}
}
filename = 'path/outputs'+'/' + str(df['title'][i])+ '.txt'
f = open(filename, 'w')
f.write(desc)
f.close()
metafilename = filename+'.metadata.json'
with open( metafilename, 'w') as f:
json.dump(meta, f)
Upload the files to S3 : The files created above need to be uploaded to S3 bucket for creation of Amazon Bedrock Knowledgebases
As a reference You can use the below sample code for uploading the all the files created in the data folder to S3
# Upload data to s3
s3_client = boto3.client("s3")
bucket_name = "bucketname" #replace the bucket name
data_root = "path" #replace the path with the data directory
def uploadFiles(path,bucket_name):
for root,dirs,files in os.walk(path):
for file in tqdm.tqdm(files):
s3_client.upload_file(os.path.join(root,file),bucket_name,file)
Step 2: Creation of Knowledgebases :
Knowledge Bases for Amazon Bedrock enable you to ingest and store data sources into a central repository of information. In this case we will use the transformed data from above to ingest the data, knowledge base treat certain columns as content fields versus metadata fields. Metadata fields provided in your metadata.json are present as columns in your CSV.
I will use the Amazon Bedrock Console to configure the knowledge bases, this can be found in Buider tools option within the Amazon Bedrock console as shown below:
The KnowledgeBase source for this source s3 bucket, is defined with “NoChunking” strategy, which means that the .csv file in itself is a chunk and no further chunking is needed. In case the contents in the CSV in the embeddings fields are huge, a chunking strategy can be adopted, in which case the meta-data for the chunks would remain the same, but more fine grained documents would be created in VectorDB.
Once the knowledgebase is created you need to sync data
Select the data source and click the sync:
Step 3 : Retrieve data from the knowledge base using metadata filtering
We will create a conversational chatbot using Streamlit, by leveraging the knowledge bases. We will use a single filter shown below for retrieval of information to retrieve , I will use a metadata filter to get movie with genres for the value “nan” and will use RetrieveAndGenerate
API to test the knowledge base with the metadata filters
# genres = nan
single_filter= {
"equals": {
"key": "genres",
"value": "nan"
}
}
import boto3
import streamlit as st
st.subheader('RAG Using Knowledge Base from Amazon Bedrock')
if 'chat_history' not in st.session_state:
st.session_state.chat_history = []
for message in st.session_state.chat_history:
with st.chat_message(message['role']):
st.markdown(message['text'])
bedrockClient = boto3.client('bedrock-agent-runtime')
def getResponse(query):
knowledgeBaseResponse = bedrockClient.retrieve_and_generate(
input={'text': query},
retrievalConfiguration=single_filter, # the filter details
retrieveAndGenerateConfiguration={
'knowledgeBaseConfiguration': {
'knowledgeBaseId': 'Knowlegebaseid',
'modelArn': 'modearn'
},
'type': 'KNOWLEDGE_BASE'
})
return knowledgeBaseResponse
query= st.chat_input('Enter you query here...')
if query:
with st.chat_message('user'):
st.markdown(query)
st.session_state.chat_history.append({"role":'user', "text":query})
response = getResponse(query)
# st.write(response)
answer = response['output']['text']
with st.chat_message('assistant'):
st.markdown(answer)
st.session_state.chat_history.append({"role":'assistant', "text": answer})
if len(response['citations'][0]['retrievedReferences']) != 0:
context = response['citations'][0]['retrievedReferences'][0]['content']['text']
doc_url = response['citations'][0]['retrievedReferences'][0]['location']['s3Location']['uri']
#Below lines are used to show the context and the document source for the latest Question Answer
st.markdown(f"<span style='color:#FFDA33'>Context used: </span>{context}", unsafe_allow_html=True)
st.markdown(f"<span style='color:#FFDA33'>Source Document: </span>{doc_url}", unsafe_allow_html=True)
else:
st.markdown(f"<span style='color:red'>No Context</span>", unsafe_allow_html=True)
Conclusion:
With metadata filtering feature in Amazon Bedrock Knowledge Bases this improve RAG output with meta-data filters . This approach helps the query responses more relevant while achieving a reduction in cost of querying the vector database.