How We Streamlined Support Replies Using LLMs and RAG

While helping our support team to resolve a client query, and joking about how AI will replace developers, and the support team no longer needs us. I found out that they also use ChatGPT to draft replies, making them more professional and formal.

On diving deep, I found out a few problems they face consistently

repetitive replies
Time wasted on formatting the response
Inconsistency in tone and writing style in the response

We decided to make this streamlined so that they don’t have to leave the support panel, and build a chat assistance for the team

📕TL;DR: TLDR For testing purposes, ₹500 was more than enough. Most of the cost went into generating embeddings, which we optimized by using bulk creation and limiting the chat data to conversations no older than 3 years.

MVP Proposition

The goal was simple: to generate an appropriate response based on the ongoing chat in our support panel, while still giving the support agent the ability to review and edit it before sending the second part was necessary as we don’t know how the client will respond, and without reviewing, we can let our client read that.

To achieve the desired result, we needed to :

Leverage LLMs (Large Language Models) to generate a response while giving a response by being in context.
Utilize our existing ticket data, FAQs, articles, and blogs to help the LLM craft meaningful replies.

This led us to explore RAG (Retrieval-Augmented Generation) — a powerful technique that uses vector search to get similar results from the knowledge base and LLM to craft the response.

To keep things simple and minimize the time, we decided to go only with dense search. If we get any success, then we can try outa hybrid search for better results.

Solution Overview

Our implementation is broken into two main parts:

1️⃣ Data Embedding & Storage

We began by collecting all relevant data sources, including:

Previous support tickets

The raw data wasn’t fit for embedding right away, so we sanitized it, removing HTML tags, extra spaces, etc. Then we concatenated the chat into a single string in this format:

[ { “role”: “client”, “chat”: “I have an issue creating an order” }, { “role”: “agent”, “chat”: “You can use the xyz solution to resolve this.” } ]

Once the ticket chart was sanitized, we strigified the JSON.

For embedding of the data into to vector, we leveraged OpenAI’s text-embedding-3-small model to generate vectors and saved them in Pinecone DB.

As Pincone DB has a generous free tier, if you guys want to try a few things out, go for it.

2️⃣ Contextual Response Generation

This part comes into action when our agents click on the button to generate a response.

The current chat is sanitized and formatted.
Then, using OpenAI text-embedding-3-small create and vector for the chat.
Find the top k similar results in the vector DB.
A similar result, + prompt + current chat, is given to llm model to generate a response

This approach ensures that the response generated remains relatable and uses our existing knowledge base.

To generate the response, we are using OpenAI GPT-3.5 Turbo

Outcome & Next Steps

On testing things out, we found out that while getting good, similar results the it was working great, we don’t even need to tinker with the response much. We also found the requirement of a good prompt, as it changes how the LLM will respond using the knowledge and current context.

We kind of solved the problem for the team, but we need to improve it still for streamlining things more properly to do so.

Next, we’re exploring:

Adding feedback loops for response quality
Improving data refresh cycles
Expanding our knowledge source to FAQs, blogs, AMC, and schemedata.