Create a super fast AI assistant with Groq (Without a database)
Last week, I tried to build a voice AI assistant using OpenAI AI assistant. It takes a while to generate a response, which is not suitable for a voice assistant. So, I'm looking for an alternative to make my assistant faster. That's how I found out about Groq. This post will cover how I built an AI assistant using Groq.
Pros and Cons summary
Pro: Easy to implement with only one API (Groq API).
Respond is fast.
Cons: The longer we chat, the higher the chance that we might lose some context along the way.
What is Groq?
Groq is a service that provides a super fast engine to run AI applications. It's not an AI model! We can run different AI models like Llama, Mixtral, Gemma and more!
How I built a fast AI assistant
Many AI models exist, but only OpenAI offers an easy way to implement a chat-like experience using the Assistants API. By default, these models won't know or understand the context of our previous chat. So, we have to re-explain everything if we want the AI to understand the context of each message.
Some alternatives exist, such as using LangChain chat history or ConversationBufferMemory. But I prefer to find a simple way (*with the caveat, of course). Luckily, I found some ideas on the internet (Thank you, Internet!).
The idea below can be implemented for any AI model/engine, not just Groq. You can try this with OpenAI itself, Mixtral, Claude, and so on.
Here is the flow:
- The user sends the initial message
- The AI responds to the message
- We ask AI to summarize the conversation
- We send the response and summary back to the user
- The user will send the summary back later alongside the new message
- AI now will reply based on the fresh message and with help of the conversation summary to provide some context.
The caveat of this method
By summarizing a conversation, we may lose some information along the way. That's why it's a good idea in certain cases to store the message history on a database (Vector database).
One way I can reduce this shortage is by attaching the recent reply from AI. I've also read an article that suggests keeping the latest 2-3 conversations and providing them as additional context later.
Code implementation
I'll use NodeJs for this tutorial. Feel free to use any language you want. The final code is available at GitHub:
- Install dependencies
npm i express groq-sdk dotenv --save
- Express for creating a route for the endpoint
- Groq-sdk is the official package for using Groq in Javascript
- dotenv to store our API key safely.
- Add API Key
Create a new .env
file. Add your Groq API key in this file like this:
GROQ_API_KEY=YOUR_GROQ_API_KEY
Make sure to sign up to Groq and get your API key here.
- Basic Setup
Let's create a new index.js
file, and we'll write everything in this file. We prepare one endpoint called chat
where we'll send these parameters:
- message: user's message
- latestReply: The latest reply from AI
- messageSummary: The conversation summary so far
In this endpoint, we'll do two things:
- Respond to new user message (with latestReply and messageSummary as context)
- Create a new conversation summary by providing the fresh reply from AI.
const express = require('express');
// Express Setup
const app = express();
app.use(express.json());
const port = 3000
require("dotenv").config();
const { GROQ_API_KEY } = process.env;
// GROQ Setup
const Groq = require("groq-sdk");
const groq = new Groq({
apiKey: GROQ_API_KEY
});
async function chatWithGroq() { } // soon
async function summarizeConversation() { } // soon
app.post('/chat', async (req, res) => {
const { message, latestReply, messageSummary } = req.body;
// request chat completion
const reply = await chatWithGroq(message, latestReply, messageSummary)
// request chat summary
const summary = await summarizeConversation(message, reply, messageSummary)
// Always return chat history/summary
res.send({
reply,
summary
})
})
app.listen(port, () => {
console.log(`Example app listening on port ${port}`)
})
- Chat with Groq method
Here is the chatWithGroq method implementation:
async function chatWithGroq(userMessage, latestReply, messageHistory) {
let messages = [{
role: "user",
content: userMessage
}]
if(messageHistory != '') {
messages.unshift({
role: "system",
content: `Our conversation's summary so far: """${messageHistory}""".
And this is the latest reply from you """${latestReply}"""`
})
}
console.log('original message', messages)
const chatCompletion = await groq.chat.completions.create({
messages,
model: "llama3-8b-8192"
});
const respond = chatCompletion.choices[0]?.message?.content || ""
return respond
}
- We only provide a conversation summary when we have one (look at the if statement). So, it won't be included in our first message.
- Conversation summary method
Here is the summarizeConversation
method implementation:
async function summarizeConversation(message, reply, messageSummary) {
let content = `Summarize this conversation
user: """${message}""",
you(AI): """${reply}"""
`
// For N+1 message
if(messageSummary != '') {
content = `Summarize this conversation: """${messageSummary}"""
and last conversation:
user: """${message}""",
you(AI): """${reply}"""
`
}
const chatCompletion = await groq.chat.completions.create({
messages: [
{
role: "user",
content: content
}
],
model: "llama3-8b-8192"
});
const summary = chatCompletion.choices[0]?.message?.content || ""
console.log('summary: ', summary)
return summary
}
In this method, we ask the AI to create a summary based on the latest summary and recent reply.
Demo Time!
You can use any API client, like Postman, Thunder (VS Code), etc.
Don't forget to run your program with node index.js
Create a POST request for the /chat
endpoint and provide message
endpoint and provide the first message parameter.
We can display the reply
from the response
on our user interface. This is the actual reply to our message.
We'll save the summary
for the next request.
Now, this is how the JSON looks like for the N+1 message:
The next messages should include the latestReply
and messageSummary
as parameters.
- message: *Don't forget to add a new message. This is you talking to the AI. Notice that I use
here
on my question, to validate that the AI knows what's the previous context here. - latestReply: Send the latest reply from AI (from previous response)
- messageSummary: Send the conversation summary so far (from previous response)
Here is the result to this request:
As you can see, the AI knows that when I said here
I was talking about Indonesia
. You can try to send a follow-up message (create a new request) by asking something like "Can you tell me more about number 4?" as an example. But don't forget that we always need to update the latestReply
and summaryConversation
on each request.
To return the response and summarize the conversation, I only need to wait around 2s
. This is much faster than using OpenAI AI assistants.
FAQ
Why don't we store all the conversation history?
The longer we talk with the assistant, the more tokens we'll need. It's to prevent us from paying a lot of money for the service. This method might work for the open source model that you run on your own server.
Reference:
- Build a smart AI voice assistant
- Basic tutorial: Assistants API by OpenAI