Create a super fast AI assistant with Groq (Without a database)

Last week, I tried to build a voice AI assistant using OpenAI AI assistant. It takes a while to generate a response, which is not suitable for a voice assistant. So, I'm looking for an alternative to make my assistant faster. That's how I found out about Groq. This post will cover how I built an AI assistant using Groq.

Build a fast AI assistant with Groq and llama model illustration

Pros and Cons summary
Pro: Easy to implement with only one API (Groq API).
Respond is fast.

Cons: The longer we chat, the higher the chance that we might lose some context along the way.

What is Groq?

Groq is a service that provides a super fast engine to run AI applications. It's not an AI model! We can run different AI models like Llama, Mixtral, Gemma and more!

Why Groq - Groq
An LPU Inference Engine, with LPU standing for Language Processing Unit™, is a new type of end-to-end processing unit system that provides the fastest

How I built a fast AI assistant

Many AI models exist, but only OpenAI offers an easy way to implement a chat-like experience using the Assistants API. By default, these models won't know or understand the context of our previous chat. So, we have to re-explain everything if we want the AI to understand the context of each message.

Some alternatives exist, such as using LangChain chat history or ConversationBufferMemory. But I prefer to find a simple way (*with the caveat, of course). Luckily, I found some ideas on the internet (Thank you, Internet!).

The idea below can be implemented for any AI model/engine, not just Groq. You can try this with OpenAI itself, Mixtral, Claude, and so on.
chat flow illustration

Here is the flow:

  • The user sends the initial message
  • The AI responds to the message
  • We ask AI to summarize the conversation
  • We send the response and summary back to the user
  • The user will send the summary back later alongside the new message
  • AI now will reply based on the fresh message and with help of the conversation summary to provide some context.

The caveat of this method
By summarizing a conversation, we may lose some information along the way. That's why it's a good idea in certain cases to store the message history on a database (Vector database).

One way I can reduce this shortage is by attaching the recent reply from AI. I've also read an article that suggests keeping the latest 2-3 conversations and providing them as additional context later.

Code implementation

I'll use NodeJs for this tutorial. Feel free to use any language you want. The final code is available at GitHub:

GitHub - hilmanski/assistants-api-with-groq-ai
Contribute to hilmanski/assistants-api-with-groq-ai development by creating an account on GitHub.
  1. Install dependencies
npm i express groq-sdk dotenv --save
  • Express for creating a route for the endpoint
  • Groq-sdk is the official package for using Groq in Javascript
  • dotenv to store our API key safely.
  1. Add API Key

Create a new .env file. Add your Groq API key in this file like this:

GROQ_API_KEY=YOUR_GROQ_API_KEY

Make sure to sign up to Groq and get your API key here.

  1. Basic Setup

Let's create a new index.jsfile, and we'll write everything in this file. We prepare one endpoint called chat where we'll send these parameters:
- message: user's message
- latestReply: The latest reply from AI
- messageSummary: The conversation summary so far

In this endpoint, we'll do two things:
- Respond to new user message (with latestReply and messageSummary as context)
- Create a new conversation summary by providing the fresh reply from AI.

const express = require('express');

// Express Setup
const app = express();
app.use(express.json());
const port = 3000

require("dotenv").config();
const { GROQ_API_KEY } = process.env;

// GROQ Setup
const Groq = require("groq-sdk");
const groq = new Groq({
    apiKey: GROQ_API_KEY
});

async function chatWithGroq() { } // soon
async function summarizeConversation() { } // soon

app.post('/chat', async (req, res) => {
    const { message, latestReply, messageSummary } = req.body;

    // request chat completion
    const reply = await chatWithGroq(message, latestReply, messageSummary)
    
    // request chat summary
    const summary = await summarizeConversation(message, reply, messageSummary)
    
    // Always return chat history/summary
    res.send({
        reply,
        summary
    })
})

app.listen(port, () => {
  console.log(`Example app listening on port ${port}`)
})
  1. Chat with Groq method

Here is the chatWithGroq method implementation:

async function chatWithGroq(userMessage, latestReply, messageHistory) {
    let messages = [{
        role: "user",
        content: userMessage
    }]

    if(messageHistory != '') {
        messages.unshift({
            role: "system",
            content: `Our conversation's summary so far: """${messageHistory}""". 
                     And this is the latest reply from you """${latestReply}"""`
        })
    }

    console.log('original message', messages)

    const chatCompletion = await groq.chat.completions.create({
        messages,
        model: "llama3-8b-8192"
    });

    const respond = chatCompletion.choices[0]?.message?.content || ""
    return respond
}
  • We only provide a conversation summary when we have one (look at the if statement). So, it won't be included in our first message.
  1. Conversation summary method

Here is the summarizeConversation method implementation:

async function summarizeConversation(message, reply, messageSummary) {
    let content = `Summarize this conversation 
                    user: """${message}""",
                    you(AI): """${reply}"""
                  `

    // For N+1 message
    if(messageSummary != '') {
        content = `Summarize this conversation: """${messageSummary}"""
                    and last conversation: 
                    user: """${message}""",
                    you(AI): """${reply}"""
                `
    }

    const chatCompletion = await groq.chat.completions.create({
        messages: [
            {
                role: "user",
                content: content
            }
        ],
        model: "llama3-8b-8192"
    });

    const summary = chatCompletion.choices[0]?.message?.content || ""
    console.log('summary: ', summary)
    return summary
}

In this method, we ask the AI to create a summary based on the latest summary and recent reply.

Demo Time!

You can use any API client, like Postman, Thunder (VS Code), etc.

Don't forget to run your program with node index.js

Create a POST request for the /chat endpoint and provide message endpoint and provide the first message parameter.

initial message illustration

We can display the reply from the response on our user interface. This is the actual reply to our message.

We'll save the summary for the next request.

Now, this is how the JSON looks like for the N+1 message:

N+1 message parameters

The next messages should include the latestReply and messageSummary as parameters.

  • message: *Don't forget to add a new message. This is you talking to the AI. Notice that I use here on my question, to validate that the AI knows what's the previous context here.
  • latestReply: Send the latest reply from AI (from previous response)
  • messageSummary: Send the conversation summary so far (from previous response)

Here is the result to this request:

Summary conversation and reply example

As you can see, the AI knows that when I said here I was talking about Indonesia. You can try to send a follow-up message (create a new request) by asking something like "Can you tell me more about number 4?" as an example. But don't forget that we always need to update the latestReply and summaryConversation on each request.

To return the response and summarize the conversation, I only need to wait around 2s. This is much faster than using OpenAI AI assistants.

FAQ

Why don't we store all the conversation history?
The longer we talk with the assistant, the more tokens we'll need. It's to prevent us from paying a lot of money for the service. This method might work for the open source model that you run on your own server.

Reference:
- Build a smart AI voice assistant
- Basic tutorial: Assistants API by OpenAI