My Genie Got It Wrong: Evaluating LLMs for a RAG Chatbot

How do you choose the right LLM for a RAG chatbot? I compared Llama 70B, 405B, and GPT-4 across 100 iterations. The AI agent's recommendation was wrong.

My Genie Got It Wrong: Evaluating LLMs for a RAG Chatbot
Photo by Nikola Johnny Mirkovic / Unsplash

Not all LLMs are alike

I recently attended the Yow! Conference, where a couple of talks on the Sustainability and Security of LLMs stood out to me. The first was a talk by Charles Humble titled Green AI: Making Machine Learning Environmentally Sustainable. The second by Katharine Jarmul on Hacking AI Systems: How to (Still) Trick Artificial Intelligence

Both speakers discussed the concept of "Using the right model for the job."

From a sustainability perspective, right-sizing the LLM reduces energy consumption. One speaker suggested using open source models where possible.

For security reasons, using an over-parametrised LLM can expose your model to abuse.

Quality Coach Research Tool

I've created a research tool that lets you explore the Quality Coach's Handbook based on your specific role and questions. Using machine learning, it provides tailored summaries and points you to relevant chapters for deeper reading. It's like having a personalised guide through the handbook that adapts to whether you're a test lead, engineering manager, or quality coach.

With greater awareness from the talks, I realised I need to be more particular about the model I used. Sustainability, security, and content accuracy are high priorities for me. I decided to evaluate the models against these criteria. Turns out there's a name for this type of testing. It's called evals.

Eval Criteria

Accuracy of response also matters. This is my book. There is no way I'm going to allow a chatbot authoritatively make up content in my name.

The criteria for the evaluation were:

  1. Reasoning Ability: Can it respond effectively depending on the persona (CEO vs. Test Lead)?
  2. Energy Efficiency: Can we minimise the carbon cost without sacrificing quality?
  3. Open Source: Can we move away from proprietary providers to use open-source?

Potential Models

I used GPT-4 as the baseline, since that is what I used in my MVP.

The two models I decided to evaluate against GPT-4 were Llama 3.1 70B and Llama 3.1 405B. In particular, I wanted to test the 405b model because the genie (Kent Beck's term for an agent) highly recommended it as the preferred reasoning model. It confidently explained that smaller models would fail the accuracy test.

This recommendation immediately made me suspicious.