LLM Evals—One of The Most Crucial Parts of Artificial Intelligence Apps

Apr 30 / AI Degree
Evaluating how well Large Language Models (LLMs) and AI agents perform is one of the most crucial parts of making sure they work correctly, especially once they are used in real-world applications. As these AI systems become more complex and widespread, understanding their performance becomes even more vital.

This discussion will draw primarily from insights provided by Aparna Dhinakaran, CEO of Arize, as presented in her Youtube video on evaluating AI agents and her Medium article on building and benchmarking LLM evaluations. Let's take a closer look.

You can listen to this podcast:

Why Evaluation is Absolutely Critical

When you put AI agents into action, it's super important to know how they are actually doing and evaluate them to ensure they work reliably in the real world. The complexity of these applications often remains hidden until problems crop up. Think of it like this: even a single line of code can set off a chain of many calls, and each of these calls needs its own specific evaluation.

This is true for standard text-based chatbots, but it's even more vital for multimodal agents, like voice AI, which are already transforming call centres. For example, applications like the Price Line Pennybot let people book entire holidays just by using their voice, hands-free.

For multimodal agents, especially those involving voice, you need extra types of evaluations beyond just checking the text. This means you must evaluate:

  • The audio chunk itself, not just the written transcript.
  • User sentiment during the conversation.
  • The accuracy of the speech-to-text transcription.
  • The consistency of the tone throughout the conversation.
  • Specific aspects like intent, speech quality, and speech-to-text accuracy for audio parts.

The Key Parts of an AI Agent and How to Evaluate Them

To properly evaluate an AI agent, you need to understand its main components:
Router: Think of this as the "boss" of the agent. Its job is to decide what the agent should do next and which "skill" to call based on what the user asks. 
  • Example: In an online shopping agent, if you ask about returns or discounts, the router decides whether to direct you to customer service, product suggestions, or discount information.
  • Evaluation: The most important thing to evaluate here is whether the router called the right skill with the right parameters. If it picks the wrong skill (e.g., sending you to customer service when you asked for leggings), the whole experience can go wrong. You need to assess the router's "control flow" to ensure it's making the correct choices.
Skills: These are the actual "logical chains" that do the work. A skill might involve calling an LLM, making an API call, or both. 
  • Evaluation: Evaluating skills can be complex because they have many parts. For example, in a "Retrieval Augmented Generation" (RAG) skill, you need to check: 
  • The relevance of the information ("chunks") that were retrieved.
  • The correctness of the final answer that was generated.
  • A particularly tricky evaluation for skills is "path convergence". This looks at how reliably and efficiently (succinctly) the agent takes the right number of steps to complete a task. Different LLMs can lead to very different numbers of steps for the same job.
Memory: This part stores what the agent previously discussed and keeps track of the conversation's state. 
  • Importance: Since agent interactions usually involve multiple turns, memory is vital to ensure the agent doesn't "forget" what was said earlier.
An agent's operation can involve multiple router calls and skill executions, with memory keeping track of everything. Every single step in an agent's journey is a place where something can go wrong. This is why you need to have evaluations running throughout your application to help you figure out if an issue happened at the router level, the skill level, or somewhere else in the process.

For instance, at Arize, they have an agent (co-pilot) where they run evaluations at every single step of its operation. This includes checking the overall response, whether the router picked the correct path, if the right arguments were passed, and if the task was completed correctly.

LLM Model Evaluation vs. LLM System Evaluation: What’s the Difference?

The term "LLM evals" can be confusing because it's used in different ways:
LLM Model Evaluation:
  • What it is: This focuses on the overall performance of the foundational LLMs themselves (like GPT-4 or Llama).
  • Who uses it: Primarily model developers (e.g., at OpenAI, Google, Meta) who create these base models. They use these evals to measure how effective their models are across various tasks.
  • Examples: Metrics like HellaSwag (sentence completion), TruthfulQA (truthfulness), and MMLU (multitasking ability).
  • For you (ML practitioner): This is usually a one-time step when you're deciding which foundational model to use for your application.
Dhinakaran, A. (2023, October 13). The Guide To LLM Evals: How To Build and Benchmark Your Evals. TDS Archive. https://medium.com/data-science/llm-evals-setup-and-the-metrics-that-matter-2cc27e8e35f3
LLM System Evaluation:
  • What it is: This involves evaluating the complete system that you control, especially the prompt (or prompt template) and the context you provide. It assesses how well your inputs lead to the desired outputs.
  • Who uses it: This defines the majority of an application's life cycle for ML practitioners.
  • Example: An LLM can evaluate your chatbot's responses for "usefulness" or "politeness," and this same evaluation can track performance changes over time in production.
  • Key Concept: The heart of LLM system evaluations is AI evaluating AI. LLMs are used to create "synthetic ground truth" data, which helps evaluate another system. This is essential because getting enough human feedback is incredibly difficult and expensive, especially if you need to evaluate every single LLM sub-call.
Dhinakaran, A. (2023, October 13). The Guide To LLM Evals: How To Build and Benchmark Your Evals. TDS Archive. https://medium.com/data-science/llm-evals-setup-and-the-metrics-that-matter-2cc27e8e35f3

Key Metrics and How to Build LLM Evals

The specific metric you use for LLM system evaluations depends entirely on your use case.

  • For extracting structured information, you might look at completeness.
  • For question answering, you might check accuracy, politeness, or brevity.
  • For RAG, you'd assess the relevance of retrieved documents and the final answer.
  • For applications involving children, age-appropriateness and toxicity are crucial.
  • You'll also want to look for hallucinations.
Here are the steps to build an LLM evaluation:
  1. Start with a Benchmark: This is your first step.
  2. Choose Your Metric: Pick the one that best suits your specific use case.
  3. Use a "Golden Dataset": This dataset should represent the kind of data your LLM eval will see, and it must have "ground truth" labels (often from human feedback). Standardised datasets are available for common tasks.
  4. Decide on an Evaluation LLM: This can be a different LLM than the one in your main application (e.g., GPT-4 for evaluation and Llama for the application). Your choice will balance cost and accuracy.
  5. Develop or Adjust an Eval Template: If you're using a library, start with an existing template and modify it for any specific needs, or build one from scratch. The template needs a clear structure, defining: 
  • What the input is.
  • What is being asked of the evaluation LLM.
  • The possible output formats (e.g., "relevant" or "irrelevant").
6. Run the Eval and Get Metrics: Execute the evaluation on your golden dataset and generate key metrics like overall accuracy, precision, recall, and F1 score.
7. Iterate and Improve: If the performance isn't good enough, change the prompt template iteratively. Be careful not to "overfit" the template to your golden dataset.

The Importance of Precision and Recall

While overall accuracy is often used, it's not enough for evaluating LLM prompt templates, especially when there's a significant imbalance in your data.
  • Example: Imagine a chatbot that's so good it returns "relevant" results 99.99% of the time. An evaluation template that always outputs "relevant" would appear 99.99% accurate, but it would completely miss the crucial cases where the model failed and gave an "irrelevant" result.
  • In such situations, precision and recall (or their combination, the F1 score) are much better measures of performance.
  • The confusion matrix is also a useful way to visually see the percentages of correct and incorrect predictions.
Dhinakaran, A. (2023, October 13). The Guide To LLM Evals: How To Build and Benchmark Your Evals. TDS Archive. https://medium.com/data-science/llm-evals-setup-and-the-metrics-that-matter-2cc27e8e35f3

Running LLM Evals on Your Application

Once your LLM evaluation is set up and proven reliable, you can use it to measure and improve your actual LLM application.
  1. You give your application the input, and it generates an answer.
  2. Then, you feed both the original prompt and the application's answer to your evaluation LLM (the one you built in the previous steps) to assess, for example, the answer's relevance. It's a best practice to use a library with built-in prompt templates for this, as it makes the process repeatable and flexible.
These evaluations are crucial in three different environments:
  • Pre-production (Benchmarking): When you're initially setting up and validating your evaluations.
  • Pre-production (Testing): To understand how your system performs before you release it to customers (like offline evaluation in traditional machine learning).
  • Production: For continuously understanding your system's performance after deployment. Real-world data, users, and even models can change unpredictably. Evals integrated throughout your application's "trace" help you pinpoint exactly where an issue occurred—at the router level, skill level, or elsewhere in the flow.
When planning your evaluation strategy, consider these questions:
  • How much data to sample? While LLM evals are cheaper than human labeling, you can't evaluate every single example. Sample enough data to be representative.
  • Which evals to use? Choose evals specific to your use case (e.g., relevancy for search, toxicity for safety). You might need multiple metrics to diagnose why an overall metric is underperforming (e.g., bad information retrieval affecting question-answering accuracy).
  • Which model for your application? Run model evaluations to determine the best LLM for your specific application, weighing trade-offs like recall versus precision.
In short, the ability to evaluate your AI application's performance is absolutely essential for production code. While LLMs introduce new complexities, they also give us the powerful capability of AI evaluating AI to help run these assessments. By focusing on thorough LLM system evaluation, understanding the agent's parts, and using strong metrics like precision and recall, teams can ensure their AI applications consistently perform well in the real world.

Your Role in The Era of AI Evaluation

As AI systems become more complex and agentic workflows become the norm, the ability to measure, debug, and optimize LLM-based applications is quickly becoming one of the most valuable skills in AI development.

If you’re excited by the challenges of AI evaluation and want to learn how to build, assess, and improve real-world LLM systems, AI Degree offers the perfect starting point. With hands-on projects, expert-led modules, and a curriculum grounded in production-level AI use cases, AI Degree helps you develop the technical and strategic thinking required in this emerging space.  Begin your journey at aidegree.org.

Start Your AI Journey Today!

If you’re ready to become an elite AI developer, now is the time to take action.
To truly harness the power of AI, you need more than just curiosity—you need expertise. The AI Degree program offers a comprehensive, flexible curriculum that lets you learn at your own pace.

From foundational topics to advanced AI development, you’ll gain the skills needed to excel in this dynamic field. Scholarships make it accessible to everyone, and optional ECTS credits provide global recognition.

Start your journey today. Explore free courses, apply for scholarships, and begin building the future of AI—your future. Learn More Here.