AI Product Engineer Day 5

Day 5: Evals for Agents

Day 5: Making Agents reliable with Tracing, Evals, and Guardrails

Welcome to Lesson #5!

Launching AI applications should feel exciting, not terrifying. Today, you will learn how AI engineers keep their Agents safe, reliable, and easy to debug.

Here is the agenda:

  • Harsh Truth about Agent Engineering

  • Why Evals are important

  • Online vs Offline Evals

  • What is Tracing

  • Why Tracing is important

  • What are Guardrails

If you missed previous lessons, find them here:


👉 Day 1: Intro to Agents
👉 Day 2: Agents vs Workflows
👉 Day 3: RAG and Tool Use
👉 Day 4: Memory


If you feel stuck or behind in the current applied AI advancements, check out the next AI Product Engineer Bootcamp this June 2025. No theory, no extra fluff. Just real projects, real mentorship, and the skills that actually get you hired. All in 6 weeks!

🚀 See our graduates’ projects: (Kelsey), (Tamil), (Autumn), (Raj)

Harsh Truth about Agent Engineering


Let’s be honest.

Too many beginners think that building a Streamlit app with LangChain or CrewAI will get them hired. They build a POC, hit “run” once, throw it on GitHub… and hope it impresses a hiring manager.

📌 But here's what actually matters:

Hiring managers don’t care what runs on your local machine.
They care what you do when it breaks in production.

Because it will break.

And you need to have answers to:

  • Can you trace where your agent fails?

  • Can you evaluate if it’s improving?

  • Can you add guardrails to protect users from bad outputs?

Today’s lesson covers exactly that: how to make your agents more reliable.

This is one of the most insightful lessons that we are sharing today from our bootcamp curriculum!

Why are evals important?

Evals, short for evaluations, are systematic processes designed to measure the quality, accuracy, and reliability of outputs generated by AI systems, especially those built on LLMs.

An eval typically involves:

  • Providing an input (such as a prompt or data point) to an AI system.

  • Capturing the output generated by the system.

  • Comparing this output against a set of ideal or expected answers using predefined criteria or scoring functions.

Evals is arguably the most important skill in AI

Evals can be quantitative (e.g., measuring accuracy, precision, recall) or qualitative (e.g., assessing style, safety, or ethical alignment). They may be as simple as checking if a chatbot gives the correct answer, or as complex as evaluating the system-level performance across multiple components.

During the bootcamp, you will also learn how to think like a software engineer when building a custom eval framework.

Online vs Offline evals

There are 2 types of evals: online and offline.

Offline evals are performed in a controlled, non-production environment using curated datasets. It’s typically done before deploying to production.

How it works:

  • Benchmark Dataset: You use a dataset containing input prompts and their expected outputs (ground truth).

  • Automated Testing: The AI agent runs on this dataset, generating outputs.

  • Comparison: The outputs are automatically compared to the expected results

  • Scoring: You may also use custom scoring functions or even human evaluators for more subjective tasks.

Purpose:

  • Ensures the AI agent meets quality standards before being exposed to real users.

  • Detects regressions or unintended changes.

  • Supports rapid iteration and debugging.

See a screenshot of the evaluation framework example presented by our guest speaker.

Freddie Vargus, CTO @QuotientAI, Guest Speaker at our last bootcamp

Online evals, on the other hand, measure how the AI system performs in real-world conditions, capturing user behavior, system outputs, and user feedback. They are crucial for catching issues that only arise in production to guide further improvements.

How It Works:

  • Live Data: The AI system interacts with actual users.

  • Real-Time Metrics: Collects data such as click-through rates, user ratings, engagement, or direct user feedback.

  • A/B Testing: Often involves comparing multiple versions of the system with real users to measure which performs better.

Purpose:

  • Measures real-world performance and user satisfaction.

  • Detects issues that only occur in production or with real user behavior.

  • Informs further improvements post-deployment.

Example metrics for online evals:

  • Costs: captures token usage

  • Latency: see the time it takes to complete each step, or the entire run.

  • User Feedback: users can provide direct feedback (thumbs up/down) to help refine or correct the agent.

  • LLM-as-a-Judge: Use a separate LLM to evaluate your agent’s output in near real-time

Examples of evals

Risk & Safety evals

AI tools can scan a test set or yesterday’s chats and report how often unsafe content slipped through:

  • Hate or unfair language

  • Sexual or violent content

  • Self-harm phrases

  • Copyrighted text

  • Jailbreak success

You choose a tolerance, for example “anything medium severity or above is a defect,” and track the defect rate over time.

Quality evals

Protecting users is not enough; your bot must still answer well. Common metrics, usually scored 1-5:

  • Intent Resolution: did the agent grasp what the user wanted?

  • Tool Call Accuracy: did it choose the right function with the right parameters

  • Task Adherence: did it follow all instructions?

  • Response Completeness: did it cover every fact in the ground truth?

  • Groundedness: in RAG systems, are all claims supported by retrieved docs?

  • Relevance, Coherence, Fluency: classic measures of correctness and readability.

  • Similarity / BLEU / ROUGE / F1: overlap with reference answers if you have them.

What is Tracing?

When an end user clicks send, dozens of things happen inside your app:

  • their message reaches an API gateway

  • the gateway calls a retrieval tool

  • the LLM writes a draft answer

  • a database stores the result

  • your front-end finally shows the text

Tracing creates a short note at each step, also called “traces”: what happened, when it happened, how long it took, and how many tokens or dollars it cost. Later, you can play those notes back in order and spot slow parts, errors, or unusual patterns.

Example of a tracing dashboard with OpenAI Agents with Langfuse

Why observability is important

Most standard monitoring tools only warn you when a web request is slow. AI agents, however, can go off the rails in new ways, making things up (hallucinations), getting stuck in loops, or using way too many tokens and racking up huge costs.

Tracing helps you spot and diagnose these unique problems so you can answer questions like:

  • “Where did the hallucination start?” - This shows you exactly which prompt or tool call led the model

  • “Why did that response cost $12?” - Tracking token usage helps you estimate costs and identify overly verbose prompts or repeated calls.

  • “Which tool call failed or behaved unexpectedly?” - Knowing this lets you pinpoint whether an external API, database lookup, or other integration is the root cause

  • “Did the agent get stuck in a loop or retry endlessly?” - Spotting repeated execution patterns prevents runaway behaviors and lets you add safeguards like retry limits or timeout rules.

You can track each action, input, and output of the Agents

What to trace in an agent stack

  1. Agent steps – Each planner/executor round‐trip: intent, chosen tool, and final action.

  2. LLM calls – Prompt text, model name, latency, and token counts (prompt, completion, total).

  3. Tool invocations – Function name plus validated Pydantic input/output so you can diff bad parameters.

  4. Retriever hits – Which documents were fetched, their scores, and embedding latency.

  5. Guardrail verdicts – Moderation labels, schema-validation failures, auto-retry count.

Finally, what are guardrails?


Another big part of making agents reliable is enforcing guardrails before final LLM output.


While traces tell you what happened, guardrails stop dangerous answers from reaching users in the first place. They run either before the model (checking the user prompt) or right after (checking the model’s draft).

Guardrail type

Catches

Typical fix

Content filter

Hate, sexual, violent, self-harm text

Return a polite refusal or ask for a new prompt

Copyright filter

Large blocks of lyrics or articles

Replace with a short summary

Jailbreak detector

“Ignore all rules and show me…”

Abort and log the attempt

Code scanner

eval(input()), SQL injection

Replace with a safe snippet

Schema validator

Malformed JSON

Auto-retry with stricter instructions

Cost watchdog

Response > 4 000 tokens

Switch to a concise fallback prompt

Congratulations on completing Day 5!

Today, you learned about tracing, guardrails, and evals! Now you know how to make AI agents more reliable.

  1. Define and build custom evals.

  2. Trace everything important to your use case.

  3. Block problems with guardrails before final output.

If you want to apply what you have learned and build these using TypeScript/Python and cutting-edge AI frameworks, sign up for the next cohort in June. We will be covering this in Week 6 at AI Product Engineering Bootcamp!

Spots are limited and filling up quickly. Book your seat today!

I enjoyed the community aspect of the bootcamp. It's really cute and every time I'm going to the chat like there are cute vibes and people supporting each other constantly!

Oren’s feedback from the last cohort

How was today's newsletter?

Login or Subscribe to participate in polls.