A former game developer who still can’t hear “AI” without thinking of A*.
After falling for Erlang and eventually Elixir, Hernán now works primarily as a consultant and backend engineer; but never passes up a chance to tinker with the frontend, whether it’s building a game loop in Rust or shipping features for an iOS app.
His interest in AI may have started with trying to train NPCs for games, but lately he’s been putting his gaming GPU to work experimenting with LLMs, RAG systems and programming assistants.
It’s happening to more of us every day: out of nowhere, we’re told to integrate an LLM into the product. And so we do, we wire up a RAG pipeline, we add agents to detect adversarial prompts and we tweak all our system prompts.
And it works! The answers are both fluent and polite, and it even says “I don’t know” when it should. Everything looks great… until a tester reports that the AI promised them a 90% discount or gave dangerously bad advice.
How do we stop that from reaching production, when unit tests can’t validate human language?
In this talk, we’ll walk through practical ways to evaluate LLM-driven apps. From the most basic BLEU and ROUGE metrics to aspect-based evaluation, and retrieval scoring; you’ll learn what to measure, when to trust it, and how to catch confident nonsense before your users do.
Key Takeaways:
Target Audience: