Evals for Conversational Data Capture

Matt Arderne
Co-founder @ sea.dev

Welcome to the sea.dev product blog, where we explain features, share insights, and explore use cases.

For more details on enabling Lending and Fintech teams to effortlessly extract data from businesses, people, and documents, please visit our website.

The Problem

Working with LLMs in production can sometimes feel like building on shifting sands.

As a team, measuring comes very naturally to us, so we almost immediately tried to quantify our uncertainty around this feeling.

In practice, this meant prioritizing our most critical processes and building a representative and isolated way to measure them so that we could control and improve them.

Measuring the Problem

When we analyzed the most critical aspects of our customer’s workflow, we found a shortcoming in evaluations around complex human-machine interactions for data capture.

The issue? Existing evaluation frameworks focus only on single tasks.

The failing? Capturing the dynamics of conversation as a way to extract information.

A key question we faced:

Can our system adaptively resolve ambiguity and contradictory information?

In the end, we built a framework to evaluate these conversational data capture processes, with a specific focus on lending.

This turned out to be a major development.

Gary

Why This Matters

It makes sense—taming the stochasticity of LLMs in production is an incredibly powerful competitive advantage.

Every LLM pipeline change that is released goes into our evaluation framework, and we test it against our needs.

For example, you may have heard that Claude 3.7 does very well with coding tasks, but…

The best way to go beyond “it seems better” is to have a set of evaluations that say “it is measurably better.”

And “better” is context-dependent.

Building and refining your own proprietary evals is the moat.

Formalizing the Solution

As part of this process, we wrote a paper introducing a dynamic benchmarking framework to assess LLM-based conversational agents through simulated interactions that exhibit a recurring user trait.

Example:

Chat example

By simulating specific user characteristics at scale—whether ambiguous, contradictory, or evasive—we can:

Measure which system instance best recognizes behavior types that require follow-up.
Determine which models can effectively adapt to real-world conversational challenges.

Next Steps

If you’d like to test this out, or if you’re also working on conversational evals, comment below! 🚀