A mid-market professional services firm moved an AI agent into their live billing support queue earlier this year. The agent had one job: triage tickets, draft first responses, and pull context from previous interactions so the team wasn’t starting cold every time.
After a successful eight-week pilot on a controlled set of tickets, they connected it to the live queue. What followed wasn’t a disaster – no client received something they shouldn’t have – but the failure was subtler than that, and harder to fix.
What the pilot hadn’t tested
Within a few weeks, three things were missing.
The first was an audit trail. The agent was updating records and moving tickets, but sometimes it wasn’t clear what it had done, or why. When a team lead tried to reconstruct what had happened to one billing query, two people spent the better part of an afternoon piecing it together from different messages and system timestamps.
The second was how to review it. During the pilot, three people handled review informally and it worked. In production, eight people were touching the queue. Some read every draft carefully. Others waved through responses with minimal checks, because it seemed to be working and they had other things to do. One message went out that was factually accurate but struck the wrong tone with a client chasing an overdue invoice.
The third was measurement. When the ops director asked what the agent had actually delivered, nobody could point to a KPI that had moved. The agent was running, but nothing about it could survive a budget conversation.
The agent wasn’t switched off, but it drifted into limbo. The team went back to their old workflows, occasionally checking the agent’s output after the fact, if at all.
We see this pattern often at aibl. Pilots rarely test everything around the agent: the audit trail, the review process, the ability to show what it actually did.
What they rebuilt
They put clear controls around how the agent operated. The agent drafted and flagged, a person approved or didn’t, and nothing that touched a client’s account moved without that sign-off. That kept the scope to what they could verify and gave them a stable base to build from.
Within that structure, they appointed one person accountable for what the agent was allowed to do, how outputs were reviewed, and when the rules needed changing. When a team member wasn’t sure whether to approve a draft on a disputed invoice, there was now someone whose job it was to know.
For the audit trail, they built a log for every decision: what came in, what context was pulled, what the agent proposed, and whether it went to approval or paused for review. A team lead checking what had happened on a Tuesday afternoon could do it in two minutes without asking anyone.
They also needed a metric that meant something, so they narrowed the scope to one sub-category (clients querying invoice discrepancies), measured the current average first-response time, and set a target. The goal was to prove a measurable improvement on one thing before anyone funded the next.
The result
The rebuild was spent on questions the pilot never asked. Who owns this when something goes wrong? How does anyone know what it did yesterday? What does good review look like when eight people are doing it? What happens when the data isn’t clean?
Those aren’t agent questions but operational ones – most pilots end just before production starts exposing them.