Quality Was Always One Job
How building a quality chatbot taught me that discovery and testing were never separate disciplines
When I built the handbook chatbot, my hypothesis was that it would drive book sales. Along with the chatbot, I built tests to make sure it worked.
Technical accuracy wasn't enough. The hypothesis failed. Book sales stayed flat. But I learned something. Users weren't asking the book factual questions. They were asking things like: "a change I pushed for has stalled because nobody sees why it matters anymore. Help me work out what to say next"
So the hypothesis shifted. People weren't looking for what's in the book — they were looking for context-specific advice, the kind that draws on thirty years of work in the quality space. I changed what the chatbot was for. Instead of answering questions about the handbook, it would answer the questions I see my clients asking.
But what does good look like for a use case that's non-deterministic and heavily context-dependent? I took the real user interactions and built a set of golden answers against them. The initial responses were too vague — they could have been anyone's. So I added deterministic checks for specific elements of each response. Did it cite a section of the handbook? Did it stay in the voice of the book, or had it slipped into therapy-speak? Did it refuse cleanly when the question fell outside the handbook's scope? I kept iterating until those passed.
Holding the hypothesis and the evaluation as one piece of work is what produced the second version. If I'd handed the testing to someone else, they'd have built something that passed. They wouldn't have built something that worked.
Only then was it ready for deployment and real customer feedback. The feedback will refine the hypothesis, or kill it. New hypotheses, new evaluations. So the cycle continues.
The split we agreed to pay for
Questions sit at the heart of quality: who matters, what matters, what's at risk. In discovery, we ask them as hypotheses. In testing, we answer them through evidence. Two sides of the same inquiry.
Somewhere along the way, we split them. Product owners took the hypothesis side — the "right product" question. Quality professionals took the evidence side — the empirical answer. The split made sense for managing labour in complex systems, and companies slice work according to their org structure anyway. It makes no sense in terms of delivering customer value, but it's a cost we've implicitly agreed to pay.
David Klahr's SDDS model — Scientific Discovery as Dual Search — describes the cognitive structure underneath all of this. You're searching two spaces at once: a hypothesis space and an experiment space. You form a partial hypothesis, design an experiment, observe the outcome, and update both spaces. A disconfirmed hypothesis doesn't just rule something out — it reshapes what you choose to test next. The two searches interact. Discovery is the interaction.
Teresa Torres makes the same argument from the product side. In Continuous Discovery Habits, discovery and delivery aren't sequential phases — they run in parallel, continuously, and by the same team. Talking to customers weekly, testing assumptions, mapping the opportunity space. Ongoing, interleaved work.
The reason it collapsed into the silo model is that this kind of search is cognitively expensive and organisationally inconvenient. Most orgs have a definition of done and a release date, so the search space closes. The hypothesis becomes an assumption. The test becomes a formality. Quality becomes whatever shipped, and monitoring and alerting catches some of what slipped through.
Isolation and decoupling are great testing strategies. Lousy for delivering customer value.
You can't run it in separate lanes anymore
The person forming the hypothesis needs to understand what the evaluation can tell them. The person reading the evidence needs to understand what the user was actually trying to achieve. Split that work and you'll build a system that answers questions correctly and solves nothing.
AI gives us tools to manage this complexity that we didn't have before. Evaluation at scale. Synthesising signal from probabilistic outputs. Closing the loop between user intent and system behaviour faster than any human process ever could.
The temptation is to hand those tools to the existing roles and call it progress. Product owners get an AI that helps with prioritisation. Quality professionals get an AI that generates test cases. The silo survives, better tooled. Quality remains nobody's whole job — just everybody's half-answer.
The opportunity is to treat discovery and testing as what they always were: the same search, conducted together, in pursuit of customer value. AI makes that tractable at scale for the first time.
David Klahr, whose work on dual-space search shaped how I think about exploratory testing, passed away on 26 April 2026. I based training on his Big Trak X2 discovery research.
Thanks, Maria Kedemo and Isabel Evans, for feedback on this article.
Got a decision to make, a conversation to prep, or a move to work out?
Drop in the real situation. KYM reads it through the Handbook's frameworks and gives you a move you can try.
Open Know Your Move →
Comments ()