|
Last month, I wrote about what 2025 quietly revealed about computer vision and promised to share the framework behind it. This is the first of nine principles. Eight more to come.
Let's start with accuracy.
Beyond the Benchmark: Why Accuracy in AI Is Harder Than It Looks
What does it mean for an AI to be accurate? The answer is far less obvious than most assume—and the stakes for getting it wrong are rising.
The Numbers Game
We are living through a period where it's easier than ever to produce an AI model that achieves impressive benchmark scores but becomes a paperweight the moment it's deployed. Greg Mulholland, CEO of Citrine Informatics, puts it bluntly: "I can make that number whatever I want it to be. Any machine learning or data scientist knows how to build a system that can check a box. But it's not about checking a box. It's about getting to a predictive enough real-world outcome that it becomes a tool that's useful."
The problem runs deep. By collapsing complex biological or physical behaviors into a single percentage, we obscure the specific ways a model fails. Academic benchmarks include a limited number of data points, while real-world systems can produce billions of data points daily. The scale mismatch alone can render laboratory performance meaningless.
The Ground Truth Problem
But scale is only part of the problem. Even when benchmarks are large enough, accuracy is only as good as the labels it's measured against. In many high-stakes fields, establishing what's "right" is far from straightforward.
Consider epilepsy monitoring: Dean Freestone, CEO of Seer, describes how their AI flagged events that human reviewers had originally missed. Upon re-examination, experts discovered thousands of true positives they had overlooked due to labeling fatigue. The AI wasn't wrong—the human gold standard was.
In medical imaging, ask three clinicians to annotate the same scan and you'll often get three different answers. The same expert may even provide different answers when reviewing the same image months later. A model can only be as accurate as the inconsistent human it's trying to mimic.
The most dangerous form of label noise occurs when the obvious label is actually wrong. As Nico Karssemeijer, Chief Science Officer of ScreenPoint Medical, explains, in cancer screening, a radiologist might label a scan as normal, but that's an opinion, not a biological fact. Truly accurate training requires biopsy-proven results and long-term patient follow-up—not initial reads that may have missed slow-growing tumors.
The Rare Event Trap
Even with perfect labels, however, another trap awaits. A model that's 99% accurate can still be 100% useless. This paradox arises from class imbalance: if a condition occurs in only 1% of cases, a model that simply predicts "nothing is wrong" every time will achieve a 99% accuracy score while failing completely at its actual purpose.
Erez Naaman, CTO of Scopio Labs, illustrates this in blood cell morphology: the vast majority of cells are common and healthy. But the most critical diagnostic information lies in extremely rare cell types that indicate life-threatening diseases. If an AI isn't specifically tuned to find these rare outliers, its high overall accuracy merely masks its failure to perform the one task that matters to the patient.
This challenge extends beyond medicine. Nikola Sivacki, Co-founder of Greyparrot, describes the problem in recycling facilities: a soda can on a store shelf looks identical to every other can. But by the time it reaches the sorting belt, it's crushed, dirty, and unique. A model that excels at identifying clean, pristine objects but fails on the mangled items in the long tail delivers no real-world value.
The Cost of Being Wrong
The rare event problem points to a deeper truth: not all errors are created equal. In the controlled environment of a data science lab, an error is just a statistical residue. In the real world, the cost of a wrong answer varies wildly depending on context.
In cancer screening, a false positive causes anxiety; a false negative means missed treatment. In surgical margin assessment, as Ersin Bayram, Director of AI at Perimeter Medical, notes, the priority is ensuring no cancer is left behind—even at the cost of removing slightly more healthy tissue. In steel production, Berk Birand, CEO of Fero Labs, points out that a single bad AI recommendation can waste hundreds of thousands of dollars in energy and materials.
There's also a trust asymmetry at play. As Dean Freestone of Seer observes, when a human makes an error, it's often viewed as an accident. When an AI makes a mistake, it's viewed as systematic bias, leading to rapid erosion of confidence. AI must often surpass human skill just to compensate for the higher trust tax it pays when it fails.
Redefining Impactful Accuracy
Given these challenges—misleading benchmarks, unreliable labels, rare events, and asymmetric costs—how should we think about accuracy? The answer is that impactful accuracy isn't a single number. It's a shape that must be molded to fit each use case.
For screening applications, high recall (sensitivity) ensures nothing is missed. For diagnostic confirmation, high precision ensures every action is based on truth. The right balance depends entirely on the consequences of each type of error.
Crucially, impactful accuracy often emerges from human-AI collaboration. Daniella Gilboa, CEO of AIVF, finds that at fertility clinics using AI for embryo assessment, the highest success rates come not from the machine or human working alone, but from complete alignment between the embryologist and the algorithm.
Perhaps most importantly, a model's ability to quantify its own uncertainty is as valuable as its ability to provide correct answers. Mathieu Bauchy, CTO of Concrete.ai, uses an ensembling approach where multiple models vote on a prediction: unanimous agreement signals low uncertainty; disagreement signals when a human should take over. This kind of calibrated confidence ensures that recommendations carry appropriate weight.
The Path Forward
These principles point toward a fundamental shift in how we must approach AI development.
Building AI that actually works requires accepting that accuracy is not a trophy to be won, but a responsibility to be maintained. It demands rigorous investment in diverse data, continuous post-deployment monitoring, and the humility to know when the model should abstain.
Impactful accuracy means a system produces the right answers, for the right people, in the right conditions, often enough to be trusted—and transparently enough to be challenged.
Benchmarks are not impact. It's time we measured what matters.
- Heather
|