Share
First in a series of nine principles
 ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌

Last month, I wrote about what 2025 quietly revealed about computer vision and promised to share the framework behind it. This is the first of nine principles. Eight more to come.


Let's start with accuracy.


Beyond the Benchmark: Why Accuracy in AI Is Harder Than It Looks


What does it mean for an AI to be accurate? The answer is far less obvious than most assume—and the stakes for getting it wrong are rising.


The Numbers Game


We are living through a period where it's easier than ever to produce an AI model that achieves impressive benchmark scores but becomes a paperweight the moment it's deployed. Greg Mulholland, CEO of Citrine Informatics, puts it bluntly: "I can make that number whatever I want it to be. Any machine learning or data scientist knows how to build a system that can check a box. But it's not about checking a box. It's about getting to a predictive enough real-world outcome that it becomes a tool that's useful."


The problem runs deep. By collapsing complex biological or physical behaviors into a single percentage, we obscure the specific ways a model fails. Academic benchmarks include a limited number of data points, while real-world systems can produce billions of data points daily. The scale mismatch alone can render laboratory performance meaningless.


The Ground Truth Problem


But scale is only part of the problem. Even when benchmarks are large enough, accuracy is only as good as the labels it's measured against. In many high-stakes fields, establishing what's "right" is far from straightforward.


Consider epilepsy monitoring: Dean Freestone, CEO of Seer, describes how their AI flagged events that human reviewers had originally missed. Upon re-examination, experts discovered thousands of true positives they had overlooked due to labeling fatigue. The AI wasn't wrong—the human gold standard was.


In medical imaging, ask three clinicians to annotate the same scan and you'll often get three different answers. The same expert may even provide different answers when reviewing the same image months later. A model can only be as accurate as the inconsistent human it's trying to mimic.


The most dangerous form of label noise occurs when the obvious label is actually wrong. As Nico Karssemeijer, Chief Science Officer of ScreenPoint Medical, explains, in cancer screening, a radiologist might label a scan as normal, but that's an opinion, not a biological fact. Truly accurate training requires biopsy-proven results and long-term patient follow-up—not initial reads that may have missed slow-growing tumors.


The Rare Event Trap


Even with perfect labels, however, another trap awaits. A model that's 99% accurate can still be 100% useless. This paradox arises from class imbalance: if a condition occurs in only 1% of cases, a model that simply predicts "nothing is wrong" every time will achieve a 99% accuracy score while failing completely at its actual purpose.


Erez Naaman, CTO of Scopio Labs, illustrates this in blood cell morphology: the vast majority of cells are common and healthy. But the most critical diagnostic information lies in extremely rare cell types that indicate life-threatening diseases. If an AI isn't specifically tuned to find these rare outliers, its high overall accuracy merely masks its failure to perform the one task that matters to the patient.


This challenge extends beyond medicine. Nikola Sivacki, Co-founder of Greyparrot, describes the problem in recycling facilities: a soda can on a store shelf looks identical to every other can. But by the time it reaches the sorting belt, it's crushed, dirty, and unique. A model that excels at identifying clean, pristine objects but fails on the mangled items in the long tail delivers no real-world value.


The Cost of Being Wrong


The rare event problem points to a deeper truth: not all errors are created equal. In the controlled environment of a data science lab, an error is just a statistical residue. In the real world, the cost of a wrong answer varies wildly depending on context.


In cancer screening, a false positive causes anxiety; a false negative means missed treatment. In surgical margin assessment, as Ersin Bayram, Director of AI at Perimeter Medical, notes, the priority is ensuring no cancer is left behind—even at the cost of removing slightly more healthy tissue. In steel production, Berk Birand, CEO of Fero Labs, points out that a single bad AI recommendation can waste hundreds of thousands of dollars in energy and materials.


There's also a trust asymmetry at play. As Dean Freestone of Seer observes, when a human makes an error, it's often viewed as an accident. When an AI makes a mistake, it's viewed as systematic bias, leading to rapid erosion of confidence. AI must often surpass human skill just to compensate for the higher trust tax it pays when it fails.


Redefining Impactful Accuracy


Given these challenges—misleading benchmarks, unreliable labels, rare events, and asymmetric costs—how should we think about accuracy? The answer is that impactful accuracy isn't a single number. It's a shape that must be molded to fit each use case.


For screening applications, high recall (sensitivity) ensures nothing is missed. For diagnostic confirmation, high precision ensures every action is based on truth. The right balance depends entirely on the consequences of each type of error.


Crucially, impactful accuracy often emerges from human-AI collaboration. Daniella Gilboa, CEO of AIVF, finds that at fertility clinics using AI for embryo assessment, the highest success rates come not from the machine or human working alone, but from complete alignment between the embryologist and the algorithm.


Perhaps most importantly, a model's ability to quantify its own uncertainty is as valuable as its ability to provide correct answers. Mathieu Bauchy, CTO of Concrete.ai, uses an ensembling approach where multiple models vote on a prediction: unanimous agreement signals low uncertainty; disagreement signals when a human should take over. This kind of calibrated confidence ensures that recommendations carry appropriate weight.


The Path Forward


These principles point toward a fundamental shift in how we must approach AI development.

Building AI that actually works requires accepting that accuracy is not a trophy to be won, but a responsibility to be maintained. It demands rigorous investment in diverse data, continuous post-deployment monitoring, and the humility to know when the model should abstain.


Impactful accuracy means a system produces the right answers, for the right people, in the right conditions, often enough to be trusted—and transparently enough to be challenged.


Benchmarks are not impact. It's time we measured what matters.


- Heather

Vision AI that bridges research and reality

— delivering where it matters


Research: Image Quality


Impact of Data Quality on Deep Learning Prediction of Spatial Transcriptomics from Histology Images


While deep learning continues to bridge the gap between histology and molecular biology, our models are only as good as the data that feeds them.

Spatial transcriptomics enables high-throughput mapping of gene expression across tissue sections, but the extreme costs associated with these technologies often limit their scale. To overcome this, researchers use deep learning to predict spatial gene expression directly from inexpensive H&E-stained images. While past efforts have focused heavily on perfecting model architectures, 𝘊𝘢𝘭𝘦𝘣 𝘏𝘢𝘭𝘭𝘪𝘯𝘢𝘯 𝘦𝘵 𝘢𝘭. recently investigated a more critical, often overlooked factor: the quality of the training data itself.

Key innovations and findings from the study include:
● 𝗧𝗲𝗰𝗵𝗻𝗼𝗹𝗼𝗴𝘆 𝗕𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸𝗶𝗻𝗴: The research compared imaging-based (𝘟𝘦𝘯𝘪𝘶𝘮) and sequencing-based (𝘝𝘪𝘴𝘪𝘶𝘮) platforms, revealing that higher molecular data quality from 𝘟𝘦𝘯𝘪𝘶𝘮 leads to significantly better predictive performance.
● 𝗤𝘂𝗮𝗻𝘁𝗶𝗳𝗶𝗮𝗯𝗹𝗲 𝗚𝗮𝗶𝗻𝘀: Models trained on 𝘟𝘦𝘯𝘪𝘶𝘮 data demonstrated an approximate 38% increase in average prediction performance across genes compared to those trained on 𝘝𝘪𝘴𝘪𝘶𝘮.
● 𝗜𝗺𝗽𝗮𝗰𝘁 𝗼𝗳 𝗦𝗽𝗮𝗿𝘀𝗶𝘁𝘆 𝗮𝗻𝗱 𝗡𝗼𝗶𝘀𝗲: Through 𝘪𝘯 𝘴𝘪𝘭𝘪𝘤𝘰 ablation experiments, the authors proved that molecular data sparsity and noise—common in sequencing technologies—directly degrade predictive accuracy.
● 𝗜𝗺𝗮𝗴𝗲 𝗥𝗲𝘀𝗼𝗹𝘂𝘁𝗶𝗼𝗻 𝗠𝗮𝘁𝘁𝗲𝗿𝘀: Reduced image resolution not only lowers prediction performance but also obscures model interpretability, making it harder to identify cellular landmarks through tools like Grad-CAM.
● 𝗧𝗵𝗲 𝗜𝗺𝗽𝘂𝘁𝗮𝘁𝗶𝗼𝗻 𝗟𝗶𝗺𝗶𝘁: Attempts to ""rescue"" lower-quality data via gene expression imputation led to overfitting and a failure to generalize to new samples, proving that high-quality ground truth is essential.

Ultimately, the study underscores that improving data quality offers an orthogonal and necessary strategy to tuning model architecture for enhancing predictive modeling in spatial transcriptomics.

Insights: Foundation Models


How Foundation Models Might Reshape Pathology—But Only If They’re Tuned to Clinical Needs


A massive model trained on the wrong problem is just a very expensive mistake.

Foundation models are reshaping computer vision—but their value in pathology is still being proven. These models promise capabilities like few-shot learning, cross-domain generalization, and better reuse—but clinical value doesn’t follow automatically. Without tuning to real-world pathology tasks, even the largest models risk becoming academic showpieces rather than clinical solutions.

To move from impressive to impactful, foundation models must be:
- Pretrained on pathology-specific data
- Adapted to clinical tasks (e.g., whole-slide reasoning, biomarker detection)
- Validated across labs, scanners, and populations

Example: A large vision model trained on public datasets excelled at general classification—but when applied to rare sarcoma subtypes in a trial screening context, it misclassified candidates, risking their exclusion from potentially life-saving therapies. Only after tuning with expert-labeled local data did it start identifying rare morphologies accurately.

So what?
When clinical grounding is missing, foundation models don’t just underperform—they mislead teams into investing in tools that never translate. The cost isn’t just technical—it’s lost time, credibility, and patient opportunity. Ground them in real-world needs, or risk building tech that never leaves the lab.

Let’s design foundation models with the clinic in mind—not just the conference.

💬 What’s the most promising way foundation models could accelerate pathology?

Leave a comment

Research: Foundation Model for the Sun


Surya: Foundation Model for Heliophysics


After decades of observing the Sun, there is finally a model that can learn from all that data at once.

Space weather forecasting has relied on narrow, task-specific models trained from scratch for each application. Solar flare prediction uses one model. Magnetic field analysis uses another. Solar wind forecasting uses yet another model. Each requires extensive labeled data for rare events that can damage satellites, disrupt power grids, and endanger astronauts.

Sujit Roy et al. introduce Surya, a 366 million parameter foundation model trained on 10 years of continuous observations from NASA's Solar Dynamics Observatory. Unlike previous approaches working with downsampled data, Surya learns from full-resolution (4096×4096) multi-instrument observations: eight extreme ultraviolet channels plus five magnetic field and velocity products.

𝗞𝗲𝘆 𝗶𝗻𝗻𝗼𝘃𝗮𝘁𝗶𝗼𝗻𝘀:
- Uses time-advancement as the pretraining task: the model learns to forecast solar evolution 60 minutes ahead, then extends predictions through autoregressive rollout up to 12 hours
- Employs spectral gating combined with long-short range attention to capture both local plasma dynamics and global magnetic field structures
- Demonstrates zero-shot capabilities for forecasting solar dynamics without task-specific training
- Fine-tunes efficiently with parameter-efficient LoRA for downstream applications: flare forecasting, active region segmentation, solar wind prediction, and spectral modeling

The approach mirrors successful strategies in Earth observation AI: build once from massive multi-modal datasets, then adapt to specific applications. The model is open sourced on Hugging Face, enabling researchers worldwide to build specialized space weather tools without starting from scratch.

Worth noting: this doesn't solve space weather prediction, but it does establish a new baseline for what's achievable when foundation models meet domain-specific physics.


Blog

Model

Enjoy this newsletter? Here are more things you might find helpful:


Pixel Clarity Call - A free 30-minute conversation to cut through the noise and see where your vision AI project really stands. We’ll pinpoint vulnerabilities, clarify your biggest challenges, and decide if an assessment or diagnostic could save you time, money, and credibility.

Book now

Did someone forward this email to you, and you want to sign up for more? Subscribe to future emails
This email was sent to _t.e.s.t_@example.com. Want to change to a different address? Update subscription
Want to get off this list? Unsubscribe
My postal address: Pixel Scientia Labs, LLC, PO Box 98412, Raleigh, NC 27624, United States


Email Marketing by ActiveCampaign