Foundation Models: Promise, Pitfalls, and the Last-Mile Reality

Foundation models have become the default answer to almost every domain-specific AI problem. If you’re working in pathology, agriculture, earth observation, or any other data-rich scientific field, you’ve likely heard some version of the same pitch: pretrain once, adapt everywhere. The promise is compelling — faster development, fewer labels, and a shortcut around years of bespoke model building.

But after working with teams trying to deploy these models in real systems, a clearer picture emerges. Foundation models are powerful tools. They are also frequently misunderstood. And the gap between what they enable and what they replace is where many AI initiatives quietly stall.

The real question isn’t whether foundation models work. It’s where they work — and where their limitations surface.

The real promise

In visual domains, foundation models deliver genuine value in a few specific places.

They are exceptionally good at learning rich representations from large, messy, weakly labeled datasets. This matters in domains where annotation is expensive, slow, or requires expert time. Pretrained models can dramatically reduce the friction of early experimentation, allowing teams to prototype downstream tasks faster than traditional approaches.

Foundation models also lower the barrier to entry. Smaller teams can now start from a strong baseline instead of training everything from scratch. In some cases, a well-chosen pretrained model can outperform years of hand-engineered pipelines.

These are real advances. Ignoring them would be shortsighted.

Where the cracks appear

The problems begin when foundation models are treated as solutions rather than components.

Most foundation models are trained on data that is convenient, not representative. In applied domains, that distinction matters more than architecture choice or parameter count. The moment a model encounters new scanners, sensors, geographies, protocols, or patient populations, performance can degrade in subtle but operationally meaningful ways.

Another common assumption is that scale substitutes for domain modeling. Larger models often mask data issues instead of resolving them. They can achieve impressive aggregate metrics while failing silently on long-tail cases — precisely the cases that matter most in safety-critical or high-stakes settings.

Then there is the cost of adaptation. While fine-tuning is often presented as relatively lightweight, in practice, it usually requires more planning than expected. Teams need to account for task-specific labeling, thoughtful validation design, ongoing monitoring, and integration into real workflows. What looks like a shortcut early on can introduce additional work later if these pieces aren’t anticipated upfront.

None of these are accidental failures. They are structural mismatches between how foundation models are marketed and how domain systems actually operate.

A reframing that helps

One mental shift makes these tradeoffs clearer: foundation models are not products.They are high-leverage components in a larger system that includes data, validation, and deployment decisions.

Impact depends far less on the model itself than on everything around it — task definition, evaluation strategy, data governance, and human-in-the-loop processes. A strong foundation model cannot compensate for unclear success criteria or poorly designed validation. In fact, it can delay those conversations by creating a false sense of progress.

Robustness is not an emergent property. It must be designed, tested, and maintained. Foundation models don’t remove that responsibility. They often make it more visible.

How leaders should evaluate foundation models

For decision-makers, the most important questions are not about model size or novelty. They are about behavior under constraint.

Where does this model reduce effort, and where does it introduce hidden costs? How does performance change under distribution shift? How much labeling is still required to reach acceptable reliability? What simpler baseline could achieve similar results with less complexity?

These questions don’t diminish the value of foundation models. They place them in context.

The bottom line

The promise of foundation models is not that they eliminate domain work. It’s that they accelerate the parts that were previously slow, while making the remaining challenges harder to ignore.

In applied AI, especially in scientific and medical domains, the last mile has not disappeared. It has simply moved.

Teams that succeed will be the ones that treat foundation models as starting points, not finish lines — and invest accordingly.

For teams navigating foundation models in applied settings, a short planning conversation early can prevent surprises later. My Pixel Clarity Call is designed to help you assess readiness, risks, and realistic paths to impact. Details here.

- Heather

Vision AI that bridges research and reality

— delivering where it matters

Research: Foundation Model Robustness

Pathology Foundation Models are Scanner Sensitive: Benchmark and Mitigation with Contrastive ScanGen Loss

The same tissue, scanned by different machines, produces different model predictions. This is a deployment blocker.

Digital pathology foundation models promise to revolutionize cancer diagnosis by learning from millions of whole slide images. Models like UNI, Virchow, and H-optimus-0 have demonstrated impressive performance on benchmark tasks. But there's a gap between benchmark performance and clinical reliability: these models learn scanner artifacts alongside biological features, causing predictions to vary based on which scanner acquired the image rather than the actual tissue pathology.

This scanner bias undermines clinician trust and prevents deployment across institutions with different scanning equipment. It's a classic example of what happens when models optimize for correlation rather than causation—they latch onto technical artifacts that happen to be consistent within training datasets but break when encountering new scanning hardware.

Gianluca Carloni et al. benchmarked five state-of-the-art pathology foundation models on multi-scanner data and demonstrated that all models suffer from scanner bias, with tissue clustering by scanner rather than specimen in embedding space.

𝗞𝗲𝘆 𝗶𝗻𝗻𝗼𝘃𝗮𝘁𝗶𝗼𝗻𝘀:
- Multi-scanner benchmark dataset with paired whole slide images
- Novel evaluation metrics: Coefficient of Variation (CoV) across scanners and magnifications to quantify generalization stability
- ScanGen: A contrastive loss function applied during fine-tuning that pulls together representations of the same specimen from different scanners while pushing apart different specimens from the same scanner
- Evaluation on EGFR mutation prediction in lung cancer

𝗧𝗵𝗲 𝗿𝗲𝘀𝘂𝗹𝘁𝘀:
UMAP visualizations reveal that foundation model embeddings cluster by scanner manufacturer rather than by biological specimen. Even UNI, which appeared visually unaffected in 2D projections, showed substantial quantitative scanner sensitivity and benefited significantly from ScanGen.

ScanGen enhanced cross-scanner generalization while maintaining or improving EGFR mutation prediction performance. The approach works by creating a scanner-invariant embedding space before the Multiple Instance Learning aggregator—it's applied during task-specific fine-tuning rather than requiring foundation model retraining.

This addresses a fundamental tension in computational pathology: foundation models trained on diverse data should theoretically generalize better, but technical variation often dominates over biological signal. Without explicit mitigation during fine-tuning, these models remain fragile to deployment contexts that differ from training environments.

Insights: Clinical Use Cases

Narrow Use Cases Can Have Big Impact

Your first clinical win won’t come from a generalist—it’ll come from a tool with a purpose.

𝐒𝐩𝐞𝐜𝐢𝐚𝐥𝐢𝐳𝐞𝐝 𝐭𝐨𝐨𝐥𝐬 𝐝𝐞𝐥𝐢𝐯𝐞𝐫 𝐜𝐥𝐢𝐧𝐢𝐜𝐚𝐥 𝐯𝐚𝐥𝐮𝐞 𝐬𝐨𝐨𝐧𝐞𝐫—𝐚𝐧𝐝 𝐰𝐢𝐧 𝐭𝐫𝐮𝐬𝐭 𝐟𝐚𝐬𝐭𝐞𝐫.
Everyone wants to build the next all-in-one diagnostic model. But in real labs, the tools that succeed often do just one thing—and do it well.

Specialized models—like mitotic figure detection, tissue fold detection, or stain quality control—address real pain points.
They’re easier to validate, easier to integrate, and faster to deploy. And they’re less likely to raise red flags with regulators or ethics boards.

📍 Example: A stain quality model flagged suboptimal slides before they reached the pathologist—saving time and reducing repeat scans. That’s not just convenience—it’s efficiency gains for lab techs, fewer delays for clinicians, and cleaner data for downstream models.

𝐒𝐦𝐚𝐥𝐥 𝐰𝐢𝐧𝐬 𝐛𝐮𝐢𝐥𝐝 𝐛𝐢𝐠 𝐭𝐫𝐮𝐬𝐭—𝐚𝐧𝐝 𝐜𝐥𝐞𝐚𝐫 𝐫𝐞𝐭𝐮𝐫𝐧 𝐨𝐧 𝐢𝐧𝐯𝐞𝐬𝐭𝐦𝐞𝐧𝐭.
And they pave the way for broader AI adoption down the line.

𝐒𝐨 𝐰𝐡𝐚𝐭?
Narrow use cases reduce risk, prove value, and keep your AI from getting stuck in pilot purgatory.

📊 It doesn’t have to be groundbreaking to be useful.

💬 What overlooked use case could unlock early value in your clinical workflow?

Research: Benchmark for Mars

Mars-Bench: A Benchmark for Evaluating Foundation Models for Mars Science Tasks

Mars science has a data problem—not too little data, but no standardized way to evaluate AI models on it.

While Earth observation has accelerated through benchmarks like GeoBench and a proliferation of foundation models, planetary science has been building task-specific models from scratch for each application. This fragmentation has prevented the field from adopting the foundation model paradigm that's transformed medical imaging, astronomy, and Earth monitoring.

Mirali Purohit et al. introduce Mars-Bench, the first comprehensive benchmark for evaluating foundation models on Mars science tasks. The benchmark unifies 20 datasets spanning classification, segmentation, and object detection across both orbital and surface imagery from Mars Reconnaissance Orbiter, Mars Odyssey, and three rover missions (Curiosity, Opportunity, Spirit).

𝗞𝗲𝘆 𝗶𝗻𝗻𝗼𝘃𝗮𝘁𝗶𝗼𝗻𝘀:
- Standardized format with consistent train/val/test splits, unified data loaders, and multiple annotation formats
- Scientifically relevant tasks developed with planetary scientists: craters, volcanic cones, boulders, landslides, dust devils, frost, atmospheric dust
- Few-shot and partitioned variants to evaluate data-limited scenarios and cross-sensor generalization
- Baseline evaluations across ImageNet models, Earth observation foundation models, and vision-language models

𝗧𝗵𝗲 𝘀𝘂𝗿𝗽𝗿𝗶𝘀𝗶𝗻𝗴 𝗿𝗲𝘀𝘂𝗹𝘁:
Earth observation foundation models (SatMAE, CROMA, Prithvi) underperformed compared to ImageNet-pretrained ViT despite being trained on satellite imagery. The domain gap between Earth and Mars is substantial—Mars lacks vegetation, water bodies, and human infrastructure while exhibiting unique geology, atmospheric conditions, and color palettes. Even the overhead perspective isn't enough to bridge this gap.

ImageNet's advantage likely stems from its larger scale and greater diversity. Vision-language models performed reasonably on common terrain classes but struggled with fine-grained geological features requiring domain expertise.

𝗣𝗲𝗿𝘀𝗼𝗻𝗮𝗹 𝗻𝗼𝘁𝗲: This takes me back to my own work on Mars computer vision at CMU and ASU (2005-2007), where I developed rock detection methods from rover imagery and both rock and crater detection from satellite imagery. Back then, we hand-engineered features and built custom pipelines for each task. Mars-Bench represents exactly the kind of standardized infrastructure that is needed now with deep learning.

The results make a clear case for Mars-specific foundation models rather than assuming Earth observation models will transfer. For challenging tasks like dust devil detection (subtle visual features, low object density, grayscale imagery), even state-of-the-art architectures struggle without domain-adapted pretraining.

Insights: ROI

What ROI Actually Looks Like in Clinical Pathology AI

📉 𝐀 𝐠𝐫𝐞𝐚𝐭 𝐑𝐎𝐂 𝐜𝐮𝐫𝐯𝐞 𝐰𝐨𝐧’𝐭 𝐬𝐚𝐯𝐞 𝐚 𝐡𝐨𝐬𝐩𝐢𝐭𝐚𝐥 𝐭𝐢𝐦𝐞, 𝐦𝐨𝐧𝐞𝐲, 𝐨𝐫 𝐥𝐢𝐯𝐞𝐬.

𝐇𝐨𝐬𝐩𝐢𝐭𝐚𝐥𝐬 𝐝𝐨𝐧’𝐭 𝐛𝐮𝐲 𝐚𝐜𝐜𝐮𝐫𝐚𝐜𝐲—𝐭𝐡𝐞𝐲 𝐛𝐮𝐲 𝐨𝐮𝐭𝐜𝐨𝐦𝐞𝐬.

𝐒𝐚𝐯𝐢𝐧𝐠 𝐓𝐢𝐦𝐞, 𝐂𝐚𝐭𝐜𝐡𝐢𝐧𝐠 𝐌𝐨𝐫𝐞, 𝐨𝐫 𝐑𝐞𝐝𝐮𝐜𝐢𝐧𝐠 𝐄𝐫𝐫𝐨𝐫?

AUC, F1, Dice score—they look great in a paper. But they rarely sway a CMO or lab director.

If you’re building AI for pathology, your job isn’t just to improve model metrics. It’s to deliver something a clinical team would fight to keep.

Real ROI comes from:
📉 Faster turnaround times
👩‍⚕️ Reduced pathologist workload
🚫 Fewer diagnostic errors
🧪 Better trial designs through smarter patient selection
💸 Savings on repeat testing, staffing, and operational inefficiencies

Example: Imagine a triage tool that prioritizes potentially abnormal cervical biopsies to the top of the review queue. It doesn’t change the total workload—but it ensures that high-risk cases are seen sooner. The result? Earlier intervention, reduced risk of delays, and stronger clinical confidence. That’s not just operational—it's a patient care upgrade.

𝐈𝐟 𝐲𝐨𝐮 𝐜𝐚𝐧’𝐭 𝐬𝐡𝐨𝐰 𝐦𝐞𝐚𝐬𝐮𝐫𝐚𝐛𝐥𝐞 𝐢𝐦𝐩𝐚𝐜𝐭, 𝐲𝐨𝐮𝐫 𝐀𝐈 𝐰𝐢𝐥𝐥 𝐬𝐭𝐚𝐥𝐥 𝐢𝐧 𝐩𝐫𝐨𝐜𝐮𝐫𝐞𝐦𝐞𝐧𝐭.

𝐒𝐨 𝐰𝐡𝐚𝐭? These aren’t just clinical wins—they’re procurement green lights. Budget-holders need clear, measurable value to justify adoption.

📊 The best AI in the lab is the one that actually gets used—because adoption is what turns technical performance into tangible ROI.

💬 What kind of value would make a pathology team embrace AI?