Actionable AI: Closing the Gap Between Insight and Impact

An accurate prediction that sits in a report no one reads has zero impact. This is a central problem with how many AI systems are evaluated and deployed: we optimize for metrics, then wonder why outcomes don't follow.

Steve Brumby of Impact Observatory put it plainly — a machine learning scientist's work isn't done when they produce a fantastic algorithm. It's done only when the output ties into a customer's workflow to answer a direct question. Decision-makers don't want pixels or maps; they want specific numbers that answer specific questions. Indra den Bakker of Overstory echoed this: you can build a beautiful, highly accurate map of every tree species in the world, but if that data isn't actionable to customers trying to prevent a wildfire or power outage, there's no use to it.

Start from the Decision, Not the Model

Impactful AI is designed backward from the decision it intends to influence. Emi Gal of Ezra describes always starting from the ultimate goal — say, reducing report turnaround from 19 minutes to 15 — rather than starting with a new architecture and working forward. Erez Naaman of Scopio Labs frames it similarly: "Machine learning is a tool and not a goal. We always start with the patient in mind."

This backward design forces a critical distinction: model performance (AUROC, F1) versus operational impact (lives saved, costs avoided). Gavin McCormick of WattTime provides a sharp example. His team found that a lower-accuracy model was actually better at rank-ordering emissions timing — the true driver of environmental impact. They abandoned accuracy as a training objective entirely in favor of a metric that directly simulated emissions reductions. Amanda Marrs of AMP Robotics takes the same approach commercially: her team tracks precision and recall internally but translates those to dollars per ton and material purity for customers, because those are the numbers that actually move decisions.

Three Barriers Between Insight and Action

Even technically strong AI systems routinely fail to drive action. The culprits fall into three categories.

Workflow integration. AI that lives in a separate tab won't get used. Coleman Stavish of Proscia observed that despite excellent published research on AI in pathology, labs weren't using it — because the technology wasn't introduced into the workflow correctly. Pathology labs are tightly optimized environments; if a tool doesn't fit how things are currently done, it won't be adopted regardless of accuracy. David Golan of Viz.ai recognized that neurosurgeons making stroke decisions aren't sitting at workstations — they're at the grocery store or driving. The interface had to be a mobile app, not a radiology system add-on.

Timeliness. An insight delivered too late is irrelevant. Gershom Kutliroff of Taranis points out that if a crop disease alert takes days to generate, the farmer has already missed the treatment window. Shahab Bahrami of SenseNet measures wildfire detection time as a primary KPI; cutting detection from 45 minutes to under 3 minutes transforms the response from containment to suppression.

Cognitive load. Too many alerts cause fatigue, and fatigue causes abandonment. Harro Stokman of Kepler Vision argues that elderly care monitoring is only sustainable if the false alarm rate is extremely low — roughly one per room per three months. More than that and nurses stop responding. Dean Freestone of Seer addresses this on the data side: their epilepsy AI doesn't diagnose, it curates — filtering weeks of EEG data down to a highlight reel of relevant events so clinicians aren't drowning in raw signals.

The Spectrum: From Information to Automation

Not all actionable AI looks the same. The level of autonomy varies significantly by domain and stakes.

At one end, AI provides context — measurements, filtered data, visualizations — that empowers a human decision. At the next level, it generates specific recommendations: Nathan Fenner's team at Afresh doesn't just forecast grocery demand, they output a specific order quantity, optimizing for profit and waste simultaneously. Mathieu Bauchy of Concrete.ai flips the model entirely — instead of predicting how a concrete mix will perform, customers input their cost and carbon targets, and the model prescribes the optimal recipe.

Decision support keeps a human in the loop but deeply integrates AI into their existing process. Sean Cassidy of Lucem Health describes it as working in the background to surface patient risk without interrupting how clinicians already practice — no extra pop-ups, no extra clicks.

At the far end, full automation: John Bertrand of Digital Diagnostics built the first FDA-cleared fully autonomous diagnostic AI, outputting a diabetic retinopathy result with no physician in the loop, enabling point-of-care diagnosis without waiting for a specialist.

Measuring What Actually Matters

If you're not measuring accuracy, what do you measure? The answers from practitioners converge on outcomes.

Decision velocity: David Golan measures stroke treatment time savings (up to 90 minutes). Daniella Gilboa of AIVF tracks time to pregnancy, reducing average cycles from over 3 to 1.6.

Resource efficiency: Todd Villines of Elucid measures the reduction in unnecessary invasive cardiac procedures — 50–70% fewer patients sent to the cath lab who don't need intervention.

Adoption rate: David Sontag of Layer Health uses nurse acceptance of AI predictions as a real-time model health monitor. A spike in rejections signals dataset shift before the metrics catch it.

And perhaps the most honest measure of all: abandonment. Dean Freestone watched his own algorithms get quietly set aside the moment he left the room, because users didn't trust them. Actionability isn't declared; it's revealed by whether people keep using the tool.

The Core Principle

A model without a downstream action is a failure — not a neutral artifact, an active failure, because it consumed resources and created the illusion of progress without delivering any.

The implication for practitioners is concrete: validate not just that your model performs well, but that someone can do something different — and better — because of its output. As Matt Pipke of physIQ puts it, validation must include understanding what action someone could take based on the information. Without defining that action upfront, you can't properly evaluate whether the system works at all.

Actionability proves the value. But you can't deliver impact with a tool no one is using.

- Heather

Vision AI that bridges research and reality

— delivering where it matters

Research: Confounders

Confounding factors and biases abound when predicting molecular biomarkers from histological images

Deep learning models are increasingly being trained to predict molecular biomarkers from H&E slides, offering a faster and cheaper alternative to molecular tests. But are these models seeing the biomarker, or are they exploiting correlational shortcuts?

A new study by Muhammad Dawood et al. reveals a critical vulnerability in current workflows: models frequently fail to isolate the effect of a single biomarker, instead learning confounded signals driven by clinical variables like histological grade or Tumor Mutational Burden (TMB).

This finding resonates with my own PhD research. In a 2018 study, we encountered this exact problem with grade. This was just the very early days of predicting molecular biomarkers from H&E, so I'm excited to see the depth in which Dawood et al. have now explored this topic.

By analyzing 8k patients across multiple cancer types and testing several model architectures, they show that this is a systemic failure across the field. Here is what they revealed about the state of WSI-based prediction:

• 𝙏𝙝𝙚 𝙁𝙖𝙞𝙡𝙪𝙧𝙚 𝙤𝙛 𝘼𝙜𝙜𝙧𝙚𝙜𝙖𝙩𝙚 𝘼𝙐𝙍𝙊𝘾: Models that appear highly accurate overall often fail completely when evaluated within specific patient subgroups. For example, the authors demonstrated that breast cancer ER predictors boasting overall AUROCs of 0.87 to 0.90 saw their performance plummet to 0.76 on medium-grade cases. Similarly, colorectal BRAF mutation predictors dropped from 0.77 to 0.65 in low-TMB cases.

• 𝙏𝙝𝙚 𝘿𝙖𝙣𝙜𝙚𝙧 𝙤𝙛 𝘾𝙤𝙙𝙚𝙥𝙚𝙣𝙙𝙚𝙣𝙩 𝘽𝙞𝙤𝙢𝙖𝙧𝙠𝙚𝙧𝙨: Biomarkers often co-occur. In colorectal cancer, MSI-High status frequently co-occurs with BRAF mutations. They showed that models cannot effectively disentangle these two signals. This lack of isolation is clinically dangerous because these two markers require entirely different therapeutic pathways.

• 𝙏𝙝𝙚 𝙂𝙧𝙖𝙙𝙚 𝘽𝙖𝙨𝙚𝙡𝙞𝙣𝙚 𝙍𝙚𝙖𝙡𝙞𝙩𝙮 𝘾𝙝𝙚𝙘𝙠: Perhaps the most striking finding is that for some biomarkers, such as TP53 mutations, a simple classifier based purely on a pathologist's assigned tumor grade performs almost identically to H&E-based deep learning.

To expose these hidden biases during evaluation, the team used stratification-based permutation testing. True clinical utility requires models to learn causal, rather than correlational, relationships.

𝙏𝙝𝙚 𝙩𝙖𝙠𝙚𝙖𝙬𝙖𝙮: We cannot rely on aggregate AUROC to validate AI biomarker predictors. To build clinical-grade AI, we must disentangle the underlying biology from morphological shortcuts.

Research: Multimodal

From Vision Encoders to Clinical Reasoning: The Multimodal Shift

For the past few years, medical AI research has focused on scaling vision encoders to extract better features from pixels. But a clinician does more than look; they synthesize. They combine histology with genomics, endoscopy with pathology, and images with dialogue.

Four new papers demonstrate a move from unimodal vision to multimodal clinical intelligence. These models are no longer just classifying images; they are integrating diverse data streams to mimic diagnostic reasoning.

Here is how the latest research tackles this.

Pathology: Two new foundation models tackle pathology with opposing strategies. Eugene Vorontsov et al. introduce PRISM2, betting on massive scale (2.3M slides) and language alignment. By training on 14 million question-answer pairs, they utilize clinical dialogue supervision to teach the model diagnostic reasoning, achieving clinical-grade cancer detection zero-shot. In contrast, Yingxue Xu et al. introduce mSTAR, challenging the "scale is all you need" dogma. Using a dense dataset of tri-modal pairs (WSI + Reports + RNA-Seq), they employ a self-taught pretraining paradigm. This allows mSTAR to outperform larger vision-only models on molecular tasks with significantly less data, proving that multimodal density can be more efficient than brute-force scaling.

Dermatology: While pathology models focus on text and genes, Siyuan Yan et al. present PanDerm, a model that unifies the fragmented visual world of dermatology. It is trained on 2 million images across 4 distinct modalities: clinical photography, dermoscopy, total-body photography, and histopathology. The key innovation is versatility; reader studies showed PanDerm improved clinicians' accuracy in diagnosing 128 different skin conditions and outperformed experts in early melanoma detection via longitudinal monitoring.

Gastroenterology: Moving beyond static images, Marietta Iacucci et al. introduce the Endo-Histo fusion model. Developed using data from a Mirikizumab clinical trial for Ulcerative Colitis, this framework fuses features from endoscopic video and histologic slides. This multimodal fusion significantly outperformed single-modality assessment for histologic remission and treatment response, offering a new standard for precision medicine in clinical trials.

PRISM2
mSTAR
PanDerm
Endo-Histo

Research: Bias

Pathology Foundation Models are Scanner Sensitive: Benchmark and Mitigation with Contrastive ScanGen Loss

TextFoundation models promise massive generalization capabilities. But when it comes to computational pathology, a simple change in the scanning hardware can still derail their predictions.

Real-world clinical AI must be robust across different commercially available scanners. If an AI predicts a mutation differently simply because the glass slide was digitized on scanner A instead of scanner B, clinicians cannot trust the tool in clinical practice.

Gianluca Carlonai et al. recently presented a study confirming that despite their size and training, foundation models still suffer from scanner bias. To solve this, they introduced a novel mitigation strategy.

Here are the key takeaways:

• 𝙏𝙝𝙚 𝘽𝙚𝙣𝙘𝙝𝙢𝙖𝙧𝙠: The authors benchmarked recent pathology foundation models on a multi-scanner dataset, demonstrating that their deep learning representations are still sensitive to irrelevant hardware-induced details.

• 𝙏𝙝𝙚 𝙎𝙘𝙖𝙣𝙂𝙚𝙣 𝙇𝙤𝙨𝙨: To actively fix this variability, they developed ScanGen, a novel contrastive loss function. Rather than pre-training a new model from scratch, ScanGen is applied during the task-specific fine-tuning phase to mitigate scanner bias.

• 𝘾𝙡𝙞𝙣𝙞𝙘𝙖𝙡 𝙑𝙖𝙡𝙞𝙙𝙖𝙩𝙞𝙤𝙣: The team tested this approach on the Multiple Instance Learning task of predicting EGFR mutations in lung cancer from H&E slides. ScanGen successfully enhanced the model's ability to generalize across different scanners while maintaining or even improving the baseline mutation prediction performance.

𝙏𝙝𝙚 𝙏𝙖𝙠𝙚𝙖𝙬𝙖𝙮: Currently, we cannot rely on general pre-training alone to solve domain generalization. Task-specific fine-tuning with targeted robustness constraints like ScanGen is an essential step for building truly clinical-grade AI.

Research: VLMs for Pathology

The Semantic Turn: Teaching Pathology Models to Speak

Current foundation models are excellent feature extractors, but they remain black boxes that process pixels, not diagnoses. A new wave of research is moving beyond raw image encoding, explicitly aligning Whole Slide Images (WSIs) with semantic knowledge—concepts, prompts, and language—to improve interpretability and data efficiency.

Three new papers illustrate this shift from "Pixel-Only" to "Concept-Aware" pathology:

Interpretability via "Concept Priors": While most models output a score, Saarthak Kapse et al. (GECKO) want the model to explain why. They introduce Gigapixel Vision-Concept Contrastive Pretraining. Instead of relying on scarce paired medical reports, they use LLMs to generate a lexicon of biological concepts (e.g., keratinization, glandular patterns). By employing a dual-branch MIL architecture, they align image features with a derived concept prior The result is a model that offers zero-shot classification and pathologist-friendly interpretability, explicitly highlighting the biological features driving the diagnosis.

Few-Shot Learning via Multi-Granular Prompts: Anh-Tien Nguyen et al. (MGPATH) address the scarcity of annotated data. Extending the massive Prov-GigaPath model, they introduce a Vision-Language framework driven by Multi-Granular Prompt Learning. Unlike standard prompting which often misses context, their method facilitates interactions between learnable prompts and image data at both the fine-grained (patch) and coarse-grained (group) levels. This allows the model to capture complex patterns with minimal examples, significantly outperforming competitors in few-shot lung and kidney tasks.

Pan-Cancer Unification via Text-Encoded Labels: Vishwesh Ramanathan et al. (ModalTune) tackle the fragmentation of clinical tasks. Rather than treating diagnostic labels as arbitrary numbers (0 vs 1), they use LLMs to encode labels as semantic text vectors. They introduce modal adapters to integrate these meaningful labels without modifying the frozen foundation model weights. This allows the model to understand the semantic relationship between diseases (e.g., Lung Adenocarcinoma vs. Squamous Cell), enabling state-of-the-art multi-task and pan-cancer performance.

The takeaway: These approaches share a common philosophy: We stop treating pathology images as abstract matrices and start treating them as visual language, anchored by human-interpretable concepts.

GECKO
MGPATH
ModalTune

_{Enjoy this newsletter? Here are more things you might find helpful:}

Pixel Clarity Call - A free 30-minute conversation to cut through the noise and see where your vision AI project really stands. We’ll pinpoint vulnerabilities, clarify your biggest challenges, and decide if an assessment or diagnostic could save you time, money, and credibility.

Book now

Did someone forward this email to you, and you want to sign up for more? Subscribe to future emails
This email was sent to _t.e.s.t_@example.com. Want to change to a different address? Update subscription
Want to get off this list? Unsubscribe
My postal address: Pixel Scientia Labs, LLC, PO Box 98412, Raleigh, NC 27624, United States