Hi ,

Computer vision systems fail in ways we often don't detect until it's too late. In my years working with these technologies, I've seen firsthand how:

Models perform brilliantly in the lab but fail catastrophically on real data
Systems inadvertently discriminate against certain patient groups
Minor data variations lead to unexplainable performance differences
Teams struggle to diagnose why their models aren't generalizing

On May 15th at 11 am EDT, I'm hosting a 30-minute webinar unpacking these challenges. We'll examine real-world cases where bias and batch effects undermined model reliability and explore practical techniques for prevention and detection.

This isn't about selling solutions - it's about sharing critical knowledge that every team building computer vision systems needs to understand.

Heather

Research: Spurious Correlations

Detecting Spurious Correlations With Sanity Tests for Artificial Intelligence Guided Radiology Systems

Ever wonder if an AI system that performs beautifully in testing might completely fail when deployed in the real world? This problem is particularly concerning in healthcare, where AI systems are increasingly helping radiologists with diagnosis and treatment planning.

Usman Mahmood et al. developed a practical solution by creating a series of targeted tests to identify when AI models learn to make predictions based on irrelevant patterns or artifacts in the data rather than clinically meaningful features.

Using pancreatic cancer detection as a case study, they demonstrated how a system that appeared effective during development actually failed on unseen data. Their sanity tests revealed the model was relying on spurious correlations rather than actual disease characteristics - a concerning finding that would have been missed in traditional validation approaches.

What makes this work valuable is its practicality - these tests can be implemented before expensive clinical validation studies, potentially saving significant resources while improving patient safety. The researchers' approach provides a systematic way to evaluate whether radiology AI systems are making predictions for the right reasons.

As we continue developing AI for healthcare applications, ensuring models focus on clinically relevant features rather than dataset artifacts will be essential for building systems that generalize properly across different patient populations and clinical settings.

Research: Applying Foundation Models

Earth Observation Foundation Models for region-specific flood segmentation

While satellite imagery has revolutionized our ability to monitor Earth's changing climate, one persistent challenge remains: clouds often obscure critical data when we need it most. This is particularly problematic during flood events, when timely information can save lives.

Helen Tamura-Wicks et al. demonstrated a promising approach to this problem. They built upon the existing Prithvi-EO foundation model (which uses optical imagery) by incorporating Synthetic Aperture Radar (SAR) imagery specifically for flood detection in the UK and Ireland.

Why is this significant? SAR can see through clouds, making it invaluable during extreme weather events when cloud cover is common. They found that incorporating SAR bands improved flood segmentation performance from 0.58 to 0.79, demonstrating that these foundation models can be relatively easily tuned to new locations and application-specific satellite bands.

What's particularly encouraging is that the research suggests Foundation Models can be adapted to different regions during fine-tuning, even with relatively little labeled data. This is crucial for environmental applications where labeled datasets are often difficult and expensive to produce.

Insights: Batch Effects

Detecting Batch Effects in Medical AI: Beyond Basic Metrics

Q: What metrics do you recommend for rigorously detecting batch effects (covariate shift)?

The silent killer of AI reliability in healthcare isn’t model architecture—it’s batch effects from multi-site data. Here’s how to spot these hidden covariate shifts using robust, nonparametric methods:

𝐌𝐞𝐭𝐡𝐨𝐝 1: 𝐕𝐢𝐬𝐮𝐚𝐥 𝐈𝐧𝐬𝐩𝐞𝐜𝐭𝐢𝐨𝐧 𝐰𝐢𝐭𝐡 𝐔𝐌𝐀𝐏/𝐓-𝐒𝐍𝐄
How it works:
- Reduce high-dimensional feature spaces to 2D using nonlinear techniques like UMAP (Uniform Manifold Approximation) or t-SNE (t-Distributed Stochastic Neighbor Embedding).
- Color-code embeddings by batch (e.g., hospital). Clustered batches = red flag for covariate shift.
Pros:
- Intuitive for interdisciplinary teams.
- Reveals nonlinear patterns missed by PCA.
Limitations:
- Subjective interpretation risk.
- Doesn’t quantify shift severity.

𝐌𝐞𝐭𝐡𝐨𝐝 2: 𝐅𝐞𝐚𝐭𝐮𝐫𝐞-𝐁𝐚𝐬𝐞𝐝 𝐂𝐥𝐚𝐬𝐬𝐢𝐟𝐢𝐞𝐫 𝐓𝐞𝐬𝐭
How it works:
- Extract latent features from your model’s penultimate layer.
- Train a simple linear classifier to predict batch (e.g., hospital) from these features.
- Measure performance via AUROC.
Pros:
- Quantifiable metric for shift severity.
Limitations:
- Requires model introspection access.
- Doesn’t distinguish harmful vs benign batch signals.

𝐂𝐫𝐢𝐭𝐢𝐜𝐚𝐥 𝐈𝐧𝐬𝐢𝐠𝐡𝐭:
These methods reveal whether your model encodes batch features, but not whether it uses them to bias your results. To check for bias, you’ll need to calculate stratified metrics on your test set to check for a bias in different subgroups.

_{Enjoy this newsletter? Here are more things you might find helpful:}

_{Office Hours -- Are you a student with questions about machine learning for pathology or remote sensing? Do you need career advice? Once a month, I'm available to chat about your research, industry trends, career opportunities, or other topics.
Register for the next session}

Did someone forward this email to you, and you want to sign up for more? Subscribe to future emails
This email was sent to _t.e.s.t_@example.com. Want to change to a different address? Update subscription
Want to get off this list? Unsubscribe
My postal address: Pixel Scientia Labs, LLC, PO Box 98412, Raleigh, NC 27624, United States