Every few months a new study lands showing that an AI model has matched or beaten radiologists on some specific diagnostic task. Pneumonia on chest X-ray. Brain bleeds on CT head. Lung nodule detection. The headline writes itself: ‘AI is here, the doctors should be worried’.

Meanwhile, the proportion of imaging studies in the UK being read with meaningful AI assistance remains small. Most NHS scans, on any given day, are still being reported without it.

This is not because the models do not work. Many of them work very well, in the conditions their authors tested them under.

Building a model that performs well on a benchmark is roughly 10% of the work involved in deploying AI safely into clinical practice. The other 90% is what people talk less about. I have spent the last two years on the inside of that 90%. Here is a glimpse of what it actually contains.

Generalisation is harder than accuracy

The first thing that breaks when you move an AI model out of its training environment is its accuracy.

A model trained on 50,000 chest X-rays from one academic hospital usually performs worse on the 50,001st chest X-ray taken in a different hospital, on a different scanner, of a patient population it has not seen before. The underlying clinical task can be identical and the result still slips. The phenomenon has a name in the literature – distribution shift or drift – and it is one of the biggest reasons that demonstrably accurate models fail in practice.

The fix is not a smarter model. It is a deliberate, expensive process of re-training and testing the model against the cases it will actually see, in the conditions it will actually run in, and being honest when the answer is “not yet”. This though can run into complicated issues with medical device regulations.

The human in the loop is not enough

There is a comforting story that says AI is safe in healthcare because there is always a clinician in the loop. The radiologist is the backstop. Whatever the model gets wrong, the doctor will catch. The evidence does not really support that story.

When you put a confident AI suggestion in front of a clinician, they accept it more often than they should. Researchers call this automation bias and it has been documented in radiology extensively, including in studies where the AI deliberately suggests incorrect findings and a meaningful proportion of readers go along with them anyway. The effect is bigger when the radiologist is tired, working at speed, or operating outside their core specialism – all attributes that we mitigate against in other ways at Hexarad.

Take that finding seriously and it changes what deployment means. You cannot just plug a model into a worklist and trust the clinician to mop up its errors. You have to train the radiologists in how to use it. You have to design the interface so the model’s accuracy and confidence is visible without being persuasive. You have to monitor, on an ongoing basis, whether your readers are starting to over-trust the system and intervene early when they do. None of that is contained in the model itself.

The metrics that get published are rarely the metrics that matter.

When a paper reports that a model achieves 94% sensitivity on lung nodule detection, that number is meaningful in the narrow sense it was measured. It is not the number that tells you whether the model is safe to deploy at your institution.

What you actually want to know is harder. How does it perform on those unusual pathologies? How does it perform across patient demographics – by age, ethnicity, body habitus? Is its performance drifting over time as scanners get upgraded and the patient mix shifts? When it is wrong, what kind of wrong, and on which patient groups?

Building the monitoring infrastructure and governance processes to answer those questions is, in my experience, comparable in scope to building the model itself. It doesn’t come ready made – and needs to be ‘built in’ from the ground up, and not ‘bolted-on’ at the end.

Compliance is not a tax. It is the product.

The instinct of some people coming into healthcare AI from other industries is to treat regulatory compliance as a tax. Something you pay at the end of the build, ideally as cheaply as possible, in order to ship. That gets the relationship backwards.

The quality management system, the technical documents, the risk file, the post-market surveillance plan, the clinical evaluation, the change-control procedures – these are not paperwork. They are the structures that force you to know, in real time, whether the thing you have built is still safe. They are how you find out a model is degrading before a patient is harmed by it. They are how you decide whether a software update needs a fresh clinical evaluation. They are the difference between a system you can stand behind and a demo.

This can be uncomfortable for founders because it slows things down. But it is also where the experience of actually running a clinical service starts to count. After almost a decade of reporting patients scans at scale, you develop a fairly unromantic view of how radiology goes wrong in practice. Knowing those failure modes from the inside changes what you treat as a serious clinical risk when you start building AI on top of the same workflow. It is not a problem you can solve by reading the literature, and not one you can shortcut by adding a few clinical advisors to the team.

What it looks like when the work has been done

Two examples from inside Hexarad.

Voice-recognition errors are a longstanding feature of radiology reporting. Somewhere between 5 and 35% of reports come out of VR software with errors that go beyond simple typos – missing words, duplicated phrases, the wrong “left” or “right”. We built an LLM-based tool that detects those errors across our reporting workflow and feeds back to each individual reporter their own error rate, benchmarked against the group. Two weeks after the personalised feedback intervention, error rates across 115 reporters had fallen by 12.3% relative – a statistically significant reduction. The interesting part is not the model. It is that the feedback loop – measuring, surfacing the human’s own error rate, measuring again – actually changes behaviour.

The second example is allocation. CT chest-abdomen-pelvis is a global headache in radiology because the request title alone does not tell you whether the patient needs a chest, gastrointestinal or gynaecological subspecialist to read the scan. The relevant information is buried in free-text referrals, often abbreviated. We built an LLM that parses those referrals and routes each study to the right subspecialist. Following deployment, the proportion of reports achieving full peer-review agreement rose by 3% and the addendum rate – corrections issued after sign-off – fell by 37.8% relative. The improvement on both measures was roughly four times bigger than the background trend across all our other modalities over the same period.

Neither of these is a model someone else could not have built. They both required taking the model as one component of a much larger workflow, and standing up the monitoring, governance and feedback infrastructure that makes the answers trustworthy at scale.

The right question

If you ask whether AI models can match human performance on isolated radiology tasks in a ‘lab’ environment, the answer is yes, and has been for years. That question is settled.

The question that is not settled, and that will determine whether AI actually changes patient outcomes at scale, is whether the people building it are willing to do the unglamorous 90%. Teleradiology services. Hospital trusts. Vendors. Regulators. There is no shortcut around any of it, and no amount of model improvement makes the rest go away.

The companies that move the needle in healthcare AI over the next decade will not be the ones with the highest benchmark scores. They will be the ones that have built the surrounding infrastructure – clinical, technical, regulatory – that lets a good model survive contact with an actual hospital.

That is a less exciting story than “AI beats radiologists”, but it is the one that matters.

Author

AIJ Thought Leader

View all posts

AIJ Thought Leader 4 weeks ago

6 minutes read

If AI is so good at radiology, why is so little of it being used?

By Dr Jaymin Patel, COO and co-founder of Hexarad

Generalisation is harder than accuracy

The human in the loop is not enough

The metrics that get published are rarely the metrics that matter.

Compliance is not a tax. It is the product.

What it looks like when the work has been done

The right question

Author

Generalisation is harder than accuracy

The human in the loop is not enough

The metrics that get published are rarely the metrics that matter.

Compliance is not a tax. It is the product.

What it looks like when the work has been done

The right question

Author

Related Articles

The B2B Advertising Agent Is Only as Good as Its Delayed Feedback

From Prompt to Publish: 10 Seedance 2.5 Platforms for 2026

How AI Is Reshaping Aluminum CNC Machining for Lightweight, High-Performance Parts

How Conventional AI Governance Quietly Kills Its Potential