I just read a research paper that shows that Microsoft’s new AI system (MAI-DxO) outperformed human physicians on complex diagnostic cases. This is crazy, because it used fewer tests.
What’s even more crazy is that when they paired it with OpenAI’s o3 model, MAI-DxO achieves 80% diagnostic accuracy. According to the research, this is 4 times higher than the 20% average of human physicians.
Could this change how we do healthcare? Would you trust something like this to diagnose you in a real hospital?
It makes a lot of sense as most LLMs are more pattern recognition tools(generally mimicking the writing style of humans), so for diagnosis, where there could be patterns, albeit very complex patterns, it can perform better than trained Physicians.
The larger issue is deployment. These systems would still need human oversight to confirm their results and the damage to the generalist/specialist status quo of medical professionals. These tools would help generalists significantly, as this study was done with them as the benchmark:
To establish human performance, we recruited 21 physicians practising in the US or UK to act as diagnostic agents. Participants had a median of 12 years [IQR 6-24 years] of experience: 17 were primary care physicians and four were in-hospital generalists.
This study should have also used specialists, but they are less available for the study, which is understandable. However, a tool like this may not be as useful to them, but it would provide a better answer if specialists are not available.
Realistically, further study is needed, as this paper brings up an intriguing question:
This raises an intriguing question: When evaluating frontier AI systems, should we evaluate frontier AI systems by comparing them to individual physicians, or to entire hospital-like teams of generalists and specialists?
Personally, I think it should be based on hospital-like teams, as in the majority of cases in Annexe C, at least one of the 18 clinicians who did 10+ cases got the diagnosis perfect or mostly correct. Diagnosis works more as a team rather than as individuals. It would give a clearer understanding of AI tools in medicine, which would improve healthcare. It would also be a better benchmark to include specialists in this study because they are trained to specifically detect the conditions they are looking at, but it limits its usage.
I think that AI has the potential to accelerate and increase the accuracy of diagnostics. At the end of the day, there is only so much that a handfull of people can truly know when it comes to making a diagnostic. Plus, as Adam had mentioned, this is exactly where AI shines - in making predictions from patterns. Furthermore, on a broader scale, it can learn what symptoms are correlated with what outcomes/diagnosis.
Like Adam had mentioned, however, it’s limited in the number of humans it was compared against, which makes me believe that further exploration is necessary.
First of all, accuracy can be an ambiguous term. However, AI is advancing healthcare and biological science in many ways.
I recently attended a healthcare conference, and AI was big player. There were tons of providers with booths up. Ambience was not one, but hosted a seminar. Ambience is a leader in listening in on doctor office visits and completing charts that insurance companies use. On the heels of Ambience’s success, many others have taken up the business. I counted almost 5. I imagine the insurance companies will soon have AI to process those charts.
Hence, the diagnostic agents you bring up are no shock. Especially considering that Microsoft had its own booth. It will be interesting to see if Microsoft’s version of Anbience’s service retains exclusivity to this diagnosis agent, or if something else happens in the space.
BTW, I don’t recall the other AI that were here, but I’m pretty sure there were other services represented.
Can you describe further how the accuracy - in this context - is ambiguous? It would seem that a diagnosis is either accurate or it isn’t, really becoming something of a binary classifier. So scoring such a model would be pretty straightforward, or better yet, gauging recovery time for patients and using that as some sort of reward mechanism. What am I missing here?
HaHaHa! Binary is pretty dry. That’s why accuracy is ambiguous. Here is the rundown on accuracy: It doesn’t consider proportions. Say accuracy is 95%. What proportion of that is correctly predicting positive results and what proportion is correctly predicting negative?
It’s a big concern in Healthcare. If you are predicting heart disease based on some demographic and check-up metrics, it would be better to send half your patients to be explicitly examined for heart disease, even if only 10% really have it, than to only send 20% of patients for exams, but let more positive heart disease patients go undetected. You need metrics that target your use case.
This seems to be more a question of precision and recall then due to the existence of false positives and negatives. There is a pretty good equation setup to find the probability of false positives and negatives(conditional probability). Using this you can find the rates of false positives and false negatives from both the expected accuracy and the actual results. So by using this you can determine if the results are still likely or not.
The best way to deal with medical issues is to maximise the F1 score, which minimises false positives and negatives, rather than general accuracy.
I disagree. I suspect tha recall in predicting positives is paramount. If only 1% are positive for an ailment, it is imperative to catch that 1% group. Hence recalls incorporation of false negative comes into play. If the model predicts no ailment when there is in fact an ailment, then the recall goes down.
Well, I see what you mean, but I think that’s the extent of a binary classifier: true positives, true negatives, false positives, and false negatives. So, with some models, you may be able to adjust the accuracy in conjunction with another rate (so, in your other example, the false negative group is the most important to catch). Or is there something else I’m missing?
I do have to disagree on this, assuming recall is 1(so all positive cases are true), while precision is not 1 allows for false positives to show up. In cases such as medicine this is dangerous especially if invasive treatment is required as if treatment is unnecessary it can also cause harm.
Likewise if precision is 1 while recall is not, false negatives exist misses the need for treatment. That is why it’s important to go more for F1 score as it minimises both false positives and false negatives, as only focusing on either precision or recall results in increases in either false positives or false negatives which may be more harmful in the long run.
I also have to back Adam here, he’s made a good point that misdiagnosing an illness as something much more serious can have consequences much more grave compared to having not diagnosed at all. It’d really have to depend on the illness and the data, but either way, if computers can do it better we should support them in doing so.