Computational classification systems built using machine learning (ML) techniques are increasingly being evaluated and employed in medical settings for a number of purposes and applications, including diagnosis, prognosis, and risk stratification. However, evaluation and validation practices that are commonly used and adopted in the application of ML to other disciplines are unlikely to be meaningfully applicable to medicine. In fact, otherwise, technically sound systems have been found to perform poorly in real settings, a concept that has been termed the “last mile of implementation.” In this chapter, we will focus on three main factors underlying the so-called last mile: the impact of observer variability on ground truth reliability; the meaningful and appropriateness of commonly adopted performance measures; and the issue of replicability in ML studies. We will discuss the above mentioned issues, and we will delineate possible solutions and concepts to address them.
Cabitza, F., Campagner, A. (2024). Towards Better Ways to Assess Predictive Computing in Medicine: On Reliability, Robustness, and Utility. In B. Carpentieri, P. Lecca (a cura di), Big Data Analysis and Artificial Intelligence for Medical Sciences (pp. 309-337). wiley [10.1002/9781119846567.ch14].
Towards Better Ways to Assess Predictive Computing in Medicine: On Reliability, Robustness, and Utility
Cabitza F.;Campagner A.
2024
Abstract
Computational classification systems built using machine learning (ML) techniques are increasingly being evaluated and employed in medical settings for a number of purposes and applications, including diagnosis, prognosis, and risk stratification. However, evaluation and validation practices that are commonly used and adopted in the application of ML to other disciplines are unlikely to be meaningfully applicable to medicine. In fact, otherwise, technically sound systems have been found to perform poorly in real settings, a concept that has been termed the “last mile of implementation.” In this chapter, we will focus on three main factors underlying the so-called last mile: the impact of observer variability on ground truth reliability; the meaningful and appropriateness of commonly adopted performance measures; and the issue of replicability in ML studies. We will discuss the above mentioned issues, and we will delineate possible solutions and concepts to address them.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.