In recent years, Machine Learning (ML) has attracted wide interest as aid for decision makers in complex domains, such as medicine. Although domain experts are typically aware of the intrinsic uncertainty around it, the issue of Ground Truth (GT) quality has scarcely been addressed in the ML literature. GT quality is regularly assumed to be adequate, regardless of the number and skills of raters involved in data annotation. These factors can, however, potentially have a severe negative impact on the reliability of ML models. In this article we study the influence of GT quality, in terms of number of raters, their expertise, and their agreement level, on the performance of ML models. We introduce the concept of reduction: computational procedures by which to produce single-target GT from multi-rater settings. We propose three reductions, based on three-way decision, possibility theory, and probability theory. We provide characterizations of these reductions from the perspective of learning theory and propose two ML algorithms. We report the result of experiments, on both real-world medical and synthetic datasets, showing that GT quality strongly impacts on the performance of ML models, and that the proposed algorithms can better handle this form of uncertainty compared with state-of-the-art approaches.
Campagner, A., Ciucci, D., Svensson, C., Figge, M., Cabitza, F. (2021). Ground truthing from multi-rater labeling with three-way decision and possibility theory. INFORMATION SCIENCES, 545, 771-790 [10.1016/j.ins.2020.09.049].
Ground truthing from multi-rater labeling with three-way decision and possibility theory
Campagner A.
;Ciucci D.;Cabitza F.
2021
Abstract
In recent years, Machine Learning (ML) has attracted wide interest as aid for decision makers in complex domains, such as medicine. Although domain experts are typically aware of the intrinsic uncertainty around it, the issue of Ground Truth (GT) quality has scarcely been addressed in the ML literature. GT quality is regularly assumed to be adequate, regardless of the number and skills of raters involved in data annotation. These factors can, however, potentially have a severe negative impact on the reliability of ML models. In this article we study the influence of GT quality, in terms of number of raters, their expertise, and their agreement level, on the performance of ML models. We introduce the concept of reduction: computational procedures by which to produce single-target GT from multi-rater settings. We propose three reductions, based on three-way decision, possibility theory, and probability theory. We provide characterizations of these reductions from the perspective of learning theory and propose two ML algorithms. We report the result of experiments, on both real-world medical and synthetic datasets, showing that GT quality strongly impacts on the performance of ML models, and that the proposed algorithms can better handle this form of uncertainty compared with state-of-the-art approaches.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.