This study investigates Large Language Models (LLMs) as dynamic Bayesian filters through question-asking experiments inspired by cognitive science. We analyse LLMs’ inference errors and the evolution of uncertainty across models using repeated sampling. Building on Bertolazzi et al. (2023), we trace LLM belief states during repeated queries, finding that entropy decreases with each interaction, signaling reduced uncertainty. However, issues like “resurrection” (reassigning probabilities to invalidated outcomes) and “Bayesian apocalypse” (probabilities approaching zero) reveal significant flaws. GPT-4o consistently outperforms GPT-3 in probabilistic reasoning. These results underscore the need for improved architectures for reliability in high-stakes contexts and suggest a link between token-level and task-level uncertainty dynamics that can be leveraged to enhance LLM performance.
Patania, S., Masiero, E., Brini, L., Piskovskyi, V., Ognibene, D., Donabauer, G., et al. (2024). Large Language Models as an active Bayesian filter: information acquisition and integration. In Proceedings of the 28th Workshop on the Semantics and Pragmatics of Dialogue, September, 11-12, 2024, University of Trento.
Large Language Models as an active Bayesian filter: information acquisition and integration
Patania, S;Masiero, E;Ognibene, D
Ultimo
;Donabauer, G;
2024
Abstract
This study investigates Large Language Models (LLMs) as dynamic Bayesian filters through question-asking experiments inspired by cognitive science. We analyse LLMs’ inference errors and the evolution of uncertainty across models using repeated sampling. Building on Bertolazzi et al. (2023), we trace LLM belief states during repeated queries, finding that entropy decreases with each interaction, signaling reduced uncertainty. However, issues like “resurrection” (reassigning probabilities to invalidated outcomes) and “Bayesian apocalypse” (probabilities approaching zero) reveal significant flaws. GPT-4o consistently outperforms GPT-3 in probabilistic reasoning. These results underscore the need for improved architectures for reliability in high-stakes contexts and suggest a link between token-level and task-level uncertainty dynamics that can be leveraged to enhance LLM performance.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.