At the heart of language sciences lies a fundamental question: How does the human cognitive system process the wide variety of human languages? Recent developments in Natural Language Processing, particularly in Multilingual Neural Language Models (MNLMs), offer a promising avenue to answer this question by providing a theory-agnostic way of representing linguistic content in different languages. This thesis leverages these advancements to explore how the human language processor responds to linguistic stimuli in different languages and the extent to which those responses are similarly modulated by linguistic features. The thesis first examines how MNLMs handle a precise syntactic phenomenon, subject-verb agreement, revealing a shared set of neural units responsible for agreement processing in five different languages. The analysis then shifts to the semantic domain, focusing on how MNLMs represent affective content. The findings indicate that emotional content is encoded in a way that is consistent across languages, demonstrating that language-general semantic information spontaneously emerges in specific network units. Once it is demonstrated that MNLMs encode syntactic and semantic content in a shared format across languages, the probabilistic estimates they generate are applied to the analysis of the behavioral correlates of naturalistic reading in a typologically diverse sample of eye-movement records. The results provide cross-linguistic evidence for the robustness of the link between predictability and cognitive effort, validating the general sensitivity of the human language processor to the information content carried by words. Lastly, a set of encoding models derived from MNLMs is employed to predict brain activity within the human language network, providing additional evidence of cross-lingual robustness in the link between language representations in artificial and biological neural networks. Further analyses showed that the encoding models can be successfully transferred zero-shot across languages, so that a model trained to predict brain activity in a set of languages can successfully account for brain responses in a held-out language, even across language families. These results demonstrate that the brain responses to language are guided (at least in part) by language-general principles.
Al centro delle scienze del linguaggio vi è una domanda fondamentale: come fa il sistema cognitivo umano a elaborare la vasta varietà di lingue? Sviluppi recenti nell'ambito del Natural Language Processing, in particolare nei Modelli Neurali Multilingue (MNM), offrono una via promettente per rispondere a questa domanda, fornendo un modo di rappresentare il contenuto linguistico in diverse lingue. Questa tesi sfrutta tali sviluppi per esplorare come il sistema di processamento linguistico umano risponda a stimoli linguistici in diverse lingue e in che misura tali risposte siano modulate in modo simile da caratteristiche linguistiche. La tesi analizza innanzitutto come gli MNM gestiscono un preciso fenomeno sintattico, l'accordo soggetto-verbo, rivelando un insieme condiviso di unità neurali responsabili del processamento dell'accordo grammaticale in cinque lingue diverse. L'analisi si sposta poi sul dominio semantico, concentrandosi su come gli MNM rappresentano il contenuto affettivo. I risultati indicano che il contenuto emotivo è codificato in modo coerente tra le lingue, dimostrando che informazioni semantiche emergono spontaneamente in specifiche unità della rete in formato consistente. Una volta dimostrato che gli MNM codificano contenuti sintattici e semantici in un formato condiviso tra le lingue, ho applicato le stime probabilistiche generate da questi modelli all'analisi dei correlati comportamentali della lettura in un campione tipologicamente vario di registrazioni dei movimenti oculari. I risultati forniscono una prova della robustezza del legame tra predicibilità e sforzo cognitivo, convalidando la sensibilità generale del processore linguistico umano al contenuto informazionale trasmesso dalle parole. Infine, ho utilizzato un insieme di modelli di encoding derivati dagli MNM per prevedere l'attività neurale nell'insieme di aree cerebrali che processano il linguaggio, dimostrando la robustezza cross-linguistica del legame tra le rappresentazioni linguistiche in reti neurali artificiali e biologiche. Analisi ulteriori hanno mostrato che i modelli di encoding possono essere trasferiti con successo tra lingue in modalità zero-shot, tanto che un modello addestrato per prevedere l'attività neurale in un insieme di lingue può spiegare con successo le risposte neurali in una lingua non inclusa, persino tra famiglie linguistiche diverse. Questi risultati dimostrano che le risposte neurali al linguaggio sono guidate (almeno in parte) da principi generali validi per tutte le lingue.
(2025). Multilingual Neural Language Models in Cognitive Science: How Cross-lingual Representation Spaces can Inform the Study of Language. (Tesi di dottorato, , 2025).
Multilingual Neural Language Models in Cognitive Science: How Cross-lingual Representation Spaces can Inform the Study of Language
DE VARDA, ANDREA GREGOR
2025
Abstract
At the heart of language sciences lies a fundamental question: How does the human cognitive system process the wide variety of human languages? Recent developments in Natural Language Processing, particularly in Multilingual Neural Language Models (MNLMs), offer a promising avenue to answer this question by providing a theory-agnostic way of representing linguistic content in different languages. This thesis leverages these advancements to explore how the human language processor responds to linguistic stimuli in different languages and the extent to which those responses are similarly modulated by linguistic features. The thesis first examines how MNLMs handle a precise syntactic phenomenon, subject-verb agreement, revealing a shared set of neural units responsible for agreement processing in five different languages. The analysis then shifts to the semantic domain, focusing on how MNLMs represent affective content. The findings indicate that emotional content is encoded in a way that is consistent across languages, demonstrating that language-general semantic information spontaneously emerges in specific network units. Once it is demonstrated that MNLMs encode syntactic and semantic content in a shared format across languages, the probabilistic estimates they generate are applied to the analysis of the behavioral correlates of naturalistic reading in a typologically diverse sample of eye-movement records. The results provide cross-linguistic evidence for the robustness of the link between predictability and cognitive effort, validating the general sensitivity of the human language processor to the information content carried by words. Lastly, a set of encoding models derived from MNLMs is employed to predict brain activity within the human language network, providing additional evidence of cross-lingual robustness in the link between language representations in artificial and biological neural networks. Further analyses showed that the encoding models can be successfully transferred zero-shot across languages, so that a model trained to predict brain activity in a set of languages can successfully account for brain responses in a held-out language, even across language families. These results demonstrate that the brain responses to language are guided (at least in part) by language-general principles.File | Dimensione | Formato | |
---|---|---|---|
phd_unimib_815277.pdf
embargo fino al 11/02/2028
Descrizione: Multilingual Neural Language Models in Cognitive Science: How Cross-lingual Representation Spaces can Inform the Study of Language
Tipologia di allegato:
Doctoral thesis
Dimensione
13.94 MB
Formato
Adobe PDF
|
13.94 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.