Latent Dirichlet Allocation (LDA) is a popular statistical tool for the analysis of text documents when the goal is detecting latent topics. A well-known limitation of the LDA is its inability to model positive correlations between topics. This is attributable to the stiffness of the Dirichlet distribution, which is the standard prior for the topic distributions. The aim is to perform a preliminary study of the extended flexible Dirichlet (EFD) as an alternative prior. The latter is a generalization of the Dirichlet distribution defined as a particular structured mixture allowing for positive correlations between its elements. The EFD distribution retains many good theoretical properties of the Dirichlet one, such as identifiability and also explicit expressions of joint moments and closure under many relevant operations on the simplex. Furthermore, the introduction of additional parameters establishes more flexibility, while still maintaining the interpretability of the model, as well as conjugacy with respect to the multinomial model. The generalization of the LDA based on the EFD distribution is illustrated via an application to real data using Markov Chain Monte Carlo (MCMC) methods.
Giampino, A., Ascari, R., Migliorati, S. (2022). LEFDA: An extension of the classical LDA. In 24th International Conference on Computational Statistics (COMPSTAT 2022) and CSDA & EcoSta Workshop on Statistical Data Science (SDS 2022).
LEFDA: An extension of the classical LDA
Giampino, Alice
;Ascari, Roberto;Migliorati, Sonia
2022
Abstract
Latent Dirichlet Allocation (LDA) is a popular statistical tool for the analysis of text documents when the goal is detecting latent topics. A well-known limitation of the LDA is its inability to model positive correlations between topics. This is attributable to the stiffness of the Dirichlet distribution, which is the standard prior for the topic distributions. The aim is to perform a preliminary study of the extended flexible Dirichlet (EFD) as an alternative prior. The latter is a generalization of the Dirichlet distribution defined as a particular structured mixture allowing for positive correlations between its elements. The EFD distribution retains many good theoretical properties of the Dirichlet one, such as identifiability and also explicit expressions of joint moments and closure under many relevant operations on the simplex. Furthermore, the introduction of additional parameters establishes more flexibility, while still maintaining the interpretability of the model, as well as conjugacy with respect to the multinomial model. The generalization of the LDA based on the EFD distribution is illustrated via an application to real data using Markov Chain Monte Carlo (MCMC) methods.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.