Addressing topic modelling via reduced latent space clustering

Published in Statistical Methods & Applications, 2025

Recommended citation: Schiavon, L. (In press) Addressing topic modelling via reduced latent space clustering, in Statistical Methods & Applications.

Addressing topic modelling via reduced latent space clustering

in: Statistical Methods & Applications, accepted for publication.

Citation: Schiavon, L. (In press) Addressing topic modelling via reduced latent space clustering, in Statistical Methods & Applications.

Abstract: In the social sciences, topic modelling is gaining increased attention for its ability to automatically uncover the underlying themes within large corpora of textual data. This process typically involves two key phases: (i) identifying the words associated with language concepts, and (ii) clustering documents that share similar word distributions. In this study, motivated by the growing interest in automatic categorisation of policy documents and regulations, we leverage recent advancements in Bayesian factor models to develop a novel topic modelling approach. This enable us to represent the high-dimensional space defined by all possible observed words through a small set of latent variables, and simultaneously cluster the documents based on their distributions over these latent constructs. Here, groups and underlying constructs are interpreted as document topics and language concepts, respectively, with the number of dimensions not required in advance. Additionally, we demonstrate the effectiveness of our approach using synthetic data, providing a comparison with existing methods in the literature. The illustration of our approach on a corpus of Italian health public plans unveils intriguing patterns concerning the semantic structures used in aging policies and document topic similarities.

Link to paper