Theme Extraction from Textual Data: A Comparative Study of Latent Dirichlet Allocation and Latent Semantic Analysis

1Ugorji C. Calistus, 2Chika R. Okonkwo, 3Chika I. Obi-Okonkwo and 4Obikwelu R. Okonkwo

1,2,4Department of Computer Science, Nnamdi Azikiwe University, Awka, Nigeria

3ICT Department, Federal Radio Corporation of Nigeria (FRCN), Enugu, Nigeria

Email: Ugochuks2@gmail.com, chikaokon@yahoo.com, cobiokonkwo@yahoo.com, ro.okonkwo@unizik.edu.ng

ABSTRACT

In today’s digital age, where information inundates every aspect of our lives, the ability to distill meaningful insights from vast troves of textual data is indispensable. Whether it’s to streamline information retrieval processes, discern sentiment trends, or unveil underlying themes, the demand for efficient and effective methods of theme extraction has never been more pressing. In response to this imperative, our study meticulously investigates two prominent techniques renowned for their prowess in theme extraction: Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA). Our research embarks on a journey to scrutinize the efficacy of LDA and LSA in teasing out coherent and interpretable themes from diverse textual datasets. By traversing a spectrum of domains including news articles, scholarly papers, and social media posts, we aim to provide a comprehensive understanding of how these methodologies perform across different textual genres and contexts. Central to our investigation is a rigorous comparative analysis, where we deploy both LDA and LSA algorithms on the datasets under scrutiny. Through meticulous evaluation utilizing metrics such as coherence, topic diversity, and interpretability, we endeavor to unravel the nuances of each technique’s performance in theme extraction. Moreover, we delve into the intricate interplay between parameter settings and theme quality, shedding light on the subtle adjustments that can significantly impact the outcome.The culmination of our study yields invaluable insights into the relative strengths and weaknesses of LDA and LSA in the realm of theme extraction. By identifying scenarios where one technique excels over the other, we unravel the underlying factors contributing to such discrepancies. Additionally, we provide practical guidelines tailored for both researchers and practitioners, facilitating informed decision-making when selecting between LDA and LSA for theme extraction endeavors. These guidelines are intricately woven around the unique characteristics of textual data and the specific objectives guiding the analysis, empowering stakeholders to navigate the theme extraction landscape with confidence and precision.

KEYWORDS: Theme extraction, Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), Textual data, Comparative analysis, Information overload, Coherence, Topic diversity, Parameter settings, Interpretability.

Ugorji C. Calistus, Chika R. Okonkwo, Chika I. Obi-Okonkwo and Obikwelu R. Okonkwo (2024). Theme Extraction from Textual Data: A Comparative Study of Latent Dirichlet Allocation and Latent Semantic Analysis. RESEARCH INVENTION JOURNAL OF ENGINEERING AND PHYSICAL SCIENCES 3(2):27-31.