Topic Modelling - Part 2

What you are about to read is the second part in a series of articles dedicated to Topic modelling. Please note that like the previous one, this article contains snippets of text and code from my dissertation thesis:

“Information extraction of short text snippets using Topic Modelling algorithms.”

The text that follows was written by me in 2021, and will mainly center around the theoretical aspect of Topic Modelling. Research papers, when cited, are properly referenced.

For my research disseration, I mainly chose to focus on five algorithms. Several datasets were fitted onto the following models, their output recorded, and their performance compared:

  • Latent Dirichlet Allocation
  • Latent Semantic Analysis
  • Non-Negative Matrix Factorization
  • Gibbs Sampling Algorithm for Dirichlet Multinomial Mixture
  • Combined LDA and transformers with BERTopic
  • Latent Dirichlet Allocation

Latent Dirichlet Allocation

More commonly known as LDA, this has become one of the most frequently used algorithms to try and uncover hidden topics within a text corpus. Whenever you look for Topic Modelling, you’re likely to stumble upon an article or a video that focuses on LDA. Its functioning can be understood as follows: Latent means that the topics are concealed (or hidden), while Dirichlet refers to the name of the German-born mathematician whose probability distribution theories are inferred from. As for Allocation, it stands for the distribution of topics within a corpus, and the distribution of words within each topic.

An LDA algorithm makes the Bayesan assumption that a corpus is made of one or more topics, and that these topics are made of one or more words. It is a distribution of distributions [1]. Each topic is attributed a weight, which helps determine the importance of the said topic within the corpus.

A K number of topics is estimated as best fitting the actual number of topics within the corpus, leading to a K-nomial (multinomial) distribution. Within each topic, a K number of words is also estimated, making the model actually use two Dirichlet distributions [2].

In practice, several approaches can be taken to estimate the K number of topics:

  1. Simple trial and error, based on adjustments (manual incrementation or decrementation of the number of topics) made from the weight of each discovered topic.
  2. A curve plot, also known as elbow plot, to visually determine the maximum K number of topics

One of the most frequently used Python libraries used for performing LDA modelling is Gensim, an open-source library that supports simple tasks such as text processing or word embeddings, but also more advanced topic modelling algorithms. It supports word vector models likeWord2Vec and FastText, and can be used to compute similarity metrics.

Though several parameters can be set, the default values seem to provide the best results across most use cases.

Latent Semantic Analysis

Latent Semantic Analysis, or LSA, provides a different approach from the above LDA algorithm [3]. The concept relies on putting together (aggregating)…

“…all the word contexts in which a given word does and does not appear provides a set of mutual constraints that largely determines the similarity of meaning of words and sets of words to each other” [4].

Its main purpose is dimensionality reduction, based on the fact that words will occur in similar documents if they have similar meaning.

Simply put, an LSA algorithm creates a matrix of documents and terms, and then attempts to break this matrix into two new matrices:

  1. A document-topic matrix (m x n), where each row corresponds to a document and each column corresponds to a word. Each cell contains a TF-IDF score (see my article on TF-IDF).
  2. A topic-term matrix

Singular Value Decomposition (SVD) is applied for a matrix decomposition on the document-term matrix, and it is through this factorisation that the LSA model is able to learn and discover underlying topics [5]. This is commonly represented as:

  • M is an m×m matrix
  • U is a m×n left singular matrix
  • Σ is a n×n diagonal matrix with non-negative real numbers.
  • V is a m×n right singular matrix
  • V* is n×m matrix, which is the transpose of the V

Both LDA and LSA models output a bag-of-words, or more simply put a Python dictionary-like object where words are keys and their number of occurrences the value.

Non-Negative Matrix Factorization

First introduced by Paatero and Tapper in 1994, and popularised in an article by Lee and Seung in 1999, Non-Negative Matrix Factorization (NNMF) is another matrix based approach that can be used for topic modelling. NNMF is fast and memory-efficient [6].

NNMF will attempt to approximate a matrix X using a low-rank matrix approximation such that X = WH [7].

To do so, the algorithm tries to minimise the problem by reducing the P dimension ro R, creating a rank approximation. W means that each column within the matrix then constitutes a building block, from which another approximation can be constructed. H corresponds to each column that maps to the coordinates of a data point in the basis W [7].

How accurate the WH approximation is actually achieved through Single Value decomposition again.

Gibbs Sampling Algorithm for Dirichlet Multinomial Mixture

Gibbs Sampling for Dirichlet Multinomial Mixture (GSDMM) offers a variant of Dirichlet Multinomial Mixture. This rather recent approach attempts to distribute topics within a corpus based on homogeneity and completeness [8].

Fundamentally, Gibbs Sampling is a Monte Carlo Markov Chain method that loops over a set of variables to draw an instance from the distribution of each of these variables [9]. The outcome depends on the current values of the other variables, and estimates the joint distributions.

GSDMM differs from LDA as it assumes that a document contains one topic, while LDA assumes that a document can have multiple topics [10].

According to its authors, GSDMM is particularly for topic modelling of short texts, and has several advantages over other approaches described above. GSDMM indeed:

  • Doesn’t need human intervention to infer the number of clusters.
  • Offers a balance between completeness and consistency.
  • Can handle sparse and high-latitude short texts well, and can provide the prominent tokens.of each topic.
  • Offers fast performance and is therefore more suited for production.

Combined LDA and transformers with BERTopic

The most recent of the algorithms described in this article is called BERTopic, and was developed in 2020.

BERTopic takes its inspiration from transformer-based models, which have become increasingly popular over the past few years. Transformers and their pre-trained models are known for representing a more accurate representation of word embeddings [11].

It uses the “sentence-transformers” package from BERT to extract various embeddings based on the context of each word. The UMAP package is used to address dimensionality issues, and allows data scientists to manually determine the number of desired clusters and local neighbours (the words that a given word is directly related to). The model then uses a density-based algorithm named HDBSCAN to create the clusters [12].

What makes BERTopic valuable is its capacity to detect outliers, as well as the fact that it can output what makes the clusters different, through c-TF-IDF, a class based variant of the TF-IDF algorithm. c-TF-IDF provides an importance score for each word within a topic, that represents a proxy of information density [13].

Conclusion

As I was writing my research dissertation, I really had a lot of fun reading research papers and trying to first understand how these models worked, then explain their process with my own words, use them on real datasets, and compare their output.

Topic modelling is a fascinating topic, with a lot of progress being made over the past few years. What is missing within that space at the moment is a strong, reliable, and universally adopted evaluation framework. In a future article we’ll take a look at Palmetto, a quality measuring tool for topic coherence. But that’s all for today!

References

Abuzayed, A. and Al-Khalifa, H. (2021). BERT for Arabic Topic Modeling: An Experimental Study on BERTopic Technique. Procedia Computer Science, 189, pp.191–194. doi:10.1016/j.procs.2021.05.096

Baker, K. (2005). Singular Value Decomposition Tutorial. [online] Available at: http://site.iugaza.edu.ps/ahdrouss/files/2010/03/Singular_Value_Decomposition_Tutorial.pdf

Blei, D.M., Ng, A.Y., Jordan, M.I. and Lafferty, J. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, [online] 3(3), p.2003. Available at: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.110.4050

Cheng, X., Yan, X., Lan, Y. and Guo, J. (2014). BTM: Topic Modeling over Short Texts. IEEE Transactions on Knowledge and Data Engineering, [online] 26(12), pp.2928–2941. doi:10.1109/TKDE.2014.2313872.

Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv:2203.05794 [cs]. [online] Available at: https://arxiv.org/abs/2203.05794.

Hoffman, M., Bach, F. and Blei, D. (2010). Online Learning for Latent Dirichlet Allocation. [online] Neural Information Processing Systems. Available at: https://proceedings.neurips.cc/paper/2010/hash/71f6278d140af599e06ad9bf1ba03cb0-Abstract.html

Landauer, T.K., Foltz, P.W. and Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25(2-3), pp.259–284. doi:10.1080/01638539809545028.

Lee, D. and Seung, H.S. (2000). Algorithms for Non-negative Matrix Factorization. [online] Neural Information Processing Systems. Available at: https://proceedings.neurips.cc/paper/2000/hash/f9d1152547c0bde01830b7e8bd60024c-Abstract.html

Lee, D.D. and Seung, H.S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, [online] 401(6755), pp.788–791. doi:10.1038/44565.

Smith, A.F.M. and Roberts, G.O. (1993). Bayesian Computation Via the Gibbs Sampler and Related Markov Chain Monte Carlo Methods. Journal of the Royal Statistical Society: Series B (Methodological), 55(1), pp.3–23. doi:10.1111/j.2517-6161.1993.tb01466.x.

Teh, Y.W., Jordan, M.I., Beal, M.J. and Blei, D.M. (2006). Hierarchical Dirichlet Processes. Journal of the American Statistical Association, 101(476), pp.1566–1581. doi:10.1198/016214506000000302.

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M. and Brew, J. (2020). HuggingFace’s Transformers: State-of-the-art Natural Language Processing. arXiv:1910.03771 [cs]. [online] Available at: https://arxiv.org/abs/1910.03771.

Yin, J. and Wang, J. (2014). A dirichlet multinomial mixture model-based approach for short text clustering. Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’14. doi:10.1145/2623330.2623715.

Zhao, H., Phung, D., Huynh, V., Jin, Y., Du, L. and Buntine, W. (2021). Topic Modelling Meets Deep Neural Networks: A Survey. arXiv:2103.00498 [cs]. [online] Available at: https://arxiv.org/abs/2103.00498.