Topic Modeling: A Survey
With topic modeling techqiues, we can discover latent topics from documents. It’s important to note that in topic modeling, while visualization is always a big problem, a topic is usually represented as a list of top ranked words which together give some sense of semantic meanings behind the topic. In the topic modeling community, researchers have shown interests in a variety of problems such as relaxation of bag-of-words assumption, sparsity, hierarchy, dynamic models and using meta-data. Next, we are going to discuss each of the above problems and introduce salient approaches which have been proposed to solve them.
1. Beyond bag of words
Most topic models, such as latent Dirichlet allocation (Blei et al., 2003), rely on the bag-of-words assumption (i.e., words occur independently in a document). As we know, however, it is not exactly appropriate to natural language. In general, phrases as the whole can help discover more interpretable topics as they are more informative than the sum of its individual components. For example, ”white house” as a phrase carries a special meaning under the “politics” topic instead of “a house which is white” under the “real estate” topic.
Many models [2, 3, 4, 5] have been proposed to relax the bag-of-words assumption. Among them, Griffiths et al. (2005) presented a model in which words are generated either conditioned on a topic or conditioned on the previous word in a bigram. Wallach (2006) proposed the bigram topic model (BTM) which combines the hierarchical Dirichlet language model [6] and the LDA so as to capture the phrase (short-distance) dependencies between adjacent words and the semantic (long-distance) dependencies between words related to the same topic. Wang et al. (2007) proposed the topical n-gram model (TNG) in which after assigned a topic, each word in a document can be generated by either a topic-specific unigram distribution (as same as in the LDA) or bigram distribution (i.e., a topic-specific conditional distribution). Lindsey et al. (2012) proposed the PDLDA which simultaneously segments a corpus into phrases of varying lengths and assigns topics to them based on the assumption that the topic of a sequence of tokens changes periodically, and the tokens between changepoints comprise a phrase.
2. Sparsity
In general, individual documents usually focus on a few salient topics instead of covering a wide variety of topics. At the same time, a topic is supposed to be clearly distinct from another, that is, a topic should adopt a narrow range of terms instead of a wide coverage of the vocabulary.
In order to introduce sparsity into topic models, many models have been proposed which can generally be categoried into two camps: 1) non-probabilistic coding or matrix factorization, and 2) probabilistic graphical topic model using specific prior or infinite stochastic process.
Sparse Topical Coding (STC) is the state-of-the-art approach in the first category which was proposed by Zhu and Xing (2011). The STC provides a feasible framework to impose various sparsity constraints directly, which are at the expense of losing the probabilistic representations of topics. More interesting works have been done in the second category. Wang and Blei (2009) and Williamson et al. (2010) introduced a Spike and Slab prior and the Indian Buffet Process to model the sparsity in finite and infinite latent topic structures of text. Jenatton et al. (2010) proposed a tree-structured sparse regularization to learn dictionaries embedded in a hierarchy which emphasize sparsity in the number of dictionary components that are active for a given document. Eisenstein et al. (2011) proposed a sparse additive generative model (SAGE) in which each latent topic is endowed with a model of the deviation in log-frequency from a constant background distribution. Chen et al. (2012) proposed a context focused topic model (cFTM) by using the Hierarchy Beta Process. Lin et al. (2014) proposed a dual-sparse topic model (DsparseTM) that effectively addresses the sparsity in both the topic mixtures and the word usage.
3. Hierarchy
The LDA is a three-level topic model which can only discover a flat layer of topics. In other words, it cannot find hierarchical topics. In general, people get used to handling topics (and many other things) in a hierarchical manner. Walking into a large book store, one probably first looks for the literature section, followed by the fiction section and sees scientific fiction books on the shelf. In this example, “book” is a general topic, “literature book” is a more specialized topic under it, followed by “fiction book” which is even more specialized, and then we go to “scientific fiction book” which is clearly a subtopic under the topic “fiction book”. For these obvious reasons, researchers have been interested in modeling hierarchical topics for many years in the topic modeling field.
The nCRP (Blei et al., 2010) is a Bayesian nonparametric model for discovering infinitely deep, infinitely branching topic hierarchies. Along the path from the root to leaves in a topic tree, topics which are represented by each node become more and more specific. However, it is limited in that each document can only select topics from one path in the tree. The nHDP (Paisley et al., 2012) generalizes the nCRP by alleviating the rigid, single-path formulation assumed by the nCRP. Some readers may have noticed that all of the methods mentioned above, they all model a topic hierarchy as a tree. We claim that it’s not the case in many situations. As an example, “clustering” as a topic itself can be categoried into the topic “machine learning” as well as the topic “data mining”, which makes the topic hierarchy a directed acyclic graph (DAG) instead of a tree. Li and McCallum (2006) proposed the pachinko allocation model (PAM) which can capture arbitrary and nested correlations among topics using a DAG structure. The hierarchical PAM (hPAM) - an enhancement of PAM, was proposed later by Mimno et al. (2007) to model DAG-structured topic hierarchies.
4. Dynamic models
Topics usually change over time. Tracking the evolution of existing topics or detecting emerging topics has attracted lots of attentions in academia as well as industries. The LDA itself is a static model, but many dynamic variants of topic models have been proposed over recent decades.
Blei et al. (2006) exploited state space models on the natural parameters of the topic-specific distributions over words to model the evolution of topics. Wang et al. (2006) assumed that the meaning (i.e., probability distribution over words) of a particular topic is stable, but the topics’ occurrence and correlations change over time. Based on that assumption, each topic is associated with a distribution over timestamps and each document is generated by considering both word-occurrences and its timestamp. Iwata et al. (2010) assumed that topics naturally evolve with multiple timescales, that is, some of them evolve dramatically in a short time while others change a little over a long period of time. Accordingly, they proposed an online topic model considering both the long-timescale dependency as well as the short-timescale dependency which can sequentially analyze the evolution of topics in document collections. Kim and Oh (2011) presented a probabilistic topic modeling based framework which can identify long-term topics and temporary issues and detect focus shifts within each topic chain which is constructed using a topic similarity metric. Saha and Sindhwani (2012) proposed a dynamic NMF approach which can discover emerging topics and track the evolution of existing topics in streaming text data.
5. Using metadata
Most topic models take individual documents as input and output latent topics. In real life, a document is usually linked with other documents as well as associated metadata (e.g., authors, publication date, venues, etc.). How to use associated information of documents to improve the performance of topic modeling is an interesting question and has drawn many researchers’ attention.
Rosen-Zvi et al. (2004) proposed an author-topic model that uses a topic-based representation to model both the content of documents and the interests of authors by combining the topic model and author model together. Steyvers et al. (2004) proposed an author-topic model where each document is generated by repeatedly doing the following steps: first choose an author at random for a target word in the document, then choose a topic for that chosen author and finally sample a word for that chosen topic. Liu et al. (2009) developed a Bayesian hierarchical model that performs topic modeling and author community discovery in one unified framework. Nallapati et al. (2008) presented a topic modeling framework that models text and citations jointly.
6. Limitations of Topic Modeling
Classical topic models such as LDA have four major limitations: 1) They only explore the bag-of-words information which is only a part of the information in heterogeneous document networks. 2) They regard words in vocabulary as unique tokens regardless of the semantic relations among them. For example, “analysis” is much closer to “analytics” than other unrelated words like ”cleaning” in a semantic sense. 3) Conventional topic models suffer from shot or few documents. 4) They usually do not scale to large-scale datasets (e.g., millions of documents and unique words) on a single machine.
7. Relations to Other Techniques
7.1 Word Representation
Relations between topic modeling and word representation are interesting. Generally speaking, topic modeling focuses on global information while word representation focuses on local contexts. In addition, topic modeling tells us topical similarities among words while word representaton tells us semantic similarities among words. Following this line of reasoning, some works have been done to improve topic modeling by leveraging the power of word representation [29, 30, 31, 32].
7.2 Neural Document Modeling
Topic modeling can represent a document as a topic proportion vector in the topic simplex space where each dimension is a specific topic, while neural document modeling [33, 34] aims at representing a document as a dense vector where each dimension is not a topic but hopefully captures some semantic sense of the document. The former one is called distributional representation while the latter one is called distributed representation.
References
[1] D. M. Blei, A. Y. Ng, M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, 2003.
[2] Thomas L. Griffiths, Mark Steyvers, David M. Blei, Joshua B. Tenenbaum. 2005. Integrating topics and syntax. In Advances in Neural Information Processing Systems 17, pages 537–544. MIT Press.
[3] H. M. Wallach. Topic modeling: beyond bag-of-words. In ICML, 2006.
[4] X Wang, A McCallum, X Wei. Topical n-grams: Phrase and topic discovery, with an application to information retrieval. In ICDM 2007.
[5] R. V. Lindsey, W. P. Headden, III, and M. J. Stipicevic. A phrase discovering topic model using hierarchical pitmanyor processes. In EMNLPCoNLL, 2012.
[6] MacKay, D. J. C., & Peto, L. C. B. (1995). A hierarchical Dirichlet language model. Natural Language Engineering, 1, 289–307.
[7] Wang, Chong and Blei, David M. Decoupling sparsity and smoothness in the discrete hierarchical dirichlet process. In NIPS, 2009.
[8] Williamson, S., Wang, C., Heller, K., and Blei, D. The IBP compound dirichlet process and its application to focused topic modeling. In ICML, 2010. [10] Jenatton, R., Mairal, J., Obozinski, G., and Bach, F. Proximal methods for sparse hierarchical dictionary learning. In ICML, 2010.
[9] J. Eisenstein, Amr Ahmed, and Eric P. Xing. Sparse additive generative models of text. In ICML, 2011.
[10] Nallapati, R. M., Ahmed, A., Xing, E. P., & Cohen, W. W. (2008, August). Joint latent topic models for text and citations. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 542-550). ACM.
[11] Liu, Yan, Alexandru Niculescu-Mizil, and Wojciech Gryc. Topic-link LDA: joint models of topic and author community. proceedings of the 26th annual international conference on machine learning. ACM, 2009.
[12] J. Zhu and E. P. Xing. Sparse topical coding. In UAI, pages 831–838, 2011.
[13] X. Chen, M. Zhou, and L. Carin. The contextual focused topic model. In KDD, pages 96–104, 2012.
[14] Lin, T., Tian, W., Mei Q. The dual-sparse topic model: mining focused topics and focused terms in short text. In International Conference on World Wide Web, 2014.
[15] D. Blei, T. Griffiths, and M. Jordan. The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. Journal of the ACM, vol. 57, no. 2, pp. 7:1–30, 2010.
[16] J. Paisley, C. Wang, D. Blei, M. Jordan. Nested hierarchical Dirichlet processes. arXiv preprint arXiv:1210.6738 (2012).
[17] W. Li, A. McCallum. Pachinko allocation: DAG-structured mixture models of topic correlations. Proceedings of the 23rd international conference on Machine learning. ACM, 2006.
[18] D. Mimno, W. Li, A. McCallum. Mixtures of hierarchical topics with pachinko allocation. In ICML 2007.
[19] D. Kim, A. Oh. Topic chains for understanding a news corpus. Interna- tional Conference on Intelligent Text Processing and Computational Linguistics. Springer Berlin Heidelberg, 2011.
[20] D. Blei, John D. Lafferty. Dynamic topic models. In ICML, 2006.
[21] Wang, Xuerui, and Andrew McCallum. Topics over time: a non-Markov continuous-time model of topical trends. Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006.
[22] D. Kim, A. Oh. Topic chains for understanding a news corpus. Interna- tional Conference on Intelligent Text Processing and Computational Linguistics. Springer Berlin Heidelberg, 2011.
[23] A. Saha, V. Sindhwani. Learning evolving and emerging topics in social media: a dynamic nmf approach with temporal regularization. Proceedings of the fifth ACM international conference on Web search and data mining. ACM, 2012.
[24] T. Iwata, T. Yamada, Y. Sakurai, N. Ueda. Online multiscale dynamic topic models. Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2010.
[25] Snow, Rion, Daniel Jurafsky, and Andrew Y. Ng. Learning syntactic patterns for automatic hypernym discovery. Advances in Neural Information Processing Systems 17 (2004).
[26] Wang, Chi, Marina Danilevsky, Nihit Desai, Yinan Zhang, Phuong Nguyen, Thrivikrama Taula, and Jiawei Han. A phrase mining framework for recursive construction of a topical hierarchy. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 437-445. ACM, 2013.
[27] Rosen-Zvi, Michal, Thomas Griffiths, Mark Steyvers, and Padhraic Smyth. The author-topic model for authors and documents. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pp. 487-494. AUAI Press, 2004.
[28] Steyvers, M., Smyth, P., Rosen-Zvi, M., & Griffiths, T. (2004, August). Probabilistic author-topic models for information discovery. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 306-315). ACM.
[29] Das, Rajarshi, Manzil Zaheer, and Chris Dyer. “Gaussian lda for topic models with word embeddings.” Proceedings of the 53nd Annual Meeting of the Association for Computational Linguistics. 2015.
[30] Nguyen, Dat Quoc, et al. “Improving Topic Models with Latent Feature Word Representations.” Transactions of the Association for Computational Linguistics 3 (2015): 299-313.
[31] Yang Liu, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. 2015. Topical word embeddings. In AAAI, pages 2418–2424.
[32] Shaohua Li, Tat-Seng Chua, Jun Zhu and Chunyan Miao. “Generative Topic Embedding: a Continuous Representation of Documents”. In the Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) 2016, pp. 666-675.
[33] Salakhutdinov, Ruslan, and Geoffrey Hinton. “Semantic hashing.” RBM 500.3 (2007): 500.
[34] Hinton, Geoffrey, and Ruslan Salakhutdinov. “Discovering binary codes for documents by learning deep generative models.” Topics in Cognitive Science 3.1 (2011): 74-91.