The Effect of Text Preprocessing on Bayesian Stochastic Blockmodels for Topic Identification

Brandon Sepulvado

Contact: brandon.sepulvado@gmail.com

Identifying automatically the topics discussed in large bodies of texts is an increasingly popular task in computational social science and data science. Across domains as varied as medicine, sociology, political science, and economics as well as within nonprofits, technology firms, and governments, the proliferation of big data frequently renders it difficult or impossible for individuals to read manually all the texts they have. Despite the fact that topic modeling is probably the most well-known approach to unsupervised topic identification, there are strong reasons to explore network analytic approaches: not only have network analysts conducted work on semantic and socio-semantic networks for the past (at least) half-century, but Bayesian Stochastic Blockmodels (SBMs) have been shown to perform better than common topic modeling algorithms. Two advantages to Bayesian SBMs are that they do not require an a priori specification of the number of topics, which is frequently unknown, and that they are able to uncover hierarchical relationships between topics. However, one key uncertainty surrounding the use of SBMs for topic identification is text preprocessing. On the one hand, although published examples of SBMs use little to no preprocessing, it is a staple of natural language processing and, as such, must not go overlooked in any implementation of SBMs for topic identification. On the other hand, preprocessing fundamentally restructures network topology: changing the set of nodes in the network as well as the density and distribution of edge weights. A central methodological question is thus if and how commonly used preprocessing techniques alter the topics discovered. We test the robustness of Bayesian SBMs for topic identification by investigating the impact of stop word removal, case standardization, stemming, lemmatization, punctuation removal, number removal, and combinations of these approaches. We examine quantitative measures of content overlap between these preprocessed and unprocessed texts as well as engage in qualitative interpretation to ascertain if the topics recovered change. Two corpora are used for this task. The first consists of scholarly articles on synthetic biology, and the second consists of social media posts from thought leaders in education and education policy. These two sets of texts are useful because they enable a comparison of short and long texts as well as technical and non-technical vocabularies. The impact of preprocessing on Bayesian SBMs is drastic. Certain steps, such as case standardization and removing punctuation, have only minor effects, but lemmatization and certain combinations of preprocessing steps either dramatically flatten the topic structure recovered or result in a total inability to discern topics. As an external gauge for the effect of preprocessing techniques, we run the same types of comparisons using Correlated Topic Models (CTMs). The CTMs seem less sensitive to preprocessing, though the Bayesian SBMs capture more detailed, hierarchical topic structures, so the CTMs have less information to lose. We conclude by noting under what conditions Bayesian SBMs are useful for topic identification and by suggesting future directions for methodological development.

← Schedule