Estimating the Size of the Twitter Universe

Traian Marius Truta, Alina Campan, Joseph Nolan, Emily Clemmons, Jenna Wallin

Contact: trutat1@nku.edu

Twitter provides two Application Programming Interfaces (APIs) — Streaming API and Search API — which allow programmers to collect tweets. Twitter Streaming API allows to collect tweets in real time, and it provides two access levels: Free and Decahose. The Free version returns a sample of all public tweets. The API documentation does not explain how such sampling is performed and how large the sample is compared to the entire population of tweets. In addition, the Free service can be used with specific filtering keywords; otherwise said, tweets that contain only specific search keywords can be collected. Again, the Twitter documentation does not explain how this filtering is accomplished and if/when all the matching tweets will be selected. Also, Twitter has not publicly announced the average number of tweets produced in a day in recent years. Our primary contribution from this work is that, using the Free version of Twitter Streaming API, we created a framework to compute the number of tweets created on Twitter during a collection time window, within a certain confidence interval. We tested our approach on data collected for two full days (April 12, 2019 and May 26, 2019, EST). For each of these days, we collected the following 10 sets of tweets: the unfiltered sample of all public tweets and nine sets of filtered tweets for each of the following nine commonly used keywords: AKU (“me” in Malay language), BE, GOOD, HOPE, ICH (“I” in German), MONEY, QUE (“what” in Spanish), TIME, and WEEK. For each of these keywords we were able to compute the total number of tweets that contained that keyword during our collection time windows, by summing the number of collected tweets and the number of lost tweets provided in the Twitter sample as “limit entries”, along with the collected tweets. We were also able to reverse engineer the filtering process in Twitter and we applied this process to the collected unfiltered sample. Using our version of filtering, as reverse-engineered from Twitter, we determined the number of tweets containing each of the above nine keywords in the unfiltered sample collected at the same time. Using the above information, we estimated for each keyword (via a point estimate and a 95% confidence interval) the number of tweets for the entire day, as well as for each individual hour. For the first day (April 12, 2019) the pooled mean of the estimated total number of tweets is 364,885,632; for the second day (May 26, 2019) the pooled mean of the estimated total number of tweets is 363,808,197. During daytime, the volume of tweets fluctuates between 11,053,104 (between 1 AM and 2 AM) and 20,880,152 (between 10 AM and 11 AM) on April 12, 2019, and between 12,218,267 (between 2 AM and 3 AM) and 20,352,464 (between 11 AM and 12 PM) on May 26, 2019. The proposed approach can be used to compute such statistics regarding the volume of tweets, or more specifically the volume of keyword-matching tweets, for a variable time duration.

← Schedule