No comments - Addressing comment sections for network and content analyses of webpages
Paul Guille-Escuret, Florian Cafiero, Jeremy WardWe advocate for the removal or extraction of “comment sections”, illustrating the biases they generate through a case study on anti-vaccine websites. We provide a semi-automated method for extracting these sections - the poster pointing to the R code on GitHub.
* How comments bias most analyses
Comments can express an opinion radically different from what is asserted in the page. Hyperlinks present in the comments can point to contents that the owner of the website does not endorse, which can distort any network analysis.
In our example, we study the community of French vaccine-critical websites. When leaving the commentary sections and analyzing the citation patterns, we could think that the various vaccine-critical movements form a tightly-knit community. Yet, when removing the comments, we notice that a majority of actors we classified as "reformists marginally interested in vaccines" are now isolated from the main graph: comments pointing to radical actors were artificially linking them to the rest of the community.
The vocabulary used in the comments can also bias content analyses, for various reasons. Because of a malfunctioning spam filter, a radical anti-vaccine website can lexically pass as an online thrift shop... The rise of protracted debates between users makes us think it that a prominent vaccine-critical blog mostly talks about HPV and hepatitis vaccines. This overshadows that this person's main concern are the H1N1 crisis management and general vaccine policy by Minister Roselyne Bachelot.
It is thus key to eliminate these comments - or to keep them for separate analyses.
* Separating the comments from the page: a tedious task
Removing or extracting the commentary sections from a set of websites is a tedious task and therefore rarely performed. Many languages can be used to encode the page: HTML 4.0/5.0, XHTML, Ajax, RoR, etc. Some standards exist, for instance for blog platforms, but are not widely adopted. And unexpected methods to create a commentary section (e.g. as a subpart of a forum) frequently occur.
* Aiming at exhaustivity: a necessity
Focusing only on the easily retrievable commentary sections would induce important biases. The way the commentary section is encoded is in itself a socially-meaningful phenomenon, demonstrating the user’s literacy in web programming, or his financial means. Excluding very poorly encoded pages, or virtuoso contents written by expert programmers, could thus translate into excluding specific groups from any further analysis.
* Extracting comments
The method we propose is not fully automated, and requires a direct identification of patterns delimiting comments sections and comments themselves in the code. Some patterns are relevant for many websites while others need to be carefully designed for a single use. A list of the regular patterns we noticed is thus provided.
We then provide an R implementation carrying out the rest of the procedure: after automated quality checks, links and contents coming from comments are subtracted, and the comment-free pages can be analyzed. Comment sections and their metadata can be extracted for separate analyses.