‘Building and Mining Corpora for Social Media Discourse Analysis’

Convenors: Grégoire Lacaze

Social media discourse analysis raises the topical question of the process of building a corpus of digital posts. The determination of the limits of the corpora is at stake in this process. In this round table, we will discuss the amount and types of data that need to be selected in the building of a corpus.

Digital platforms of social media have the main property to be regarded as open environments in which new posts and comments can be added without a limitation in time, which has a strong impact on the singularity of corpora that can be elaborated at a given time.

The question of reproducibility applied to this data according to the FAIR principles (Findable, Accessible, Interoperable, Reusable) will also be tackled. Once the corpora are constituted, they have to be stored on safe and permanent repositories, which directly leads us to highlight the importance of open data for long-term analyses.

When the corpora are built, they can be analysed thanks to data-mining techniques. Different approaches and methodologies will be presented, some of them being based on deep learning techniques including neural networks. Digital corpora obviously need digital tools to be analysed. Algorithms and software such as open source Iramuteq will be shown.

A recurrent question as far as corpus building is concerned is the dichotomy between qualitative analysis and quantitative analysis.

Confirmed speakers: Bernie Hogan (Oxford Internet Institute), Gudrun Ledegen (Université Rennes 2)

Bernie Hogan (Senior Research Fellow, Oxford Internet Institute, University of Oxford): “Theorising and integrating platform signals into digital text corpora”

The interpretation of online text, like any text, ought to be informed by context. For many texts that context is provided by related sources, historical documentation, and other features that help to frame and interpret the material. For social media data that context can be understood on a very granular level through platform signals. These are the metatextual features that guide the user of a social media platform and are based on prior interactions. These include Facebook reactions, Twitter favourites, retweet counts, upvotes on most forums, video views, and Reddit gilding, as examples. By thinking of text in an unstructured way (the n tweets with a hashtag or all the comments in a forum), we run the risk of misinterpreting the text itself and the audience’s relation to the text. Yet audience’s ability to communicate with a producer of social media content is considerable and granular, but it’s voice is not always heard through text. To give example to this work, I present and reframe the findings from a prior paper with Jack LaViolette on Men’s Rights. In the paper we highlight distinctions between a pro-feminist Men’s Lib forum and an anti-feminist Men’s Rights forum. Relevant here is the use of platform signals in this approach. We show that by preferentially sampling on upvoted content we were able to more effectively classify content between the forums, indicating that the audience helped to produce the ideological differences.

Gudrun Ledegen (Université Rennes 2, Laboratoire PREFICS) : “Suicide prevention chat, quantitative and qualitative description of a discourse genre for better listening”

Prevention chat system offers a listening ear to those who wish to talk about their loneliness or suicidal thoughts. This is the case with the SOS Amitié chat room, set up in France in 2005 to specifically collect the words of young people, who are less likely to use the telephone to confide their discomfort.

Our team of discourse analysts, sociolinguists and computer scientists has been conducting research on a large corpus of 10 years of chat conversations of this association since four years. Our research aims to understand the role of different types of formulation in the help relationship of this device. How do they allow the expression of suffering, and what form of suffering do they allow to relieve? We aim to identify the linguistic elements that build this particular help relationship, in order to participate in the improvement of this chat-based care by SOS Amitié.

For the volunteering listeners of the association, the exercise can be difficult, in the absence of the intonation which brings a great part of empathy in conversations (as by phone). However, they manage to get callers to express their concerns and discomforts, by weaving together essentially two moments: a first one, called phatic, which establishes contact, brings together greetings and expressions of empathy: vous souhaitez en parler un peu ? ; je comprends votre souffrance (you want to talk about it a little; I understand your suffering). And a second one, more focused on a thematic content, comes to ask for information and explanations, and (make) reformulations: j’entends au travers de vos mots que vous pensez que ce n’est pas bien, c’est cela que vous aimeriez que je comprenne ? (I hear through your words that you think it is not good, that is what you would like me to understand?)

Just like phone calls, chat is a synchronous and remote mode of communication, which has a series of characteristics that make it attractive for the young generation in this situation of sensitive interaction: indeed, anonymity can be preserved: face-to-face is avoided, but the absence of the voice further reinforces the absence of identification or revelation of emotions through intonation; but chat also allows to “do face-to-face in writing” (Marcoccia 2004), and the conversations can be close to the ordinary chat habits of interlocutors, especially youngsters. But our explorations also reveal that on a linguistic level, this is a rather unusual chat: the exchange adopts a fairly high degree of formality which allows both sides to maintain a social distance, which helps to establish a situation of interaction conducive to confidence: by addressing someone unknown, whom one solicits in order to confide in him or her, this distance in the form underlines the uninvolved posture of the interlocutor.

Our analyses combine qualitative and quantitative approaches, by using different textometry software (Lexico 3, TXM …) which allows a constant and essential return to the text, by different explorations, while giving to see regularities. In addition, the team of computer scientists performs different deep learning explorations to identify sociological and textual metadata (reason(s) for coming to the chat), and to contrast the 10 years of corpus gathered.