Case Study: Using Natural Language Processing to Analyze How Reddit Users Discuss Emotional Topics / by Daniel Berry

Overview

With millions of posts and comments a day, Reddit is where people around the world go for news, commentary and community. Using Natural Language Processing, I was able to create a model that correctly identified which of two subreddits a post was shared on 94.8% of the time. These results were then used to determine the differences and similarities between how people seek support and discuss emotional topics on these subreddits. 

The Context

The QAnon conspiracy theory has taken root in the U.S. in recent years. Although it started off as a fringe idea shared on 4chan message boards, the first Congressional nominee to openly support the theory was recently elected to the House of Representatives.

In July 2019, r/QAnonCasualties was created as an online community where individuals with relatives and friends who have expressed support for the theories can gather for support and resources.

Almost a decade earlier, in October 2009, the r/offmychest subreddit was created as a space where people could find mutual support and share experiences they aren’t able tell those in their lives, whether it's something happy or something that's been weighing on them.

The Challenge

Reddit users who post to the r/OffMyChest and r/QAnonCasualties subreddits do so in order to share their experiences and find support. Using Natural Language Processing, I sought to create a model that could predict which subreddit a post was posted to, and determine the differences between how people talk in each subreddit.

The Process

While examining the data, there were a number of posts that were either null, or that only contained `[removed]` or `[deleted]`, all of which I dropped from the dataset. I also discovered a significant number of posts that contained variations of `QAnon` or `conspiracy` in them from the r/QAnonCasualties subreddit. While some of these values were also in the r/offmychest subreddit, it was not nearly to the same extent.

I created a function that would remove these values and other similar values from the text and ran it on the combined dataframe with all of the text values for both subreddits so as to prevent the target from leaking into my X variable. 

Reviewing the most commonly used words in both subreddits, I found that people are more likely to use words related to the political right and their families in r/QAnonCasualties (e.g., 'trump', 'right', 'mom', 'family'), whereas people are more likely words related to those outside of their family in r/offmychest (e.g., 'people', 'friends'). Across both subreddits, people would use words related to their inner thoughts and feelings (e.g., 'know', 'feel', 'believe', 'want', 'think'). Below are graphs for each subreddit that show the top words used in each.

Posts in r/QAnonCasualties often use words related to the political right and family.

Posts in r/QAnonCasualties often use words related to the political right and family.

Posts in r/offmychest use some similar words to those in r/QAnonCasualties; however, users tend to write more about those outside their family.

Posts in r/offmychest use some similar words to those in r/QAnonCasualties; however, users tend to write more about those outside their family.

To determine the best model, I tested Logistic Regression, Random Forests and K-Nearest Neighbors models to determine which would most accurately predicted the subreddit a post came from. Following this process, I found that the Logistic Regression model performed the best, with a ~95% accuracy between the train and the test data. 

The Random Forest model performed better on the train data (~99% accurate) and worse on the test data (~93% accurate), indicating that it was overfit to the training data and not a good model to use for this project. The K-Nearest Neighbors model performed worse than the Logistic Regression and Random Forest models on both the train (~78% accurate) and the test (~75% accurate) data.

Results

From this process, I was able to create a model that accurately predicts which subreddit a post comes from and found that there are some differences in how people discuss emotional topics and seek support on these subreddits, namely that those on r/QAnonCasualties are more likely to mention members of their family than those in the r/offmychest subreddit.

The implications of this data indicate that QAnon belief is having a disproportionate impact on family dynamics. Researchers and organizations who focus on combatting the real-world impacts of QAnon belief may find it effective to create or target resources with information about having difficult conversations with family members to those who have expressed concern about QAnon on social channels.

Additional research into this area could gather data over a longer time horizon, and could also compare posts from more than two subreddits to further differentiate how individuals discuss this topic online.