Case Study: Coding for Social Impact / March 2, 2021 by Daniel Berry

Overview

COVID-19 has disproportionately affected residents of long-term care facilities, with these facilities constituting less than 1 percent of the U.S. population yet 43 percent of all COVID-19 deaths. As part of a recent project focused on social impact, two members of my cohort, Ethan Kuehl and Jun Choi, and I used data provided by the Centers for Medicare & Medicaid Services to create a visualization tool that shows the impact of COVID-19 on nursing homes, as well as several predictive models to attempt to better anticipate potential spikes in nursing home COVID-19 cases.

The Context

Long-term care facilities, also known as nursing homes, have been some of the most devastated communities during the COVID-19 pandemic. By early May, the New York Times identified more than 36,000 cases and more than 7,000 deaths in nursing homes, almost a month before the Centers for Medicare and Medicaid Services began tracking COVID-19 data for nursing homes. As a result, our modeling data and infographics did not capture the earliest days of COVID-19 in nursing homes. Despite this, we were able to analyze the subsequent responses by these facilities across the country with the hopes of preventing a return to the level of death present during the early days of the pandemic.

The Challenge

In addition to visualizing this data, we also wanted to determine if we could use time-series models to predict nursing home COVID cases and deaths.

The Process

This project leveraged the COVID-19 Nursing Home Dataset provided by the Centers for Medicare & Medicaid Services (CMS), which is a collection of COVID-19 data from nursing homes across the country. The Centers for Disease Control and Prevention's (CDC) mandates that 100% of nursing homes report the following data, as required by the Secretary of Health and Human Services:

Suspected and confirmed COVID-19 infections among residents and staff, including residents previously treated for COVID-19
Total deaths and COVID-19 deaths among residents and staff
Personal protective equipment and hand hygiene supplies in the facility
Ventilator capacity and supplies in the facility
Resident beds and census
Access to COVID-19 testing while the resident is in the facility
Staffing shortages

To acquire the data, we downloaded the .csv file and created smaller files that split up the data by week to allow us to upload the data to GitHub. We then concatenated all of the data into one large DataFrame that recreated the original data we pulled from the CMS website of data at the beginning of each notebook. Due to the scale of the data, we implanted an extensive cleaning process that included removing nursing homes that hadn’t submitted data to CMS each week, as well as those nursing homes that submitted data that did not pass the quality assurance check.

We then split the dataset into numerical and text-based data, and found that most of the text-based data was in fact binary "Yes" or "No" responses that allowed us to quickly binarize the data into 1's and 0's. The remaining text data described the name of the nursing home and the address of the nursing home, so we did not process this data in our ensuing model. The numeric data, on the other hand, could be broken down into dates, geo-locations and counts. We processed the dates for our time series modeling, the geolocations for our nursing home visualization dashboard, and the supplies/resident/staff counts for our timeseries modeling. The primary cleaning step involved imputing values for our null values through groupby operations that allowed us to impute averages based on location for the numeric data rather than just simply imputing zeros.

There are several limitations to the dataset. For instance, nursing homes were not required to report COVID-19 data retroactive to the beginning of the pandemic, which resulted in some facilities reporting higher numbers of cases/deaths compared to other facilities due to their retrospective reporting source. Furthermore, a facility’s access to COVID-19 testing may impact their reported data. For example, facilities that did not have the ability to test all residents would not be able to report all residents with confirmed cases.

Results

From this project, we were able to create an app that helped visualize COVID-19 data provided by CMS. Additionally, through a series of modeling processes, we were able to determine that ARIMA timeseries models were not useful for helping to predict COVID-19 cases and deaths in nursing homes, as there is not enough data for the models to train on.

Case Study: Using Natural Language Processing to Analyze How Reddit Users Discuss Emotional Topics / February 16, 2021 by Daniel Berry

Overview

With millions of posts and comments a day, Reddit is where people around the world go for news, commentary and community. Using Natural Language Processing, I was able to create a model that correctly identified which of two subreddits a post was shared on 94.8% of the time. These results were then used to determine the differences and similarities between how people seek support and discuss emotional topics on these subreddits.

The Context

The QAnon conspiracy theory has taken root in the U.S. in recent years. Although it started off as a fringe idea shared on 4chan message boards, the first Congressional nominee to openly support the theory was recently elected to the House of Representatives.

In July 2019, r/QAnonCasualties was created as an online community where individuals with relatives and friends who have expressed support for the theories can gather for support and resources.

Almost a decade earlier, in October 2009, the r/offmychest subreddit was created as a space where people could find mutual support and share experiences they aren’t able tell those in their lives, whether it's something happy or something that's been weighing on them.

The Challenge

Reddit users who post to the r/OffMyChest and r/QAnonCasualties subreddits do so in order to share their experiences and find support. Using Natural Language Processing, I sought to create a model that could predict which subreddit a post was posted to, and determine the differences between how people talk in each subreddit.

The Process

While examining the data, there were a number of posts that were either null, or that only contained `[removed]` or `[deleted]`, all of which I dropped from the dataset. I also discovered a significant number of posts that contained variations of `QAnon` or `conspiracy` in them from the r/QAnonCasualties subreddit. While some of these values were also in the r/offmychest subreddit, it was not nearly to the same extent.

I created a function that would remove these values and other similar values from the text and ran it on the combined dataframe with all of the text values for both subreddits so as to prevent the target from leaking into my X variable.

Reviewing the most commonly used words in both subreddits, I found that people are more likely to use words related to the political right and their families in r/QAnonCasualties (e.g., 'trump', 'right', 'mom', 'family'), whereas people are more likely words related to those outside of their family in r/offmychest (e.g., 'people', 'friends'). Across both subreddits, people would use words related to their inner thoughts and feelings (e.g., 'know', 'feel', 'believe', 'want', 'think'). Below are graphs for each subreddit that show the top words used in each.

Posts in r/QAnonCasualties often use words related to the political right and family.

Posts in r/offmychest use some similar words to those in r/QAnonCasualties; however, users tend to write more about those outside their family.

To determine the best model, I tested Logistic Regression, Random Forests and K-Nearest Neighbors models to determine which would most accurately predicted the subreddit a post came from. Following this process, I found that the Logistic Regression model performed the best, with a ~95% accuracy between the train and the test data.

The Random Forest model performed better on the train data (~99% accurate) and worse on the test data (~93% accurate), indicating that it was overfit to the training data and not a good model to use for this project. The K-Nearest Neighbors model performed worse than the Logistic Regression and Random Forest models on both the train (~78% accurate) and the test (~75% accurate) data.

Results

From this process, I was able to create a model that accurately predicts which subreddit a post comes from and found that there are some differences in how people discuss emotional topics and seek support on these subreddits, namely that those on r/QAnonCasualties are more likely to mention members of their family than those in the r/offmychest subreddit.

The implications of this data indicate that QAnon belief is having a disproportionate impact on family dynamics. Researchers and organizations who focus on combatting the real-world impacts of QAnon belief may find it effective to create or target resources with information about having difficult conversations with family members to those who have expressed concern about QAnon on social channels.

Additional research into this area could gather data over a longer time horizon, and could also compare posts from more than two subreddits to further differentiate how individuals discuss this topic online.

More than halfway through! / February 3, 2021 by Daniel Berry

Seven weeks into the Data Science Immersive course, I’m astounded by how much ground we’ve covered and how much I’ve learned in such a short amount of time. From the basics of Python all the way through to our most recent project in Natural Language Processing, there are millions of potential avenues to explore.

Based on my previous experience, I was particularly excited to learn about using APIs, as well as web scraping. For instance, I can imagine how connecting and using the APIs provided by many of my former team’s tools would streamline the reporting process. This, combined with information we could gather through web scraping on online forums not covered by those tools, I believe this would be an overall value-add to my previous work. While I can see how this knowledge would be helpful for my prior role, I’m excited to take these new skills into different fields and explore how they can help solve new and different business challenges.

While we’ve covered many different supervised learning models for both regression and classification problems, such as Linear/Logistic Regression, K-Nearest Neighbors, Random Forest and SVM, I am particularly excited to learn more about unsupervised learning and discover how it compares to the previous modeling methods we’ve learned.

While there’s still plenty more to learn, I’m excited by the possibilities these new skills can open up, and I look forward to applying them to my capstone project as we approach the end of the course.

How the pandas and docx libraries can streamline repetitive COVID-19 communications / January 20, 2021 by Daniel Berry

During the first several months of the COVID-19 pandemic, my colleagues and I were responsible for creating communications packets for a client with close to 2,000 retail locations across the U.S. These packets helped inform regional vice presidents, district managers, store managers and store employees about confirmed cases of COVID-19, and included information regarding how to communicate the information to store employees, what the company was doing to protect employees and customers, and reminders about best practices for maintaining a sanitary and safe working environment.

Many of these packets were repetitive in nature and there was a clear, logical order we would use to fill them out. In order to fill these packets out, we would look at several different Excel files with information such as the name of regional, district and store leadership, the address of the store in question and how many cases the store has had previously.

Given my recent experiences with data frames, importing/exporting files using Python and building functions, I believe that the creation of these materials could be automated, which would streamline the process and allow the team to get the communications out to the necessary individuals faster. The pandas and docx libraries would be the most important libraries to use to create this process.

Below, I’ve outlined how I think these libraries could be utilized to create functions that would have automated much of the communications packet creation for us, as well as considerations that would have to go into each step.

Pandas

As I mentioned, while creating these packets initially, my team and I would have to look at several different Excel files in order to get the information we needed. Using Pandas, I would take those files with the information and merge them, mapping regional vice president, district manager and store manager names onto their respective regions/districts/stores. I would then save this data frame into something like a store-lookup.csv that would be easily accessible and could be used in a function find the information we would use the memos.

Each time we had to enter a new confirmed case into the system, we could use the input function in python to assign different values to a dictionary and concatenate them to a second data frame that housed all of the confirmed case information, which could be named something along the lines of confirmed-case-log.csv. This would keep a running tally of the region, state, store number, date of notification and any other information we might need for future reference.

DocX

I recently found the docx library while considering ways to streamline the repetitive communications tasks that we had to perform during the pandemic. Because this library works similarly to how we might print out information, once the text of the memos and the other communications packets were settled upon with the client, using the input function to create the variables we needed to find the information in the store-lookup.csv file, we could use f-statements to insert the names of the regional vice presidents, district and store managers, and store addresses into the documents as appropriate.

Additionally, for each of the communications talking points, we could have a set of talking points for if the case was the store’s first confirmed case, or if it already had a confirmed case (i.e., if there were already an entry in the confirmed-case-log.csv file). Using if/elif/else statements, the appropriate verbiage and words could be placed as needed in the file, and then exported as whole into the directory of choice.

Ultimately, given the urgency with which these packets have to be communicated to store employees, and the repetitive nature of the packets, creating a program to create and export these packets quickly could significantly reduce the overall time spent on them, creating value for both the client and the account team.

A Journey towards Data Science / January 3, 2021 by Daniel Berry

Having worked in crisis management at multiple agencies since graduating from college, I saw how consequential the recommendations my colleagues and I would make could be on outcome of any particular situation. However, it wasn’t until I started working at Weber Shandwick that I saw the impact of data-driven decision making come to life. The value and importance our clients placed on the insights we were able to distill from various data sources were the driving force behind my decision to leave crisis management and pivot toward data science, which I intend to use to help companies make key decisions that meaningfully impact their bottom line.

I had my first experience working with datasets during my time at Colgate, as I supported my thesis advisor as he examined media narratives around events taking place in the Middle East. In my first job out of college, I had the opportunity to work with even larger data sets, as I helped clients monitor for potential issues and themes that could come up in traditional media or on social media. In each of these cases, I built a strong foundation of how to use various tools to find and isolate specific themes and issues, and communicate our findings to key stakeholders.

Ultimately, it was the invitation to join the Corporate Issues team at Weber Shandwick that was instrumental in my journey towards data science. As part of a leading global communications firm, the Corporate Issues team worked on many high-profile issues and crises for some of the world’s largest and most recognizable companies.

During my time at Weber, several of my colleagues and I were able to create space on the team to focus on how we and the company could continue to develop and expand upon the team’s data and analytics capabilities. Additionally, I was able to hone my abilities in the realm of social and traditional media listening and analytics, helping to provide clients with real-time insights and recommendations based upon ongoing and emerging trends.

As the world changed with the spread of COVID-19, the team and I worked nearly around the clock for months on end to help clients around the world manage the impact on their business and communicate actions to key stakeholders. Similarly, we supported clients during the protests this past summer, as many companies reckoned with the actions and changes they would need to take to make their workplaces equitable and inclusive for their current and future employees. And when the future direction of the country seemed uncertain in the final weeks before the election, we advised clients on how to best remain true to their core values while still being transparent with their stakeholders. Throughout these crises, I was part of the process of exploring and developing the key insights we used to provide actionable insights to our clients.

In a year full of crises, I saw the value clients placed on the data and insights we provided, and how the insights, coupled with strategic recommendations, drove our clients to make operational changes. As companies continue to work with and manage large amounts of internal and external data, I want to use the skills I’ve learned in my prior roles, along with the tools I’m learning in General Assembly’s Data Science Immersive course, to help make decisions that have a positive affect on a company’s performance.