Overview
COVID-19 has disproportionately affected residents of long-term care facilities, with these facilities constituting less than 1 percent of the U.S. population yet 43 percent of all COVID-19 deaths. As part of a recent project focused on social impact, two members of my cohort, Ethan Kuehl and Jun Choi, and I used data provided by the Centers for Medicare & Medicaid Services to create a visualization tool that shows the impact of COVID-19 on nursing homes, as well as several predictive models to attempt to better anticipate potential spikes in nursing home COVID-19 cases.
The Context
Long-term care facilities, also known as nursing homes, have been some of the most devastated communities during the COVID-19 pandemic. By early May, the New York Times identified more than 36,000 cases and more than 7,000 deaths in nursing homes, almost a month before the Centers for Medicare and Medicaid Services began tracking COVID-19 data for nursing homes. As a result, our modeling data and infographics did not capture the earliest days of COVID-19 in nursing homes. Despite this, we were able to analyze the subsequent responses by these facilities across the country with the hopes of preventing a return to the level of death present during the early days of the pandemic.
The Challenge
In addition to visualizing this data, we also wanted to determine if we could use time-series models to predict nursing home COVID cases and deaths.
The Process
This project leveraged the COVID-19 Nursing Home Dataset provided by the Centers for Medicare & Medicaid Services (CMS), which is a collection of COVID-19 data from nursing homes across the country. The Centers for Disease Control and Prevention's (CDC) mandates that 100% of nursing homes report the following data, as required by the Secretary of Health and Human Services:
Suspected and confirmed COVID-19 infections among residents and staff, including residents previously treated for COVID-19
Total deaths and COVID-19 deaths among residents and staff
Personal protective equipment and hand hygiene supplies in the facility
Ventilator capacity and supplies in the facility
Resident beds and census
Access to COVID-19 testing while the resident is in the facility
Staffing shortages
To acquire the data, we downloaded the .csv file and created smaller files that split up the data by week to allow us to upload the data to GitHub. We then concatenated all of the data into one large DataFrame that recreated the original data we pulled from the CMS website of data at the beginning of each notebook. Due to the scale of the data, we implanted an extensive cleaning process that included removing nursing homes that hadn’t submitted data to CMS each week, as well as those nursing homes that submitted data that did not pass the quality assurance check.
We then split the dataset into numerical and text-based data, and found that most of the text-based data was in fact binary "Yes" or "No" responses that allowed us to quickly binarize the data into 1's and 0's. The remaining text data described the name of the nursing home and the address of the nursing home, so we did not process this data in our ensuing model. The numeric data, on the other hand, could be broken down into dates, geo-locations and counts. We processed the dates for our time series modeling, the geolocations for our nursing home visualization dashboard, and the supplies/resident/staff counts for our timeseries modeling. The primary cleaning step involved imputing values for our null values through groupby operations that allowed us to impute averages based on location for the numeric data rather than just simply imputing zeros.
There are several limitations to the dataset. For instance, nursing homes were not required to report COVID-19 data retroactive to the beginning of the pandemic, which resulted in some facilities reporting higher numbers of cases/deaths compared to other facilities due to their retrospective reporting source. Furthermore, a facility’s access to COVID-19 testing may impact their reported data. For example, facilities that did not have the ability to test all residents would not be able to report all residents with confirmed cases.
Results
From this project, we were able to create an app that helped visualize COVID-19 data provided by CMS. Additionally, through a series of modeling processes, we were able to determine that ARIMA timeseries models were not useful for helping to predict COVID-19 cases and deaths in nursing homes, as there is not enough data for the models to train on.