Some Multiple Regression Models for the Number of COVID-19 Cases and Deaths in the United States

The whole world has been affected by the COVID-19 pandemic. It has changed life drastically, affecting both social and business behavior and causing major economic distress throughout the world. The disease is often denominated a “novel coronavirus,” meaning that it is a new strain, that none of us carry antibodies to it and that there is much to be learned about its pathology. This obviously makes it hard to control. While several countries seem to have grasped ways to contain the virus, the United States (the “U.S.”) has seen steady growth in the number of cases and deaths. This paper uses multiple regression models to examine the differences among the several U.S. states in the numbers of cases and deaths and investigates several possible contributing factors to these totals.


Introduction
The first confirmed case of COVID-19 was reported in the US on January 22, 2020. As of October 30, 2020, the U.S. had recorded over nine million confirmed cases and more than 230,000 deaths. The initial cases in the US were travel related and for some time there appeared to be no indications of community spread. On February 26, 2020, the CDC reported its first known case of community spread when a man in California became infected with no travel history or known contact with an infected person. At that time, the U.S. had only 15 cases in total, 12 of them travel related (Hauck, Gelles, Bravo & Thorson, 2020).
Initially, it was believed that the virus could be controlled through testing and contact tracing and that, like the flu, it would dissipate in the summer heat. Unfortunately, the virus is resilient and has defied efforts to contain it in the U.S. The U.S. surpassed 10,000 cases on March 19, 100,000 cases on March 26 and 1 million on April 28. To put the numbers in perspective, the New York Times provided the following comparison: By summer, the total number of infections in the U.S. was more than the combined populations of Nebraska, Vermont and Montana and the national death toll by summer exceeded the population of Syracuse, N.Y. (Almukhtar, Bloch, Aufrichtig, Calderone, Collins, Conlen et al., 2020).
The overall federal response consisted of a series of non-mandatory guidelines combined with inconsistent messaging from various federal agencies. Americans were first urged to stay at home under Presidential guidelines issued on March 16, 2020. Because these guidelines were not mandatory, the states were forced to respond independently, with various actions taken and results obtained. Many (but not all) states issued stay at home orders; however, the level of restriction, the length of the orders and the enforcement of the orders were far from uniform across states. The overall result was that the economy was shut down in most states for the better part of April. Millions lost their jobs and unemployment reached depression-era levels. The virus has led to some drastic changes in American lifestyles. A visit to the theater (should one be open) or dining out at a restaurant or even simple nights out with friends are events that appear to be fraught with danger. Large gatherings are forbidden in most states, weddings have been postponed or scaled down to name a few changes. Many wonder if life will ever return to what it was pre-pandemic.
initial response (such as the extent, length, and enforcement of stay at home orders). All these factors can potentially influence the initial outbreak of the disease as well as its trajectory.
In addition to attempts to reduce the virus' spread, attempts were also made to expand testing. Initial problems with testing included 1) defective tests, (2) insufficient numbers of testing kits, and (3) delays in getting test results from medical lab facilities (Shear, Goodnough, Kaplan, Fink, Thomas & Weiland 2020). Despite these issues, testing did expand, though at less than ideal rates.
The number of COVID-19 cases per state and the statewide trajectory of these cases is widely available and published almost daily. See for example: ("The New York Times, The Coronavirus Outbreak", 2020), ("Centers for Disease Control and Prevention, Coronavirus (COVID-19)", 2020), ("Johns Hopkins Coronavirus Resource Center", 2020). There have also been many epidemiological and clinical studies on Covid-19. For example, Yang and Wang et. al. (2020) examined 150 patients in Wuhan, China early in the epidemic to investigate the clinical predictors of mortality due to Covid. Zhang et al. (2020) studied looked at estimation of the reproductive number and outbreak size of the disease on the Diamond Princess Cruise Ship. Williamson et al. (2020) looked at factors associated with COVID-19 deaths in England. However, as far as we know, to date no statistical analysis has been done to investigate the factors that affect the number of state-wide COVID cases and deaths in USA. We believe that the results would be of social interest and a study like this could perhaps lead to better nationwide planning with more resources directed towards states that are at higher risk. Towards this goal, we investigate multiple regression models using the factors that might influence our variables of interest. For this study, we have considered data at two critical dates in the path of each state's outbreak: (1) The date 5 weeks (35 days) after the 100th confirmed case, and (2) the date 13 weeks (91 days) after the 100th confirmed case. We chose the 5-week date because that is roughly a period of time after lockdown procedures had been instituted and results were being seen. We chose the 13-week period because that is roughly a period after the lifting of lockdown procedures in which effects were being seen.
As of October 30, the US had a per 100,000 population confirmed case rate of about 2,720. A quick comparison with other highly populated countries like China (approximately 6), Pakistan (approximately 153), Indonesia (approximately 149) and India (approximately 590) shows the per capita number of cases is much lower compared to the U.S. Europe, which was hit very early and very hard, was able to control the spread for a time, but has experienced a resurgence of the virus, though levels are still somewhat lower than the US, Spain's case rate is approximately 2,482 per 100,000, while Sweden's is 1,207 per 100,000, the UK's is 1,434 per 100,000, France's is 2,038 per 100,000 and Italy's is 1,108 per 100,000.
The rest of the paper is organized as follows. Section 2 describes the data sources, the data and presents descriptive summaries of the data. In Section 3 we develop regression models to describe the number of cases and deaths on Day 35 as a function of factors like population density, GDP, mobility index etc. In Section 4 we develop similar regression models for the number of cases and deaths on Day 91. Section 5 discusses model assumptions. Section 6 contains some concluding remarks. A table of the data is in the Appendix.

Data Description
The initial runs consist of 9 independent variables which we regressed onto 4 response variables. The data are contained in the Appendix 1 and 2. The predictor variables (the "Predictor Variables") are:


Proportion of the population that is African-American (af_am). It has been widely noted that minority communities have been very hard hit by the virus. We use the proportion of the population that is African-American as a proxy for minority composition.
 Population density (popdens). Epidemiological models predict that the rate of interaction among the population is a major contributing factor to disease spread. We will use population density as a proxy for the rate of interaction.
 Per capita GDP (GDPpercap). We use this as a proxy for the overall wealth of the population.


Proportion of population with a college degree (coll_deg). Because higher education leads to higher paying jobs and less poverty, this is also a proxy for the wealth of the population.


Proportion of population that is over 65 (over65). It has been noted that older members of the population are more susceptible to the disease and are more likely to succumb to it.


Monthly flights into the state before travel bans (flights_to). We use this measure as a proxy for the likelihood that the disease will travel into the state by such flights.
 Party in control of state governor's office. (party_ctrl, 0=GOP, 1=Dem). This variable test whether results have differed depending on the party in control of the governor's office.


Proportion of the initial 35 days where distancing restrictions in place (prop_dist). This variable is a proxy for the extent of the stay at home restrictions, i.e., the extent of the statewide economic shutdown.
 Change in mobility index from the day of the 100th case (calculated for 35 and 91 days)(dmob). The mobility index is maintained by Descartes Labs. ("Descartes Labs", 2020) The four response variables we considered in the four different models are the following:  Confirmed cases per 100,000 population 5 weeks (i.e., on the 35th day) after 100th case (cases_100k35).
We can see differing trajectories of the epidemic in the time series plots for various states. Figures 1 and 2 contain time series plots of confirmed cases ( Figure 1) and deaths ( Figure 2) for each of the six most populous states: California, Texas, New York, Florida, Pennsylvania and Illinois. It is clear that New York was hit very hard very early and had high numbers of cases and deaths, and then got things under control. Pennsylvania had similar, but less severe timing. Illinois has had a double peak in number of cases; the first peak occurred at the same time as New York and Pennsylvania while the second occurred later on, however fatalities were much less pronounced the second time around. Florida, Texas and California had later outbreaks which were more severe in terms of cases, but less so in terms of fatalities.

Method and Results
We fit a regression models to (1) the number of cases per 100,000 population on the 35th day (5 th week) after the 100th confirmed case in the state (2) the number of deaths per 100,000 population on the 35th day after the 100th confirmed case in the state (3) the number of cases per 100,000 population on the 91 st day (13 th week) after the 100th confirmed case in the state (4) the number of deaths per 100,000 population on the 91 st day after the 100th confirmed case in the state. The results are in the following paragraphs.

Model 1: Dependent Variable: Confirmed Cases as of 35 th Day (5 th Week)
In the first regression model, we fit a regression model to the number of cases per 100,000 population for the several states 5 weeks (the 35th day) after the 100th confirmed case within each state. The response variable is the number of cases per 100,000 residents on that day. The initial predictor variables are the Predictor Variables enumerated in Section 2.
The initial model run shows that the model is significant with an R-squared of 0.743 and an adjusted R-squared of 0.685, however, the only variable that is significant at = 0.05 (in the presence of all the other independent variables) is population density. The party in control of the governorship and the number of flights into the city are marginally significant (p-value < 0.06) in the presence of the other variables.
We used the backward selection method to choose the best subset of regressors resulting in retention of the following variables in our final model: • Population Density, • Per-capita GDP, • Proportion of the population with a college degree, • Flights into the state (pre-travel bans), • Party in control of governorship, and • Proportion of the initial 35-day period with distancing restrictions in place.
The final model is significant with an R-squared of 0.733 (adjusted R-squared of 0.696). Of the predictor variables, population density (popdens) is highly significant (p-value near 0), and party in control (party_ctrl) of the governorship is significant (p-value=0.0381). All other variables are marginally significant (0.05 < p-value < 0.10). The scatter plot matrix for the variables contained in the final regression model for confirmed cases per 100,000 on Day 35 is shown in Figure 3 while the summary of the model is in Table 1.  prop_dist -109.1 59.24 -1.842 0.0724 It was of special interest to examine the effect of the party in control on the dependent variable. To that end, we decided to look at scatter plots between the dependent variable and some of the independent variables using different markers for the party in control (Red is Republican and blue is Democratic.) These plots are shown in Figures 4 and 5. Note that that the number of cases (as well as the deaths) seem to be higher in Democratic States rather than the Republican States. This may be because the states that were hit hardest at the beginning of the pandemic were in the primarily Democratic northeast U.S. In this model we looked at a regression model fitting the number of deaths per 100,000 population for the states for the time period of 5 weeks (the 35th day) after the 100th case. The dependent variable is the number of deaths per 100,000 residents on that day. As before, the initial predictor variables are the Predictor Variables enumerated in Section 2. The initial regression analysis shows that the model is significant with an R-squared of 0.609 and an adjusted R-squared of 0.521. The analysis shows that population density is extremely significant (in the presence of other variables) and party in control is marginally significant.
Just as in the first model, we used the backward selection procedure to parse the model which results in the retention of the following variables: • Population Density, • Per-capita GDP, • Proportion of the population that is African-American, • Proportion of the population with a college degree, • Flights into the state (pre-travel bans), and • Party in control of governorship.
The final model has an R-squared of 0.601 and adjusted R-squared of 0.545. The significant variables are population density and party in control of the governor's office. Population density (popdens) is once again highly significant (p-value near 0). Party in control (party_ctrl) of the governorship is significant (p-value < 0.05). No other variables are significant. The scatterplot matrix for the variables in the final model for Deaths as of 35 th Day is shown in Figure 6 while the summary of the model is given in Table 2.  In this model we considered as the dependent variable the number of confirmed cases as of 13 weeks (or 91 days) after the 100 th case. The predictor variables were the same as the ones used for day 35 except that the change in mobility index is computed for the 91st day.
The model is significant with an R-squared of 0.745 and an adjusted R-squared of 0.694. The significant variables (in the presence of all the other independent variables) are population density, the proportion of the first 35 days that were under stay at home orders, and percentage of the population that is African-American.
As in Models 1 and 2, we have used backward selection method resulting in the retention of the following predictors: • Proportion of the initial 35-day period with distancing restrictions in place.
• Proportion of the population that is African-American, • Flights into the state (pre-travel bans), • Per-capita GDP, and • Party in control of governorship.
The final model is significant with an R-squared of 0.743 and adjusted R-squared of 0.707. Population density (popdens) is once again highly significant (p-value near 0). The proportion of the initial 35 days that were under distancing restrictions (prop_dist) and the proportion of the population that are African American (af_am) are both significant (p-value < 0.05). Per-capita GDP (GDPpercap) and the number of pre-travel ban flights into the jurisdiction were both marginally significant (p-value < 0.10). The scatter plot matrix and the output for the final regression model for Confirmed Cases on Day 91 are given below in Figure 7 and Table 3.  In this model we considered as the dependent variable the number of deaths as of 13 weeks (or 91 days) after the 100 th case. The predictor variables were the same as the ones used for day 35 except that the change in mobility index is computed for the 91st day. The final model, as with the other models, is significant with an R-squared of 0.777 and an adjusted R-squared of 0.727. Here the only significant variable (in the presence of all the other independent variables) is population density. As before, we use backward selection to choose the best regressors resulting in the following independent variables to be retained in the model:

•
Population Density, • Per-capita GDP, • Proportion of the population that is African-American, • Flights into the state (pre-travel bans), and • Party in control of governorship.
The final model is significant with an R-squared of 0.700 and adjusted R-squared of 0.666. Population density (popdens) is extremely significant (p-value near 0). Per-capita GDP (GDPpercapita) and the proportion of the population that are African-American (af_am) are both significant (p-value < 0.05). The number of pre-travel ban flights into the jurisdiction is marginally significant (p-value < 0.10). The scatter plot matrix and the output for the final model for deaths per 100,000 on Day 91 are given below in Figure 8 and Table 4.

Model Assumptions
In checking model assumptions, we found that multicollinearity was not a problem in any of the models. The normality assumption was violated in Models 1 and 2 while Models 3 and 4 did not present any serious violation. The homogeneity of variance assumption was violated in Model 2, but none of the other models posed serious violations. We did not attempt any transformations of the variables because the main purpose of the models was to describe the relationship between the independent variables and the response variables, and transformations would have made interpretation of the results much less straightforward.

Discussion and Conclusion
In this paper, we used multiple linear regression models to identify the significant factors affecting the number of confirmed COVID-19 cases and the number of deaths per 100,000 in the states of the US. We identified population density as the most influential factor in all the models. Interestingly, the model suggests Party control is a significant factor for number of deaths and number of cases at the fifth week, but not at the thirteenth week. The percentage of African American citizens is another interesting factor that was not significant at the fifth week but became significant at the thirteenth week. Per-capita GDP was significant or marginally significant in all the models except the number of deaths at the fifth week. We believe that this paper is an important first step towards a deeper understanding of the factors which influence the number of Covid-19 cases and deaths. Understanding the factors will help us get a better understanding of how to control the virus. However, the trajectory of this disease continues to evolve. In the United States, the first states to feel the wrath of this virus were in the Northeast back in March. Later in the summer we saw the Sun Belt states get hit and as of the time of this analyis it appears to be the Midwest feeling the worst effects of the virus. Another factor to consider is that early in the onset of the virus, the death rate from the disease was extremely high but now seems to have decreased. It is clearly of interest then to see why the waves of infection hit different parts of the country at different times and why the death rate seems to be more stable now. Towards that end, for future research, we would like to see if the addition of a predictor variable that distinguishes between states that were hit by the virus early and those that were hit later and see if that changes the effects of the independent variable. Another factor of interest would be to investigate what if any are the differences in testing protocols and data collection methods in different states and how that affects the models. One important factor in data collection methods is the difference between officially reported deaths due to COVID-19 and excess deaths (the difference between total statewide deaths and the average number of deaths in the past several years). If there are significant differences in this count between states, then that warrants a deeper investigation. Finally, this paper looked at two discrete points in time to model the number of cases and deaths for the different US states. A time series analysis of our two dependent variables would definitely add to the understanding of this new disease.