Mathematical Model and Data Analysis to Determine the Number of Confirmed Infections Due to Covid-19 in Spain

Covid-19 initially started in China, although cases of infection by this virus are currently being identified in Europe since January and February of this year camouflaged within a strong outbreak of influenza that had not been identified before. What is certain is that in about a hundred days it has spread around the world threatening humanity. There seems to be a great need to find a rapid response to the speed at which the virus is spreading. In this work, different mathematical models are studied to accurately determine the speed of propagation or infection of people infected by Covid-19 based on data collected from the evolution of the pandemic in Spain. Several mathematical models are proposed and analyzed, but the model proposed as the most suitable is a fourth degree polynomial regression adjustment that presents an R-square statistic of 99.72% which gives a great adjustment of the model for the calculation of the number of infected confirmed by this virus in Spain. Knowing these data is of vital importance to be able to take and undertake the most urgent health and social measures in an effective and orderly manner. This will have a great repercussion in being able to avoid a high number of possible infections.


Introduction
In 1927, biochemist William Ogilvy Kermack and epidemiologist Anderson Gray McKendrick (Kermack & McKendrich, 1932;Kermack & McKendrich, 1923), both Scottish, published a paper that is still used to model epidemics of infectious diseases. The problem they studied was and still is one of the leading causes of death worldwide.
Just think that the 1918 influenza pandemic, also known as the Spanish flu, killed between 50 and 100 million people, while the death toll from World War I in the previous four years was less than 20 million. Kermack and McKendrick developed the so-called SIR model, where the population is divided into "S" for susceptible, "I" for infected and "R" for recovered. In the 'S' of susceptible are all the people who are not vaccinated -which in the case of covid-19 is the entire population -and who may become ill. In the 'I' of infected, whose curve must try not to rise above the health capacity of the country, because they are those who may require hospital care, and finally in the 'R' of recovered, which are those who neither infect nor can be infected, where the dead are always counted (Dahari et al., 2005;Dee & Shuler, 1997;Diekmann & Heesterbeek, 2000;Ellner et al.,1998). The sum of "S" plus "I" plus "R" is the total number of the population. However, these models also have their limitations. The simplest CRS models make basic assumptions, for example, that everyone has the same chance of getting the virus from an infected person because the population is perfectly mixed and that people with the disease are equally infectious until they die or recover. More advanced models subdivide people into smaller groups (by age, sex, health status, employment, number of contacts, etc.) to establish who meets whom, when and where (Brand, 1957;Brauer et al., 2008;Cadarso & González, 2007;Canini & Perelson, 2014;Cañas, García & Andérica, 2003;Checkoway, Pearce, Crawford, 1989;Chen & Bokka, 2005;Clapham et al.,2016;Haerdle, 1993;Lofgren, 1993;Schafter & Kot, 1985). of those infected worldwide and 14.17% of those infected in Europe.The communities most affected in Spain by infected persons have been Madrid, Cataluña, Castilla La Mancha, Castilla and Leon and the País vasco.
It may sound strange, but actually counting the dead as "recovered" is part of the mathematical model that is at the base of most simulators used to show how the disease caused by the new coronavirus spreads around the world. For example, it is the model on which the interactive map of the pandemic of the Johns Hopkins University in the United States is based, an institution that has positioned itself as one of the maximum statistical references in this health crisis (CNE, 2005;Sanglier, Robas & Jiménez, 2020a).
In this paper we will analyze and present different mathematical models focused on adjusting to the total number of confirmed infections in Spain with data taken from the Center for Coordination of Health Alerts and Emergencies belonging to the Ministry of Health of Spain (MS, 2020).

Methods
A comparison of regression models will be made using the Statgraphics Centurion program. The objectives will be to calculate the covariance and Pearson's linear correlation coefficient between two variables, to perform a linear regression analysis on the data, to determine the existence of a simple model that best fits the data by checking the type of transformation performed, to determine the best statistically significant polynomial that best fits the data, and finally, to analyze the normality of the residues of the best fitting model (Armitage & Bery, 1994;Drake, 1998;Martínez-González, 2004;Montesinos & Hernández, 2007;Velasco, 2007;Canini & Perelson, 2014;Clapham et al., 2016;Sanglier, Robas & Jiménez, 2020b).

Results
The numerical data available have been extracted from the Health Alert and Emergency Coordination Centre of the Spanish Ministry of Health. A table of data for the study is attached (Table 1).  Vol. 14, No. 7;2020 Initially, a multivariate analysis of the evolution of the number of people infected between the dates shown in the table will be carried out, from February 12 to April 25, 2020, the period in which there has been the greatest evolution in the number of people infected by Covid-19. The statistical summary obtained is set out below.  Table 2 shows the statistical summary for each of the selected variables. It includes measures of central tendency, of variability, and of shape. Of interest are standardized bias and standardized kurtosis, which can be used to determine whether the sample is from a normal distribution. Values of these statistics outside the range of -2 to +2 indicate significant deviations from normality, which would tend to invalidate many of the statistical procedures usually applied to these data. In this case, the time variable shows values of standardized kurtosis outside the expected range.
In the box and whisker graph, the relationship between the two variables analyzed is shown.
0,0000 Table 3 shows the correlations between each pair of variables. These correlation coefficients range from -1 to +1, and measure the strength of the linear relationship between the variables. The number of data pairs used to calculate each coefficient is also shown in parentheses. The third number in each block of the table is a P-value that proves the statistical significance of the estimated correlations. P-values below 0.05 indicate correlations mas.ccsenet.org Modern Applied Science Vol. 14, No. 7;2020 significantly different from zero, with a 95.0% confidence level. The following pairs of variables have P-values below 0.05 and therefore have a good correlation To determine the type of relationship between the two variables, we will start by analyzing the simple linear regression model ( Figure 2).

Figure 2. Simple linear regression graph
It is observed that the calculated regression line does not completely fit the data obtained, although if we analyze the analysis of variance table we have a model that is statistically significant, since its P-value is less than 0.05 and with coefficients of constant and slope of the line ( Y = a + b X ) also statistically significant since its P-values are less than 0.05 as shown in Table 4 below.
The ANOVA table presents the values of variability between and within groups. The sum of squares between groups measures the variability between the means of the factor groups. The sum of intra-group squares measures the variability within each factor group. The sum of total squares measures the variability of all data with respect to the mean. The F-ratio is the value of the mean of the inter-group squares divided by the value of the mean of the intra-group squares. The P-value indicates the level of significance (it is the area to the right of F). For small values (less than 0.05) it indicates that the sample/variable measurements are significantly different.   These are the models that can have a square R higher than the linear model that has been studied (R-square = 87.285). Table 5 shows the results of fitting several curvilinear models to the data. Of the adjusted models, the X-square model is the one that gives the highest R-square value with 95.9984%. This is 8.71737% higher than the selected linear model.
Next, the X-square model that has the best R-square of all the models will be tested and the graph below is obtained. It is observed that the new model fits much better to the data than the linear model initially proposed.    The mean absolute error (MAE) of 13415.9 is the average value of the residues. The Durbin-Watson (DW) statistician examines the residues to determine if there is any significant correlation based on the order in which they are presented in the data file. Since the P-value is less than 0.05, there is an indication of a possible serial correlation at a 95.0% confidence level. Plot the residuals versus the row number to see if there is any pattern that can be detected.
You can try another grade 5 polynomial regression model. The data obtained from the analysis and the fitted model are as follows (Figure 4).  It can be observed that in the table of the T statistic there are some coefficients such as the Time (Days)^5 coefficient that has a P-value > 0.05, so it can be said that it is not a statistically significant coefficient and should be eliminated from the model. If this is done, we are left with the following data (Table 8).
The model sought is that presented by equation 3 because it presents the best R-square of all. The R-Square statistic indicates that the model thus adjusted explains 99.7743% of the variability in Number of Infected. The R-squared adjusted statistic, which is more appropriate for comparing models with different numbers of independent variables, is 99.7212%. The standard error of the estimate shows that the standard deviation of the residues is 4193.35. This value can be used to construct limits for new observations by selecting the Reports option from the text menu. The mean absolute error (MAE) of 3170.41 is the average value of the residue. The Durbin-Watson (DW) statistician examines the residues to determine if there is any significant correlation based on the order in which they are presented in the data file. Since the P-value is less than 0.05, there is an indication of possible serial correlation at a 95% confidence level. Plot the residuals versus the row number to see if there is any pattern that can be detected.
To determine if the order of the polynomial is appropriate, look at the P-value in the higher order term as 2.0384E-8. Since the P-value is less than 0.05, the higher order term is statistically significant at a 95% confidence level. Because of this, no lower order model is considered for testing.
We will now analyze the errors in the model selected as the best fit. We will check whether the residuals follow a normal distribution. We are going to test the model for normality and goodness-of-fit.
The following data are obtained from the normality tests (Table 9). It is observed from the results obtained to determine the residues that these can be adequately modelled with a normal distribution. The chi-square test divides the range of residues into 13 equally likely classes and compares the number of observations in each class with the expected number of observations. The Shapiro-Wilk test is based on the comparison of the quartiles of the normal distribution adjusted to the data. The standardized bias test looks for lack of symmetry in the data. The standardized kurtosis test looks for whether the shape of the distribution is flatter or more pointed than the normal distribution.
Because the smallest P-value of the tests performed is greater than or equal to 0.05, one cannot reject the idea that the residues come from a normal distribution with 95% confidence.
If we now look at the goodness-of-fit tests, we get the following data (Table 10).  The results of the various tests carried out to determine whether the residues can be adequately modelled with a normal distribution have been presented. The chi-square test divides the range of residues into non-overlapping intervals and compares the number of observations in each class with the expected number based on the adjusted distribution. The Kolmogorov-Smirnov test calculates the maximum distance between the cumulative residue distribution and the FDA of the adjusted normal distribution. In this case, the maximum distance is 0.11034. The other statisticians compare the empirical distribution function with the adjusted FDA, in different ways. It is noted that in some tests the P-value could not be determined accurately.
Because the smallest P-value of the tests performed is greater than or equal to 0.05, the idea that the residues come from a normal distribution with 95% confidence cannot be rejected.

Discussion
Although the percentage of people infected by Covid-19 in the world is approximately 0.0447%, considering the world population at 7.88E9 people, the coronavirus pandemic has managed to generally alert humanity. The speed of its spread around the world is what has put us on alert ( Munayco, 2009;Al-Rousan, 2020;WHO, 1994).To carry out this study, the Statgraphics Centurion program has been used to calculate the regression model that best fits two variables, in this case the number of people infected by Covid-19 in Spain over time.The results obtained by simple linear regression have been compared with linear models with transformation of variables and polynomial regression model. The normality of the model residues has been analyzed.
We started with a multivariate analysis of the two variables where it was found that one of the analysed variables presented standardised kurtosis values out of range. However, it is demonstrated that there is a good correlation between the variables.
The first model proposed has been a simple linear regression model presenting an R-square of 87.281%. The model is statistically representative as the P values are less than 0.05. Equation 1 has been obtained for this model. This model has been compared with other 27 alternative models where it has been determined that the best model was the X-square that presents a R-square of 96%, this meant an improvement of 8.71% with respect to the initial model. Equation 2 has been obtained for this model.
It was then tested with a polynomial regression model of order five, where after analyzing the statistical data it was determined that the grade five coefficient had a P value above 0.05. The order was lowered to four and a new grade four polynomial model was retested. In this case the model presented the best R-square, obtaining a value of 99.72%. It was determined as the best model and equation 3 was obtained as the final result.
Finally, a study of the model residues was made to see if they could be modeled as a normal distribution. The results obtained confirmed this assumption.
In this research article, different regression models have been highlighted in order to study the model that provides the best adjustment to the variables studied, such as the number of infected people in Spain during a period of maximum infection, depending on their evolution over time.
After discussing and analyzing the models obtained, it has been found that the model that best fits the number of confirmed infected persons in Spain is a fourth-order polynomial regression model. The study of possible models that can calculate the number of deaths and recoveries, two very important parameters in studies on the evolution of the virus in the case of pandemics of this type, is left for another occasion.
The development of this type of mathematical model is very necessary in order to help the different governments in the rapid adoption of prevention measures, both of a clinical and social nature, with the possible reduction of the number of infected people. This would lead to a greater decongestion of hospitals and health systems in general.