Validation of a Mathematical Model Applied to Four Autonomous Communities in Spain to Determine the Number of People Infected by Covid-19

In this study, the application of a mathematical method already developed in other articles for the study of the speed of propagation of the number of people infected by Covid-19 has been validated. The proposed mathematical model has been contrasted with models based on quantitative prognostic methods in order to seek its validation. At the same time, the forecast errors of the proposed methods and the mathematical model have been calculated and compared. The results obtained have been applied to the four autonomous communities in Spain with the highest number of coronavirus infections during the study period, obtaining very satisfactory results and achieving very good approximations to the real data on the number of infected people in all the autonomous communities studied.


Introduction
Doing a little scientific history, it can be indicated that coronaviruses are a kind of big family of viruses that can cause diseases that have a wide range of impact in terms of their severity in people. The first known serious illness caused by a coronavirus emerged with the 2003 Severe Acute Respiratory Syndrome (SARS) epidemic, also in China. A second serious disease outbreak was detected in Saudi Arabia with the Middle East Respiratory Syndrome (MERS) in 2012.
In late December last year, Chinese authorities alerted the World Health Organization (WHO) to an outbreak of a new strain of coronavirus that was causing a serious, rapidly spreading disease, and was subsequently named SARS-CoV-2. Until 20th February 2020, almost 167,500 cases of 19-COVID have been documented worldwide, although many mild cases may not have been well diagnosed because the symptoms were very mild or because the tests and equipment needed to diagnose them were not available, it is estimated that the number of actual cases of people infected may be as much as ten times higher. So far, the virus has killed more than 6,600 people worldwide, a number that is a global warning.
If we focus on Spain, the coronavirus disease pandemic has spread throughout almost all of its territory, currently being the second country with the highest number of confirmed cases and the second in number of deaths, behind Italy. The first positive diagnosis was confirmed on 31 January 2020 on the island of La Gomera, while the first death occurred on 13 February in the city of Valencia, a fact that became known twenty days later. As of April 4, 2020, there are 124,736 confirmed cases in Spain, of which 34,219 have been discharged and 11,814 have died according to the Spanish authorities, the majority of the deceased being persons over 65 years of age (CCAyES, 2020). the region of Madrid is 542 positive cases per 100,000 inhabitants, while in the community of La Rioja, this rate rises to 765 confirmed cases per 100,000 inhabitants. The main outbreak of the epidemic during the first weeks was in the town of Torrejón de Ardoz, Madrid. The Canary Island of La Graciosa is, at the moment, the only area in the whole country without any confirmed case (CCAyES, 2020).
During the month of March, the virus has spread very quickly, forcing the governments of the different Autonomous Communities (AC) to take measures, mainly health, urgent in the most affected areas, until finally on March 14, the Spanish government decreed the entry into force of the state of alarm throughout the national territory that extends to date.
The study carried out in this work, corresponding to the statistical data provided by the CCAyES (Centre for the Coordination of Health Alerts and Emergencies of the Ministry of Health) during the period from 25 February to 29 March of this year, and are referred to the Spanish Autonomous Communities that had the greatest number of infected people, such as the communities of Madrid, Cataluña, the País Vasco and La Rioja that are going to be studied.

Experimental Procedures
As indicated above, the proposed mathematical model, which has been applied in this article, was already developed by some of the authors before (Sanglier, Robas & Jimenez, 2020).
In that article, it was commented that viral infections of the respiratory tract are common acute diseases among the human population, and that the transmission of the virus, whether by direct or indirect routes, occurs in the most dispersed areas of the world, and that a more in-depth analysis would lead to a consideration of how the transmission of these viruses can have a broad impact on public health.
The development of the mathematical model took into account various meteorological factors: ambient temperature (θ), air currents (Ca) related to ventilation processes and air flows, air humidity or absolute humidity (H) and rainfall (Pr). It is possible that these meteorological factors play a more important role in some regions than others (CDS, 2009;Checkoway, Pearce & Crawford-Brown, 1989;Chen, Xu & Salyards, 2012;Khiabanian, Farrel, St George & Rabadan, 2009).
The mathematical model was also developed taking into account other effects such as non-environmental ones: family and social structures (Efs), seasonal changes in behavior (Ce) and pre-existing immunity (Ip), which have been considered to also play an important role in the transmissibility of respiratory viruses and infection rates (Lofgren et al., 2007;Lowen, Mubareka, Steel & Palese, 2007;McKinney, Gong & Lewis, 2006;Pica & Bouvier, 2011).
Below are the environmental and non-environmental data belonging to the four autonomous communities that have been introduced into the mathematical model proposed for obtaining part of the final data (Table 1). Starting from the variables previously indicated, and using the mathematical tool of Classical Dimensional Analysis (CDA), we proceeded to determine the mathematical model that was already deduced, and that will be used in this article to determine the number of infected people in the four Spanish communities most affected by the epidemic (Velasco, 2007).
The population and surface area data for the different autonomous communities have been obtained from the National Institute of Statistics (INE) for the year 2019 (INE, 2020). With these parameters it has been possible to determine the population density (inhabitants/Km2 ) in each autonomous community.
Meteorological data related to ambient temperature, absolute humidity and wind speed have been obtained from the State Meteorological Agency (AEMET) (AEMET, 2020). With regard to the data on the parameter average annual wind speed, it should be noted that these have been collected at 30 m from the ground as this is considered the height with the least influence in order to subsequently determine the air mass flow for a given surface. The time-related Ce parameter should be entered in days since this is the time period in which the study has been carried out.

Mathematical Model
n this study, the application of a mathematical system developed in another article for the study of the speed of propagation of the number of people infected by Covid-19 has already been validated. The equation that was obtained is shown below (Sanglier, Robas & Jimenez, 2020). (1) The first term of the formula has been determined by Classical Dimensional Analysis (Brand, 1957;Alhama & Madrid, 2011;Kunes, 2012) discussing the same for the different parameters that compose it. The second term is an adjustment term of the first part where F is determined as a function of the actual death curve data. The compensation variable F is obtained by iterations to adjust the curve of the proposed model to the real data model as closely as possible. The results obtained by pairs of points (real model and proposed model) are compared, and depending on the best approximation, the value of F is determined.
The values of parameters a and b of the proposed model correspond to the values obtained from making a linear adjustment to the real data curve of the deceased in Spain due to Covid-19.
The other parameters that appear in the formula are related to meteorological factors and non-environmental effects. The other parameters that appear in the formula are related to meteorological factors and non-environmental effects. Meteorological factors have been associated with virus infection rates such as ambient temperature (θ), air currents (Ca) referring to ventilation processes and air flows, air or absolute humidity (H) and precipitation (Pr). All these data have been obtained from the INE-Instituto Nacional de Estadística (INE, 2020). It is possible that meteorological factors play a more important role in some regions than in others. Non-environmental effects such as family and social structures (Efs), seasonal changes behaviour (Ce) and pre-existing immunity (Ip), could also play an important role in the transmissibility of respiratory viruses and infection rates (Pica & Bouvier, 2012).
Using the above equation, the successive data on the number of infected persons in each Spanish Autonomous Community analyzed for the different days of the study period have been determined.

Methodology
The following diagram aims to show the methodology followed in this work in order to make the necessary adjustments to accurately determine the number of infected ( Figure 1). An organization chart of the process carried out to accurately determine the number of actual coronavirus infections in the four communities studied. The study has been initiated from the environmental and non-environmental data that have been introduced in the mathematical model formula to determine the first two terms of the model, this has been done with the first term of the formula. The second term of the formula has been used to obtain the rest of the terms of the mathematical model. The factor 'F' that appears in the equation of the mathematical model, is a compensation factor as indicated above. To validate the proposed mathematical model with respect to the real one, it has been compared with mathematical models based on quantitative prognostic methods such as weighted average and simple exponential smoothing for values of 0.3, 0.5 and 0.7 of the alpha smoothing constant (α).

Forecasting Methods and Errors
These methods aim to eliminate random fluctuations in the time series by providing less distorted data of the actual behaviour of the time series.
The time series, also called time series or historical series, is a set of numerical data obtained in regular and specific periods through time. This type of method is appropriate for a stable time series, that is, one that does not exhibit significant trend, cyclical or seasonal effects (Brauer, 2008;Munayco, 2009;Martin del Rey, 2009;Bowerman et al., 2007;Canovas, 2003;Diebold, 2001;Diekmann & Heesterbeek, 2000;Dowell, 2001;Elandt-Johnson, 1975).

Prognosis of the Number of Infected with the Weighted Average Method
This method discards the oldest data and considers the most recent, as we work with simple moving averages, the difference is that historical data are weighted, meaning, they are assigned different weights, in the model the weights have been 40%, 30% and 30% for the most recent, intermediate and furthest periods respectively, using an n=3 for the most recent period.
For the calculation of the forecast using this methodology, the formula (IE, 2016) has been used: where: Ft: forecast for the next period (day) t.
Dt: observed value of the number of infected people in the period (day) wi: weight or weighting for the observed value of the number of infected in period t-i.

Exponencial Simple Prognosis of the Number of Infected with the Simple Exponential Smoothing Method
To generate a forecast by this method, you need the most recent forecast, the value of the number of infected people for that day, and the alpha smoothing constant (α). Each time the forecast is calculated, the previous observation is removed and replaced by the most recent number of infected, and this is where this method is interesting (IE, 2016).
The prognosis has been calculated using the following formula: At-1 -Ft-1= error of previous forecast.

Error Calculation Method
The following table presents the forecast errors determined in this work, as well as the formulas used in their calculations Table 2. Error measurement factors and applied formulas

Results
In this section, the results obtained from the proposed mathematical model compared with the real data provided for the four Autonomous Communities studied (Tables 4,5,6 and 7) will be determined and presented by means of data tables.  Table 3. Comparison of the real data and those obtained by the mathematical model for the communities studied in Spain As can be seen from the table above, the actual data in some cases are very similar to the data obtained by the proposed mathematical model applying the equation described in section 2.2.
Below, these data collected in the previous table are shown in graphic form, and it is possible to appreciate what was mentioned above (Figure 2).  Vol. 14, No. 6;2020 Next, the data obtained (actual deaths) will be compared with those obtained with the proposed mathematical model and with the quantitative prognosis models indicated above, obtaining the data reflected in the following tables (Tables 4, 5, 6 and 7) for the different Spanish communities under study. In the first column appears the initial date of the study and the final date, in the second the number of days elapsed until the end, in the third the data obtained from actual deaths (INE, 2020), in the fourth column appear data calculated with the proposed model, the fifth, seventh and ninth columns correspond to the data calculated with the simple exponential prediction model for the smoothing factors of 0, 3, 0.5 and 0.7 respectively, the sixth, eighth and tenth columns show the prediction errors corresponding to the three cases of the exponential model, the eighth column has calculated the prediction error, and finally the last two columns correspond to the data obtained for the averaging model and its prediction error (IE, 2016).    Table 7. Comparison of actual data and forecast models with prediction error for the Community of La Rioja All these data will be graphed in the discussion part of this paper in order to be able to compare the different models made against the real model (built with real data on the number of deaths). A study of forecast errors will also be carried out, taking into account four measurement factors.
The calculation of the forecast error allows for decision making when faced with the different forecast methods used in this work. In addition, they are able to detect when the forecast of the speed of propagation of the virus, as in this case, does not adjust to the real data obtained from reliable sources such as the INE. This allows for the adaptation of decisions made in terms of a better solution. Forecasting errors occur due to systematic and random errors. The former are caused by a constant error: misinterpretation of the problem, not using the right variables or using inappropriate relationships between them. Minimizing this error will be achieved with the greatest experience of the calculator or specialist scientist (Enders, 2008;Farnum, 1989;Guisande, Vaamonde & Barreiro, 2011;Gujarati & Porter, 2010;Hanke & Wichen, 2006;Kirklang & Viguerie, 1997;LaPointe, 2004;Levine, Sthepan, Krehbiel & Berenson, 1997;Montesino & Hernández, 2007).
Below are the graphs obtained for the numerical data reflected in tables 4,5,6 and 7 of the four Spanish communities analyzed. In the four graphs, the curves corresponding to the real model, the proposed mathematical model and the four forecast models that have been studied (the average model and the three simple exponential smoothing models) given for the different values of the smoothing coefficient will be represented. In general, it can be seen from the four graphs of the Autonomous Communities that the forecast model based on exponential smoothing is the model that comes closest to the real data curve (strong orange curve). The model based on average (blue curve) always moves away from the real data curve, in the same way as the proposed mathematical model (grey curve), it does so from below, except in the final days of the study that the model tends to follow the number of real infected people well.
With regard to the prognosis errors obtained for the Spanish Autonomous Communities analysed, the following data have been obtained (Table 8): The cumulative forecast error total (CFE) is the basic measure and the measure that gives rise to the others. You can indicate that it is the cumulative sum of all previously calculated forecast errors. It allows you to evaluate the bias of the forecast. If, over the analysis time, the actual value of the rate of spread of the infection is greater than the forecast value, the CFE will be higher, which is indicative of a systematic error in determining or calculating the rate of spread of the infection. Therefore, it is of interest to obtain values below the forecast value. In this study, we have compared the values obtained by means of the mathematical model. This occurs for all the communities analysed, except for some cases in the community of La Rioja.
The mean absolute deviation (MAD), measures the dispersion of the forecast error, obtaining the measurement of the error in units. The least dispersion observing the data in the tables occurs for the forecast models based on simple exponential with a smoothing factor of 0.7 in the four Autonomous Communities.
The mean square error (MSE) is also a measure of the dispersion of the forecast error, but this measure, unlike the previous one, maximizes the error by squaring it, penalizing periods of time where its difference was higher. The use of this parameter is highly recommended for study periods with very small deviations (MAD). The results are identical to the previous parameter, the less dispersion is produced in the use of the same forecast model for the four analyzed communities.
And finally, the mean percentage error (MAPE) or percentage of mean absolute error (PEMA) gives us the deviation in percentage terms and not in units as in the previous parameters analyzed. As can be seen from the tables above, for the Community of Madrid, the model that best fits is the simple exponential smoothing model with a coefficient of α=0.7. However, the mean percentage absolute error (MAPE) is better for the proposed mathematical model.
For the Communities of Cataluña, País Vasco and La Rioja it is exactly the same as for the Community of Madrid, also the model that best fits the real data of the number of infected is the simple exponential smoothing model for alpha=0.7 as it gives lower errors for CFE, MAD and MSE. On the other hand, as far as ASM is concerned, in Cataluña very similar results appear between the proposed model and the exponential models for the three alpha values studied. For the community of La Rioja, very similar ASM results appear for the exponential models, but they are further away from the value obtained for the real model. For the País Vasco, the best ASM corresponds once again to the proposed model.
Finally, and taking into account that the mean percentage absolute error (MAPE), also known as the mean percentage absolute deviation (MAPD), measures the accuracy of a method for the adjusted construction of time series values in statistics, in our case with respect to the real data of infected persons (real model), the proposed model is better in two of the four cases studied (Madrid and the País Vasco), and in a third case (Barcelona), it can be concluded that the adjustment of the proposed mathematical model is satisfactory with respect to the real data model.

Discussion
Mathematical modelling of communicable infectious diseases is attracting increasing interest, and major innovations can be expected in the coming years, especially if their use is extended and applied to "neglected" communicable diseases or other health problems. It is a challenge for countries to invest money and apply it to research in these fields of epidemiology, virology and mathematics.
The equation of the proposed mathematical model presented in this paper for validation, obtained in a previous work through classical dimensional analysis in one part, is created based on environmental and non-environmental parameters that have been studied as very influential in the spread of communicable diseases of this type (Murray & Morse, 2011;Sadique et al., 2007;Shaman & Kohn, 2009;Velasco, 2007).
The cumulative sum of prognostic errors (CFE) indicates that no values below the prognosis have been obtained, but it is true that this has been with respect to the values obtained through the mathematical model and not the actual data model. It has been deduced that this occurs for all the communities analysed, except for some cases, in the community of La Rioja.
For the values of deviations with respect to the forecast error, it has been obtained that the smallest are produced for the forecast model that uses the simple exponential methodology with a smoothing factor α=0.7.
Within the models based on the prognostic methods studied in this work, the simple exponential smoothing method with a smoothing factor α=0.7 is presented as the method with the highest adjustment to the real model fed with real data from people infected by coronavirus in the Spanish autonomous communities that have been proposed for analysis. However, this method is only based on numerical data from the real model, not taking into account other types of factors or situations that might require a more sensitive model. The proposed mathematical model shows the best results in terms of average absolute percentage error, a parameter that measures the accuracy of its adjustment to another model, in this case the real one. In contrast, the simple smoothing forecast model, for the values of the smoothing constant analyzed, shows a better fit to the real one, data confirmed by the values obtained from the forecast error, the mean absolute deviation and the mean square error in all the Autonomous Communities studied.
Therefore, it is possible to validate the equation of the mathematical model presented, depending on the data obtained for the communities analyzed and for the period of time studied.