Analysis of Data on Socio-Demographic and Clinical Factors of the COVID-19 Coronavirus Epidemic in Spain on Cases of Recovered and Death Cases

Carrying out a study of socio-demographic and clinical factors to determine which of these are more significant and have a greater influence on the speed of the spread of the virus, taking into account the behaviour of people who have died and been recovered in Spain. The objectives of this study have been to analyze the influence of socio-demographic and clinical factors on the speed of propagation of Covid-19, to determine the most relevant factors and to propose studies determining the prevalence of the disease. The Chi-square model supported by the statistical program Statgraphics Centurion xvi has been used to determine the dependence or not of the different variables studied on the speed of propagation of the virus. In relation to the clinical variables, a cluster study has been carried out to see their dependence. Very relevant conclusions have been obtained from the factor of age in the different analyzed bands, as well as from the little influence of the economic position of the people in the speed of propagation of the virus. The high population density and the areas studied are not always indicative of further spread of the disease A linear function has been determined to link the clinical parameters studied that could be used in subsequent prevalence and seroprevalence studies. The fundamental variables in the study of the coronavirus have been indicated according to socio-demographic and clinical factors. We warn about environmental factors to be studied.


Introduction
Although the epidemic has spread throughout the world, Spain has been one of the countries hardest hit by the pandemic. The first patient registered in Spain with coronavirus Covid-19 was known on January 31st on the island of La Gomera. Nine days later another case was detected on the island of La Palma. But it was not until Feb. 24 that the virus jumped to the mainland, with the first cases being detected in the communities of Madrid, Catalonia and Valencia (Sanglier, Robas & Jimenez, 2020).
Real-time analysis of the evolution of the coronavirus by Johns Hopkins University continues to add to its numbers. Today, the balance of infections in the world amounts to more than 3.3 million, with more than 239,000 deaths. Spain is now the fourth largest country in the world after France, the United Kingdom and Italy.
To date, Spain presents the following data: confirmed data of infected by PCR (Polymerase Chain Reaction) tests (216,582), deceased (25,100) and recovered (117,248). Their evolution from 31 January to 2 May is shown in the following graph. We look for the possible numerous reasons for the rapid spread of the coronavirus in the different communities of Spain. Researchers are analysing possible causes due to social-demographic factors (Chen et al., 2012;Clapham et al., 2016), environmental factors (Shaman & Kohl, 2009;van Regenmortel, 2000;Pica & Bouvier; and clinical factors ( Yu et al.,2004;Zhu et al.,2019) among the main causes of the spread of the virus.
The aim of this research will focus on studying the effect of several variables contained in the factors mentioned above and seeing their involvement in the spread of the coronavirus CoVID-19 in Spain. The research will focus on studying the effect of variables such as sex, age, area (community), social level (income), etc. on the number of deaths and recoveries in Spain. The Chi-square methodology will be used to determine whether the variables analysed have an influence on the spread of the virus. Based on the variables analysed and the results obtained in this study on the influence on the spread of the virus, a prediction model could be determined that will help in the study of this type of pandemic (Canini & Perelson, 2014;Checkoway, Pearce & Crawford-Brown, 1989).

Materials and Method
The study used data provided by the Ministry of Health's Alert and Emergency Coordination Centre for cases of infection by Covid-19 disease from 23 March to 25 April, as these are the dates when the number of infected, dead and recovered cases first began to rise and then decline in a controlled manner.
Statistical analysis has been used as the Chi-square test to analyse the data collected and see its impact on the different variables to be taken into account in the study of the speed of the epidemic's spread.
The Chi-square or Pearson's test is a statistical test that allows us to recognize the association between two categorical variables whether they are dichotomous or polytomous. It is used to check the relationship between two variables in a contingency table that presents the frequency distribution (multivariate) of the variables. This test, also known as the independence test, is used to test the hypotheses of the categorical variables, and to check whether these variables are independent population variables or not (Dahari et al., 2005;De Wit, van Doremalen, Falzarano & Munster, 2016). An analysis will be carried out using the Chi-square test of quantitative variables among them, such as sex, age, region, etc. against other quantitative variables such as the number of deaths or the number of people recovered according to the statistical data collected.
The Chi-square values calculated can be obtained by the formula: Being fo the observed frequency and faith the expected frequency. To calculate the critical Chi-square, the values of the level of significance and the degree of freedom obtained will be taken into account.  Vol. 14, No. 8; The comparison between the expected Chi-square and the critical one will determine whether the null hypothesis or the alternative hypothesis raised in relation to the independence or not of the variables under study is fulfilled. The variables are significant if the probability of the association (P) does not exceed 1 in every 1,000 cases (0.0001); otherwise, it will be considered that it is not significant, therefore, the association between the variables will be rejected. We can also speak of a 95% confidence level, this means that if the P-value is less than 0.05 the data compared will be independent of each other, otherwise they will be independent.
Cross-tabulation methodology has been used to understand the relationship between the independent variables and the chaos of the deceased and recovered population. This cross tabulation method is used between a dependent variable and an independent variable, to understand how the dependent variable moves in relation to the independent variables (Dee & Shuler, 1997;Diekman & Heesterbeek, 2000;Montesinos-López & Hernández Suárez, 2007).

Cases of Recovered and Deceased People
As of April 25, the number of confirmed cases of people infected with Covid-19 in Spain was 194,333, an increase of 1.5%. The total number of cases recovered was 78,430 and the number of deaths was 14,629.
For this study, carried out between 23rd February and 24th April for the reasons already mentioned, it has been determined that the number of people infected was 176,311, of which 99,005 were women and 77,306 men.
To study the effect of the sex-independent variable on the number of persons cured or recovered, the number of female cases recovered was 30,910 and the number of male cases recovered was 40,555. Figure 2. Distribution of deceased, undeceased, recovered and unrecovered cases base on sex.

Sex Influence
The Chi-square test will be used to determine the impact of the sex variable on the number of people recovered and on the number of deaths (Lee & Storch, 2014;Center for Disease Control and Prevention, 2009;Cember, 1985). The Table 1  In the test of independence, a statistic of 8377.392 was obtained, with 3 degrees of freedom and a P-value of 0.0000. Since this value is less than 0.05, the hypothesis that the variables in the rows and columns are independent with a 95% confidence level can be rejected. Therefore, the variable observed in each row is related to its column. This means that the sex variable is statistically significant with the number of cases of recovered and deceased persons. The number of cases of deceased men is slightly higher than that of women, as is the case with those recovered, this is also observed in figure 2.
For the analysis of the age variable, with respect to the cases of infected persons, recovered persons and death, 9 age ranges have been taken into account for men and women (0-9, 10-19, 20-29. 30-39, 40-49, 50-59, 60-69, 70-79, 80 onwards). The data have been collected in the graph in figure 3. Figure 3. Distribution of infected, deceased and recovered cases base on age

Age Influence
As in the case of sex, the Chi-square test will be used to determine the impact of the age variable (in its different year bands) on the number of people recovered and on the number of deaths (Abrego & Del Río, 1998;Khiabanian, Farrel, St George, & Rabadan, 2009;Lee & Storch, 2014;Konca et al., 2017).
The following table of frequencies shows the data entered for the respective age ranges for men and women, as well as all the weights obtained, which will give an idea of the influence of each variable with respect to the number of infected, recovered and dead people (Table 2).  10.15% 11.07% 100.00% In the independence test, a statistic of 23917,882 was obtained, with 34 degrees of freedom and a P-value of 0.0000. Since this value is less than 0.05, the hypothesis that the variables in the rows and columns are independent with a 95% confidence level can be rejected. Therefore, the variable observed in each row is related to its column. This means that the age variable is statistically significant with the number of cases of infected persons, recovered persons and deaths.
The number of cases of infected women shows two significant increases in those between 40 and 49 years old with 5.69% and in women over 80 years old with 9.71% of cases. In men, the infection is more uniform between 50 and 80 years of age or more, with data ranging from 5.32 to 5.95%.
The number of recovered cases has an increasing trend in women between 1.09% and 3.51% between the ages of 40-49 and 80+ respectively. For men, the trend is upwards from 1.54% to 3.80% for the 40-79 age group, with a significant drop in the number of cases recovered in the over-80 age group to 3.48%, as shown in Figure 3. As for the number of deaths, the trend is upwards for women and men, with a slightly higher number of deaths among men, 1.64% compared to 1.55%.
It can be seen that for women, the age groups between 40-49, 50-59 and 80+ and for men between 50 and 80+ are the most important because they have a greater number of people infected in order to carry out studies on the influence of the virus and to determine critical factors in its spread.

Zones and Population Density Influence
Another factor to take into account is the analysis by communities and population density (Sadique et al., 2007;Lowen, Mubareka, Steel & Pelese, 2007) as possible factors of influence in the speed of the virus propagation, taking into account the chaos of the number of recovered and dead people. For this purpose, a table will be constructed collecting the data indicated for the different Autonomous Communities.  Figure 4 shows that the communities most affected by the number of coronavirus infections in Spain have been the communities of Madrid and Catalonia to a greater extent, followed by the communities of Castilla y León, the Basque Country, La Rioja and Andalusia. This also corresponds to a higher number of recovered cases and deaths.
If we now make a comparative study of the population density (number of inhabitants / surface area) against the number of infected persons, we obtain the graph in figure 5.

Figure 5. Distribution of infected versus population density in Autonomous Communities
As can be seen from figure 5, a higher population density, as presented by the communities of Ceuta, Melilla, Balearic and Canary Islands mainly, does not seem to correspond to a higher number of infected people. This could be due to the fact that the four Spanish communities are physically outside the peninsula. If we analyse other communities with an important population density such as Madrid and Catalonia, in this almost, if we  From the above data, it can be deduced that the number of people infected ranges from u 0.03% to 16.07%, of people recovered from u 0.03% to 10.94% and of people who died from u 0.00% to 2.19%. The population density is between o,o1% and 2.29%.
The table of frequencies confirms the conclusions obtained above. The number of infected persons is higher for the autonomous communities of Madrid and Catalonia with 16.07% and 13.44%, also for the number of recovered persons with 10.94% and 5.37% respectively, and consequently for the number of deaths with 2.19% and 1.42%.
The communities of Melilla and Ceuta have the highest population densities with 2.29% and 1.42% of the total.
The graph in figure 6 has compared the number of confirmed infections in Covid-19 with the ratio of number infected divided by population density to see the influence on the first factor. It has been used for the ordinate axis logarithmic scale.

Figure 6. Number of infected versus number of infected divided by population density for the different Autonomous Communities
It is now perfectly clear that in communities with a lower population density, the number of infected people is smaller (Ceuta and Melilla). For other communities with high population densities, such as the Balearic Islands, the Canary Islands, Madrid, Catalonia and the Basque Country, it is diverse, and the same is true for low population densities (Aragon, Extremadura).
In the independence test, a statistic of 238616.112 was obtained with 54 degrees of freedom and a P-value of 0.0000. As in the cases studied previously, a value of less than 0.05 is obtained, therefore, the hypothesis that the variables in the rows and columns are independent with a 95% confidence level can be rejected. Therefore, the variable observed in each row is related to its column. This means that the variable of the zones (communities) is statistically significant with the number of cases of infected people, recovered people, dead people and with the established population density data.

Income Influence
Another socio-demographic factor to be analysed is the income of the population in Spain. How can this affect the economic level of people in the spread of the virus? For this parameter three economic ranges have been established, a low income ranging from 0-15,000 euros, the population that identifies with this status is 38.5%; an average income from 15,001-40,000 euros with 52.3 of population, and finally, a high income from 40,001 euros onwards with 9.2 of population.  38.50% 52.30% 9.20% 100.00% The table above shows the results obtained to determine whether or not the income of the Spanish population can influence or affect the number of people infected by coronavirus in the population of men and women. After performing the Chi-square analysis the value of the statistic performed with 4 degrees of libertas is 0.000, with a P-value of 1.0000, this indicates that since this value is greater than 0.05 (95% confidence level), it can be assured that the income data do not influence the speed of propagation of the virus.

Clinical Factors Influence
Apart from the socio-demographic data analysed above, there are other clinical data (Lourraine, 1998;Dowell, 2001;Elandt-Johnson, 1975) that could be of great importance to investigate in view of the possible spread of the pandemic, such as: hypertension, high cholesterol levels, allergies, mental health, tobacco and overweight that could affect the evolution of the disease.
The following table shows the data collected by the National Health System (SNS) in relation to the above parameters: For the study of these independent variables, the respective data will be calculated using the previous weights, with respect to the infected, recovered and deceased persons between the time period of the study.
A multivariate statistical analysis by principal component analysis will be carried out with the aim of reducing the dimensionality of the study from the seven quantitative independent variables to new and few variables, and thus be able to make a better interpretation of the data (Murray & Morse, 2011;Lofgrnn, Fefferman, Naumov, Gorski & Naumova, 2007).
The weights corresponding to these variables will be determined in order to determine, by means of a single independent variable or main component, the influence on the number of infected people and to reduce as much as possible the number of independent variables of the study. For this purpose, the analysis program StatGraphics Centurion XVI v.2.0 will be used.
The data obtained are summarized in the following table where the data of percentages of variance and accumulated data appear, in addition to the eigenvalues of each of the variables. The technique consists of making a factorial decomposition, that is, taking the variance and covariance matrices of the variables and calculating the eigenvalues that will generate the associated eigenvectors and these are the weights or evaluations of the new variables called main components.  Where the values in equation 2 have been standardized by subtracting their mean and dividing by their standard deviations. Next, a cluster análisis (Chan et al., 2020) will be performed to see if groupings can be made for some common characteristics of the factors analyzed. Similarities between the groups are sought. There are two criteria, one is the metric to be used, and the other is the clustering method, for the first a Euclidean square metric has been chosen, and for the second the nearest neighbour clustering method. The clustering carried out has been of observations with the standardized variables.
A cluster with 36 variables has been created. Clusters are groups of variables with similar characteristics. To form these clusters, the procedure starts with each variable in separate groups, then combines the two variables that were closest to each other to form a new group. After recalculating the distance between groups, the two closest groups are combined. This process is repeated until only one group remains. A graph called a dendogram has been obtained to see the associations of variables. They have been grouped into three groups as shown in Figure 7 below. The three groupings made in the dendogram are shown in the scatter plot. The first grouping appears in blue, the second in orange and the third in purple. The different centroids of the groupings made have also been represented in brown.
The graph in figure 9 shows a linear functional dependence of all the variables to which equation (1) obtained above responds for the main component that links the variables analysed

Discussion
The selection of the variables chosen in this study has been made very difficult. For the analysis of all the observable factors, both socio-demographic and clinical, the Chi-square method has been used. In the case of the clinical factors, a cluster or conglomerate type analysis has been carried out to see relationships between the variables. Variables have been left pending so as not to make the study unmanageable, and those considered most important have been chosen.
In the variable of age, many age ranges have been opened (9 ranges) because the authors have considered it very necessary given the high number of deaths that there were with the older people (70 years onwards). The results of the analysis have confirmed that men have a wider range of dangerous ages, from 50 to over 80 years of age. However, this is not the case for women, who are clearly more dangerous than women over 80 years of age, followed by a significant range but to a lesser extent for women between 40 and 59 years of age. This may be due to the fact that men, in general, present a more important clinical pathology picture than women.
A linear function of dependence between all the clinical variables has been found that could be included in a predictive model.
All the parameters studied in this work could be used in future work on prevalence and seroprevalence or in the last phase of the epidemic.

Conclusion
This study highlights the influence or not of the variables studied on the speed of spread of Covid-19 coronavirus in Spain. The variables studied have been divided into two groups, socio-demographic variables (sex, age, cohabitation area, population density and social level), and clinical variables (hypertension, cholesterol, allergies, mental health, tobacco and overweight). The environmental variables have been studied in another work resulting in very interesting conclusions in some of them such as temperature and air currents.
The application of the Chi-square technique indicates that of the socio-demographic variables studied all have importance in the propagation of the epidemic with the exception of people's income or economic level.
It is observed that for women, the age groups between 40-49, 50-59 and 80+ and for men between 50 and 80+ are the most important because they have a greater number of infected people in order to carry out studies on the influence of the virus and to determine critical factors in its propagation. These are the ages to take care of or to take into account in the event of a pandemic.
The different areas of coexistence or autonomous communities with high population density show preferences with a higher speed of infection of the virus, although this is not always true. It is important to pay attention to transport as a means of connection between and within the different communities. Transport is a very important element to study and monitor.
As for the clinical factors, they all show a linear functional dependence, so that they can be related through equation (1) for further studies.
All these results refer to deceased or recovered persons depending on the parameter and time period of study related to the speed of infection. These are very important data to take into account in studies of the evolution of pandemics worldwide.
Mitigation of the disease requires strict hygiene measures and containment processes, with de-escalation taking into account health parameters in particular. The study of the great amount of variables will help without a doubt in knowing better the disease to face better prepared in future times. Scientists in these cases must join efforts in one direction, and that direction can only be one, to save humanity.

Availability of Data
Our data are available upon request from the corresponding autor.

Conflict of Interest
None of the authors have conflicts of interest associated with this study to report.

Source of Funding
The authors received no specific funding for this work.