Using a K-Means Clustering Algorithm to Examine Patterns of Vehicle Crashes in Before-After Analysis

,


Introduction and Previous Studies
All policies that affect travel patterns also affect the numbers killed and injured in transport accidents, and conversely, changing the travel patterns may in itself be a way of reducing these numbers.Investigation of this interaction between travel patterns (De Luca & Dell'Acqua, 2012a) and the number of deaths and injuries in transport accidents can benefit greatly from various kinds of data that are already commonly collected in travel surveys (Dell'Acqua, 2012).
A recent study (Ghosh & Savolainen, 2012) examined the factors that affect the time required by the Michigan Department of Transportation Freeway Courtesy Patrol to clear incidents occurring on the southeastern Michigan freeway network.These models were developed using traffic flow data, roadway geometry information, and an extensive incident inventory database.Lord et al. (2010) provided a detailed review of the key issues associated with crash-frequency data as well as the strengths and weaknesses of the various methodological approaches that researchers have used to address these problems.Depaire et al. (2008), always in the field of road safety, demonstrated cluster analysis in order to identify homogeneous classes of accidents that allowed for a very effective analysis.Similarly, KwokSuen et al. (2002) used cluster analysis to group homogeneous data in an experimental analysis to develop an algorithm to estimate the number of road accidents and to assess the risk of accidents.Finally Dell 'Acqua et al. (2011a) applied cluster analysis to develop a Decision Support System (DSS) useful to indentify the Accident Modification Factors (AMF).

Methodology and Objectives
The methodology used in this work is based on two techniques: a) Cluster Analysis (Hard c-Means Method -H c m); b) The "Before-After" approach; Using "Cluster Analysis" (by means of the special binary partition algorithm "hard c-means") all accidents with a high degree of resemblance or strong similarities were aggregated (clustered).Then the "cluster representative" accident for each cluster was determined to find the average of all the different characteristics (geometric, environmental, traffic-related).Then, for each cluster, a hazard index was created whereby it was possible to determine the danger rate for each cluster in terms of accidents.Using this information (the average characteristics of the "cluster representative" as independent variables, and the "hazard index" as the dependent variable) it was possible to construct, by means of multiple-regression multivariate analysis), an accident prediction model.This model was used to support the choice of infrastructure projects and to simulate situations in which the Before-After technique could be applied.In particular, the model was used to simulate "after" situations, with and without any intervention.A comparison of these two situations, simulated by the model, made it possible to assess the effectiveness of interventions in terms of road safety.Detailed descriptions of the two techniques mentioned in this section follow.The study aims to develop a support procedure to estimate the efficacy of infrastructural interventions to improve road safety.The procedure was developed using the two techniques described above in an experiment on a stretch of highway of approximately 110 km.

Cluster Analysis with Hard c-Means Algorithm (H c m)
The principles of this technique are as follows (Tryon, 1939).The aim of the group analysis consists in identifying a specific U partition, in c groups (2 ≤ c ≤ n) of the U collection space constituted by n-elements.The hypothesis upon which this method is based is the following: the elements of the X space, that belong to a group, are characterized by a mathematical affinity and this affinity is greater than the elements of the different groups.Each element in the sample can be schematized as a point identified by m-coordinates, and each coordinate constitutes an attribute of the same element.One of the simpler measures of affinity is represented by the distance measured between two points which belong to the data-space (De Luca et al., 2012b)

Before-After Analysis
In this study, the Naïve "Four steps/before-after" study (Hauer, 1997) was used.To overcome the limitations of the "naive" techniques, the data were subject not only to cluster analysis, but they were also given an accident rate index, whose details are illustrated below, to take into consideration the traffic variable.In this way, it was possible to compare, in the two different comparison configurations ("after" without intervention" and "after", with intervention), the traffic conditions, the meteorological conditions, the number of vehicles involved, and other quantities, the details of which are given in later sections.

Naïve "Four Steps/Before-After" Study
Let us suppose that an intervention was carried out on 1, 2, …, j, …, n.Let us also suppose that in the period preceding the intervention (Before), K(1), K(2), …, K(j), …, K(n) accidents occurred; in the "after" period then, the number of accidents was L(1), L(2), …, L(j), …, L(n).Since the duration of the periods "before" and "after" (De Luca et al., 2011) may be different from one entity to another (where entity means an intersection, a curve exiting a tunnel with a wet surface, a road/rail crossing with level-crossing, etc), we define the "duration ratio" for entity j as:   duration of the after period for entity duration of the before period for entity Let us also define as follows: , expected number of accidents in the "after" period with no intervention , expected number of accidents in the "after" period following a specific intervention. Specifically: To establish the validity of the effectiveness of the intervention, the indices ( 6) and ( 7) are used: Defined as the relationship between the degree of safety with the modernization works and what it would have been without them: Defined as the number of accidents expected (or expected accident rate) "after" the intervention.
Let us also define the variation of  and d as: In particular, the first index , takes on the following important meaning: If  > 1, the intervention has a negative influence on safety If  < 1, the intervention has no negative influence on safety

Data Collection
The Stretch analyzed (see Figure 1) belongs to the A3 (Salerno-Reggio Calabria freeway) situated in the south of Italy between distance "170.000km", and distance "259.000km" (Dell'Acqua et al., 2011c).

Description of the Stretch "Before" the Intervention
The stretch t analyzed (from 30 September 1998 to 30 September 1999) has a slope of between 0% and 4%.The road surface is dense asphalt-flexible type.Geometric data were obtained from map sources to a scale of 1:2000 and 1:10000.The accidents used (totaling about 520 in the whole segment) were made available by the police authorities; traffic data were taken from the archives of the local government administration.The collected data were organized as indicated in Table 1.In the period "after" (from 30 September 2002 to 31 August 2003) there were 85 accidents with 38 injuries and no deaths.On the same stretch in the "Before" period (from 30 September 1998 to 31 August 1999) there were 104 accidents with 58 injuries and 1 fatality.

Cluster Analysis (H c m Algorithm)
Hard c means algorithm was applied to the variables showed in table 2 (to accidents included in the period from 30 September 1998 to 30 September 1999 and distance from 170.000 to 281.000 km).Table 3 shows the results obtained.The same can be said of other "clusters" shown in Table 3. where: Nv is the number of vehicles involved in the accident under consideration; L is the length of the hazardous zone (cluster).It was calculated as follows: an influence area (equal to 500m) was assumed for each accident, or accidents, occurring at the same distance.
K1 is equal to 0.75 for a dry road surface and 0.25 for a wet road surface (Dell'Acqua et al. 2011b).
K2 is equal to 0.67 for daylight and 0.33 for nocturnal light.
ADT is the average daily traffic at each cluster.In particular, the traffic associated with each cluster is the average value for the traffic for each distance where an accident (or accidents) was/were recorded in the "After" period (from 30 September 2002 to 31 August 2003).The traffic for this period was used because the model ( 11) shown below was used to simulate the accidents for that period (30 September 2002to 31 August 2003).
This model is only applicable on highways for the values given in the third column of Table 2.
The Idc was constructed (11) using the data in Table 3, (i.e.Id as a dependent variable and the variables 1, 2, 3, 4 and 5 are independent variables-Predictors) with a multiple regression equation multivariate analysis (Mauro & Branco, 2013).The ordinary-least-square method was applied to estimate the coefficients of the explanatory variables.
1415 95011 106 875 364 399 ( 2 0.94) The significance of the variables (confidence level < 5%) was examined using the t-student test.In Table 4, the last two columns show the values for the t-student tests and the significance of the model ( 11).

Before-AfterAnalysis
The "Before-After" analysis described in section 2.2 was applied to the first two clusters (cluster "a" and cluster "b") shown in Table 3.
The first cluster (cluster "a"), contains 14 accidents (their average characteristics are shown in the first row of Table 3) in the following kilometric ranges: 4 accidents at km 228.000 ± 0.5 km 5 accidents at km 244.500 ± 0.5 km 5 accidents at km 253.000 ± 0.5 km The second cluster (cluster "b"), contains 28 accidents (their average characteristics are shown in the second row of Table 3) in the following kilometric ranges: 5 accidents at km 226.000 ± 0.5 km 6 accidents at km 221.200 ± 0.5 km 8 accidents at km 249.500 ± 0.5 km Bearing in mind the average characteristics of the cluster (i.e., the geometric and environmental situations in which accidents occurred), the following infrastructure projects were planned and implemented.
Cluster "a" interventions at km 244500±0.5: 1 Adjustment of acceleration and deceleration lanes at freeway exit.
2 Replacing dense asphalt with porous asphalt.
Cluster "b" interventions at km 249.000±0.5: 1 Increase in curve radius from 390 m to 800 m.
2 Replacing dense asphalt with porous asphalt.
Table 5 shows the works proposed to improve road safety in clusters "a" and "b".The first column shows the area of intervention.The second column shows the Idc value for the after period but without works (i.e., the "p" parameter was established).The third column describes the scheduled works.The fourth column shows the numerical variation of the variables of model ( 11) after the works in the third column had been planned (Loprencipe & Cantisani, 2013).Finally, the last column shows the Idc value for the period after with the work had been carried out (i.e.parameter "l" was established).
Table 5. Results obtained by the model ( 11 2. Was should be from 2 to 1. However give that the porous asphalt results in an improvement but not a complete change.From wet t dry conditions.Was associated with a coefficient of 1.20. Table 6, using the symbols introduced in Section 2.2, shows the results for the estimation of the effectiveness of the modernization works.The first two columns show the values of "l" and "p" while the fifth and sixth show the values of " " and " " whose expressions are derived from ( 6) and ( 7).The sixth column shows the benefit (as a percentage) of comparison between "l" and "p".
As can be seen, the works were effective, resulting  < 1.
Lastly, Table 7 illustrates the "Before-After" comparison at the sites of intervention for the accidents observed.
The table shows a comparison (based on the observed data) between the number of accidents, the number of injured, the number of vehicles involved and the traffic before and after the work.The Observed Hazardous Index (Id) from Equation ( 10) is shown.The last two columns show the values of "l" and "p" calculated from the observed values.As can be seen from a comparison of these two values ("l", "p"), in the two situations in

*
The variation.as indicated in Table

Table 1 .
Collecting and organizing data

Table 2 .
Variables introduced in the cluster analysisThe clusters identified using the "H c m" technique can be considered as a "hazardous zone".If we analyze the first line of Table3(obtained by calculating the mean value within the cluster) we can observe that it has the following average characteristics:

Table 3 .
Results of cluster analysis

Table 4 .
Model obtained

Table 6
(for the simulated values) and Table 7 (for the observed values), the estimate carried out using the experimental