Predicting car accidents in my neighborhood: A data-driven approach.

Original Source Here

Predicting car accidents in my neighborhood: A data-driven approach.

I was born and raised in a calm and mostly residential district of the northern suburbs of Athens called “Vrilissia”. However, this calmness is often interrupted by car crashes in the streets. And when I say often, I mean that as a kid, I remember periods of time that car accidents were happening on a daily basis. So I tried to collect data about car accidents in Vrilissia, analyze them, try to interpret them and if possible try to predict the severity of a car accident that may happen in future time.

Data collection and preprocessing

I collected data about car crashes in Vrilissia through the local news site, So a BIG thank you for making my life easier trying to find out when and what kind of car accidents happened from 2015 onwards.

I read all the articles tagged with “car accident” one-by-one and I recorded the following features for each car accident.

  1. Date
  2. Location (in form of street name or crossing of streets)
  3. Cause (if mentioned in the article, unfortunately for many accidents this info was not disclosed)
  4. Severity
  5. A link to the original article.

In total I found info about 119 unique car accidents between 2015 and now. From the data I collected I found out that these accidents happened in 71 different locations. So I assigned one integer number (1–71) to each one of the locations that car accidents happened as a preprocessing step that will help me analyze the data easier.

From this recording, 11 main discrete causes / kinds of accidents emerged:

  1. STOP sign violation
  2. Not disclosed in article (aka N/A)
  3. Careless Parking
  4. Red traffic light violation
  5. Careless driving
  6. Drive under the influence of alcohol (aka Drunk driver)
  7. Driver felt “dizzy”
  8. Bicycle dragging
  9. Slide due to ice
  10. Vehicle overturning
  11. Pedestrian dragging

Also, I categorized Severity with the following 4 discrete values:

  1. Only car damages
  2. Car damages + Minor injuries
  3. Car damages + Major injuries
  4. Death

A closer look on the data

After I collected and preprocessed the data I was really curious about what they had to tell. What are the most common causes of accidents? Which are the locations with the highest frequency of car accidents ? Is there a correlation between the location of the accident and its severity?

So a dozen lines of python later, I came up with some plots and insights. Let’s see them by order of importance.

Recorded accidents by cause

The first observation I did, is that the most common cause of accident is the violation of STOP signs. This observation is a bit sad (😢) because I believe that these accidents are the ones that could be prevented if the drivers were just a little more cautious while driving. Unfortunately for a large number of our data points, cause of accident was not disclosed (which intuitively means that the true percentage of car accidents that were caused by STOP sign violation may actually be more).

Recorded accidents by severity

The next observation was that fortunately, the vast majority of accidents (70,59%) are not very severe as only car damages are recorded. Also, on the whole dataset, only 1 fatal accident was recorded.

Then I tried to figure out if time was a critical aspect for car accidents in Vrilissia and thus I plot accidents versus days of week, months and years. However, I did not find any useful insight except:

  1. The high variance in number of accidents between different months
  2. The higher number of car accidents during 2020 related to previous years.

The second observation may be a bit misleading because I think that during 2020, the site from where I gathered the data was simply writing more articles and news about car accidents in a more organized and cautious way.

Recorded accidents by Weekday
Recorded accidents by month
Recorded accidents by Year

Finally, I wanted to find out and interpret if some locations are more “dangerous” for car accidents to happen. So I plot number of accidents versus locations that these accidents happened.

Recorded accidents by location

In this plot, we observe that most locations only account for just one accident, so I made a more helpful plot showing only the location where more than 1 accident happened.

Recorded accidents by location ( > 1 accidents)

We observe that from a total of 71 discrete locations only 25 locations account for more than 1 car accident. The most dangerous location accounts for 8 accidents.

Finally I was curious if and how the different features may associate with each other. So I used the Cramér’s V metric which is a measure of association between discrete categorical values (I had a lot of them in the dataset). Imagine it like correlation for discrete categorical values.

The association matrix that emerged from the application of Cramér’s V method over the dataset, “tells” us that there is a highly positive association (> 0.87) between the locations (aka Point) of accident and its cause and there is also a strong positive association (> 71) between the location of an accident and it severity.

OK, and now what? Can we predict them ?

As the amount of data is very limited it is very difficult to predict when or where the next car accident will happen. However we could try to use machine learning methods and algorithms that could predict a property of an accident (let’s say the Severity) based on the other properties we have gathered. Or in other words, we can classify an accident in terms of severity or cause, if we know the rest of the information about it.

The first method I tried for classification of the accidents based on Severity is a Random Forest. It is an ensemble method of machine learning that uses many Random Trees and decides the outcome based on the majority of outcomes of the random trees that participate in the ensemble. I choose this method as it generally performs better in datasets with many categorical fields with high cardinality (as our dataset).

For the prediction / machine learning part of the project I used SciKit Learn framework.

As a preprocessing step before training the algorithm, I label-encoded the Cause and Severity categorical feature. Then I tried to predict the Severity of an accident based on the rest of the features. I did it using a 70/30 train-test split, which means that I used 70% of data points to train the model and the rest 30% to test its accuracy.

After 10 successive executions, I got an average accuracy of 59% which is not so bad given the limited amount of data and the cardinality of the predicted feature (4).

I also tried to predict the same variable using a 2-layer neural network (multi-layer perceptron) using ReLu as an activation function.

After 10 successive executions, I got an average accuracy of 49%. Neural networks perform better with larger amount of data, so it was completely expected to perform worse than simpler algorithms such as Random Forests.


Data is out there and they know the truth, even for car accidents. The only thing we have to do is just to go and get them. Then, they will talk no matter what. The best case scenario is to help us predict the future value of a variable with significant accuracy. If for some reason they are unable to do this, they will still provide useful insights most of the times.

Hope you enjoyed!


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: