Mapping Queensland's Fatal Crash Locations

Mon 02 December 2019

Filed under Cartopy

Visualization of Queensland Road Crash Sites¶

Introduction¶

Background¶

Recently I attended (mostly as an observer) the 2019 GovHack event, which describes itself as the "THE LARGEST OPEN DATA HACKATHON IN THE SOUTHERN HEMISPHERE" (which is a pretty bold claim). I couldn't commit to the time needed to participate fully (which was why I was mostly observing), but found the process very interesting. The event was very well run at the Peregian Digital Hub, just north of where I live.

Govhack provides access to a large array of Government [datasets])(https://hackerspace.govhack.org/data_sets), including road crash locations for Queensland.

Goal¶

The road crash locations databse is one that I have been looking for, as I have a dream of a mobile phone app that would warnimg drivers about dangerous sections of the road, so that they can pay extra care to driving. This notebook is about the first stage of achieving that dream, in visualizing crash locations in various ways. I've decided to show the initial data exploration and display, as well as the more polished final results.

Implementation¶

This sections contains the housekeeping commands related to implementation details.

Notebook magic commands¶

Show all graphics inline in notebook.

In [1]:

%matplotlib inline

Watermark is used to show environmental details for reproducability

In [2]:

%load_ext watermark

I use black as my python code formatter, tailored for Pelican webpage production.

In [3]:

%load_ext lab_black

Notebook imports¶

Note that some imports are not actually used by the visualization code, but are there to support display of reproducability information.

In [4]:

# all imports should go here


import sys
import os
import subprocess
import datetime
import platform
import datetime

import seaborn as sns
import pandas as pd
import geopandas as gpd
from cartopy import crs as ccrs
import cartopy.feature as cfeature

from cartopy.io.img_tiles import GoogleTiles

import cartopy.io.img_tiles as cimgt
import six
from PIL import Image

import matplotlib.pyplot as plt

Data Load¶

Initial Exploration¶

So the first thing to do is to load and explore the data (held in a downloaded csv file. We get a somewhat opaque warning message: it turns out that the auto-discovery of data column types can get confused by things like PostCodes that (in Australia in any case) mostly look like integers.

In [5]:

crash_path = 'D:\\QLDCrashLocations\\data\\locations.csv'
crashes = pd.read_csv(crash_path)

D:\Anaconda3\envs\ac5-py37\lib\site-packages\IPython\core\interactiveshell.py:3020: DtypeWarning: Columns (15) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

As usual, we do a .head() to see what we have.

In [6]:

crashes.head()

Out[6]:

	Crash_Ref_Number	Crash_Severity	Crash_Year	Crash_Month	Crash_Day_Of_Week	Crash_Hour	Crash_Nature	Crash_Type	Crash_Longitude_GDA94	Crash_Latitude_GDA94	...	Count_Casualty_MedicallyTreated	Count_Casualty_Total	Count_Unit_Car
0	1	Property damage only	2001	January	Monday	9	Angle	Multi-Vehicle	152.960485	-26.685052	...	0	0	2
1	2	Property damage only	2001	January	Tuesday	23	Hit object	Single Vehicle	153.038263	-27.550776	...	0	0	1
2	3	Medical treatment	2001	January	Tuesday	16	Angle	Multi-Vehicle	153.116342	-27.545153	...	1	1	2
3	4	Property damage only	2001	January	Tuesday	12	Rear-end	Multi-Vehicle	153.021629	-27.384448	...	0	0	2
4	5	Property damage only	2001	January	Tuesday	10	Rear-end	Multi-Vehicle	152.937497	-27.083184	...	0	0	3

5 rows × 52 columns

This all looks promising: lets see what types of crashes we have recorded.

In [7]:

crashes["Crash_Severity"].unique()

Out[7]:

array(['Property damage only', 'Medical treatment', 'Hospitalisation',
       'Minor injury', 'Fatal'], dtype=object)

Lets see what the pesky column 15 is all about.

In [8]:

crashes.columns[15]

Out[8]:

'Loc_Post_Code'

A little Googling on Stack Overflow reveals the source of the problem. We declare PostCode to be strings, and reload data (not really necessary, but just checking this fixes the warning message)

In [9]:

crash_path = 'D:\\QLDCrashLocations\\data\\locations.csv'
crashes = pd.read_csv(
    crash_path, dtype={'Loc_Post_Code': str}
)

Initial Geodata Exploration¶

At this stage I decide that fatal crashes were the only ones I was interested in, so I create a Pandas view to the crashes dataframe. This is a touch unfeeling of me, in that I am using data representing human tragedy and suffering for a blog post to amuse my readers, but my long term goal is an app that might reduce this tragedy.

In [10]:

fatal = crashes[crashes["Crash_Severity"] == 'Fatal']

Let's check what this this gives us.

In [11]:

fatal.head()

Out[11]:

	Crash_Ref_Number	Crash_Severity	Crash_Year	Crash_Month	Crash_Day_Of_Week	Crash_Hour	Crash_Nature	Crash_Type	Crash_Longitude_GDA94	Crash_Latitude_GDA94	...	Count_Casualty_MinorInjury	Count_Casualty_Total	Count_Unit_Car	Count_Unit_Motorcycle_Moped	Count_Unit_Truck	Count_Unit_Pedestrian
18	19	Fatal	2001	January	Thursday	9	Hit pedestrian	Hit pedestrian	151.934354	-27.566384	...	0	1	0	0	1	1
148	149	Fatal	2001	January	Friday	14	Hit object	Single Vehicle	148.702519	-21.623856	...	0	1	0	1	0	0
154	155	Fatal	2001	January	Monday	15	Hit object	Single Vehicle	153.018326	-27.275777	...	0	1	1	0	0	0
177	178	Fatal	2001	January	Wednesday	2	Hit object	Single Vehicle	153.388039	-27.937847	...	0	2	1	0	0	0
285	286	Fatal	2001	February	Friday	17	Head-on	Multi-Vehicle	152.073306	-26.881843	...	1	3	2	0	0	0

5 rows × 52 columns

Let's see how many fatal crashes we have. Turns out that we have quite a few.

In [12]:

len(fatal)

Out[12]:

Now, let's explore the geographic data, and check that it looks reasonable.

In [13]:

fatal["Crash_Longitude_GDA94"].min()

Out[13]:

138.11879852299998

In [14]:

fatal["Crash_Longitude_GDA94"].max()

Out[14]:

153.54953899999998

The range of Longitude values looks reasonable for Queensland. If we look at the dataframe for all crashes, we see that there are some crashes with a suspicious (wrong) value. Clearly, some crashes have a missing Longtitude value, that gets coded as 0.0 in the csv file.

In [15]:

crashes["Crash_Longitude_GDA94"].min()

Out[15]:

0.0

Let's see how many crashes have no Lat/Lon values.

In [16]:

len(crashes[crashes["Crash_Longitude_GDA94"] < 0.1])

Out[16]:

Compared to the total numbers of crashes, the fraction with Lat/Lon missing values is quite small.

In [17]:

len(crashes[crashes["Crash_Longitude_GDA94"] > 0.1])

Out[17]:

It turns out that all the fatal crashes have a recorded Lat/Lon pair.

In [18]:

len(fatal[fatal["Crash_Longitude_GDA94"] > 0.1])

Out[18]:

At this stage, I decided to use GeoPandas, to support my preliminary visualizations. We show all the columns again, to refresh my memory of what data we have. I could have chosen to extract just the Lat/Lon pairs, but decided to keep all the data columns in my GeoPandas data structures: the numbers of rows is not huge, and I may choose to use this data somewhere in my visualizations.

In [19]:

fatal.columns

Out[19]:

Index(['Crash_Ref_Number', 'Crash_Severity', 'Crash_Year', 'Crash_Month',
       'Crash_Day_Of_Week', 'Crash_Hour', 'Crash_Nature', 'Crash_Type',
       'Crash_Longitude_GDA94', 'Crash_Latitude_GDA94', 'Crash_Street',
       'Crash_Street_Intersecting', 'State_Road_Name', 'Loc_Suburb',
       'Loc_Local_Government_Area', 'Loc_Post_Code', 'Loc_Police_Division',
       'Loc_Police_District', 'Loc_Police_Region',
       'Loc_Queensland_Transport_Region', 'Loc_Main_Roads_Region',
       'Loc_ABS_Statistical_Area_2', 'Loc_ABS_Statistical_Area_3',
       'Loc_ABS_Statistical_Area_4', 'Loc_ABS_Remoteness',
       'Loc_State_Electorate', 'Loc_Federal_Electorate',
       'Crash_Controlling_Authority', 'Crash_Roadway_Feature',
       'Crash_Traffic_Control', 'Crash_Speed_Limit',
       'Crash_Road_Surface_Condition', 'Crash_Atmospheric_Condition',
       'Crash_Lighting_Condition', 'Crash_Road_Horiz_Align',
       'Crash_Road_Vert_Align', 'Crash_DCA_Code', 'Crash_DCA_Description',
       'Crash_DCA_Group_Description', 'DCA_Key_Approach_Dir',
       'Count_Casualty_Fatality', 'Count_Casualty_Hospitalised',
       'Count_Casualty_MedicallyTreated', 'Count_Casualty_MinorInjury',
       'Count_Casualty_Total', 'Count_Unit_Car', 'Count_Unit_Motorcycle_Moped',
       'Count_Unit_Truck', 'Count_Unit_Bus', 'Count_Unit_Bicycle',
       'Count_Unit_Pedestrian', 'Count_Unit_Other'],
      dtype='object')

Creating a GeoPandas DataFrame¶

In [20]:

fatal_gdf = gpd.GeoDataFrame(
    fatal,
    geometry=gpd.points_from_xy(
        fatal.Crash_Longitude_GDA94,
        fatal.Crash_Latitude_GDA94,
    ),
)

The Lat/Lon pairs in the geometry column look reasonable.

In [21]:

fatal_gdf.geometry.head()

Out[21]:

18     POINT (151.934353976 -27.566383584)
148    POINT (148.702518587 -21.623855674)
154    POINT (153.018325532 -27.275777136)
177    POINT (153.388038935 -27.937846918)
285    POINT (152.073305579 -26.881843464)
Name: geometry, dtype: object

We now do the almost the most minimal visualization. It turns out that the Natural Earth world dataset doesn't think that Australia is a continent, so I had to use Oceania, which includes French Polynesia on the far side of the Date Line.

In [22]:

world = gpd.read_file(
    gpd.datasets.get_path('naturalearth_lowres')
)

# We restrict to Oceania
ax = world[world.continent == 'Oceania'].plot(
    color='white', edgecolor='black'
)

# We can now plot our ``GeoDataFrame``.
fatal_gdf.plot(ax=ax, color='red')

plt.show()

No description has been provided for this image

A listing of what Natural Earth thinks are the continents(?!).

In [23]:

world.continent.unique()

Out[23]:

array(['Oceania', 'Africa', 'North America', 'Asia', 'South America',
       'Europe', 'Seven seas (open ocean)', 'Antarctica'], dtype=object)

The absolutely minimal GeoPandas visualization. You can make out the shape of Queensland, and even what I suspect is Birdsville in the South West corner.

In [24]:

fatal_gdf.plot()

Out[24]:

<matplotlib.axes._subplots.AxesSubplot at 0x26f00e346d8>

Mapping fatal crashes¶

Supporting functions and definitions¶

In this section, I produce some more detailed maps. In some, I use a basemap showing roads and towns, and I have chosen Google and Stamen as the source for these backgrounds.

Define an imagery class for Stamen, should we decide to use it.

In [25]:

class StamenTerrain(GoogleTiles):
    def _image_url(self, tile):
        x, y, z = tile
        url = 'http://tile.stamen.com/terrain/{}/{}/{}.png'.format(
            z, x, y
        )
        return url

    # end _image_url


# end StamenTerrain

We have to chose a Coordinate Reference System (CRS) for our maps. To make things easy, I have chosen PlateCarree: Queensland looks OK in this projection (unlike, say continential USA). We re-create the GeoPandas dataframe.

In [26]:

# define the CRS we will use for all maps

my_crs = ccrs.PlateCarree()

fatal_gdf = gpd.GeoDataFrame(
    fatal,
    geometry=gpd.points_from_xy(
        fatal.Crash_Longitude_GDA94,
        fatal.Crash_Latitude_GDA94,
    ),
)

Plotting Queensland Fatal Crashes¶

In this figure, I use the higher resolution coastline and state borders (1:10M) to provide geographic context. I manually set the extent of the map.

Each fatal crash is shown as a dot with a very low alpha (i.e. very translucent), so that areas with high fatal crash density are easier to perceive. Mount Isa is unexpectedly very prominent.

In [33]:

fig = plt.figure(figsize=(20, 20))

ax = fig.add_subplot(1, 1, 1, projection=my_crs)

ax.set_extent((137.4, 154.2, -30, -9.8), crs=my_crs)
ax.coastlines(resolution='10m')

# Create a feature for States/Admin 1 regions at 1:10m from Natural Earth
states_provinces = cfeature.NaturalEarthFeature(
    category='cultural',
    name='admin_1_states_provinces_lines',
    scale='10m',
    facecolor='none',
)

# plot state border
SOURCE = 'Natural Earth'
LICENSE = 'public domain'
ax.add_feature(states_provinces, edgecolor='gray')

# plot fatal crashes
fatal_gdf.plot(ax=ax, color='red', alpha=0.1)

# set title
ax.set_title('Highways of Blood', fontsize=20)

plt.show()

Showing Local Crashes¶

If we want to show the local area (for, say a Sunday drive), then we need more geographic context, showing roads and towns. The example below shows fatal crashes in the context of Google tiles. I have chosen to use the CRS of the image tiles for this map, as re-projecting images can result in a fuzzy looking basemap.

As before, the map extent is set manually to my local patch, and the zoom level is set manually to a value that looks good, and has the right amount of detail. On the recommendation of Stack Overflow, I have the interpolation='spline36' parameter on the add_image call, as it might improve image quality if I choose a different CRS.

In [28]:

fig = plt.figure(figsize=(20, 20))

imagery = cimgt.GoogleTiles()

ax = plt.axes(projection=imagery.crs)
ax.set_extent(
    (153, 153.2, -26.6, -26.4), crs=ccrs.PlateCarree()
)

# Add the imagery to the map.
zoom = 12

ax.add_image(imagery, zoom, interpolation='spline36')


# Here's what the plot looks like in GeoPandas
crs_proj4 = imagery.crs.proj4_init
fatal_gdf.crs = {'init': 'epsg:4326'}
fatal_gdf2 = fatal_gdf.to_crs(crs_proj4)
fatal_gdf2.plot(ax=ax, color='red', alpha=0.8, zorder=4)

plt.title(
    "Local Fatal Crashes: Sunshine Coast",
    {"fontsize": 30},
    pad=40,
)
plt.show()

The diagram above represents a mock-up of what might be my final app UI. If you use Google Maps to plan a trip, then the red dots representing fatal crashes might allow the user to avoid 'Black Spots', and show where extra care might be needed.

Conclusions / Summary¶

This is a work in progress: the nxet steps will be to try Folium, and get a fata crash webserver going.

Reproducibility¶

This section of the notebook provides information to support reproducability.

Notebook version status¶

In [29]:

theNotebook = (
    '2019-10-21-dc-QLDCrashLocationsNotebook.ipynb'
)

In [30]:

# show info to support reproducibility


def python_env_name():
    envs = subprocess.check_output(
        'conda env list'
    ).splitlines()
    # get unicode version of binary subprocess output
    envu = [x.decode('ascii') for x in envs]
    active_env = list(
        filter(lambda s: '*' in str(s), envu)
    )[0]
    env_name = str(active_env).split()[0]
    return env_name


# end python_env_name

print('python version : ' + sys.version)
print('python environment :', python_env_name())
print('pandas version : ' + pd.__version__)

print('current wkg dir: ' + os.getcwd())
print('Notebook name: ' + theNotebook)
print(
    'Notebook run at: '
    + str(datetime.datetime.now())
    + ' local time'
)
print(
    'Notebook run at: '
    + str(datetime.datetime.utcnow())
    + ' UTC'
)
print('Notebook run on: ' + platform.platform())

python version : 3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit (AMD64)]
python environment : ac5-py37
pandas version : 0.23.4
current wkg dir: C:\Users\donrc\Documents\JupyterNotebooks\QLDCrashLocationsNotebookProject\develop
Notebook name: 2019-10-21-dc-QLDCrashLocationsNotebook.ipynb
Notebook run at: 2019-12-02 15:17:56.451019 local time
Notebook run at: 2019-12-02 05:17:56.451019 UTC
Notebook run on: Windows-10-10.0.18362-SP0

In [31]:

%watermark

2019-12-02T15:17:56+10:00

CPython 3.7.1
IPython 7.2.0

compiler   : MSC v.1915 64 bit (AMD64)
system     : Windows
release    : 10
machine    : AMD64
processor  : Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
CPU cores  : 8
interpreter: 64bit

In [32]:

%watermark -h -iv

platform   1.0.8
seaborn    0.9.0
matplotlib 3.0.2
pandas     0.23.4
geopandas  0.5.1
cartopy    0.17.0
PIL.Image  5.3.0
six        1.12.0
host name: DESKTOP-SODFUN6

In [ ]: