Thu 08 February 2024

Filed under Visualization

Tags geopandas legends

Geopandas categories in legends

Introduction

I recently ran across a little wrinkle in plotting maps with Geopandas. In this particular case I was plotting a subset of a large DataFrame, and the Legends produced were misleading. This brief blog post show a workaround.


Implementation

Define environment

Import of packages used

In [1]:
import geopandas as gpd
import numpy as np
import pandas as pd

Notebook utilities used

In [2]:
%load_ext watermark
In [3]:
%load_ext lab_black

Get example dataset

We will use a sample built-in dataset that Geopandas provides to illustrate the glitch: first, plot and print the contents.

In [4]:
gpd.datasets.available
Out[4]:
['naturalearth_cities', 'naturalearth_lowres', 'nybb']
In [5]:
df = gpd.read_file(gpd.datasets.get_path("nybb"))
df.plot()
Out[5]:
<AxesSubplot:>
In [6]:
df.head()
Out[6]:
BoroCode BoroName Shape_Leng Shape_Area geometry
0 5 Staten Island 330470.010332 1.623820e+09 MULTIPOLYGON (((970217.022 145643.332, 970227....
1 4 Queens 896344.047763 3.045213e+09 MULTIPOLYGON (((1029606.077 156073.814, 102957...
2 3 Brooklyn 741080.523166 1.937479e+09 MULTIPOLYGON (((1021176.479 151374.797, 102100...
3 1 Manhattan 359299.096471 6.364715e+08 MULTIPOLYGON (((981219.056 188655.316, 980940....
4 2 Bronx 464392.991824 1.186925e+09 MULTIPOLYGON (((1012821.806 229228.265, 101278...

Set Borough Name column as a categorical column

We would typically do this on a large DataFrame in order to improve efficiency

In [7]:
df['BoroName'] = df['BoroName'].astype('category')

List out the categories.

In [8]:
df['BoroName'].cat.categories
Out[8]:
Index(['Bronx', 'Brooklyn', 'Manhattan', 'Queens', 'Staten Island'], dtype='object')

Now request a plot, where the Boroughs are distinguished by color

In [9]:
df.plot(
    column='BoroName',
    legend=True,
    legend_kwds={'loc': 'upper left',},
)
Out[9]:
<AxesSubplot:>

Now we will plot just a subset of the data. We subset by Borough Code. First, we list all codes.

In [10]:
[code for code in df['BoroCode']]
Out[10]:
[5, 4, 3, 1, 2]

Now we mask out all those Boroughs that have a code number value not in 3, 4, or 5.

In [11]:
selected_codes = [3, 4, 5]
df2 = df[
    [code in selected_codes for code in df['BoroCode']]
].copy()
df2.plot(
    column='BoroName',
    legend=True,
    legend_kwds={'loc': 'upper left',},
)
Out[11]:
<AxesSubplot:>

Now you can see the problem: although we only plotted 3 Boroughs, the Legend shows five colors!

We can see that the categories have been unchanged by our subsetting operation.

In [12]:
df2['BoroName'].cat.categories
Out[12]:
Index(['Bronx', 'Brooklyn', 'Manhattan', 'Queens', 'Staten Island'], dtype='object')

Solution

We define a helper function that strips out unused categories

In [13]:
def strip_unused_categories(
    df: pd.DataFrame, column_name: str
):
    '''
    strip_unused_categories:  remove unused categories from a categorical colm definition
    
    Parameters:
    df: pandas DataFrame, with one column being  categorical
    
    column_name: str - name of categorical column to be pruned
    
    Returns:
    None
    
    Side Effects:
    Categories of referenced columns might be changed
    
    Example:
    If we have: 
        0     Blue
        1    Green
        2     Blue
        3    Green
        4    Brown
        Name: color, dtype: category
        Categories (4, object): [Blue, Brown, Green, Red]
    
    we want:
        0     Blue
        1    Green
        2     Blue
        3    Green
        4    Brown
        Name: color, dtype: category
        Categories (3, object): [Blue, Green, Brown]
        
        Usage Example:
        df = pd.DataFrame(
            {
                'color': np.random.choice(
                    ['Blue', 'Green', 'Brown', 'Red'], 50
                )
            }
        )

        df.color = df.color.astype('category')
        print(df.color.cat.categories)

        df = df.query('color != "Brown"')
        strip_unused_categories(df, 'color')
        print(df.color.cat.categories)
        
        ->
        Index(['Blue', 'Brown', 'Green', 'Red'], dtype='object')
        Index(['Blue', 'Green', 'Red'], dtype='object')
        
    '''

    # get the all the current category names as a list
    current = list(df[column_name].cat.categories)

    # get list on unique categories actually used
    used = df[column_name].unique()

    for cat_name in current:
        if cat_name not in used:
            df[column_name] = df[
                column_name
            ].cat.remove_categories(cat_name)
        # end if
    # end for


# end strip_unused_categories

Using the helper function, we prune the category list

In [14]:
strip_unused_categories(df2, 'BoroName')

Now the plot has a legend that makes sense!

In [15]:
df2.plot(
    column='BoroName',
    legend=True,
    legend_kwds={'loc': 'upper left',},
)
Out[15]:
<AxesSubplot:>

Just checking our prune operation worked as expected.

In [16]:
df2['BoroName'].cat.categories
Out[16]:
Index(['Brooklyn', 'Queens', 'Staten Island'], dtype='object')

Just as a side note, the misleading Legend also appears if other means of subsetting are used.

In [17]:
df.query('BoroCode>2').plot(
    column='BoroName',
    legend=True,
    legend_kwds={'loc': 'upper left',},
)
Out[17]:
<AxesSubplot:>
In [ ]:
 

Reproducability

In [18]:
%watermark -iv
pandas   : 1.5.2
numpy    : 1.23.5
geopandas: 0.9.0

In [19]:
%watermark
Last updated: 2024-02-04T19:12:33.113777+10:00

Python implementation: CPython
Python version       : 3.9.15
IPython version      : 8.8.0

Compiler    : MSC v.1916 64 bit (AMD64)
OS          : Windows
Release     : 10
Machine     : AMD64
Processor   : Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
CPU cores   : 8
Architecture: 64bit

In [20]:
%watermark -co
conda environment: mapping2

In [ ]:
 
Comment

Mon 13 August 2018

Filed under Cartopy

Tags python geopandas cartopy

Various visualizations of Australia's Population Grid

Read More

Thu 02 August 2018

Filed under OSMNX

Tags python geopandas cartopy osmnx networkx

Visualization of communities identified via the road network

Read More

Thu 02 August 2018

Filed under OSMNX

Tags python geopandas cartopy osmnx

Visualization of Police Response Times via Road Network

Read More

net-analysis.com Data Analysis Blog © Don Cameron Powered by Pelican and Twitter Bootstrap. Icons by Font Awesome and Font Awesome More