Geopandas categories in legends¶
Introduction¶
I recently ran across a little wrinkle in plotting maps with Geopandas. In this particular case I was plotting a subset of a large DataFrame, and the Legends produced were misleading. This brief blog post show a workaround.
Implementation¶
Define environment¶
Import of packages used¶
import geopandas as gpd
import numpy as np
import pandas as pd
Notebook utilities used¶
%load_ext watermark
%load_ext lab_black
Get example dataset¶
We will use a sample built-in dataset that Geopandas provides to illustrate the glitch: first, plot and print the contents.
gpd.datasets.available
df = gpd.read_file(gpd.datasets.get_path("nybb"))
df.plot()
df.head()
Set Borough Name column as a categorical column¶
We would typically do this on a large DataFrame in order to improve efficiency
df['BoroName'] = df['BoroName'].astype('category')
List out the categories.
df['BoroName'].cat.categories
Now request a plot, where the Boroughs are distinguished by color
df.plot(
column='BoroName',
legend=True,
legend_kwds={'loc': 'upper left',},
)
Now we will plot just a subset of the data. We subset by Borough Code. First, we list all codes.
[code for code in df['BoroCode']]
Now we mask out all those Boroughs that have a code number value not in 3, 4, or 5.
selected_codes = [3, 4, 5]
df2 = df[
[code in selected_codes for code in df['BoroCode']]
].copy()
df2.plot(
column='BoroName',
legend=True,
legend_kwds={'loc': 'upper left',},
)
Now you can see the problem: although we only plotted 3 Boroughs, the Legend shows five colors!
We can see that the categories have been unchanged by our subsetting operation.
df2['BoroName'].cat.categories
def strip_unused_categories(
df: pd.DataFrame, column_name: str
):
'''
strip_unused_categories: remove unused categories from a categorical colm definition
Parameters:
df: pandas DataFrame, with one column being categorical
column_name: str - name of categorical column to be pruned
Returns:
None
Side Effects:
Categories of referenced columns might be changed
Example:
If we have:
0 Blue
1 Green
2 Blue
3 Green
4 Brown
Name: color, dtype: category
Categories (4, object): [Blue, Brown, Green, Red]
we want:
0 Blue
1 Green
2 Blue
3 Green
4 Brown
Name: color, dtype: category
Categories (3, object): [Blue, Green, Brown]
Usage Example:
df = pd.DataFrame(
{
'color': np.random.choice(
['Blue', 'Green', 'Brown', 'Red'], 50
)
}
)
df.color = df.color.astype('category')
print(df.color.cat.categories)
df = df.query('color != "Brown"')
strip_unused_categories(df, 'color')
print(df.color.cat.categories)
->
Index(['Blue', 'Brown', 'Green', 'Red'], dtype='object')
Index(['Blue', 'Green', 'Red'], dtype='object')
'''
# get the all the current category names as a list
current = list(df[column_name].cat.categories)
# get list on unique categories actually used
used = df[column_name].unique()
for cat_name in current:
if cat_name not in used:
df[column_name] = df[
column_name
].cat.remove_categories(cat_name)
# end if
# end for
# end strip_unused_categories
Using the helper function, we prune the category list
strip_unused_categories(df2, 'BoroName')
Now the plot has a legend that makes sense!
df2.plot(
column='BoroName',
legend=True,
legend_kwds={'loc': 'upper left',},
)
Just checking our prune operation worked as expected.
df2['BoroName'].cat.categories
Just as a side note, the misleading Legend also appears if other means of subsetting are used.
df.query('BoroCode>2').plot(
column='BoroName',
legend=True,
legend_kwds={'loc': 'upper left',},
)
Reproducability¶
%watermark -iv
%watermark
%watermark -co