Plotnine - statistical and geometric processing
geom and stat concepts¶
The official description of the goem and stat concepts are:
Statistical transformations (stats) do aggregations and other computations on data before it is drawn out. stat_* determine the type of computation done on the data. Different types of computations yield varied results, so a stat must be paired with a geom that can represent all or some of the computations.
and
Geometric objects (geoms) are responsible for the visual representation of data points. geom_* classes determine the kind of geometric objects and every plot must have at least one geom added to it.
import warnings
import numpy as np
import pandas as pd
import plotnine as p9
from scipy import stats
watermark provides reproducability information
%load_ext watermark
# warnings.filterwarnings("ignore", category=UserWarning, module="plotnine.*")
geom and stat concepts¶
To repeat, the official description of the goem and stat concepts are:
Statistical transformations (stats) do aggregations and other computations on data before it is drawn out. stat_* determine the type of computation done on the data. Different types of computations yield varied results, so a stat must be paired with a geom that can represent all or some of the computations.
and
Geometric objects (geoms) are responsible for the visual representation of data points. geom_* classes determine the kind of geometric objects and every plot must have at least one geom added to it.
There is a mutual relationship between the two. A geom has an associated statistical process (a stat), and a stat has an associated method of visualization (a geom). If you call a geom, it will call a stat before crteating a graphic visualization; if you call a stat, it will call a geom to render a visualization of the data processing results.
We can make this explicit with an example.
First we generate 1,000 normal random numbers
size = 1000
x = np.random.normal(0, 10, size)
geom referencing a stat¶
Now, we create a histogram with a geom (geom_histogram) that has a default statistical process of binning data, and counting the entries in each bin. Now stat_bin is the default statistical process for histogram, but we make this explicit by the stat= parameter. plotnine takes this parameter, glues "stat_" on the front, and trhen calls that function. The mapping= parameter tells plotnine to get the x values from the x array and map them to the x-axis of the plot.
The steps are:
- create an empty plot
- add a histogram, setting the line color to gray to make the bars stand out
- set theme to black & white
plot = (
p9.ggplot()
+ p9.geom_histogram(mapping=p9.aes(x=x), color="gray", stat="bin")
+ p9.theme_bw()
)
plot
C:\Users\donrc\anaconda3\envs\fun_minim\Lib\site-packages\plotnine\stats\stat_bin.py:112: PlotnineWarning: 'stat_bin()' using 'bins = 24'. Pick better value with 'binwidth'.
stat referencing a geom¶
We can get exactly the same plot by invoking the statistical process, and specifying we want a histogram visualization
The steps are:
- create an empty plot
- call for a binning statistical process, which (by default) will add a histogram. The line color parameter is passed onto the geom
- set theme to black & white
plot = (
p9.ggplot()
+ p9.stat_bin(mapping=p9.aes(x=x), color="gray", geom="histogram")
+ p9.theme_bw()
)
plot
C:\Users\donrc\anaconda3\envs\fun_minim\Lib\site-packages\plotnine\stats\stat_bin.py:112: PlotnineWarning: 'stat_bin()' using 'bins = 24'. Pick better value with 'binwidth'.
We can reference a goem that isn't the default geom. In the case below, we ask to visualize the results of the statistical operation with points
The steps are:
- create an empty plot
- call for a binning statistical process, and specify visualization by points. In this case, the color parameter is passed to the geom to set the point color
- set theme to black & white
plot = (
p9.ggplot()
+ p9.stat_bin(mapping=p9.aes(x=x), color="gray", geom="point")
+ p9.theme_bw()
)
plot
C:\Users\donrc\anaconda3\envs\fun_minim\Lib\site-packages\plotnine\stats\stat_bin.py:112: PlotnineWarning: 'stat_bin()' using 'bins = 24'. Pick better value with 'binwidth'.
In the case below, we ask to visualize the results of the statistical operation as a line
The steps are:
- create an empty plot
- call for a binning statistical process, and specify visualization by a line
- set theme to black & white
plot = (
p9.ggplot()
+ p9.stat_bin(
mapping=p9.aes(
x=x,
),
color="gray",
geom="line",
)
+ p9.theme_bw()
)
plot
C:\Users\donrc\anaconda3\envs\fun_minim\Lib\site-packages\plotnine\stats\stat_bin.py:112: PlotnineWarning: 'stat_bin()' using 'bins = 24'. Pick better value with 'binwidth'.
This only works is the target geom gets all the data it needs. In the example below,we feed the count of points in each bin to the label aes parameter required by geom_label. The binning process doesn't ever use the label parameter, but it is passed onto geom_label which does use it.
plot = (
p9.ggplot()
+ p9.stat_bin(
mapping=p9.aes(x=x, label=p9.after_stat("count")),
color="gray",
geom="label",
format_string="{:.0f}",
)
+ p9.theme_bw()
)
plot
C:\Users\donrc\anaconda3\envs\fun_minim\Lib\site-packages\plotnine\stats\stat_bin.py:112: PlotnineWarning: 'stat_bin()' using 'bins = 24'. Pick better value with 'binwidth'.
It can work the other way (invoke a stat from a geom, and use the results of the statistical processing in the geom). Each stat may create pseudo-variables that can be accessed in the parameters of the geom call.
For example, stat_bin creates:
Options for computed aesthetics
- "count" # number of points in bin
- "density" # density of points in bin, scaled to integrate to 1
- "ncount" # count, scaled to maximum of 1
- "ndensity" # density, scaled to maximum of 1
- "ngroup" # number of points in group
So here we can use the after_stat function to access these variable by name, and specify that we want a plot layer consisting of labels: the x array is mapped to x-axis position, the after-statistical processing count of values in each bin is mapped to the y-axis position, and the count of values in each bin is mapped to the label to show.
The steps are:
- create an empty plot
- create labels with a call for a binning statistical process, setting labels content after the statistical process from thye pseudo-variable "count"
- set theme to black & white
plot = (
p9.ggplot()
+ p9.geom_label(
mapping=p9.aes(x=x, y=p9.after_stat("count"), label=p9.after_stat("count")),
stat="bin",
format_string="{:.0f}",
)
+ p9.theme_bw()
)
plot
C:\Users\donrc\anaconda3\envs\fun_minim\Lib\site-packages\plotnine\stats\stat_bin.py:112: PlotnineWarning: 'stat_bin()' using 'bins = 24'. Pick better value with 'binwidth'.
As another example, we show labels holding the normalized count (maximum=1), another pseudo variable computed the stat_bin.
The steps are:
- create an empty plot
- create labels with a call for a binning statistical process, setting labels content after the statistical process to normalized counts
- set theme to black & white
plot = (
p9.ggplot()
+ p9.geom_label(
mapping=p9.aes(x=x, y=p9.after_stat("count"), label=p9.after_stat("ncount")),
stat="bin",
format_string="{:.2f}",
size=8,
va="bottom",
)
+ p9.theme_bw()
)
plot
C:\Users\donrc\anaconda3\envs\fun_minim\Lib\site-packages\plotnine\stats\stat_bin.py:112: PlotnineWarning: 'stat_bin()' using 'bins = 24'. Pick better value with 'binwidth'.
More on computed pseudo-variables¶
Plotnine call these pseudo-variables "computed aesthetics".
We set up a pandas dataframe, with a single column "var1"
df = pd.DataFrame({"var1": [1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6]})
Create a vertical bar chart, with the count how many time each value occurs mapped to the y-axis (i.e. the top of the bar). Note we are using stat_count (by default), which is different to stat_bin, which counts values falling into each bin.
The pseudo-variables for stat_count are:
Options for computed aesthetics
- "count" # Number of observations at a position
- "prop" # Ratio of points in the panel at a position
In the example below, we use the psedo-variable "count" to set the height (y value) of each bar
The steps are:
- create an empty plot, specifying the source dataframe
- create a bar graph (which calls a counting statistical process) and use the results from the counting
- set theme to black & white
plot = (
p9.ggplot(data=df)
+ p9.geom_bar(
mapping=p9.aes(
x="var1",
y=p9.after_stat("count"),
),
)
+ p9.theme_bw()
)
plot
In the example below, we show how we can use expressions operating on the pseudo-variables in the mapping to plot parameters. Of course, we could have used the pseudo-variable "prop", but that is not the point of the example. We could have squared count, or taken the square root, etc.
The steps are:
- create an empty plot, specifying the source dataframe
- create a bar graph (which calls a counting statistical process) and use the results from the counting in a formulae that plotnine will evaluate
- set theme to black & white
plot = (
p9.ggplot(data=df)
+ p9.geom_bar(
mapping=p9.aes(
x="var1",
y=p9.after_stat("count/np.sum(count)"),
)
)
+ p9.theme_bw()
)
plot
Some stats trash original aes mappings¶
In the example below, we map var1 to the x-axis, y_axis, and color of each point. The stat geom_point calls by default is stat_identity, the "do-nothing" stat. Thus the multiple mappings work.
The steps are:
- create an empty plot, specifying the source dataframe
- create points colored according to x value
- set theme to black & white
plot = (
p9.ggplot(data=df)
+ p9.geom_point(
mapping=p9.aes(
x="var1",
y="var1",
color="var1",
),
size=20,
)
+ p9.theme_bw()
)
plot
In the example below , geom_bar calls stat_count, and as a result, the original aes mappings are lost: so assigning the variable "var1" to fill (to set the color) fails: even worse, it fails silently!
The steps are:
- create an empty plot, specifying the source dataframe
- create a bar graph (which calls a counting statistical process) and use the results from the counting in a formulae that plotnine will evaluate. We try to set the fill according to the original variable names from the dataframe (Fail!)
- set theme to black & white
plot = (
p9.ggplot(data=df)
+ p9.geom_bar(
mapping=p9.aes(
x="var1",
y=p9.after_stat("count"),
fill="var1",
),
)
+ p9.theme_bw()
)
plot
However, in the new aes environment after the stat_count operations, we have the aesthetic variable "x" and "y" defined. Another way to think about it is that we have forgotten about the dataframe that was our original source of data, and are now operating the after-statistical-processing regime, where we can use pseudo-variable defined by the process. So we can use "x" (holding the values that came from the "var1" column in the source dataframe) to map to the fill value
This gives us what we want: a gradient of colors for the bars
The steps are:
- create an empty plot, specifying the source dataframe
- create a bar graph (which calls a counting statistical process) and use the results from the counting in a formulae that plotnine will evaluate. We try to set the fill according to one of the after-process pseudo-variable names (Works!)
- set theme to black & white
plot = (
p9.ggplot(data=df)
+ p9.geom_bar(
mapping=p9.aes(
x="var1",
y=p9.after_stat("count"),
fill=p9.after_stat("x"),
),
)
+ p9.theme_bw()
)
plot
We can use a different pseudo-variable for the fill: say, the count of each value: this color-codes the bars by height
The steps are:
- create an empty plot, specifying the source dataframe
- create a bar graph (which calls a counting statistical process) and use the results from the counting in a formulae that plotnine will evaluate. We try to set the fill according to one of the after-process pseudo-variable names (Works!)
- set theme to black & white
plot = (
p9.ggplot(data=df)
+ p9.geom_bar(
mapping=p9.aes(
x="var1",
y=p9.after_stat("count"),
fill=p9.after_stat("count"),
),
)
+ p9.theme_bw()
)
plot
Multiple stat calls¶
In the example below, we construct a plot with two layers: one holding vertical bars, one holding text annotations. We use geom_bar and geom_text. In the call to geom_text we specify that we want to call stat_count for the statistical processing (overriding the default). This means we can use the pseudo_variable "prop" (normalized counts) to set the label to be shown at the top of the bar (y aes parameter set to the count, same as the bar)
The steps are:
- create an empty plot, specifying the source dataframe
- create a bar graph (which calls a counting statistical process) and use the results from the counting in a formulae that plotnine will evaluate. We try to set the fill according to one of the after-process variable names
- create a layer with text positioned at top of bars (same y value), with text label content derived from an after-process variable ("prop" = proportion), of the counting statistical process
- set theme to black & white
Note we have a number of different namespaces here: the stat in the geom_text call is named "count", and this stat creates a pseudo-variable "count": two completely different concepts, same name!
plot = (
p9.ggplot(data=df)
+ p9.geom_bar(
mapping=p9.aes(
x="var1",
y=p9.after_stat("count"),
fill=p9.after_stat("prop"),
),
)
# need outer stat= so p9.aes(... after_stat..) works
+ p9.geom_text(
mapping=p9.aes(x="var1", y=p9.after_stat("count"), label=p9.after_stat("prop")),
stat="count",
format_string="{:4.2}\n",
)
+ p9.theme_bw()
)
plot
Generate some random 2D points. These points will be shown behind the output of the stat... calls.
x = np.random.normal(0, 1, size)
y = np.random.normal(0, 2, size)
df = pd.DataFrame({"x": x, "y": y})
Find the convex hull os the points (by default, stat_hull uses geom_line to visualize the processing
The steps are:
- create an empty plot, specifying the source dataframe
- create a layer holding points
- call for the convex hull process (that will draw a path)
- set theme to black & white
plot = (
p9.ggplot(data=df)
+ p9.geom_point(mapping=p9.aes(x="x", y="y"))
+ p9.stat_hull(
mapping=p9.aes(x="x", y="y"),
)
+ p9.theme_bw()
)
plot
stat_ellipse¶
Draw ellipses that enclose 50% (red) and 90% (blue) of the points
The steps are:
- create an empty plot, specifying the source dataframe
- create a layer holding points
- call for the ellipse process (that will draw a path in blue)
- call for the ellipse process (that will draw a path in red)
- set theme to black & white
plot = (
p9.ggplot(data=df)
+ p9.geom_point(
mapping=p9.aes(x="x", y="y"),
alpha=0.2,
)
+ p9.stat_ellipse(
mapping=p9.aes(x="x", y="y"),
color="blue",
)
+ p9.stat_ellipse(
mapping=p9.aes(x="x", y="y"),
color="red",
level=0.5,
)
+ p9.theme_bw()
)
plot
stat_function¶
Computes and display as a line any function of a variable in the data environment. Here we show a cubic y=x^3.
def cube(x):
return x * x * x
# end sq
The steps are:
- create an empty plot, specifying the source dataframe
- create a layer holding points
- call for the function calculation process, that will draw a line in blue
- set the y axis limits
- set theme to black & white
plot = (
p9.ggplot(data=df)
+ p9.geom_point(
mapping=p9.aes(x="x", y="y"),
alpha=0.2,
)
+ p9.stat_function(
mapping=p9.aes(
x="x",
),
fun=cube,
color="blue",
)
+ p9.ylim((-5, 5))
+ p9.theme_bw()
)
plot
C:\Users\donrc\anaconda3\envs\fun_minim\Lib\site-packages\plotnine\layer.py:372: PlotnineWarning: geom_point : Removed 14 rows containing missing values. C:\Users\donrc\anaconda3\envs\fun_minim\Lib\site-packages\plotnine\geoms\geom_path.py:100: PlotnineWarning: geom_path: Removed 48 rows containing missing values.
stat_density_2d¶
Draws 2D density contours
The steps are:
- create an empty plot, specifying the source dataframe
- create a layer holding points
- call for the 2D density process, that will draw contours in red
- set theme to black & white
plot = (
p9.ggplot(data=df)
+ p9.geom_point(
mapping=p9.aes(x="x", y="y"),
alpha=0.2,
)
+ p9.stat_density_2d(
mapping=p9.aes(x="x", y="y"),
color="red",
)
+ p9.theme_bw()
)
plot
Draw more contours
The steps are:
- create an empty plot, specifying the source dataframe
- create a layer holding points
- call for the 2D density process, that will draw multiple contours in red
- set theme to black & white
plot = (
p9.ggplot(data=df)
+ p9.geom_point(
mapping=p9.aes(x="x", y="y"),
alpha=0.2,
)
+ p9.geom_density_2d(
mapping=p9.aes(x="x", y="y"),
color="red",
levels=8,
)
+ p9.theme_bw()
)
plot
stat_qq¶
A Q–Q plot (quantile–quantile plot):
is a probability plot, a graphical method for comparing two probability distributions by plotting their quantiles against each other.
It is typically used to test for normality of a random variable. In the graph below, the Q–Q plot compares a sample of data on the vertical axis to a statistical population on the horizontal axis. The points follow a strongly linear pattern, suggesting that the data are distributed as a standard normal (X ~ N(0,1)). The zero offset between the line and the points suggests that the mean of the data is 0. Which we knew, as this is how we created the data.
The steps are:
- create an empty plot, specifying the source dataframe
- generate a QQ plot using random data from the dataframe
- call the qq_line process, that will plot a line for the normal distribution on the plot
- set theme to black & white
plot = (
p9.ggplot(data=df)
+ p9.geom_qq(
mapping=p9.aes(sample="x"),
color="red",
)
+ p9.stat_qq_line(mapping=p9.aes(sample="x"))
+ p9.theme_bw()
)
plot
___stat_quantile__, geom_quantile¶
Quantile regression:
is a type of regression analysis used in statistics and econometrics. Whereas the method of least squares estimates the conditional mean of the response variable across values of the predictor variables, quantile regression estimates the conditional median (or other quantiles) of the response variable
We generate correlated but noisy data, and construct a plot showing:
- the data points
- the estimate 25% 50% and 75% quantiles
The steps are:
- generate a dataframe with noisy data
- create an empty plot, specifying the source dataframe
- create a layer holding points
- generate the quantile regression lines
- set theme to black & white
x = np.random.normal(0, 1, size)
y = x + np.random.normal(0, 2, size)
df = pd.DataFrame({"x": x, "y": y})
plot = (
p9.ggplot(data=df)
+ p9.geom_point(
mapping=p9.aes(x="x", y="y"),
alpha=0.2,
)
+ p9.geom_quantile(
mapping=p9.aes(x="x", y="y"),
color="red",
)
+ p9.theme_bw()
)
plot
In this example, we use stat_quantile, and use a pseudo-variable to color-code the quantile lines. We use the pseudo-function "factor" to indicate we have only discrete values, and not a continuous gradient
The steps are:
- generate a dataframe with noisy data
- create an empty plot, specifying the source dataframe
- create a layer holding points
- call for a statistical process that will draw quantile regression lines, color coding for quantile value
- set theme to black & white
x = np.random.normal(0, 1, size)
y = x + np.random.normal(0, 2, size)
df = pd.DataFrame({"x1": x, "y1": y})
plot = (
p9.ggplot(data=df)
+ p9.geom_point(
mapping=p9.aes(x="x1", y="y1"),
alpha=0.2,
)
+ p9.stat_quantile(
mapping=p9.aes(x="x1", y="y1", color=p9.after_stat("factor(quantile)")),
quantiles=(0.1, 0.5, 0.9),
size=2,
)
+ p9.theme_bw()
)
plot
C:\Users\donrc\anaconda3\envs\fun_minim\Lib\site-packages\statsmodels\regression\quantile_regression.py:191: IterationLimitWarning: Maximum number of iterations (1000) reached.
Density visualization¶
This section will provide examples of visualizing density in one and two dimensions
geom_rug¶
Rug plots are drawn in one dimension by generating tick marks for each data point, on the x and / or y axis.
The steps are:
- create an empty plot, specifying the source dataframe
- create a layer holding points, increasing the transparency (alpha) so as to not draw the eye
- generate the rug plot on x and y axis, showing density of points when projected on each axis
- set theme to black & white
plot = (
p9.ggplot(data=df)
+ p9.geom_point(
mapping=p9.aes(x="x1", y="y1"),
alpha=0.2,
)
+ p9.geom_rug(
mapping=p9.aes(
x="x1",
y="y1",
),
alpha=0.1,
)
+ p9.theme_bw()
)
plot
Y axis density graphs¶
There are two ways to visualize the density of points in the y-axis. These are:
- stat_ydensity
- stat_density
stat_ydensity¶
stat_ydensity seems to have been created largely as a helper function for violin plots, but it can be used as a general purpose function.
The statistical process takes a set of values via the y parameter (in our case actually y-axis coordinates) and generates a pseudo-variable "violinwidth" that gives the estimated density at each input value. By default, this stat draws a violin plot, but we can override this to just draw a path. Note that the y parameter is defined for the stat and geom calls, and holds the same values (y-axis coordinates values)
The steps are:
- create an empty plot, specifying the source dataframe
- create a layer holding points, made transparent
- call the y axis density statistical process, specifying the visualization is via a path (continuous line), with path x values being density values (translated to the left), path y values being data y values
- draw a vertical line as a y=0 axis for the density curve
- set theme to black & white
xmin = abs(min(df["x1"]))
plot = (
p9.ggplot(data=df)
+ p9.geom_point(
mapping=p9.aes(x="x1", y="y1"),
alpha=0.2,
)
+ p9.stat_ydensity(
mapping=p9.aes(x=p9.after_stat("violinwidth-xmin"), y="y1"), geom="path"
)
+ p9.geom_vline(mapping=p9.aes(xintercept=-xmin))
+ p9.theme_bw()
)
plot
stat_density¶
This example is a little convoluted, but illustrates the use of the stage function.
The stage function allows us to handle the case where the stat_ ... and the subsequent geom_... have the same parameter defined (in this case x=...), but require a completely different set of values.
To illustrate the stat_density needs a set of values to use to estimate density, assigned via the x parameter. We want density along the y axis, so we say we want to start by assigning x the values of the pseudo-variable "y1" (being a column in our source dataframe). Then, after the staistical processing, and when we start drawing the path, we want x to be the estimated density (from the after-process pseudo-variable "scaled".
Thus we have (-x_min just translates the graph to the left):
x=p9.stage(start="y1", after_stat="scaled-x_min"),
For the parameter y (which stat_density doesn't use, but geom_path requires), we start by assigning y to 0. Then after the processing, we need the y-axis values, but the statistical process trashed our start environment (pseudo-variables mapping to dataframe column names): thankfully, the statistical process stores the input set of values (used to compute density) under the pseudo-variable "x". Se we can tell geom_path to use this "x" pseudo-variable to get y-axis values to draw a line.
Thus we have:
y=p9.stage(start=0, after_stat="x"),
The steps are:
- create an empty plot, specifying the source dataframe
- create a layer holding points, made transparent
- call the general density statistical process, specifying the visualization is via a path (continuous line), with path x values being density values (translated to the left), path y values being data y values ((translated to the left)
- draw a vertical line as a y=0 axis for the density curve
- set theme to black & white
x_min = abs(min(x))
df = pd.DataFrame({"x1": x, "y1": y})
plot = (
p9.ggplot(data=df)
+ p9.geom_point(
mapping=p9.aes(x="x1", y="y1"),
alpha=0.2,
)
+ p9.stat_density(
mapping=p9.aes(
x=p9.stage(start="y1", after_stat="scaled-x_min"),
y=p9.stage(start=0, after_stat="x"),
),
geom="path",
)
+ p9.geom_vline(mapping=p9.aes(xintercept=-x_min))
+ p9.theme_bw()
)
plot
X Axis density curve¶
After all the complexity of drawing a density curve on the y-axis, the x-axis is less complex. We can use the geom_density call, the only complication being to scale the density, and translate away from the cluster of points
The steps are:
- create an empty plot, specifying the source dataframe
- create a layer holding points
- generate a density curve, x values being x data values, y values being scaled density values (translated down)
- draw a horizontal line as a y=0 axis for the density curve
- set theme to black & white
y_min = abs(min(y))
scaling = 2
plot = (
p9.ggplot(data=df)
+ p9.geom_point(
mapping=p9.aes(x="x1", y="y1"),
alpha=0.2,
)
+ p9.geom_density(mapping=p9.aes(x="x1", y=p9.after_stat("scaled*scaling-y_min")))
+ p9.geom_hline(mapping=p9.aes(yintercept=-y_min))
+ p9.theme_bw()
)
plot
Multi-examples¶
Here we demonstrate on the same plot:
- rug plots on the x axis
- rug plots on the y axis
- y-axis density plot
- x-axis density plot
- 2D density contours
- 2D density ellipses
plot = (
p9.ggplot(data=df)
+ p9.geom_point(
mapping=p9.aes(x="x1", y="y1"),
alpha=0.2,
)
+ p9.geom_rug(
mapping=p9.aes(
x="x1",
# y="y1",
),
alpha=0.1,
color="blue",
)
+ p9.geom_rug(
mapping=p9.aes(
# x="x1",
y="y1",
),
alpha=0.1,
color="red",
)
+ p9.stat_density(
mapping=p9.aes(
x=p9.stage(start="y1", after_stat="scaled*scaling-xmin*2"),
y=p9.stage(start=0, after_stat="x"),
),
geom="path",
color="red",
)
+ p9.geom_vline(mapping=p9.aes(xintercept=-x_min * 2))
+ p9.geom_density(
mapping=p9.aes(x="x1", y=p9.after_stat("scaled*scaling-y_min")),
color="blue",
)
+ p9.geom_hline(mapping=p9.aes(yintercept=-y_min))
+ p9.stat_density_2d(
mapping=p9.aes(x="x", y="y"),
color="orange",
)
+ p9.stat_ellipse(
mapping=p9.aes(x="x1", y="y1"),
color="green",
)
+ p9.theme_bw()
)
plot
2D binning¶
In this example we generate random uniform sample (0-1) for the x- and y-axis, and pour these into a dataframe. We use geom_bin_2d to produce a heatmap style visualization of the density of points
size = 100
x = np.random.rand(1000)
y = np.random.rand(1000)
df = pd.DataFrame({"x": x, "y": y})
plot = (
p9.ggplot(data=df)
+ p9.geom_point(
mapping=p9.aes(x="x", y="y"),
alpha=0.1,
)
+ p9.geom_bin_2d(
mapping=p9.aes(
x="x",
y="y",
),
binwidth=0.1,
alpha=0.4,
)
+ p9.theme_bw()
+ p9.theme(
figure_size=(8, 8),
)
)
plot
Now suppose we want to show the count in each 2D bin, as a text annotation
The stat_bin_2d creates the following pseudo-variables:
Options for computed aesthetics
- "xmin" # x lower bound for the bin
- "xmax" # x upper bound for the bin
- "ymin" # y lower bound for the bin
- "ymax" # y upper bound for the bin
- "count" # number of points in bin
- "density" # density of points in bin, scaled to integrate to 1
We perform the initial steps as before:
- create and empty plot, specifying our data source
- call geom_bin_2d to show a color-coded visualization of the density in each bin (100 in all - 10*10)
Then we call stat_bin_2d, specifying we want geom_text called after the 2D binning process. Again, stat_bin_2d and geom_text have two variables "x"and "y", with different meanings in each context.
In the stat_bin_2d context, "x" and "y" are the point coordinates (we count the number of points in each 2D bin).
In the geom_text context, "x" and "y" are the coordinates of where a given text label is to appear.
So:
x=p9.stage(
start="x",
after_stat="(xmin+xmax)/2",
),
says: for the stat processing, set the parameter "x" to the dataframe column "x": in the geom_text context, set the x parameter to the x-axis midpoint of each bin (as computed from pseudo-variables created by the stat process)
Similarly:
y=p9.stage(
start="y",
after_stat="(ymin+ymax)/2",
),
says the same above, except for the y-axis.
The label mapping parameter in:
label=p9.after_stat("count"),
is not used by the stat process, but is passed on to the geom call. It says: use the post-stat process pseudo-variable "count" to be mapped to the label in the geom_text call.
There are a set of geom__text settings declared in the stat_bin_2d call that get passed on to geom_text :
size=8,
format_string="{:.0f}",
color="black",
path_effects=[pe.withStroke(linewidth=3, foreground="lightgray")],
The patheffects module allows for backgrounds behind text to be easier to read on dark backgrounds
import matplotlib.patheffects as pe
plot = (
p9.ggplot(data=df)
+ p9.geom_bin_2d(
mapping=p9.aes(
x="x",
y="y",
),
binwidth=0.1,
alpha=0.4,
)
+ p9.stat_bin_2d(
mapping=p9.aes(
x=p9.stage(
start="x",
after_stat="(xmin+xmax)/2",
),
y=p9.stage(
start="y",
after_stat="(ymin+ymax)/2",
),
label=p9.after_stat("count"),
),
geom="text",
binwidth=0.1,
size=8,
format_string="{:.0f}",
color="black",
path_effects=[pe.withStroke(linewidth=3, foreground="lightgray")],
)
+ p9.theme_bw()
+ p9.theme(
figure_size=(8, 8),
)
)
plot
Use of helper functions¶
I was curious as the the distribution of the bin counts in the graphic above. They should have a mean of 10 (10*10 = 100 bins, 1,000 points), and look like a Binomial distribution ( (0.01 +0.99)^1000 ). The stat_bin_2d process creates a pseudo-variable "count", that is an array holding the points inside each bin. To plot the distribution, I would to get the unique values present in the array, and then count how often each value occurs. For example, if we have counts of (2,2,4,5,5,5), I want to generate the unique values array (2, 4, 5) and the value_counts (2, 1, 3): the value 2 appears twice,the value 4 appears only one time, the value 5 appears three times.
I create some helper functions. It appears that plotnine expects to deal with Pandas Series objects
def unique(l):
"""
unique: get the unique values from a Pandas Series, returned as a pandas Series
Get the input Pandas Series values, turn them into a list, then into a set, then into a list, then into a Series
list -> set -> list removes duplicates
"""
return pd.Series(list(set(list(l.values))))
# end unique
def my_count_items(l):
"""
my_count_items: get count of how often each unique value appears in a Pandas Series, returned as a pandas Series
"""
return pd.Series([list(l.values).count(i) for i in unique(l)])
# end my_count_items
The steps are:
- create an empty plot, specifying the source of data (df dataframe)
- invoke stat_bin_2d, specifying the use of geom_col for visualization. Both stat_bin_2d and geom__col use the x and y mapping variables. We use p9.stage( ... ) to specify that the starting statistical process uses the "x", and "y" columns from the dataframe. After the statistical processing, we want the geom_col to use the results of my functions operating on the "count" pseudo-variable (x now holds the unique values, y the frequency of occurance of each unique value)
- set the theme to black and white
- set the figure size
- set the explanatory labels on the plot
plot = (
p9.ggplot(data=df)
+ p9.stat_bin_2d(
mapping=p9.aes(
x=p9.stage(
start="x",
after_stat="unique(count)",
),
y=p9.stage(
start="y",
after_stat="my_count_items(count)",
),
fill=p9.after_stat("my_count_items(count)"),
),
geom="col",
binwidth=0.1,
size=8,
)
+ p9.theme_bw()
+ p9.theme(
figure_size=(8, 8),
)
+ p9.labs(
y="count of squares containing points counts on x axis",
x="point count in square",
fill="contained\npoint\ncount",
)
)
plot
C:\Users\donrc\anaconda3\envs\fun_minim\Lib\site-packages\plotnine\layer.py:356: PlotnineWarning: position_stack : Removed 80 rows containing missing values.
Regression lines¶
Finally, the workhorse of plotting - putting regression lines through data points
We create a toy dataset, and fit a linear model
x = np.random.normal(0, 1, size)
y = x + np.random.normal(0, 2, size)
df = pd.DataFrame({"x": x, "y": y})
plot = (
p9.ggplot(data=df)
+ p9.geom_point(
mapping=p9.aes(x="x", y="y"),
alpha=0.2,
)
+ p9.geom_smooth(mapping=p9.aes(x="x", y="y"), method="glm", color="green")
+ p9.theme_bw()
)
plot
%watermark
Last updated: 2026-02-26T15:32:41.270837+10:00 Python implementation: CPython Python version : 3.11.14 IPython version : 9.10.0 Compiler : MSC v.1929 64 bit (AMD64) OS : Windows Release : 10 Machine : AMD64 Processor : Intel64 Family 6 Model 170 Stepping 4, GenuineIntel CPU cores : 22 Architecture: 64bit
%watermark -h -iv -co
conda environment: fun_minim Hostname: INSPIRON16 matplotlib: 3.10.8 numpy : 2.4.1 pandas : 2.3.3 plotnine : 0.15.0 scipy : 1.16.3
import contextlib
import ipynbname
with contextlib.suppress(FileNotFoundError):
print(f"Notebook file name: {ipynbname.name()}")
# end with
Notebook file name: p9_stats