Stanford STATS191 in Python, Lecture 15 : Poisson Regression
Stanford Stats 191¶
Introduction¶
This is a re-creation of the Stanford Stats 191 course, using Python eco-system tools, instead of R. This is lecture "Poisson: " ( see https://web.stanford.edu/class/stats191/notebooks/Poisson.html )
In this notebook, we look at modelling count data. The model is that of a Poisson process, where events occur in a fixed interval of time or space if these events occur with a constant mean rate and independently of the time since the last event. The probability we get x
events in a unit time is shown below
Initial Notebook Setup¶
watermark
documents the Python and package environment, black
is my chosen Python formatter
%load_ext watermark
%reload_ext lab_black
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sn
import math
import warnings
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from scipy import stats
from statsmodels.formula.api import logit
from statsmodels.formula.api import probit
from statsmodels.formula.api import glm
import statsmodels.api as sm
from scipy import stats
import os
Simple Contingency Table¶
We wish to test if Males and Females answer a question differently
First, create a pandas
dataframe from the survey data
data_dict = {'Y': [435, 375], 'N': [147, 134]}
df = pd.DataFrame(data_dict)
df.head()
Give names to vertical and horizontal indexes
df.index = pd.Index(['M', 'F'], name='Gender')
df.columns = pd.Index(['Y', 'N'], name='Response')
df
Get the data as a numpy
array, and then process with the Table
object
tab0 = np.asarray([df['Y'], df['N']])
tab0
tab1 = sm.stats.Table(tab0.T)
print(tab1)
tab1.table_orig
tab1.fittedvalues
Show the contributions to the Chi-squared statistic
tab1.chi2_contribs
Assess the indepedence between rows and columns (both as nominal and ordinal variables)
print(tab1.test_nominal_association())
print(tab1.test_ordinal_association())
Get the Chi-squared value; we see that it is quiet likely that we would get the observed value by chance
tab1_chi2 = sum(sum(tab1.chi2_contribs))
1 - stats.chi2.cdf(tab1_chi2, 1)
tab1_chi2
I found the Chi-squared curve for degree-of-freedom = 1 to be quiet counter-intuitive. It was only when I plotted out the PDF and CDF, that I realized that a lot of the probability mass is concentrated near zero
x = list(np.linspace(0.01, 2, 1000))
y = [stats.chi2.pdf(x1, 1) for x1 in x]
y2 = [stats.chi2.cdf(x1, 1) for x1 in x]
_ = plt.yscale('log')
_ = plt.plot(x, y, 'r-', label='Chi2 PDF')
_ = plt.plot(x, y2, 'g:', label='Chi2 CDF')
_ = plt.axvline(tab1_chi2, color='green', label='Observed')
_ = plt.legend(loc='best')
statsmodels
also has a specific Object for 2 by 2 tables, as below
tab2 = sm.stats.Table2x2(tab0.T)
tab2.summary()
scipy¶
scipy
also has Chi-squared analysis methods, that are a little more advanced than statsmodels
, in that they offer a variety of correction methods to estimate Chi-squared
With Yates correction
chi2, p_value, dof, expected = stats.chi2_contingency(
tab0.T
)
print(f' chi2 = {chi2}, P value = {p_value}')
Without Yates correction
chi2, p_value, dof, expected = stats.chi2_contingency(
tab0.T, correction=False
)
print(f' chi2 = {chi2}, P value = {p_value}')
odds_ratio, p_value = stats.fisher_exact(tab0.T)
print(p_value)
Poisson Regression¶
We can also treat this as a case of regression, where we assume the rate of Y
or N
answers might depend upon gender
We build a dataframe with the data, one row per observation
data_dict2 = {
'Y': [435, 147, 375, 134],
'S': ['M', 'M', 'F', 'F'],
'B': ['Y', 'N', 'Y', 'N'],
}
df2 = pd.DataFrame(data_dict2)
df2
Now fit a Poisson model. We get the same Chi-squared value as before
res = glm(
'Y ~ S + B', data=df2, family=sm.families.Poisson()
).fit()
res.summary()
res.fittedvalues
Lumber Dataset¶
Finally, we look at a dataset, relating to customers travelling to a lumber store from different regions
Column | Definition |
---|---|
Customers | number of customers visting store from region |
Housing | number of housing units in region |
Income | average household income |
Age | average housing unit age in region |
Competitor | distance to nearest competitor |
Store | distance to store in miles. |
Read and Explore Dataset¶
lumber = pd.read_fwf(
'../data/lumber.txt', widths=[2, 6, 8, 4, 6, 6]
)
lumber.columns = [
'Customers',
'Housing',
'Income',
'Age',
'Competitor',
'Store',
]
lumber.head()
Perform a Poisson Regression, using the Generalized Linear Model formula interface
res2 = glm(
'Customers ~ Housing + Income + Age + Competitor + Store',
data=lumber,
family=sm.families.Poisson(),
).fit()
res2.summary()
Show AIC value matches STATS191 notes
res2.aic
Plot the Residuals against Fitted Values (seems OK)
_ = plt.plot(
res2.fittedvalues,
lumber['Customers'] - res2.fittedvalues,
'+',
)
We can also use the Object interface, but have to specify regularized fitting
y = np.array(lumber['Customers'])
x = np.array(
lumber[
['Housing', 'Income', 'Age', 'Competitor', 'Store']
]
)
X = sm.add_constant(x)
res3 = sm.Poisson(y, X).fit_regularized()
res3.summary()
res3.summary2()
The bare-bones call to the Poisson object fails. I don't really know why this interface fails
res4 = sm.Poisson(y, X).fit()
Conclusions¶
This concludes my sweep through the STATS191 lecture notes, and I must say that I have learned a lot. Obviously I would have to learn a lot more to be even close to being a statistician, and I suspect that analysing hundreds of datasets would be needed to gain a deep understanding of strengths and weaknesses of the various techniques
Reproducibility¶
%watermark -h -iv
%watermark
sm.show_versions()