Stanford STATS191 in Python, Lecture 1 : Introduction
Stanford Stats 191¶
Introduction¶
This is a re-creation of the Stanford Stats 191 course (https://web.stanford.edu/class/stats191/), using Python eco-system tools, instead of R. This the lecture "Course introduction and review" (see https://web.stanford.edu/class/stats191/notebooks/Review.html).
I found the STATS191 course pitched at just the right level for me (who can dimly remember my Uni stats courses). The only drawback was that all the examples were in R, not Python. So I decided to redo the course material using Python eco-system tools.
Why Python?¶
I have tried to get into R a number of times, and bounced hard each time. Part of it was that I was learning R because "I should learn R", not because I had a specific project in mind. On the other hand, I learnt Python implementing a specific application that took me a few months to complete.
Further, I found the syntax of R to be sufficiently different to be jarring (e.g. left and right assignment statements). Most modern computer languages have a syntax largely based on C (the Proto-Indo-European of programming), but not R, and I think that might make it harder for experienced programmers to pick up. Further, the tidy-verse religious wars (and the introduction of things like the pipe operator) indicates to me that I probably can't pick R by idle tinkering.
I think there is a deeper reason for this: R is for statisticians, by statisticians, while Python is for software engineers, by software engineers. Of course Python has surprises/ gotchas (I'm look at you, mutable function parameter defaults), but I found R to have many more.
There is an argument that eventually general purposes computers /languages will win out over domain-specific computers /languages. Certainly I have used Python for a variety of tasks (generating Word Documents hold thumbnails of images in a directory, playing legal chess, visualizing geospatial data, processing web-cam images, etc) that I wouldn't have the first idea how to do in R. Part of the strength of Python is that it comes "batteries included"; there is an amazing amount of open-source software that supports almost any task. Part of this current project (redo STATS191) is to see if there is any limit to Python's statistics support (spoiler alert: haven't found any yet, but seeing as how I am on the beginner slopes, I wouldn't expect there to be any limits).
Initial Notebook Setup¶
watermake
documents the Python and package environment, black
is my preferred formatter, and we want graphs, etc visible in the Notebook
%load_ext watermark
%load_ext lab_black
%matplotlib inline
All imports go here
pandas handles high-level datatable structures
numpy handle numeric data and algorithms
seaborn handles vizualization
The statistics support in Python comes from two packages. the stats
sub-module of scipy
, and statsmodels
; we mostly use the latter.
import pandas as pd
import numpy as np
import seaborn as sn
import math
import matplotlib.pyplot as plt
from scipy import stats
from statsmodels.formula.api import ols
from statsmodels.formula.api import rlm
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import (
wls_prediction_std,
)
from statsmodels.stats.stattools import jarque_bera
Data Load¶
Load the initial data set (Mother-Daughter height data), and explore it in different ways.
DATA_PREFIX = 'd:/StanfordStats191/alr3data/'
data = pd.read_csv(DATA_PREFIX + 'heights.txt', sep=' ')
Pandas visualization¶
We print and graph the data in different ways.
data.head()
data.describe()
data['Mheight'].hist()
data['Dheight'].hist()
data.boxplot(column=['Mheight', 'Dheight'])
__ = pd.plotting.scatter_matrix(
data[['Mheight', 'Dheight']]
)
Matplotlib graphics¶
The pandas graphics can be improved upon, but with more code. First we use matplotlib, and using seaborn.
fig, ax = plt.subplots(figsize=(10, 10))
ax.plot(data['Mheight'], data['Dheight'], 'o', label='data')
ax.grid()
ax.set_xlabel("Mother's Height")
ax.set_ylabel("Daughter's Height")
ax.legend(loc='best')
Seaborn graphics¶
We generate some seaborn graphics with minimal code.
fig, ax = plt.subplots(figsize=(10, 10))
sn.regplot(
'Mheight',
'Dheight',
data=data,
ax=ax,
label='Seaborn Regression\n95% Conf Intvl',
)
ax.grid()
ax.set_xlabel("Mother's Height")
ax.set_ylabel("Daughter's Height")
ax.legend(loc='best')
Note: Seaborn jointplot
graphics cannot be incorporated easily into Matplotlib figures
# fig, ax = plt.subplots(figsize=(10, 10))
g = sn.jointplot(
'Mheight',
'Dheight',
kind='reg',
data=data,
# ax=ax,
label='Seaborn Regression\n95% Conf Intvl',
)
g = sn.jointplot(
'Mheight',
'Dheight',
kind='hex',
data=data,
# ax=ax,
label='Seaborn Regression\n95% Conf Intvl',
)
Statistics Models¶
So now having done our data exploration and initial visualization, we might be of the view that there is a linear relationship between a mothers height, and her daughter's height.
We test this hypothesis, by using Ordinary Least Squares to fit a line to the data. statsmodels
has two modes: one where you build your input data matrixes explicitly (and you use the UPPER CASE variants of the class names), and one where you can text formulas a la R (and you use lower case variants of the class names).
data = pd.read_csv(DATA_PREFIX + 'heights.txt', sep=' ')
daughter = data['Dheight']
mother = data['Mheight']
mother = sm.add_constant(mother)
res = sm.OLS(daughter, mother).fit()
We can see from the RegressionResults summary method below, the linear model doesn't explain much of the spread in the daughter's heights (R^2 = 0.241), but the likelihood of seeing these results if daughters heights had no relation to mothers heights is vanishingly small. Thus we reject the hypothesis that daughters heights have no relation to mothers heights
res.summary()
We display the line of best (OLS) fit. Note we set alpha
to be less than default, so that the density of data points is more easily assessed.
fig, ax = plt.subplots(figsize=(10, 10))
ax.plot(
data['Mheight'],
data['Dheight'],
'o',
label='data',
alpha=0.5,
)
ax.plot(
data['Mheight'],
res.fittedvalues,
'r-',
label='Fitted Values',
)
ax.grid()
ax.set_xlabel("Mother's Height")
ax.set_ylabel("Daughter's Height")
ax.legend(loc='best')
If we want to add some context the graph, we can add Confidence Intervals. First, we can be very confident about the line of best fit (due to the large number of data points averaging out the ). However, any prediction we make using this line will have the inherent error of our model, so the confidence interval for predictions is much wider.
Note that the wls_prediction_std
method we call comes from a statsmodels
sandbox:
This sandbox contains code that is for various reasons not ready to be included in statsmodels proper. It contains modules from the old stats.models code that have not been tested, verified and updated to the new statsmodels structure: cox survival model, mixed effects model with repeated measures, generalized additive model and the formula framework. The sandbox also contains code that is currently being worked on until it fits the pattern of statsmodels or is sufficiently tested.
We can only hope that this call won't disappear in the future!
x_ci = np.linspace(
data['Mheight'].min(), data['Mheight'].max(), 20
)
x_ci_enh = sm.add_constant(x_ci)
gp = res.get_prediction(x_ci_enh)
ci = gp.conf_int(alpha=0.05)
upper = [z[1] for z in ci]
lower = [z[0] for z in ci]
prstd, iv_l, iv_u = wls_prediction_std(res)
fig, ax = plt.subplots(figsize=(10, 10))
ax.plot(
data['Mheight'],
data['Dheight'],
'o',
label='data',
alpha=0.5,
)
ax.plot(
data['Mheight'],
res.fittedvalues,
'g-',
label='Fitted Values',
)
ax.plot(
x_ci, upper, 'r-', label='95% Conf Int (u)', alpha=0.5
)
ax.plot(
x_ci, lower, 'r-', label='95% Conf Int (l)', alpha=0.5
)
ax.plot(
data['Mheight'], iv_l, 'y-', label='95% Pred Int (l)'
)
ax.plot(
data['Mheight'], iv_u, 'y-', label='95% Pred Int (u)'
)
ax.grid()
ax.set_xlabel("Mother's Height")
ax.set_ylabel("Daughter's Height")
ax.legend(loc='best')
Use Symbolic Model Specification¶
We can also use R-style symbolic model specification (that assumes a constant term in the linear model, by default). We get the same result, comparing the two summary() method calls.
res2 = ols('Dheight ~ Mheight', data=data).fit()
res2.summary()
res.summary()
Just to make sure we understand the outputs, we compute the R^2 values manually
mean_y = data['Dheight'].mean()
sstot = sum([(y - mean_y) ** 2 for y in data['Dheight']])
print(f'Total sum of squares around the mean {sstot}')
ssreg = sum([(y1 - mean_y) ** 2 for y1 in res.fittedvalues])
ssres = sum(
[
(y1 - y2) ** 2
for y1, y2 in zip(data['Dheight'], res.fittedvalues)
]
)
print(
f'Total sum of square spread around mean {sstot},\n'
+ f'sum of square errors, predictions around mean {ssreg}, \n'
+ f'sum of square errors, predictions against actual {ssres}'
)
print(ssreg + ssres)
r2 = 1 - ssres / sstot
r2
print(
f'res.ssr = {res.ssr}; res.mse_model = {res.mse_model}'
)
print(
f'res.mse_total = {res.mse_total}, se_total = {res.mse_total*(1375-1)}'
)
print(res.centered_tss)
print(
f'ssres {ssres} = res.mse_resid {res.mse_resid} * 1373 = {res.mse_resid*1373}'
)
Multivariate Regression¶
Numerical descriptive statistics¶
We now look at a dataset that links income to RightToWorkLaws (or more accurately, RightToFireWithoutPeskyUnions), in a number of USA cities. In STATS191, it is used as example to illustrate numerical descriptive statistics. The pandas describe()
method gives us these.
DATA_PREFIX = 'd:/StanfordStats191/'
rtw = pd.read_csv(DATA_PREFIX + 'P005.txt', sep='\t')
rtw.head()
The variables are:
Income: income for a four-person family
COL: cost of living for a four-person family
PD: Population density
URate: rate of unionization in 1978
Pop: Population
Taxes: Property taxes in 1972
RTWL: right-to-work indicator
Note the method used to change the color of the boxplot boxes. We display a few more pandas graphs for this dataset.
fig, ax = plt.subplots(figsize=(10, 10))
props = dict(boxes="Pink")
rtw.boxplot(
column=['COL'],
by='RTWL',
ax=ax,
color=props,
patch_artist=True,
)
ax.set_ylabel("Cost Of Living")
__ = pd.plotting.scatter_matrix(
rtw[
[
'City',
'COL',
'PD',
'URate',
'Pop',
'Taxes',
'Income',
'RTWL',
]
]
)
rtw.describe()
The median()
method gives us the median of each numeric column. We really should have declared RTWL a categorical column, but this is what you get by default.
rtw.median()
To enhance the default pandas scatterplots, we specify we want RTWL status to be highlighted. It is annoying that we don't get a legend by default.
fig, ax = plt.subplots(figsize=(10, 10))
__ = pd.plotting.scatter_matrix(
rtw[['COL', 'PD', 'URate', 'Pop', 'Taxes', 'Income']],
c=rtw['RTWL'],
cmap='rainbow',
ax=ax,
alpha=0.95,
)
At least Seaborn knows about legends!
_ = sn.pairplot(
rtw,
vars=['COL', 'PD', 'URate', 'Pop', 'Taxes', 'Income'],
kind='scatter',
hue='RTWL',
diag_kind='hist',
diag_kws={'alpha': 0.5},
)
One of the cities is clearly an outlier: it is New York.
rtw.iloc[26]
Numerical descriptive statistics analysis¶
We read data on a treatment / placebo experiment, and try to determine if the observed data supports an hypothesis that the treatment is better than the placebo
Read data¶
DATA_PREFIX = 'd:/StanfordStats191/'
cal = pd.read_csv(DATA_PREFIX + 'calcium.txt', sep='\t')
Initial Analysis¶
cal.head()
cal.describe()
Start to look at difference between Treatment and Placebo. Seems to be a difference, but could it arise by chance?
cal[cal['Treatment'] == 'Calcium']['Decrease'].mean()
cal[cal['Treatment'] == 'Placebo']['Decrease'].mean()
Visualize difference¶
First by overlayed histograms, and then by boxplot.
fig, ax = plt.subplots(figsize=(10, 10))
cal[cal['Treatment'] == 'Calcium']['Decrease'].hist(
ax=ax, label='Calcium'
)
cal[cal['Treatment'] == 'Placebo']['Decrease'].hist(
ax=ax, label='Placebo', alpha=0.5
)
ax.legend(loc='best')
ax.set_ylabel('Count')
ax.set_xlabel('Decrease in blood pressure')
fig, ax = plt.subplots(figsize=(6, 6))
cal.boxplot(column=['Decrease'], by='Treatment', ax=ax)
Compute Means and Mediums for test population as a whole, and then for each treatment group.
mean_decrease = cal['Decrease'].mean()
mean_treated = cal[cal['Treatment'] == 'Calcium'][
'Decrease'
].mean()
mean_placebo = cal[cal['Treatment'] == 'Placebo'][
'Decrease'
].mean()
print(mean_decrease, mean_placebo, mean_treated)
median_decrease = cal['Decrease'].mean()
median_decrease
median_treated = cal[cal['Treatment'] == 'Calcium'][
'Decrease'
].median()
median_treated
Show the spread of Treatment response by quantiles
sorted(cal[cal['Treatment'] == 'Calcium']['Decrease'])
cal[cal['Treatment'] == 'Calcium']['Decrease'].quantile(
q=[0.25, 0.75]
)
Compute the standard deviation of the Treatment response two ways (first manually, second by pandas)
treated = cal[cal['Treatment'] == 'Calcium']['Decrease']
samp_std_dev = sum(
[(x - mean_treated) ** 2 for x in treated]
) / (len(treated) - 1)
math.sqrt(samp_std_dev)
cal[cal['Treatment'] == 'Calcium']['Decrease'].std()
In order to help visualize the difference, we take 200 samples of size 5 from the overall, treated, and placebo populations, and chart the results as histograms. Once again, treatment and placebo seem different.
sampling = [
sum(
np.random.choice(
cal['Decrease'], size=5, replace=True
)
)
/ 5
for i in range(200)
]
sampling_treated = [
sum(
np.random.choice(
cal[cal['Treatment'] == 'Calcium']['Decrease'],
size=5,
replace=True,
)
)
/ 5
for i in range(200)
]
sampling_placebo = [
sum(
np.random.choice(
cal[cal['Treatment'] == 'Placebo']['Decrease'],
size=5,
replace=True,
)
)
/ 5
for i in range(200)
]
fig, ax = plt.subplots(figsize=(6, 6))
ax.hist(sampling, label='Combined', color='blue', alpha=0.5)
ax.hist(
sampling_placebo,
alpha=0.5,
label='Placebo Only',
color='green',
)
ax.hist(
sampling_treated,
alpha=0.5,
label='Calcium Only',
color='red',
)
ax.axvline(mean_decrease, c='blue', label='Total')
ax.axvline(mean_treated, c='red', label='Treated')
ax.axvline(mean_placebo, c='green', label='Placebo')
ax.legend(loc='best')
sample = [
5,
-1,
0,
2,
6,
-1,
5,
6,
3,
3,
-3,
-1,
-2,
3,
0,
4,
1,
4,
6,
0,
]
print(np.mean(sample), np.std(sample, ddof=1))
Manually compute the value of the T statistic. Note we use the sample Sum of Squares / (N-1), (see numpy.std
) to give an unbiased estimator of the variance of the infinite population.
t_value = (np.mean(sample) - 0) / (
np.std(sample, ddof=1) / math.sqrt(20)
)
t_value
len(sample)
Now we use the scipy.stats
package to perform a T Test on the data set as a whole. We see that given the Null Hypothesis that the mean is 0, the results we actually see are very unlikely (5 in 1000)
stats.ttest_1samp(sample, 0)
Visualizing the Difference from H0¶
fig, ax = plt.subplots(figsize=(6, 6))
ax.hist(sample, label='Data')
ax.axvline(0, c='blue', label='H0: Mean=0')
ax.axvline(np.mean(sample), c='red', label='Actual Mean')
ax.legend(loc='best')
Get the 2% rejection range.
df = 19
crit1, crit2 = stats.t.ppf([1 - 0.01, 0.01], df)
print(crit1, crit2)
x_rej1 = np.linspace(stats.t.ppf(0.001, df), crit2, 20)
x_rej2 = np.linspace(crit1, stats.t.ppf(0.999, df), 20)
We plot the T distribution appropriate for our sample size, the acceptance / rejection regions, and our observed value
fig, ax = plt.subplots(figsize=(6, 6))
x = np.linspace(
stats.t.ppf(0.001, df), stats.t.ppf(0.999, df), 50
)
ax.plot(x, stats.t.pdf(x, df), 'r-', label=f'T (df={df})')
ax.spines['bottom'].set_position('zero')
ax.spines['left'].set_position('zero')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.axvline(t_value, c='black', label='T value')
ax.axvline(crit1, c='yellow', label='99% range')
ax.axvline(crit2, c='yellow', label='99% range')
ax.fill_between(
x_rej1,
0,
stats.t.pdf(x_rej1, df),
facecolor='red',
label='Rejection Region (99%)',
alpha=0.5,
)
ax.fill_between(
x_rej2,
0,
stats.t.pdf(x_rej2, df),
facecolor='red',
alpha=0.5,
)
ax.grid()
ax.legend(loc='right')
stats.t.ppf([0.975, 0.025], df)
Analysis of Calcium Data¶
Back to our treatment / placebo data. After calculating the T statistic manually to make sure we understand the mechanics, we use the stats.ttest_ind()
method to compare two samples. In this case, we see that under the Null Hypothesis (treatment = placebo), we would see our T value about 1 in 10 times. Thus we can't reject H0 at the 5% level. We plot the results to show the value we see falls in the Accept H0 region.
treatment = cal[cal['Treatment'] == 'Calcium']['Decrease']
placebo = cal[cal['Treatment'] == 'Placebo']['Decrease']
l_treatment = len(treatment)
l_placebo = len(placebo)
print(
f'Number of treatments: {l_treatment}, number of placebos: {l_placebo}'
)
t_mean = treatment.mean()
p_mean = placebo.mean()
t_std = np.std(treatment, ddof=1)
p_std = np.std(placebo, ddof=1)
ss2 = (l_treatment - 1) * t_std ** 2 + (
l_placebo - 1
) * p_std ** 2
mean_ss2 = ss2 / (l_treatment + l_placebo - 2)
t_value = (t_mean - p_mean) / (
math.sqrt(mean_ss2)
* math.sqrt(1 / l_treatment + 1 / l_placebo)
)
print(t_mean, p_mean, t_value)
stats.ttest_ind(treatment, placebo, equal_var=True)
Get 5% Accept / Reject range
df = 19
crit1, crit2 = stats.t.ppf([1 - 0.025, 0.025], df)
print(crit1, crit2)
Graph the results
fig, ax = plt.subplots(figsize=(6, 6))
df = 19
x = np.linspace(
stats.t.ppf(0.001, df), stats.t.ppf(0.999, df), 50
)
ax.plot(x, stats.t.pdf(x, df), 'r-', label=f'T (df={df})')
ax.spines['bottom'].set_position('zero')
ax.spines['left'].set_position('zero')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.axvline(t_value, c='black', label='T value')
ax.axvline(crit1, c='green', label='95% upper cutoff value')
ax.legend(loc='upper left')
stats.ttest_ind(treatment, placebo, equal_var=False)
Regression Model¶
We can try to fit (by OLS) a linear model to our data: again we find that the data we see, would be seen about 1 time 10, assuming H0. I presume that a researcher who believed in the treatment, would try again with a larger sample size
res = ols(formula='Decrease ~ Treatment', data=cal).fit()
print(res.summary())
Display the residual errors for our model. This illustrates how much noise there is in the data.
x_index = np.linspace(
1, len(res.fittedvalues) + 1, len(res.fittedvalues)
)
fig, ax = plt.subplots(figsize=(6, 6))
ax.plot(x_index, cal['Decrease'], 'ro', label='Data')
ax.plot(
x_index, res.fittedvalues, 'g+', label='Fitted values'
)
# we many residual lines, but only label 1 for the legend
# do all residual error lines
for x, y1, y2 in zip(
x_index, cal['Decrease'], res.fittedvalues
):
ax.plot([x, x], [y1, y2], 'y-')
# end for
# label the first error line, then bail
for x, y1, y2 in zip(
x_index, cal['Decrease'], res.fittedvalues
):
ax.plot(
[x, x], [y1, y2], 'y-', label='Prediction Error'
)
break
# end for
ax.grid()
ax.set_xlabel('Sample Number')
ax.set_ylabel('BP Decrease')
ax.legend(loc='best')
Reproducibility¶
%watermark -h -iv
%watermark