Stanford Stats 191¶

Introduction¶

This is a re-creation of the Stanford Stats 191 course (https://web.stanford.edu/class/stats191/), using Python eco-system tools, instead of R. This the lecture "Course introduction and review" (see https://web.stanford.edu/class/stats191/notebooks/Review.html).

I found the STATS191 course pitched at just the right level for me (who can dimly remember my Uni stats courses). The only drawback was that all the examples were in R, not Python. So I decided to redo the course material using Python eco-system tools.

Why Python?¶

I have tried to get into R a number of times, and bounced hard each time. Part of it was that I was learning R because "I should learn R", not because I had a specific project in mind. On the other hand, I learnt Python implementing a specific application that took me a few months to complete.

Further, I found the syntax of R to be sufficiently different to be jarring (e.g. left and right assignment statements). Most modern computer languages have a syntax largely based on C (the Proto-Indo-European of programming), but not R, and I think that might make it harder for experienced programmers to pick up. Further, the tidy-verse religious wars (and the introduction of things like the pipe operator) indicates to me that I probably can't pick R by idle tinkering.

I think there is a deeper reason for this: R is for statisticians, by statisticians, while Python is for software engineers, by software engineers. Of course Python has surprises/ gotchas (I'm look at you, mutable function parameter defaults), but I found R to have many more.

There is an argument that eventually general purposes computers /languages will win out over domain-specific computers /languages. Certainly I have used Python for a variety of tasks (generating Word Documents hold thumbnails of images in a directory, playing legal chess, visualizing geospatial data, processing web-cam images, etc) that I wouldn't have the first idea how to do in R. Part of the strength of Python is that it comes "batteries included"; there is an amazing amount of open-source software that supports almost any task. Part of this current project (redo STATS191) is to see if there is any limit to Python's statistics support (spoiler alert: haven't found any yet, but seeing as how I am on the beginner slopes, I wouldn't expect there to be any limits).

Initial Notebook Setup¶

watermake documents the Python and package environment, black is my preferred formatter, and we want graphs, etc visible in the Notebook

In [39]:

%load_ext watermark

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark

In [40]:

%load_ext lab_black

The lab_black extension is already loaded. To reload it, use:
  %reload_ext lab_black

In [41]:

%matplotlib inline

All imports go here

pandas handles high-level datatable structures
numpy handle numeric data and algorithms
seaborn handles vizualization

The statistics support in Python comes from two packages. the stats sub-module of scipy, and statsmodels; we mostly use the latter.

In [42]:

import pandas as pd
import numpy as np
import seaborn as sn

import math

import matplotlib.pyplot as plt
from scipy import stats
from statsmodels.formula.api import ols
from statsmodels.formula.api import rlm
import statsmodels.api as sm

from statsmodels.sandbox.regression.predstd import (
    wls_prediction_std,
)

from statsmodels.stats.stattools import jarque_bera

Data Load¶

Load the initial data set (Mother-Daughter height data), and explore it in different ways.

In [43]:

DATA_PREFIX = 'd:/StanfordStats191/alr3data/'

data = pd.read_csv(DATA_PREFIX + 'heights.txt', sep=' ')

Pandas visualization¶

We print and graph the data in different ways.

In [44]:

data.head()

Out[44]:

	Mheight	Dheight
0	59.7	55.1
1	58.2	56.5
2	60.6	56.0
3	60.7	56.8
4	61.8	56.0

In [45]:

data.describe()

Out[45]:

	Mheight	Dheight
count	1375.000000	1375.000000
mean	62.452800	63.751055
std	2.355103	2.600053
min	55.400000	55.100000
25%	60.800000	62.000000
50%	62.400000	63.600000
75%	63.900000	65.600000
max	70.800000	73.100000

In [46]:

data['Mheight'].hist()

Out[46]:

<matplotlib.axes._subplots.AxesSubplot at 0x21779cdb550>

No description has been provided for this image

In [47]:

data['Dheight'].hist()

Out[47]:

<matplotlib.axes._subplots.AxesSubplot at 0x21779c2a8d0>

In [48]:

data.boxplot(column=['Mheight', 'Dheight'])

Out[48]:

<matplotlib.axes._subplots.AxesSubplot at 0x21776ffcb00>

In [49]:

__ = pd.plotting.scatter_matrix(
    data[['Mheight', 'Dheight']]
)

Matplotlib graphics¶

The pandas graphics can be improved upon, but with more code. First we use matplotlib, and using seaborn.

In [50]:

fig, ax = plt.subplots(figsize=(10, 10))

ax.plot(data['Mheight'], data['Dheight'], 'o', label='data')
ax.grid()
ax.set_xlabel("Mother's Height")
ax.set_ylabel("Daughter's Height")

ax.legend(loc='best')

Out[50]:

<matplotlib.legend.Legend at 0x21779e33358>

Seaborn graphics¶

We generate some seaborn graphics with minimal code.

In [51]:

fig, ax = plt.subplots(figsize=(10, 10))

sn.regplot(
    'Mheight',
    'Dheight',
    data=data,
    ax=ax,
    label='Seaborn Regression\n95% Conf Intvl',
)
ax.grid()
ax.set_xlabel("Mother's Height")
ax.set_ylabel("Daughter's Height")

ax.legend(loc='best')

D:\Anaconda3\envs\ac5-py37\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

Out[51]:

<matplotlib.legend.Legend at 0x2177a08fe48>

Note: Seaborn jointplot graphics cannot be incorporated easily into Matplotlib figures

In [52]:

# fig, ax = plt.subplots(figsize=(10, 10))

g = sn.jointplot(
    'Mheight',
    'Dheight',
    kind='reg',
    data=data,
    # ax=ax,
    label='Seaborn Regression\n95% Conf Intvl',
)

In [53]:

g = sn.jointplot(
    'Mheight',
    'Dheight',
    kind='hex',
    data=data,
    # ax=ax,
    label='Seaborn Regression\n95% Conf Intvl',
)

Statistics Models¶

So now having done our data exploration and initial visualization, we might be of the view that there is a linear relationship between a mothers height, and her daughter's height.

We test this hypothesis, by using Ordinary Least Squares to fit a line to the data. statsmodels has two modes: one where you build your input data matrixes explicitly (and you use the UPPER CASE variants of the class names), and one where you can text formulas a la R (and you use lower case variants of the class names).

In [54]:

data = pd.read_csv(DATA_PREFIX + 'heights.txt', sep=' ')

daughter = data['Dheight']
mother = data['Mheight']

mother = sm.add_constant(mother)

res = sm.OLS(daughter, mother).fit()

We can see from the RegressionResults summary method below, the linear model doesn't explain much of the spread in the daughter's heights (R^2 = 0.241), but the likelihood of seeing these results if daughters heights had no relation to mothers heights is vanishingly small. Thus we reject the hypothesis that daughters heights have no relation to mothers heights

In [55]:

res.summary()

Out[55]:

OLS Regression Results
Dep. Variable:	Dheight	R-squared:	0.241
Model:	OLS	Adj. R-squared:	0.240
Method:	Least Squares	F-statistic:	435.5
Date:	Wed, 18 Mar 2020	Prob (F-statistic):	3.22e-84
Time:	16:12:10	Log-Likelihood:	-3075.0
No. Observations:	1375	AIC:	6154.
Df Residuals:	1373	BIC:	6164.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	29.9174	1.622	18.439	0.000	26.735	33.100
Mheight	0.5417	0.026	20.868	0.000	0.491	0.593

Omnibus:	1.412	Durbin-Watson:	0.126
Prob(Omnibus):	0.494	Jarque-Bera (JB):	1.353
Skew:	0.002	Prob(JB):	0.508
Kurtosis:	3.154	Cond. No.	1.66e+03

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.66e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

We display the line of best (OLS) fit. Note we set alpha to be less than default, so that the density of data points is more easily assessed.

In [56]:

fig, ax = plt.subplots(figsize=(10, 10))

ax.plot(
    data['Mheight'],
    data['Dheight'],
    'o',
    label='data',
    alpha=0.5,
)
ax.plot(
    data['Mheight'],
    res.fittedvalues,
    'r-',
    label='Fitted Values',
)
ax.grid()
ax.set_xlabel("Mother's Height")
ax.set_ylabel("Daughter's Height")

ax.legend(loc='best')

Out[56]:

<matplotlib.legend.Legend at 0x2177b4f0080>

If we want to add some context the graph, we can add Confidence Intervals. First, we can be very confident about the line of best fit (due to the large number of data points averaging out the ). However, any prediction we make using this line will have the inherent error of our model, so the confidence interval for predictions is much wider.

Note that the wls_prediction_std method we call comes from a statsmodels sandbox:

This sandbox contains code that is for various reasons not ready to be included in statsmodels proper. It contains modules from the old stats.models code that have not been tested, verified and updated to the new statsmodels structure: cox survival model, mixed effects model with repeated measures, generalized additive model and the formula framework. The sandbox also contains code that is currently being worked on until it fits the pattern of statsmodels or is sufficiently tested.

We can only hope that this call won't disappear in the future!

In [57]:

x_ci = np.linspace(
    data['Mheight'].min(), data['Mheight'].max(), 20
)
x_ci_enh = sm.add_constant(x_ci)

gp = res.get_prediction(x_ci_enh)
ci = gp.conf_int(alpha=0.05)
upper = [z[1] for z in ci]
lower = [z[0] for z in ci]

In [58]:

prstd, iv_l, iv_u = wls_prediction_std(res)

In [59]:

fig, ax = plt.subplots(figsize=(10, 10))

ax.plot(
    data['Mheight'],
    data['Dheight'],
    'o',
    label='data',
    alpha=0.5,
)
ax.plot(
    data['Mheight'],
    res.fittedvalues,
    'g-',
    label='Fitted Values',
)

ax.plot(
    x_ci, upper, 'r-', label='95% Conf Int (u)', alpha=0.5
)
ax.plot(
    x_ci, lower, 'r-', label='95% Conf Int (l)', alpha=0.5
)

ax.plot(
    data['Mheight'], iv_l, 'y-', label='95% Pred Int (l)'
)
ax.plot(
    data['Mheight'], iv_u, 'y-', label='95% Pred Int (u)'
)

ax.grid()
ax.set_xlabel("Mother's Height")
ax.set_ylabel("Daughter's Height")

ax.legend(loc='best')

Out[59]:

<matplotlib.legend.Legend at 0x2177b524710>

Use Symbolic Model Specification¶

We can also use R-style symbolic model specification (that assumes a constant term in the linear model, by default). We get the same result, comparing the two summary() method calls.

In [60]:

res2 = ols('Dheight ~ Mheight', data=data).fit()

In [61]:

res2.summary()

Out[61]:

OLS Regression Results
Dep. Variable:	Dheight	R-squared:	0.241
Model:	OLS	Adj. R-squared:	0.240
Method:	Least Squares	F-statistic:	435.5
Date:	Wed, 18 Mar 2020	Prob (F-statistic):	3.22e-84
Time:	16:12:11	Log-Likelihood:	-3075.0
No. Observations:	1375	AIC:	6154.
Df Residuals:	1373	BIC:	6164.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	29.9174	1.622	18.439	0.000	26.735	33.100
Mheight	0.5417	0.026	20.868	0.000	0.491	0.593

Omnibus:	1.412	Durbin-Watson:	0.126
Prob(Omnibus):	0.494	Jarque-Bera (JB):	1.353
Skew:	0.002	Prob(JB):	0.508
Kurtosis:	3.154	Cond. No.	1.66e+03

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.66e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

In [62]:

res.summary()

Out[62]:

OLS Regression Results
Dep. Variable:	Dheight	R-squared:	0.241
Model:	OLS	Adj. R-squared:	0.240
Method:	Least Squares	F-statistic:	435.5
Date:	Wed, 18 Mar 2020	Prob (F-statistic):	3.22e-84
Time:	16:12:11	Log-Likelihood:	-3075.0
No. Observations:	1375	AIC:	6154.
Df Residuals:	1373	BIC:	6164.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	29.9174	1.622	18.439	0.000	26.735	33.100
Mheight	0.5417	0.026	20.868	0.000	0.491	0.593

Omnibus:	1.412	Durbin-Watson:	0.126
Prob(Omnibus):	0.494	Jarque-Bera (JB):	1.353
Skew:	0.002	Prob(JB):	0.508
Kurtosis:	3.154	Cond. No.	1.66e+03

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.66e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Just to make sure we understand the outputs, we compute the R^2 values manually

In [63]:

mean_y = data['Dheight'].mean()

sstot = sum([(y - mean_y) ** 2 for y in data['Dheight']])

print(f'Total sum of squares around the mean {sstot}')

Total sum of squares around the mean 9288.615970909108

In [64]:

ssreg = sum([(y1 - mean_y) ** 2 for y1 in res.fittedvalues])

ssres = sum(
    [
        (y1 - y2) ** 2
        for y1, y2 in zip(data['Dheight'], res.fittedvalues)
    ]
)

In [65]:

print(
    f'Total sum of square spread around mean {sstot},\n'
    + f'sum of square errors, predictions around mean {ssreg}, \n'
    + f'sum of square errors, predictions against actual {ssres}'
)

print(ssreg + ssres)

Total sum of square spread around mean 9288.615970909108,
sum of square errors, predictions around mean 2236.6585750284344, 
sum of square errors, predictions against actual 7051.957395880653
9288.615970909088

In [66]:

r2 = 1 - ssres / sstot
r2

Out[66]:

0.24079567742206331

In [67]:

print(
    f'res.ssr = {res.ssr}; res.mse_model = {res.mse_model}'
)

res.ssr = 7051.957395880656; res.mse_model = 2236.6585750284357

In [68]:

print(
    f'res.mse_total = {res.mse_total}, se_total = {res.mse_total*(1375-1)}'
)

print(res.centered_tss)

print(
    f'ssres {ssres} =  res.mse_resid {res.mse_resid} * 1373 = {res.mse_resid*1373}'
)

res.mse_total = 6.760273632393807, se_total = 9288.615970909092
9288.615970909092
ssres 7051.957395880653 =  res.mse_resid 5.136167076387951 * 1373 = 7051.957395880656

Multivariate Regression¶

Numerical descriptive statistics¶

We now look at a dataset that links income to RightToWorkLaws (or more accurately, RightToFireWithoutPeskyUnions), in a number of USA cities. In STATS191, it is used as example to illustrate numerical descriptive statistics. The pandas describe() method gives us these.

In [69]:

DATA_PREFIX = 'd:/StanfordStats191/'

rtw = pd.read_csv(DATA_PREFIX + 'P005.txt', sep='\t')

In [70]:

rtw.head()

Out[70]:

	City	COL	PD	URate	Pop	Taxes	Income	RTWL
0	Atlanta	169	414	13.6	1790128	5128	2961	1
1	Austin	143	239	11.0	396891	4303	1711	1
2	Bakersfield	339	43	23.7	349874	4166	2122	0
3	Baltimore	173	951	21.0	2147850	5001	4654	0
4	Baton Rouge	99	255	16.0	411725	3965	1620	1

The variables are:

Income: income for a four-person family
COL: cost of living for a four-person family
PD: Population density
URate: rate of unionization in 1978
Pop: Population
Taxes: Property taxes in 1972
RTWL: right-to-work indicator

Note the method used to change the color of the boxplot boxes. We display a few more pandas graphs for this dataset.

In [71]:

fig, ax = plt.subplots(figsize=(10, 10))

props = dict(boxes="Pink")
rtw.boxplot(
    column=['COL'],
    by='RTWL',
    ax=ax,
    color=props,
    patch_artist=True,
)
ax.set_ylabel("Cost Of Living")

Out[71]:

Text(0, 0.5, 'Cost Of Living')

In [72]:

__ = pd.plotting.scatter_matrix(
    rtw[
        [
            'City',
            'COL',
            'PD',
            'URate',
            'Pop',
            'Taxes',
            'Income',
            'RTWL',
        ]
    ]
)

In [73]:

rtw.describe()

Out[73]:

	COL	PD	URate	Pop	Taxes	Income	RTWL
count	38.000000	38.000000	38.000000	3.800000e+01	38.000000	38.000000	38.000000
mean	223.631579	780.157895	24.221053	2.040736e+06	4902.815789	4708.763158	0.263158
std	76.506640	1121.935430	8.600036	2.114432e+06	483.829033	2084.042504	0.446258
min	99.000000	43.000000	6.500000	1.623040e+05	3965.000000	782.000000	0.000000
25%	170.750000	302.000000	17.825000	4.970500e+05	4619.750000	3109.500000	0.000000
50%	205.500000	400.000000	24.050000	1.408054e+06	4858.000000	4865.000000	0.000000
75%	266.500000	963.750000	30.000000	2.355462e+06	5166.500000	6081.500000	0.750000
max	381.000000	6908.000000	39.200000	9.561089e+06	6404.000000	8392.000000	1.000000

The median() method gives us the median of each numeric column. We really should have declared RTWL a categorical column, but this is what you get by default.

In [74]:

rtw.median()

Out[74]:

COL           205.50
PD            400.00
URate          24.05
Pop       1408054.50
Taxes        4858.00
Income       4865.00
RTWL            0.00
dtype: float64

To enhance the default pandas scatterplots, we specify we want RTWL status to be highlighted. It is annoying that we don't get a legend by default.

In [75]:

fig, ax = plt.subplots(figsize=(10, 10))

__ = pd.plotting.scatter_matrix(
    rtw[['COL', 'PD', 'URate', 'Pop', 'Taxes', 'Income']],
    c=rtw['RTWL'],
    cmap='rainbow',
    ax=ax,
    alpha=0.95,
)

D:\Anaconda3\envs\ac5-py37\lib\site-packages\ipykernel_launcher.py:8: UserWarning: To output multiple subplots, the figure containing the passed axes is being cleared

At least Seaborn knows about legends!

In [76]:

_ = sn.pairplot(
    rtw,
    vars=['COL', 'PD', 'URate', 'Pop', 'Taxes', 'Income'],
    kind='scatter',
    hue='RTWL',
    diag_kind='hist',
    diag_kws={'alpha': 0.5},
)

One of the cities is clearly an outlier: it is New York.

In [77]:

rtw.iloc[26]

Out[77]:

City      New York
COL            323
PD            6908
URate         39.2
Pop        9561089
Taxes         5260
Income        4862
RTWL             0
Name: 26, dtype: object

Numerical descriptive statistics analysis¶

We read data on a treatment / placebo experiment, and try to determine if the observed data supports an hypothesis that the treatment is better than the placebo

Read data¶

In [78]:

DATA_PREFIX = 'd:/StanfordStats191/'

cal = pd.read_csv(DATA_PREFIX + 'calcium.txt', sep='\t')

Initial Analysis¶

In [79]:

cal.head()

Out[79]:

	Treatment	Begin	End	Decrease
0	Calcium	107	100	7
1	Calcium	110	114	-4
2	Calcium	123	105	18
3	Calcium	129	112	17
4	Calcium	112	115	-3

In [80]:

cal.describe()

Out[80]:

	Begin	End	Decrease
count	21.000000	21.000000	21.000000
mean	114.047619	112.000000	2.238095
std	9.708121	9.782638	7.687033
min	98.000000	95.000000	-11.000000
25%	109.000000	105.000000	-3.000000
50%	112.000000	114.000000	-1.000000
75%	119.000000	118.000000	7.000000
max	136.000000	133.000000	18.000000

Start to look at difference between Treatment and Placebo. Seems to be a difference, but could it arise by chance?

In [81]:

cal[cal['Treatment'] == 'Calcium']['Decrease'].mean()

Out[81]:

5.0

In [82]:

cal[cal['Treatment'] == 'Placebo']['Decrease'].mean()

Out[82]:

-0.2727272727272727

Visualize difference¶

First by overlayed histograms, and then by boxplot.

In [83]:

fig, ax = plt.subplots(figsize=(10, 10))
cal[cal['Treatment'] == 'Calcium']['Decrease'].hist(
    ax=ax, label='Calcium'
)
cal[cal['Treatment'] == 'Placebo']['Decrease'].hist(
    ax=ax, label='Placebo', alpha=0.5
)
ax.legend(loc='best')
ax.set_ylabel('Count')
ax.set_xlabel('Decrease in blood pressure')

Out[83]:

Text(0.5, 0, 'Decrease in blood pressure')

In [84]:

fig, ax = plt.subplots(figsize=(6, 6))
cal.boxplot(column=['Decrease'], by='Treatment', ax=ax)

Out[84]:

<matplotlib.axes._subplots.AxesSubplot at 0x2177ce0dc50>

Compute Means and Mediums for test population as a whole, and then for each treatment group.

In [85]:

mean_decrease = cal['Decrease'].mean()
mean_treated = cal[cal['Treatment'] == 'Calcium'][
    'Decrease'
].mean()
mean_placebo = cal[cal['Treatment'] == 'Placebo'][
    'Decrease'
].mean()

In [86]:

print(mean_decrease, mean_placebo, mean_treated)

2.238095238095238 -0.2727272727272727 5.0

In [87]:

median_decrease = cal['Decrease'].mean()
median_decrease

Out[87]:

2.238095238095238

In [88]:

median_treated = cal[cal['Treatment'] == 'Calcium'][
    'Decrease'
].median()
median_treated

Out[88]:

4.0

Show the spread of Treatment response by quantiles

In [89]:

sorted(cal[cal['Treatment'] == 'Calcium']['Decrease'])

Out[89]:

[-5, -4, -3, -2, 1, 7, 10, 11, 17, 18]

In [90]:

cal[cal['Treatment'] == 'Calcium']['Decrease'].quantile(
    q=[0.25, 0.75]
)

Out[90]:

0.25    -2.75
0.75    10.75
Name: Decrease, dtype: float64

Compute the standard deviation of the Treatment response two ways (first manually, second by pandas)

In [91]:

treated = cal[cal['Treatment'] == 'Calcium']['Decrease']
samp_std_dev = sum(
    [(x - mean_treated) ** 2 for x in treated]
) / (len(treated) - 1)

In [92]:

math.sqrt(samp_std_dev)

Out[92]:

8.743251365736

In [93]:

cal[cal['Treatment'] == 'Calcium']['Decrease'].std()

Out[93]:

8.743251365736

In order to help visualize the difference, we take 200 samples of size 5 from the overall, treated, and placebo populations, and chart the results as histograms. Once again, treatment and placebo seem different.

In [94]:

sampling = [
    sum(
        np.random.choice(
            cal['Decrease'], size=5, replace=True
        )
    )
    / 5
    for i in range(200)
]

sampling_treated = [
    sum(
        np.random.choice(
            cal[cal['Treatment'] == 'Calcium']['Decrease'],
            size=5,
            replace=True,
        )
    )
    / 5
    for i in range(200)
]
sampling_placebo = [
    sum(
        np.random.choice(
            cal[cal['Treatment'] == 'Placebo']['Decrease'],
            size=5,
            replace=True,
        )
    )
    / 5
    for i in range(200)
]

In [95]:

fig, ax = plt.subplots(figsize=(6, 6))

ax.hist(sampling, label='Combined', color='blue', alpha=0.5)


ax.hist(
    sampling_placebo,
    alpha=0.5,
    label='Placebo Only',
    color='green',
)
ax.hist(
    sampling_treated,
    alpha=0.5,
    label='Calcium Only',
    color='red',
)

ax.axvline(mean_decrease, c='blue', label='Total')
ax.axvline(mean_treated, c='red', label='Treated')
ax.axvline(mean_placebo, c='green', label='Placebo')

ax.legend(loc='best')

Out[95]:

<matplotlib.legend.Legend at 0x2177d4542b0>

T Tests¶

We will use the T Test to compare experimental results. We run through a contrived example to show the principles.

Using contrived data¶

In [96]:

sample = [
    5,
    -1,
    0,
    2,
    6,
    -1,
    5,
    6,
    3,
    3,
    -3,
    -1,
    -2,
    3,
    0,
    4,
    1,
    4,
    6,
    0,
]
print(np.mean(sample), np.std(sample, ddof=1))

2.0 2.8837110593268895

Manually compute the value of the T statistic. Note we use the sample Sum of Squares / (N-1), (see numpy.std) to give an unbiased estimator of the variance of the infinite population.

In [97]:

t_value = (np.mean(sample) - 0) / (
    np.std(sample, ddof=1) / math.sqrt(20)
)

In [98]:

t_value

Out[98]:

3.1016532953502334

In [99]:

len(sample)

Out[99]:

Now we use the scipy.stats package to perform a T Test on the data set as a whole. We see that given the Null Hypothesis that the mean is 0, the results we actually see are very unlikely (5 in 1000)

In [100]:

stats.ttest_1samp(sample, 0)

Out[100]:

Ttest_1sampResult(statistic=3.1016532953502334, pvalue=0.005873436001416835)

Visualizing the Difference from H0¶

In [101]:

fig, ax = plt.subplots(figsize=(6, 6))

ax.hist(sample, label='Data')
ax.axvline(0, c='blue', label='H0: Mean=0')
ax.axvline(np.mean(sample), c='red', label='Actual Mean')
ax.legend(loc='best')

Out[101]:

<matplotlib.legend.Legend at 0x2177d4e9358>

Get the 2% rejection range.

In [102]:

df = 19

crit1, crit2 = stats.t.ppf([1 - 0.01, 0.01], df)
print(crit1, crit2)

x_rej1 = np.linspace(stats.t.ppf(0.001, df), crit2, 20)
x_rej2 = np.linspace(crit1, stats.t.ppf(0.999, df), 20)

2.539483190622288 -2.5394831906222888

We plot the T distribution appropriate for our sample size, the acceptance / rejection regions, and our observed value

In [103]:

fig, ax = plt.subplots(figsize=(6, 6))

x = np.linspace(
    stats.t.ppf(0.001, df), stats.t.ppf(0.999, df), 50
)

ax.plot(x, stats.t.pdf(x, df), 'r-', label=f'T (df={df})')


ax.spines['bottom'].set_position('zero')
ax.spines['left'].set_position('zero')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

ax.axvline(t_value, c='black', label='T value')
ax.axvline(crit1, c='yellow', label='99% range')
ax.axvline(crit2, c='yellow', label='99% range')

ax.fill_between(
    x_rej1,
    0,
    stats.t.pdf(x_rej1, df),
    facecolor='red',
    label='Rejection Region (99%)',
    alpha=0.5,
)
ax.fill_between(
    x_rej2,
    0,
    stats.t.pdf(x_rej2, df),
    facecolor='red',
    alpha=0.5,
)
ax.grid()

ax.legend(loc='right')

Out[103]:

<matplotlib.legend.Legend at 0x2177d482940>

In [104]:

stats.t.ppf([0.975, 0.025], df)

Out[104]:

array([ 2.09302405, -2.09302405])

Analysis of Calcium Data¶

Back to our treatment / placebo data. After calculating the T statistic manually to make sure we understand the mechanics, we use the stats.ttest_ind() method to compare two samples. In this case, we see that under the Null Hypothesis (treatment = placebo), we would see our T value about 1 in 10 times. Thus we can't reject H0 at the 5% level. We plot the results to show the value we see falls in the Accept H0 region.

In [105]:

treatment = cal[cal['Treatment'] == 'Calcium']['Decrease']
placebo = cal[cal['Treatment'] == 'Placebo']['Decrease']
l_treatment = len(treatment)
l_placebo = len(placebo)

print(
    f'Number of treatments: {l_treatment}, number of placebos: {l_placebo}'
)

Number of treatments: 10, number of placebos: 11

In [106]:

t_mean = treatment.mean()
p_mean = placebo.mean()

t_std = np.std(treatment, ddof=1)
p_std = np.std(placebo, ddof=1)

ss2 = (l_treatment - 1) * t_std ** 2 + (
    l_placebo - 1
) * p_std ** 2

In [107]:

mean_ss2 = ss2 / (l_treatment + l_placebo - 2)

In [108]:

t_value = (t_mean - p_mean) / (
    math.sqrt(mean_ss2)
    * math.sqrt(1 / l_treatment + 1 / l_placebo)
)

In [109]:

print(t_mean, p_mean, t_value)

5.0 -0.2727272727272727 1.6341082415908594

In [110]:

stats.ttest_ind(treatment, placebo, equal_var=True)

Out[110]:

Ttest_indResult(statistic=1.6341082415908594, pvalue=0.11869682666685942)

Get 5% Accept / Reject range

In [111]:

df = 19
crit1, crit2 = stats.t.ppf([1 - 0.025, 0.025], df)
print(crit1, crit2)

2.093024054408263 -2.0930240544082634

Graph the results

In [112]:

fig, ax = plt.subplots(figsize=(6, 6))

df = 19
x = np.linspace(
    stats.t.ppf(0.001, df), stats.t.ppf(0.999, df), 50
)

ax.plot(x, stats.t.pdf(x, df), 'r-', label=f'T (df={df})')


ax.spines['bottom'].set_position('zero')
ax.spines['left'].set_position('zero')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

ax.axvline(t_value, c='black', label='T value')
ax.axvline(crit1, c='green', label='95% upper cutoff value')
ax.legend(loc='upper left')

Out[112]:

<matplotlib.legend.Legend at 0x2177d7ab198>

In [113]:

stats.ttest_ind(treatment, placebo, equal_var=False)

Out[113]:

Ttest_indResult(statistic=1.6037172876755148, pvalue=0.12883936962193396)

Regression Model¶

We can try to fit (by OLS) a linear model to our data: again we find that the data we see, would be seen about 1 time 10, assuming H0. I presume that a researcher who believed in the treatment, would try again with a larger sample size

In [114]:

res = ols(formula='Decrease ~ Treatment', data=cal).fit()
print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               Decrease   R-squared:                       0.123
Model:                            OLS   Adj. R-squared:                  0.077
Method:                 Least Squares   F-statistic:                     2.670
Date:                Wed, 18 Mar 2020   Prob (F-statistic):              0.119
Time:                        16:12:23   Log-Likelihood:                -70.735
No. Observations:                  21   AIC:                             145.5
Df Residuals:                      19   BIC:                             147.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
Intercept                5.0000      2.335      2.141      0.045       0.112       9.888
Treatment[T.Placebo]    -5.2727      3.227     -1.634      0.119     -12.026       1.481
==============================================================================
Omnibus:                        1.015   Durbin-Watson:                   2.085
Prob(Omnibus):                  0.602   Jarque-Bera (JB):                0.918
Skew:                           0.314   Prob(JB):                        0.632
Kurtosis:                       2.191   Cond. No.                         2.68
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Display the residual errors for our model. This illustrates how much noise there is in the data.

In [115]:

x_index = np.linspace(
    1, len(res.fittedvalues) + 1, len(res.fittedvalues)
)
fig, ax = plt.subplots(figsize=(6, 6))

ax.plot(x_index, cal['Decrease'], 'ro', label='Data')
ax.plot(
    x_index, res.fittedvalues, 'g+', label='Fitted values'
)

# we many residual lines, but only label 1 for the legend
# do all residual error lines
for x, y1, y2 in zip(
    x_index, cal['Decrease'], res.fittedvalues
):
    ax.plot([x, x], [y1, y2], 'y-')
# end for

# label the first error line, then bail
for x, y1, y2 in zip(
    x_index, cal['Decrease'], res.fittedvalues
):
    ax.plot(
        [x, x], [y1, y2], 'y-', label='Prediction Error'
    )
    break
# end for

ax.grid()
ax.set_xlabel('Sample Number')
ax.set_ylabel('BP Decrease')
ax.legend(loc='best')

Out[115]:

<matplotlib.legend.Legend at 0x2177dc30400>

Reproducibility¶

In [116]:

%watermark -h -iv
%watermark

seaborn     0.9.0
scipy       1.1.0
matplotlib  3.0.2
pandas      1.0.0
statsmodels 0.9.0
numpy       1.15.4
host name: DESKTOP-SODFUN6
2020-03-18T16:12:24+10:00

CPython 3.7.1
IPython 7.2.0

compiler   : MSC v.1915 64 bit (AMD64)
system     : Windows
release    : 10
machine    : AMD64
processor  : Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
CPU cores  : 8
interpreter: 64bit

Stanford STATS191 in Python, Lecture 1 : Introduction

Stanford Stats 191¶

Introduction¶

Why Python?¶

Initial Notebook Setup¶

Data Load¶

Pandas visualization¶

Matplotlib graphics¶

Seaborn graphics¶

Statistics Models¶

Use Symbolic Model Specification¶

Multivariate Regression¶

Numerical descriptive statistics¶

Numerical descriptive statistics analysis¶

Read data¶

Initial Analysis¶

Visualize difference¶

T Tests¶

Using contrived data¶

Visualizing the Difference from H0¶

Analysis of Calcium Data¶

Regression Model¶

Reproducibility¶

Comments

About