Stanford STATS191 in Python, Lecture 11 : Bootstrapping regression
Stanford Stats 191¶
Introduction¶
This is a re-creation of the Stanford Stats 191 course (see https://web.stanford.edu/class/stats191/), using Python eco-system tools, instead of R. This is lecture "Bootstrapping regression:"
It turns out that it appears that statsmodels
appears to have no support for Bootstrap-style operations. There are a number of Python packages that claim to support Bootstrap operations, but I decided to use numpy
and its sample
method, and explicitly perform the resampling and parameter re-estimation.
Initial Notebook Setup¶
watermark
documents the current Python and package environment, black
is my preferred Python formatter
%load_ext watermark
%load_ext lab_black
%matplotlib inline
Imports¶
import pandas as pd
import numpy as np
import seaborn as sns
import math
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from scipy import stats
from statsmodels.formula.api import ols
from statsmodels.formula.api import rlm
import statsmodels.api as sm
OLS of Gaussian Noise as Error Term¶
First, we perform an OLS best fit of a linear model to a dataset with Gaussian Noise
# size of dataset
n_data = 50
# linear parameters
beta_0 = 1.5
beta_1 = 2.0
sigma = 2
x = np.random.normal(0, 1, n_data)
x = np.array(sorted(x))
# linear model with Gaussian noise
y = beta_0 + x * beta_1
#
y = y + np.random.normal(0, 1, n_data) * sigma
Quick plot of dataset
_ = plt.plot(x, y, 'o')
Perform OLS fit to linear model
df_dict = {'x': x, 'y': y}
data = pd.DataFrame(df_dict)
res = ols('y ~ x', data=data).fit()
res.params
res.summary()
Non-Gaussian Errors¶
Now, consider case where the error term is uniformly distributed (i.e. non-Gaussian). We have the same model parameters
n_data = 50
beta_0 = 1.5
beta_1 = 2
sigma = 2
x = np.random.normal(0, 1, n_data)
x = np.array(sorted(x))
y = beta_0 + x * beta_1
#
y = (
y
+ np.random.uniform(low=-1.0, high=1.0, size=n_data)
* sigma
)
_ = plt.plot(x, y, 'o')
df_dict = {'x': x, 'y': y}
data = pd.DataFrame(df_dict)
res = ols('y ~ x', data=data).fit()
res.params
res.summary()
The linear model parameters from the OLS best fit
res.params['Intercept'], res.params['x']
Now, sample the dataset with replacement, recompute parameters again, for ntrials
times
param1 = []
param2 = []
ntrials = 200
for trial in range(ntrials):
data_sample = data.sample(frac=1, replace=True)
res = ols('y ~ x', data=data_sample).fit()
param1.append(res.params['Intercept'])
param2.append(res.params['x'])
# end for
Display the spread of intercept parameters estimates (actual vaue 1.5)
_ = plt.hist(param1, edgecolor='white')
Display the spread of slope estimates (actual value 2)
_ = plt.hist(param2, edgecolor='white')
Find the 95% confidence interval of the slope, based upon the repeated estimation, compared to the one-shot OLS estimate = 1.494 <-> 2.216
np.percentile(param2, 2.5), np.percentile(param2, 97.5)
expn = np.random.exponential(1, 1000) - 1
_ = plt.hist(expn, edgecolor='w')
Compute the observations using this skewed error source
x = np.random.exponential(1, 1000)
y = 3 + 2.5 * x + x * expn
Quick plot of dataset
_ = plt.plot(x, y, 'o')
Perform OLS linear model fit
df_dict = {'x': x, 'y': y}
data = pd.DataFrame(df_dict)
res3 = ols('y ~ x', data=data).fit()
res3.params
Repeatedly sample original dataset with replacement, repeat OLS best fit, accumulated parameter estimate results
param1 = []
param2 = []
ntrials = 500
for trial in range(ntrials):
data_sample = data.sample(frac=1, replace=True)
res = ols('y ~ x', data=data_sample).fit()
param1.append(res.params['Intercept'])
param2.append(res.params['x'])
# end for
Quick plot of distribution of estimate of intercept (actual value 3
_ = plt.hist(param1, edgecolor='w')
Quick plot of distribution of estimate of slope (actual value 2.5
_ = plt.hist(param2, edgecolor='w')
res3.params
Formal Visualization¶
We use Seaborn to prepare more detailed graphics
First the distribution of the estimate of the Intercept
fig, ax = plt.subplots(figsize=(10, 10))
sns.distplot(
param1, rug=True, hist_kws={'edgecolor': 'w'}, ax=ax
)
ax.set_xlabel('Value')
ax.set_ylabel('Relative Intercept Probability')
ax.set_title(
'Distribution of Intercept Value (by Bootstrapping)'
)
ax.axvline(
res3.params['Intercept'],
color='g',
label='OLS estimate of Intercept',
)
ax.legend(loc='best')
And now the distribution of the estimate of the slope
fig, ax = plt.subplots(figsize=(10, 10))
sns.distplot(param2, rug=True, hist_kws={'edgecolor': 'w'})
ax.set_xlabel('Value')
ax.set_ylabel('Relative Slope Probability')
ax.set_title(
'Distribution of Slope Value (by Bootstrapping)'
)
ax.axvline(
res3.params['x'],
color='g',
label='OLS estimate of slope',
)
ax.legend(loc='best')
Finally, a quick plot of the OLS fit
_ = plt.plot(x, y, 'o')
_ = plt.plot(x, res3.predict(), 'r-')
Environment¶
%watermark -h -iv
%watermark
sm.show_versions()