Stanford STATS191 in Python, Lecture 11 : Bootstrapping regression

Thu 23 April 2020

Filed under scipy

Tags statistics scipy statsmodels

Stanford Stats 191

Introduction

This is a re-creation of the Stanford Stats 191 course (see https://web.stanford.edu/class/stats191/), using Python eco-system tools, instead of R. This is lecture "Bootstrapping regression:"

It turns out that it appears that statsmodels appears to have no support for Bootstrap-style operations. There are a number of Python packages that claim to support Bootstrap operations, but I decided to use numpy and its sample method, and explicitly perform the resampling and parameter re-estimation.

Initial Notebook Setup

watermark documents the current Python and package environment, black is my preferred Python formatter

In [60]:
%load_ext watermark
The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
In [61]:
%load_ext lab_black
The lab_black extension is already loaded. To reload it, use:
  %reload_ext lab_black
In [62]:
%matplotlib inline

Imports

In [ ]:
import pandas as pd
import numpy as np
import seaborn as sns

import math

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D


from scipy import stats
from statsmodels.formula.api import ols
from statsmodels.formula.api import rlm
import statsmodels.api as sm

OLS of Gaussian Noise as Error Term

First, we perform an OLS best fit of a linear model to a dataset with Gaussian Noise

In [64]:
# size of dataset
n_data = 50

# linear parameters
beta_0 = 1.5
beta_1 = 2.0
sigma = 2

x = np.random.normal(0, 1, n_data)
x = np.array(sorted(x))

# linear model with Gaussian noise
y = beta_0 + x * beta_1
#
y = y + np.random.normal(0, 1, n_data) * sigma

Quick plot of dataset

In [65]:
_ = plt.plot(x, y, 'o')

Perform OLS fit to linear model

In [66]:
df_dict = {'x': x, 'y': y}
data = pd.DataFrame(df_dict)

res = ols('y ~ x', data=data).fit()

res.params
Out[66]:
Intercept    1.473338
x            2.009848
dtype: float64
In [67]:
res.summary()
Out[67]:
OLS Regression Results
Dep. Variable: y R-squared: 0.454
Model: OLS Adj. R-squared: 0.443
Method: Least Squares F-statistic: 39.92
Date: Sun, 19 Apr 2020 Prob (F-statistic): 8.18e-08
Time: 19:37:26 Log-Likelihood: -101.86
No. Observations: 50 AIC: 207.7
Df Residuals: 48 BIC: 211.5
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 1.4733 0.268 5.499 0.000 0.935 2.012
x 2.0098 0.318 6.318 0.000 1.370 2.649
Omnibus: 0.019 Durbin-Watson: 2.373
Prob(Omnibus): 0.990 Jarque-Bera (JB): 0.174
Skew: 0.030 Prob(JB): 0.917
Kurtosis: 2.718 Cond. No. 1.19


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Non-Gaussian Errors

Now, consider case where the error term is uniformly distributed (i.e. non-Gaussian). We have the same model parameters

In [115]:
n_data = 50

beta_0 = 1.5
beta_1 = 2
sigma = 2

x = np.random.normal(0, 1, n_data)
x = np.array(sorted(x))
y = beta_0 + x * beta_1
#
y = (
    y
    + np.random.uniform(low=-1.0, high=1.0, size=n_data)
    * sigma
)
In [116]:
_ = plt.plot(x, y, 'o')
In [117]:
df_dict = {'x': x, 'y': y}
data = pd.DataFrame(df_dict)

res = ols('y ~ x', data=data).fit()

res.params
Out[117]:
Intercept    1.388669
x            1.918366
dtype: float64
In [118]:
res.summary()
Out[118]:
OLS Regression Results
Dep. Variable: y R-squared: 0.682
Model: OLS Adj. R-squared: 0.675
Method: Least Squares F-statistic: 102.8
Date: Sun, 19 Apr 2020 Prob (F-statistic): 1.62e-13
Time: 20:07:11 Log-Likelihood: -81.728
No. Observations: 50 AIC: 167.5
Df Residuals: 48 BIC: 171.3
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 1.3887 0.183 7.597 0.000 1.021 1.756
x 1.9184 0.189 10.137 0.000 1.538 2.299
Omnibus: 26.087 Durbin-Watson: 2.360
Prob(Omnibus): 0.000 Jarque-Bera (JB): 4.907
Skew: 0.331 Prob(JB): 0.0860
Kurtosis: 1.615 Cond. No. 1.23


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The linear model parameters from the OLS best fit

In [119]:
res.params['Intercept'], res.params['x']
Out[119]:
(1.388669147713418, 1.918365581875839)

Now, sample the dataset with replacement, recompute parameters again, for ntrials times

In [120]:
param1 = []
param2 = []

ntrials = 200
for trial in range(ntrials):
    data_sample = data.sample(frac=1, replace=True)

    res = ols('y ~ x', data=data_sample).fit()

    param1.append(res.params['Intercept'])
    param2.append(res.params['x'])
# end for

Display the spread of intercept parameters estimates (actual vaue 1.5)

In [121]:
_ = plt.hist(param1, edgecolor='white')

Display the spread of slope estimates (actual value 2)

In [122]:
_ = plt.hist(param2, edgecolor='white')

Find the 95% confidence interval of the slope, based upon the repeated estimation, compared to the one-shot OLS estimate = 1.494 <-> 2.216

In [123]:
np.percentile(param2, 2.5), np.percentile(param2, 97.5)
Out[123]:
(1.5116567769330833, 2.2939972651879335)

Heavily skewed errors

We consider an exponential distribution of errors, adjusted so mean=0

In [95]:
expn = np.random.exponential(1, 1000) - 1

_ = plt.hist(expn, edgecolor='w')

Compute the observations using this skewed error source

In [96]:
x = np.random.exponential(1, 1000)

y = 3 + 2.5 * x + x * expn

Quick plot of dataset

In [97]:
_ = plt.plot(x, y, 'o')

Perform OLS linear model fit

In [98]:
df_dict = {'x': x, 'y': y}
data = pd.DataFrame(df_dict)

res3 = ols('y ~ x', data=data).fit()

res3.params
Out[98]:
Intercept    2.943530
x            2.512294
dtype: float64

Repeatedly sample original dataset with replacement, repeat OLS best fit, accumulated parameter estimate results

In [99]:
param1 = []
param2 = []

ntrials = 500
for trial in range(ntrials):
    data_sample = data.sample(frac=1, replace=True)

    res = ols('y ~ x', data=data_sample).fit()

    param1.append(res.params['Intercept'])
    param2.append(res.params['x'])
# end for

Quick plot of distribution of estimate of intercept (actual value 3

In [100]:
_ = plt.hist(param1, edgecolor='w')

Quick plot of distribution of estimate of slope (actual value 2.5

In [101]:
_ = plt.hist(param2, edgecolor='w')
In [102]:
res3.params
Out[102]:
Intercept    2.943530
x            2.512294
dtype: float64

Formal Visualization

We use Seaborn to prepare more detailed graphics

First the distribution of the estimate of the Intercept

In [103]:
fig, ax = plt.subplots(figsize=(10, 10))
sns.distplot(
    param1, rug=True, hist_kws={'edgecolor': 'w'}, ax=ax
)
ax.set_xlabel('Value')
ax.set_ylabel('Relative Intercept Probability')
ax.set_title(
    'Distribution of Intercept Value (by Bootstrapping)'
)
ax.axvline(
    res3.params['Intercept'],
    color='g',
    label='OLS estimate of Intercept',
)
ax.legend(loc='best')
D:\Anaconda3\envs\ac5-py37\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[103]:
<matplotlib.legend.Legend at 0x2e9fa7e2a20>

And now the distribution of the estimate of the slope

In [104]:
fig, ax = plt.subplots(figsize=(10, 10))

sns.distplot(param2, rug=True, hist_kws={'edgecolor': 'w'})

ax.set_xlabel('Value')
ax.set_ylabel('Relative Slope Probability')
ax.set_title(
    'Distribution of Slope Value (by Bootstrapping)'
)
ax.axvline(
    res3.params['x'],
    color='g',
    label='OLS estimate of slope',
)
ax.legend(loc='best')
Out[104]:
<matplotlib.legend.Legend at 0x2e9fab438d0>

Finally, a quick plot of the OLS fit

In [41]:
_ = plt.plot(x, y, 'o')
_ = plt.plot(x, res3.predict(), 'r-')

Environment

In [125]:
%watermark -h -iv
%watermark
numpy       1.15.4
seaborn     0.9.0
statsmodels 0.9.0
matplotlib  3.0.2
pandas      1.0.0
scipy       1.1.0
host name: DESKTOP-SODFUN6
2020-04-19T20:10:22+10:00

CPython 3.7.1
IPython 7.2.0

compiler   : MSC v.1915 64 bit (AMD64)
system     : Windows
release    : 10
machine    : AMD64
processor  : Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
CPU cores  : 8
interpreter: 64bit
In [126]:
sm.show_versions()
INSTALLED VERSIONS
------------------
Python: 3.7.1.final.0

Statsmodels
===========

Installed: 0.9.0 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\statsmodels)

Required Dependencies
=====================

cython: 0.29.2 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\Cython)
numpy: 1.15.4 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\numpy)
scipy: 1.1.0 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\scipy)
pandas: 1.0.0 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\pandas)
    dateutil: 2.7.5 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\dateutil)
patsy: 0.5.1 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\patsy)

Optional Dependencies
=====================

matplotlib: 3.0.2 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\matplotlib)
    backend: module://ipykernel.pylab.backend_inline 
cvxopt: Not installed
joblib: 0.13.2 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\joblib)

Developer Tools
================

IPython: 7.2.0 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\IPython)
    jinja2: 2.10.1 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\jinja2)
sphinx: 1.8.2 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\sphinx)
    pygments: 2.3.1 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\pygments)
nose: 1.3.7 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\nose)
pytest: 5.0.1 (D:\Anaconda3\envs\ac5-py37\lib\site-packages)
virtualenv: Not installed


In [ ]:
 

Comments


net-analysis.com Data Analysis Blog © Don Cameron Powered by Pelican and Twitter Bootstrap. Icons by Font Awesome and Font Awesome More