Stanford STATS191 in Python, Lecture 3 : Simple Linear Regression

Thu 19 March 2020

Filed under scipy

Tags statistics scipy statsmodels

Stanford Stats 191

Introduction

This is a re-creation of the Stanford Stats 191 course, using Python eco-system tools, instead of R. This is lecture "Simple linear regression" (https://web.stanford.edu/class/stats191/notebooks/Simple_linear_regression.html). See the previous blog posts for my motivation.

Initial Notebook Setup

watermark helps document the Python and package environment; black is my preferred Python formatter.

In [3]:
%load_ext watermark
In [4]:
%load_ext lab_black
In [5]:
%matplotlib inline

All imports go here. numpy handles low level numerical computing, pandas handles high level datatable structures, seaborn handles easy vizualization.

scipy.stats and statsmodels handle most of the statistics.

In [6]:
import pandas as pd
import numpy as np
import seaborn as sn

import math

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D


from scipy import stats
from statsmodels.formula.api import ols
from statsmodels.formula.api import rlm
import statsmodels.api as sm

from statsmodels.sandbox.regression.predstd import (
    wls_prediction_std,
)

from statsmodels.stats.stattools import jarque_bera

import os
D:\Anaconda3\envs\ac5-py37\lib\site-packages\statsmodels\compat\pandas.py:49: FutureWarning: The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version
  data_klasses = (pandas.Series, pandas.DataFrame, pandas.Panel)

Create Data

We create a dataset with Gaussian noise, and a very weak relation beyween x and y

In [7]:
n_data = 50

beta_0 = 1.5
beta_1 = 0.1
sigma = 2

x = np.random.normal(0, 1, n_data)
x = np.array(sorted(x))
y = beta_0 + x * beta_1
#
y = y + np.random.normal(0, 1, n_data) * sigma
In [8]:
_ = plt.plot(x, y, 'o')

Life becomes a lot easier, if we create a Pandas dataframe with this data. This done, we then perform a Ordinary Least Squares (OLS) fit of a line to the data. We use the R-style symbolic model specification (saying y depends on x linearly)

In [9]:
df_dict = {'x': x, 'y': y}
data = pd.DataFrame(df_dict)

res = ols('y ~ x', data=data).fit()

res.params
Out[9]:
Intercept    1.698655
x           -0.032369
dtype: float64
In [10]:
y_hat = res.predict()

We use the minimal matplotlib calls to plot our data nd best-fit line

In [11]:
_ = plt.plot(x, y, 'o')
_ = plt.plot(x, y_hat, 'r-')

Error / Loss Functions

One question of interest is "If we choose an other metric to minimize (say absolute error) are the results much different?"

In [12]:
def loss_sq(y1, y2):

    return (y1 - y2) ** 2


# end loss


def loss_abs(y1, y2):
    return np.abs(y1 - y2)


# end loss_abs

We build a grid of intercept and slope values, compute the metric, and plot the result

In [13]:
b1_grid = np.linspace(beta_1 - 2, beta_1 + 2, 50)
b0_grid = np.linspace(beta_0 - 2, beta_0 + 2, 50)
bxx, byy = np.meshgrid(b0_grid, b1_grid)
zz = np.zeros((50, 50))


for i in range(len(bxx)):  # run over all intercepts
    for j in range(len(byy)):  # run over all slopes
        sum = 0
        for k in range(len(x)):
            y_predict = bxx[i, j] + byy[i, j] * x[k]
            loss = loss_sq(y_predict, y[k])
            sum = sum + loss
        # end for
        zz[i, j] = sum
    # end for
# end for

h = plt.contourf(bxx, byy, zz)

This time we adjust alpha, and add contour lines to show how close the OLS fit is to the true value

In [14]:
fig, ax = plt.subplots(figsize=(10, 10))

cnt = ax.contour(bxx, byy, zz, levels=20)
cnt2 = ax.contourf(bxx, byy, zz, levels=20, alpha=0.3)
fig.colorbar(cnt2, ax=ax)
ax.grid()

ax.spines['bottom'].set_position('zero')
ax.spines['left'].set_position('zero')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

ax.plot(
    [res.params[0]],
    [res.params[1]],
    'o',
    label='OLS Estimate',
)
ax.plot([beta_0], [beta_1], 'r+', label='True')

ax.set_xlabel('Intercept')
ax.set_ylabel('Slope')

ax.legend(loc='best')
Out[14]:
<matplotlib.legend.Legend at 0x1f11ace9630>

We now do the same for the sum of absolute error. It turns out the difference is minimal

In [15]:
b1_grid = np.linspace(beta_1 - 2, beta_1 + 2, 50)
b0_grid = np.linspace(beta_0 - 2, beta_0 + 2, 50)
bxx, byy = np.meshgrid(b0_grid, b1_grid)
zz = np.zeros((50, 50))


for i in range(len(bxx)):  # run over all intercepts
    for j in range(len(byy)):  # run over all slopes
        sum = 0
        for k in range(len(x)):
            y_predict = bxx[i, j] + byy[i, j] * x[k]
            loss = loss_abs(y_predict, y[k])
            sum = sum + loss
        # end for
        zz[i, j] = sum
    # end for
# end for

h = plt.contourf(bxx, byy, zz)
In [16]:
fig, ax = plt.subplots(figsize=(10, 10))

cnt = ax.contour(bxx, byy, zz, levels=20)
cnt2 = ax.contourf(bxx, byy, zz, levels=20, alpha=0.3)
fig.colorbar(cnt2, ax=ax)
ax.grid()

ax.spines['bottom'].set_position('zero')
ax.spines['left'].set_position('zero')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

ax.plot(
    [res.params[0]],
    [res.params[1]],
    'o',
    label='OLS Estimate',
)
ax.plot([beta_0], [beta_1], 'r+', label='True')

ax.set_xlabel('Intercept')
ax.set_ylabel('Slope')

ax.legend(loc='best')
Out[16]:
<matplotlib.legend.Legend at 0x1f11a437208>

We now repeat the exercise for OLS, with different true slope and intercept, and plot the 3D total error surface

In [17]:
n_data = 50

beta_0 = 10
beta_1 = 5
sigma = 2

x = np.random.normal(0, 1, n_data)
x = np.array(sorted(x))
y = beta_0 + x * beta_1
#
y = y + np.random.normal(0, 1, n_data) * sigma
In [18]:
df_dict = {'x': x, 'y': y}
data = pd.DataFrame(df_dict)

res = ols('y ~ x', data=data).fit()

res.params
Out[18]:
Intercept    9.669345
x            5.381967
dtype: float64
In [19]:
y_hat = res.predict()

Filled contour plot; detail is lost in this default plot

In [20]:
b1_grid = np.linspace(beta_1 - 2, beta_1 + 2, 50)
b0_grid = np.linspace(beta_0 - 2, beta_0 + 2, 50)
bxx, byy = np.meshgrid(b0_grid, b1_grid)
zz = np.zeros((50, 50))


for i in range(len(bxx)):  # run over all intercepts
    for j in range(len(byy)):  # run over all slopes
        sum = 0
        for k in range(len(x)):
            y_predict = bxx[i, j] + byy[i, j] * x[k]
            loss = loss_sq(y_predict, y[k])
            sum = sum + loss
        # end for
        zz[i, j] = sum
    # end for
# end for

h = plt.contourf(bxx, byy, zz)

As before adjust alpha of the fill, and add contour lines

In [21]:
fig, ax = plt.subplots(figsize=(10, 10))

cnt = ax.contour(bxx, byy, zz, levels=20)
cnt2 = ax.contourf(bxx, byy, zz, levels=20, alpha=0.3)
fig.colorbar(cnt2, ax=ax)
ax.grid()

# ax.spines['bottom'].set_position('zero')
# ax.spines['left'].set_position('zero')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

ax.plot(
    [res.params[0]],
    [res.params[1]],
    'o',
    label='OLS Estimate',
)
ax.plot([beta_0], [beta_1], 'r+', label='True')

ax.set_xlabel('Intercept')
ax.set_ylabel('Slope')

ax.legend(loc='best')
Out[21]:
<matplotlib.legend.Legend at 0x1f11b18cbe0>

3D Surface Plots

We invert the error surface for clarity, and show the error surface.

In [22]:
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, projection='3d')

# Plot the surface.
surf = ax.plot_surface(
    bxx,
    byy,
    -zz,
    linewidth=1,
    antialiased=False,
    cmap='rainbow',
)

fig.colorbar(surf, shrink=0.5, aspect=5)
Out[22]:
<matplotlib.colorbar.Colorbar at 0x1f11b1e4fd0>

The figure above looks nice, but we can add more detail; planar contour plot, line down from surface maximum (i.e. minimum error), title, legend, axes labels

In [23]:
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, projection='3d')

# Plot the surface.
surf = ax.plot_surface(
    bxx,
    byy,
    -zz,
    linewidth=1,
    antialiased=False,
    cmap='rainbow',
    alpha=0.3,
)

z_offset = -600

ax.set_xlim(beta_0 - 4, beta_0 + 4)
ax.set_ylim(beta_1 - 4, beta_1 + 4)
ax.set_zlim(z_offset, 0)

ax.contourf(
    bxx,
    byy,
    -zz,
    zdir='z',
    offset=z_offset,
    cmap='rainbow',
    alpha=0.5,
    levels=20,
)
ax.contour(
    bxx,
    byy,
    -zz,
    zdir='z',
    offset=z_offset,
    cmap='rainbow',
    alpha=0.9,
    levels=20,
)

ax.contour(
    bxx,
    byy,
    -zz,
    zdir='z',
    cmap='gray',
    alpha=0.8,
    levels=20,
)

ax.set_xlabel('Intercept')
ax.set_ylabel('Slope')
ax.set_zlabel('\n\n-(Sum of Error^2)')

ax.plot(
    [beta_0], [beta_1], [z_offset + 1], 'go', label='True'
)
ax.plot(
    [res.params[0], res.params[0]],
    [res.params[1], res.params[1]],
    [z_offset + 1, 0],
    'r-',
    label='OLS Estimate',
)

ax.legend(loc='best')
fig.suptitle(
    'Inverted Loss Surface\n' + 'OLS Linear Regression ',
    fontsize=16,
)


fig.colorbar(surf, shrink=0.25, aspect=10)
Out[23]:
<matplotlib.colorbar.Colorbar at 0x1f11ade6240>

Wages Data

We now look at a different data set: one linking wages with education level

Load and explore data

In [24]:
wages = pd.read_csv('../data/wage.csv')
In [25]:
wages.head()
Out[25]:
education logwage
0 16.750000 2.845000
1 15.000000 2.446667
2 10.000000 1.560000
3 12.666667 2.099167
4 15.000000 2.490000
In [26]:
wages['logwage'].mean()
Out[26]:
2.240279398911975

Perform OLS

Fit a line to the data, and display

In [27]:
res = ols('logwage ~ education', data=wages).fit()
In [28]:
res.params
Out[28]:
Intercept    1.239194
education    0.078600
dtype: float64
In [29]:
_ = plt.plot(
    wages['education'], wages['logwage'], 'o', alpha=0.5
)

Because we have so many data points with the same x value, we "jitter" the x values by adding random noise. This gives us a better sense on where most of the datapoints are to be found. We also set alpha to a smaller value, so plotted points less obscure underlying points

In [30]:
_ = plt.plot(
    wages['education']
    + np.random.normal(0, 0.1, len(wages['education'])),
    wages['logwage'],
    'o',
    alpha=0.2,
)

Reduce alpha still further, and plot the line of best fit.

In [31]:
_ = plt.plot(
    wages['education']
    + np.random.normal(0, 0.1, len(wages['education'])),
    wages['logwage'],
    'o',
    alpha=0.1,
)
_ = plt.plot(wages['education'], res.predict(), 'r+')
In [32]:
res.params
Out[32]:
Intercept    1.239194
education    0.078600
dtype: float64

In order to make sure we understand the OLS results, we compute the best-fit intercept and slope manually

In [33]:
x_mean = wages['education'].mean()
y_mean = wages['logwage'].mean()

sxx = (wages['education'] - x_mean) @ (
    wages['education'] - x_mean
).T

cov_x_y = (wages['education'] - x_mean) @ (
    wages['logwage'] - y_mean
).T
In [34]:
beta_1_hat = cov_x_y / sxx

beta_0_hat = y_mean - x_mean * beta_1_hat

print(f' beta 0 = {beta_0_hat}, beta 1 = {beta_1_hat}')
 beta 0 = 1.239194326803034, beta 1 = 0.0785995102448242
In [35]:
sigma2_hat = (
    1
    / (len(wages['education']) - 2)
    * (wages['logwage'] - res.predict())
    @ (wages['logwage'] - res.predict()).T
)
In [36]:
sigma_hat = np.sqrt(sigma2_hat)
sigma_hat
Out[36]:
0.4037827822998898

Display the OLS results

In [37]:
res.summary()
Out[37]:
OLS Regression Results
Dep. Variable: logwage R-squared: 0.135
Model: OLS Adj. R-squared: 0.135
Method: Least Squares F-statistic: 340.0
Date: Wed, 18 Mar 2020 Prob (F-statistic): 1.15e-70
Time: 21:04:47 Log-Likelihood: -1114.3
No. Observations: 2178 AIC: 2233.
Df Residuals: 2176 BIC: 2244.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 1.2392 0.055 22.541 0.000 1.131 1.347
education 0.0786 0.004 18.440 0.000 0.070 0.087
Omnibus: 46.662 Durbin-Watson: 1.769
Prob(Omnibus): 0.000 Jarque-Bera (JB): 59.725
Skew: -0.269 Prob(JB): 1.07e-13
Kurtosis: 3.606 Cond. No. 82.4


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [38]:
res.cov_params()
Out[38]:
Intercept education
Intercept 0.003022 -0.000231
education -0.000231 0.000018

We see that the manually calculated values match those displayed by the summary() method. The manually calculated R^2 value also is identical, as are the parameter standard errors

res.ess, res.ssr, res.centered_tss, (res.ess + res.ssr)
In [39]:
res.ess / res.centered_tss
Out[39]:
0.13514534147541912

The equations for the standard error for the slope (beta) and intercept (alpha) are shown below

$$s_{\widehat {\beta }}={\sigma }{\sqrt { {\frac {1} {\sum _{i=1}^{n}({x_{i}}-{\bar x})^{2}}}}}$$
$$s_{\widehat {\alpha }}=s_{\widehat {\beta }}{\sqrt {{\frac {1}{n}}\sum _{i=1}^{n}x_{i}^{2}}}$$
In [40]:
se_beta1 = sigma_hat * np.sqrt(1 / (sxx))
sum_x2 = wages['education'] @ wages['education'].T
se_beta0 = se_beta1 * np.sqrt(
    sum_x2 / len(wages['education'])
)

print(f'Standard Error beta 1 = {se_beta1}')
print(f'Standard Error beta 0 = {se_beta0}')
Standard Error beta 1 = 0.004262470820775879
Standard Error beta 0 = 0.05497420521139725

Reproducibility

In [41]:
%watermark -h -iv
%watermark
statsmodels 0.9.0
numpy       1.15.4
scipy       1.1.0
matplotlib  3.0.2
pandas      1.0.0
seaborn     0.9.0
host name: DESKTOP-SODFUN6
2020-03-18T21:04:47+10:00

CPython 3.7.1
IPython 7.2.0

compiler   : MSC v.1915 64 bit (AMD64)
system     : Windows
release    : 10
machine    : AMD64
processor  : Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
CPU cores  : 8
interpreter: 64bit

Comments


net-analysis.com Data Analysis Blog © Don Cameron Powered by Pelican and Twitter Bootstrap. Icons by Font Awesome and Font Awesome More