Linear Regression by Numpy¶
Introduction¶
This snippet arose because I was working my way through the statsmodels
documentation. This was as part of a process of converting a web lecture series I am reading from R
to the Python ecosystem.
Anyway, I was working through using Ordinary Least Squares error minimization to perform linear regression, and realized I didn't fully understand what was going on behind the scenes. So then I thought: I will do it all again using Numpy, and Python matrix multiplication. The main reference I used was as below:
https://en.wikipedia.org/wiki/Explained_sum_of_squares
Caveats: my Python formatter lays out matrix expressions in curious ways, so with bear with me here. Also, the embedded mathematical expressions may not render correctly in all browsers.
Notebook Initialization¶
watermark
helps document the environment; lab_black
is my preferred Python formatter; and we need to tell matplotlib
to draw plots inline.
%load_ext watermark
%load_ext lab_black
%matplotlib inline
All imports go here. I should probably update all my imports to the latest versions to fix the warning diagnostic.
import pandas as pd
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt
from scipy import stats
from statsmodels.formula.api import ols
from statsmodels.formula.api import rlm
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import (
wls_prediction_std,
)
n_data = 200
x1 = np.linspace(0, 100, n_data)
x0 = np.ones(n_data)
X = [[x_0, x_1] for x_0, x_1 in zip(x0, x1)]
X = np.array(X)
Choose a linear set of parameters, generate the true data, and plot them.
beta = np.array([5, -2]) # slope -2, intercept 5
y_true = X @ beta.T
plt.plot(x1, y_true, 'r-')
Now add noise, and plot again.
# add noise
sigma = 20
y_actual = y_true + np.random.normal(0, 1, n_data) * sigma
print(y_actual[0:4])
plt.plot(x1, y_actual, 'o')
plt.plot(x1, y_true, 'g-')
Perform Linear Regression by OLS¶
The matrix equation for the estimated linear parameters is as below:
The expressions in Wikipedia assume vectors are column vectors by default, but numpy
has row vector by default. Thus, the T
operator appears in the code in places where it doesn't in the Wikipedia formula (in order to turn row vectors into column vectors).
beta_estimated = np.linalg.inv(X.T @ X) @ X.T @ y_actual.T
beta_estimated
Calculate the sum of squared residual errors
RSS = (
y_actual @ y_actual.T
- y_actual
@ X
@ np.linalg.inv(X.T @ X)
@ X.T
@ y_actual.T
)
RSS
Calculated the Total Sum of Squares of the spread of the actual (noisy) values around their mean
y_mean = np.ones(n_data) * np.mean(y_actual)
TSS = (y_actual - y_mean).T @ (y_actual - y_mean)
TSS
y_pred = X @ beta_estimated.T
y_pred[0:4]
Calculate the Sum of Squares of the spread of the predictions around their mean.
ESS = (y_pred - y_mean) @ (y_pred - y_mean).T
ESS
Prove that TSS = ESS + RSS
TSS, ESS + RSS
Get the amount of variation in the actual y values, that is explained by the linear model
1 - RSS / TSS
Plot the result
plt.plot(x1, y_true, 'o', markersize=2)
plt.plot(x1, y_actual, 'r+')
plt.plot(x1, y_pred, 'g-')
Calculate the standard error of the regression. We divide by (n-2)
, because the Expectation of the sum of squares is (n-2)*sigma^2
.
# standard error of regression
sr2 = (
(1 / (n_data - 2))
* (y_pred - y_actual)
@ (y_pred - y_actual).T
)
sr = np.sqrt(sr2)
sr
In order to get the standard errors for our linear parameters, we use the matrix formula below:
# get variance / covariance matrix
var_beta = sr2 * np.linalg.inv(X.T @ X)
var_beta
print(
f'Std Error for b0 {np.sqrt(var_beta[0, 0])}, \nStd Error for b1 {np.sqrt(var_beta[1, 1])}'
)
x3 = [z[1] for z in X]
data = pd.DataFrame({'x': x3, 'y': y_actual})
res = ols('y ~ x', data).fit()
res.summary()
We see that we get the same answers. I strongly suspect that statsmodels
has a fine-tuned algorithm, that would be more efficient than using numpy
. After all, I use explicit matrix inversion, which is a minor numerical methods sin.
Environment Definition¶
%watermark -h -iv
%watermark