Stanford STATS191 in Python, Lecture 12 : Selection
Stanford Stats 191¶
Introduction¶
This is a re-creation of the Stanford Stats 191 course (see https://web.stanford.edu/class/stats191/), using Python eco-system tools, instead of R. This is lecture "Selection:"
This lecture was one with the most divergence between Python statsmodels
and R. The standard tools of R for refining a regression model by reducing irrelvant parameters appear not to be matched by functions available from statsmodels
.
However, Machine Learning has model refinement as one of its core processes. This post will cover some of these.
One of the datasets used by the STATS191 course dealth with USA Presidential Election votes, and a number of proposed explanatory parameters (Incumbancy, Economic Growth, etc). I started out cranking the numbers on this dataset, but then started to have my doubts. The dataset covered all the 20th century; are we to believe that the fundamental mechanisms that operated to decide political choice in 1912 are the same as in 2000, given the huge demographic and cultural changes in the USA? Now maybe the answer is "Yes", but that is a large claim that I would want justified, before blindly accepting the results of a statistical model.
Initial Notebook Setup¶
watermark
documents the current Python and package environment, black
is my preferred Python formatter
%load_ext watermark
%load_ext lab_black
%matplotlib inline
Imports¶
import pandas as pd
import numpy as np
import seaborn as sns
import math
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from scipy import stats
from statsmodels.formula.api import ols
from statsmodels.formula.api import rlm
import statsmodels.api as sm
from sklearn import linear_model
import seaborn as sns
Model Selection¶
Election Dataset¶
As an example of the issues around selecting the best linear model, we use a datset with attributes as below.
Variable | Description |
---|---|
V | votes for a presidential candidate |
I | are they incumbent? |
D | Democrat or Republican incumbent? |
W | wartime election? |
G | GDP growth rate in election year |
P | (absolute) GDP deflator growth rate |
N | number of quarters in which GDP growth rate >3.2% |
Read and Explore Dataset¶
data = pd.read_csv('../data/election.txt', sep='\t')
data.head()
We look at the unique values for each column
for n in data.columns:
uniq_val = data[n].unique()
print(f'Column {n}: {uniq_val}')
# end for
Visualization¶
We can visualize the relationship between columns using Seaborn. We exclude the first column, as it just holds year of election.
sns.pairplot(data=data[data.columns[1:]])
Cleaning Dataset¶
We find some trailing whitespace in column names, so we remove them
_ = [print(f'>{n}<') for n in data.columns]
new_names = [n.strip() for n in data.columns]
_ = [print(f'>{n}<') for n in new_names]
data.columns = new_names
Model Building¶
As a start in model building, we start by incorporating almost all the possible explanatory parameters, and add in the interaction betwen Incumbency and Economic Growth (rather than Growth on its own).
res0 = ols('V ~ I + D + W + G:I + P + N', data=data).fit()
res0.summary()
A number of the parameters have coefficients with p values that indicate that might be candidates for exclusion from the model
Stepwise Model Building¶
We next look at each of our proposed parameters, one by one, and look at the variance explained
for variable in ['I', 'D', 'W', 'G:I', 'P', 'N']:
model = 'V ~ ' + variable
res1 = ols(model, data=data).fit()
print(
f' variable {variable:5} alone has R^2 adj = {res1.rsquared:6.4f}'
)
# end for
We now repeat a process of finding the unused parameter (not already in our evolving model) that most improves our model score (here R^2), and then adding it to our model
# start with empty model
base_model = 'V ~ '
variables_start = ['I', 'D', 'W', 'G:I', 'P', 'N']
#
# variables is the list of as yet unused parameters
#
variables = ['I', 'D', 'W', 'G:I', 'P', 'N']
# as we improve(?) the model, we track the improvement in R^2
model_r2 = []
# run over all unused parameters, find one that when added to the current model, maximizes R^2
for count in range(1, len(variables) + 1):
max_r2 = -1 #
for variable in variables:
model = base_model + ' + ' + variable
res1 = ols(model, data=data).fit()
# is this a better model than any seen so far?
if res1.rsquared > max_r2:
max_var = variable
max_r2 = res1.rsquared
# end if
# end for
# update model with one that maximizes R^2
base_model = base_model + ' + ' + max_var
print(
f' model {base_model:30} has R^2 = {max_r2:6.4f}'
)
# update list of unused variables, by removing one just added to model
variables.remove(max_var)
model_r2.append(max_r2)
# end for
Do a quick and easy plot showimng improvement in R^2
_ = plt.plot(
list(range(1, len(variables_start) + 1)), model_r2, 'o'
)
Adjusted R^2¶
Do this all over again, but using an adjusted R^2. The adjusted R-squared increases only if the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected by chance
base_model = 'V ~ '
variables_start = ['I', 'D', 'W', 'G:I', 'P', 'N']
variables = ['I', 'D', 'W', 'G:I', 'P', 'N']
model_r2 = []
for count in range(1, len(variables) + 1):
max_r2 = float(
'-inf'
) # must be smaller than first r^2 adjusted value
for variable in variables:
model = base_model + ' + ' + variable
res1 = ols(model, data=data).fit()
if res1.rsquared_adj > max_r2:
max_var = variable
max_r2 = res1.rsquared_adj
# end if
# end for
base_model = base_model + ' + ' + max_var
print(
f' model {base_model:30} has R^2 (adjusted) = {max_r2:6.4f}'
)
variables.remove(max_var)
model_r2.append(max_r2)
# end for
The adjusted R^2 plot shows that the last two parameters added to the model have made it worse
_ = plt.plot(
list(range(1, len(variables_start) + 1)), model_r2, 'o'
)
AIC as a Model Score¶
"The Akaike information criterion (AIC) is an estimator of out-of-sample prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models."
We repeat the step by step addition of parameters to the model, this time using AIC as our model score
base_model = 'V ~ '
variables_start = ['I', 'D', 'W', 'G:I', 'P', 'N']
variables = ['I', 'D', 'W', 'G:I', 'P', 'N']
model_aic = []
for count in range(1, len(variables) + 1):
max_aic = float(
'-inf'
) # must be smaller than first aic value
for variable in variables:
model = base_model + ' + ' + variable
res1 = ols(model, data=data).fit()
if res1.aic > max_aic:
max_var = variable
max_aic = res1.aic
r2_at_max_aic = res1.rsquared
# end if
# end for
base_model = base_model + ' + ' + max_var
print(
f' model {base_model:30} has AIC = {max_aic:6.4f}, R^2 = {r2_at_max_aic:6.4f}'
)
variables.remove(max_var)
model_aic.append(max_aic)
# end for
The quick and easy plot of AIC scores suggests the the last parameter should definitely by dropped from our model, and that really only the first two parameters should be in the model
_ = plt.plot(
list(range(1, len(variables_start) + 1)), model_aic, 'o'
)
BIC for Model Score¶
"The Bayesian information criterion (BIC) is a criterion for model selection among a finite set of models. It is based, in part, on the likelihood function, and it is closely related to Akaike information criterion (AIC). ... The penalty term is larger in BIC than in AIC:"
base_model = 'V ~ '
variables_start = ['I', 'D', 'W', 'G:I', 'P', 'N']
variables = ['I', 'D', 'W', 'G:I', 'P', 'N']
model_bic = []
for count in range(1, len(variables) + 1):
max_bic = float(
'-inf'
) # must be smaller than first aic value
for variable in variables:
model = base_model + ' + ' + variable
res1 = ols(model, data=data).fit()
if res1.bic > max_bic:
max_var = variable
max_bic = res1.bic
r2_at_max_bic = res1.rsquared
# end if
# end for
base_model = base_model + ' + ' + max_var
print(
f' model {base_model:30} has AIC = {max_bic:6.4f}, R^2 = {r2_at_max_bic:6.4f}'
)
variables.remove(max_var)
model_bic.append(max_bic)
# end for
_ = plt.plot(
list(range(1, len(variables_start) + 1)), model_bic, 'o'
)
Least Absolute Shrinkage and Selection Operator (LASSO)¶
We now move to the functions provided by the Machine Learning packages of Python, where a variety of penalties for an excessive numbers of parameters in the models can be accessed (under a variety of opaque abreviations: well, opaque to me, they probably are very familiar to evry-day users).
The key concept is that there is an alpha
value, that measures the penalty applied to irrelevant parameters.
In the code below, we step through a number of alpha
values (alpha
=0 is equivalent to Ordinary Least Squares regression), and for each alpha
value record the coefficient of each parameter in our model. The default score
returned is our old friend R^2
data['GI'] = data['G'] * data['I']
# set up dict that will hold coefficient list( for the alpha values) for each coefficient
# initially we have an empty list
var_coefs = {
n: []
for n in ['Intercept', 'I', 'D', 'W', 'P', 'N', 'GI']
}
# scores holds the R^2 value for a model
scores = []
# run over all alpha values
for alpha in [0.00001, 0.0005, 0.001, 0.002, 0.005]:
# perform regularized regression for this current value of alpha
reg = linear_model.Lasso(alpha=alpha, normalize=True)
reg.fit(
np.asarray(data[['I', 'D', 'W', 'P', 'N', 'GI']]),
np.asarray(data['V']),
)
# update the coefficient value for each parameter, fot this alpha value
for n, v in zip(
data[['I', 'D', 'W', 'P', 'N', 'GI']].columns,
reg.coef_,
):
var_coefs[n].append(v)
# end for
var_coefs['Intercept'].append(reg.intercept_)
# show the alpha score
print(f'Score for alpha = {alpha}')
print(
reg.score(
np.asarray(
data[['I', 'D', 'W', 'P', 'N', 'GI']]
),
np.asarray(data['V']),
)
)
# append this score to list of scores
scores.append(
reg.score(
np.asarray(
data[['I', 'D', 'W', 'P', 'N', 'GI']]
),
np.asarray(data['V']),
)
)
# end for
Do a quick plot. It appears that alpha
= 0.002 removes most of the irrelevant parameters (by setting there coefficient to 0)
alphas = [0.00001, 0.0005, 0.001, 0.002, 0.005]
for n in ['I', 'D', 'W', 'P', 'N', 'GI']:
_ = plt.plot(alphas, var_coefs[n], '-o', label=n)
# end for
for x in res0.params[1:]:
_ = plt.plot(0, x, 'r+')
# end for
_ = plt.plot(0, x, 'r+', label='OLS (alpha=0)')
_ = plt.axhline(0, color='k', alpha=0.1)
_ = plt.legend(loc='best')
Do a more detailed plot, showing score with parameter coefficient values
fig, ax = plt.subplots(figsize=(10, 10))
for n in ['I', 'D', 'W', 'P', 'N', 'GI']:
_ = ax.plot(alphas, var_coefs[n], '-o', label=n)
# end for
# show the original OLS values
for x in res0.params[1:]:
_ = ax.plot(0, x, 'k*')
# end for
_ = ax.plot(0, x, 'k*', label='OLS (alpha=0)')
_ = ax.axhline(0, color='k', alpha=0.4)
_ = ax.legend(loc='upper center')
ax.set_ylabel('Parameter Coefficient')
ax.set_xlabel('Alpha')
ax.set_title('LASSO Regression')
ax2 = ax.twinx()
ax2.plot(alphas, scores, 'ro', label='Model Score')
ax2.set_ylabel('Model Score (R^2)')
ax2.legend(loc='upper right')
data['GI'] = data['G'] * data['I']
var_coefs = {
n: []
for n in ['Intercept', 'I', 'D', 'W', 'P', 'N', 'GI']
}
scores = []
alphas = [0.00001, 0.005, 0.01, 0.1, 0.5, 1, 2, 10]
for alpha in alphas:
reg = linear_model.Ridge(alpha=alpha, normalize=True)
reg.fit(
np.asarray(data[['I', 'D', 'W', 'P', 'N', 'GI']]),
np.asarray(data['V']),
)
for n, v in zip(
data[['I', 'D', 'W', 'P', 'N', 'GI']].columns,
reg.coef_,
):
var_coefs[n].append(v)
# end for
var_coefs['Intercept'].append(reg.intercept_)
print(f'Score for alpha = {alpha}')
print(
reg.score(
np.asarray(
data[['I', 'D', 'W', 'P', 'N', 'GI']]
),
np.asarray(data['V']),
)
)
scores.append(
reg.score(
np.asarray(
data[['I', 'D', 'W', 'P', 'N', 'GI']]
),
np.asarray(data['V']),
)
)
# end for
The plot indicates that the N
parameter probably should be dropped from the model.
fig, ax = plt.subplots(figsize=(10, 10))
for n in ['I', 'D', 'W', 'P', 'N', 'GI']:
_ = ax.plot(alphas, var_coefs[n], '-o', label=n)
# end for
for x in res0.params[1:]:
_ = ax.plot(0, x, 'k*')
# end for
_ = ax.plot(0, x, 'k*', label='OLS (alpha=0)')
_ = ax.axhline(0, color='k', alpha=0.4)
_ = ax.legend(loc='upper center')
ax.set_ylabel('Parameter Coefficient')
ax.set_xlabel('Alpha')
ax.set_title('RIDGE Regression')
ax2 = ax.twinx()
ax2.plot(alphas, scores, 'ro', label='Model Score')
ax2.set_ylabel('Model Score (R^2)')
ax2.legend(loc='upper right')
LASSO with Cross Validation¶
One problem with penalizing bad paramters (as measured by alpha
) is: what value of alpha
should I use? The Cross Validation process build a model from a subset of the dataset, and score the model by looking at the prediction error of the model when applied the remaining subset. We vary alpha
till we get a minimum error on the validation dataset
reg = linear_model.LassoCV(normalize=True, cv=5)
reg.fit(
np.asarray(data[['I', 'D', 'W', 'P', 'N', 'GI']]),
np.asarray(data['V']),
)
score = reg.score(
np.asarray(data[['I', 'D', 'W', 'P', 'N', 'GI']]),
np.asarray(data['V']),
)
print(score)
all_coef = [reg.intercept_] + list(reg.coef_)
all_names = ['Intercept'] + ['I', 'D', 'W', 'P', 'N', 'GI']
_ = [
print(f"{n:15} = {v: 6.4f}")
for n, v in zip(all_names, all_coef)
]
reg.alpha_
From the average of the error score across repeated subsetting of the original dataset into model-building and validation subsets, we get a minumum at about alpha
= 0.002
_ = plt.plot(reg.alphas_, reg.mse_path_, ':', alpha=1)
_ = plt.plot(
reg.alphas_,
reg.mse_path_.mean(axis=-1),
'k',
label='Average across the folds',
linewidth=2,
)
_ = plt.axvline(
reg.alpha_,
linestyle='--',
color='r',
label='alpha: CV estimate',
)
plt.legend(loc='best')
LASSO and Information Criteria¶
We can score the model in different ways, and below we use the AIC and BIC scores, that give roughly the same value of alpha
. I am a little suspicious of the results, as it shows AIC- and BIC-scoring to have the same minimum?
The results are to retain the Intercept, D
, and GI
(G and I interactio)n parameters in our model, and discard the rest.
reg_bic = linear_model.LassoLarsIC(
normalize=True, criterion='bic'
)
reg_bic.fit(
np.asarray(data[['I', 'D', 'W', 'P', 'N', 'GI']]),
np.asarray(data['V']),
)
score = reg_bic.score(
np.asarray(data[['I', 'D', 'W', 'P', 'N', 'GI']]),
np.asarray(data['V']),
)
print(f'BIC as criterion, score = {score}')
reg_aic = linear_model.LassoLarsIC(
normalize=True, criterion='aic'
)
reg_aic.fit(
np.asarray(data[['I', 'D', 'W', 'P', 'N', 'GI']]),
np.asarray(data['V']),
)
score = reg_aic.score(
np.asarray(data[['I', 'D', 'W', 'P', 'N', 'GI']]),
np.asarray(data['V']),
)
print(f'AIC as criterion, score = {score}')
_ = plt.plot(reg_bic.alphas_, reg_bic.criterion_, 'b-')
_ = plt.axvline(reg_bic.alpha_, color='b')
_ = plt.plot(reg_aic.alphas_, reg_aic.criterion_, 'r-')
_ = plt.axvline(reg_aic.alpha_, color='r')
all_coef = [reg_aic.intercept_] + list(reg_aic.coef_)
all_names = ['Intercept'] + ['I', 'D', 'W', 'P', 'N', 'GI']
_ = [
print(f"{n:15} = {v: 6.4f}")
for n, v in zip(all_names, all_coef)
]
all_coef = [reg_bic.intercept_] + list(reg_bic.coef_)
all_names = ['Intercept'] + ['I', 'D', 'W', 'P', 'N', 'GI']
_ = [
print(f"{n:15} = {v: 6.4f}")
for n, v in zip(all_names, all_coef)
]
Environment¶
%watermark -h -iv
%watermark
sm.show_versions()