Stanford STATS191 in Python, Lecture 7 : Interactions and qualitative variables (Part 1)
Stanford Stats 191¶
Introduction¶
This is a re-creation of the Stanford Stats 191 course (see https://web.stanford.edu/class/stats191/), using Python eco-system tools, instead of R. This is lecture "Interactions and qualitative variables"
Initial Notebook Setup¶
watermark
documents the current Python and package environment, black
is my preferred Python formatter
%load_ext watermark
%load_ext lab_black
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from scipy import stats
from statsmodels.formula.api import ols
from statsmodels.formula.api import rlm
import statsmodels.api as sm
from patsy import dmatrices
import warnings
Contrived Dataset¶
Before I started the STATS191, I decided to create a dataset with strong interaction effects, to see how OLS and statsmodels
behaved
Set up our data grid (in x
and y
)
n_data = 50
x_grid = np.linspace(0, 10, n_data)
y_grid = np.linspace(0, 10, n_data)
z = np.zeros((n_data, n_data))
beta = [2, 5, 8, 4]
Compute our response to the input x
and y
for i in range(len(x_grid)): # run over all x
for j in range(len(y_grid)): # run over all y
x = x_grid[i]
y = y_grid[j]
z[i, j] = (
beta[0]
+ x * beta[1]
+ y * beta[2]
+ x * y * beta[3]
)
# end for
# end for
Plot the surface we computed
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, projection='3d')
# Plot the surface.
surf = ax.plot_surface(
x_grid,
y_grid,
z,
linewidth=5,
antialiased=False,
cmap='rainbow',
alpha=0.5,
)
ax.set_xlabel('X')
ax.set_ylabel('Y')
fig.colorbar(surf, shrink=0.5, aspect=10)
I find it helps visualize a 3D surface if contour lines are added
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, projection='3d')
# Plot the surface.
surf = ax.contour(
x_grid,
y_grid,
z,
antialiased=False,
cmap='rainbow',
levels=20,
)
z_offset = 0
ax.contour(
x_grid,
y_grid,
z,
zdir='z',
offset=z_offset,
cmap='rainbow',
alpha=0.9,
levels=20,
)
z_offset = 0
ax.contourf(
x_grid,
y_grid,
z,
zdir='z',
offset=z_offset,
cmap='rainbow',
alpha=0.2,
levels=20,
)
ax.set_xlabel('X')
ax.set_ylabel('Y')
fig.colorbar(surf, shrink=0.5, aspect=10)
A wireframe visualization is less useful
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, projection='3d')
# Plot the surface.
surf = ax.plot_wireframe(
x_grid,
y_grid,
z,
linewidth=1,
antialiased=False,
cmap='rainbow',
rcount=20,
ccount=20,
)
ax.set_xlabel('X')
ax.set_ylabel('Y')
Add Noise, and Fit Model¶
n_points = n_data * n_data
z_points = np.zeros(n_points)
x_points = np.zeros(n_points)
y_points = np.zeros(n_points)
sigma = 2
error = np.random.normal(0, 1, n_points) * sigma
for i in range(len(x_grid)): # run over all intercepts
for j in range(len(y_grid)): # run over all slopes
x = x_grid[i]
y = y_grid[j]
i_z = i * n_data + j
x_points[i_z] = x
y_points[i_z] = y
z_points[i_z] = (
beta[0]
+ x * beta[1]
+ y * beta[2]
+ x * y * beta[3]
) + error[i_z]
# end for
# end for
df_dict = {'x': x_points, 'y': y_points, 'z': z_points}
data = pd.DataFrame(df_dict)
res = ols('z ~ x + y + x*y', data=data).fit()
res.summary()
The OLS fit did a good job of returning the coefficients we used
beta
It also did a good job of estimating the noise standard deviation
sigma_hat = np.sqrt(res.mse_resid)
sigma_hat
Lets try again, but with much larger noise (100 vs 2)
beta = [2, 5, 8, 1]
n_points = n_data * n_data
z_points = np.zeros(n_points)
x_points = np.zeros(n_points)
y_points = np.zeros(n_points)
sigma = 100
error = np.random.normal(0, 1, n_points) * sigma
for i in range(len(x_grid)): # run over all intercepts
for j in range(len(y_grid)): # run over all slopes
x = x_grid[i]
y = y_grid[j]
i_z = i * n_data + j
x_points[i_z] = x
y_points[i_z] = y
z_points[i_z] = (
beta[0]
+ x * beta[1]
+ y * beta[2]
+ x * y * beta[3]
) + error[i_z]
# end for
# end for
df_dict = {'x': x_points, 'y': y_points, 'z': z_points}
data = pd.DataFrame(df_dict)
res = ols('z ~ x + y + x*y', data=data).fit()
res.summary()
The OLS still does a good job of estimating the coefficients (excluding the intercept), and the standard deviation
beta
sigma_hat = np.sqrt(res.mse_resid)
sigma_hat
STATS191 Data Set¶
The STATS191 lecture starts with looking at a dataset of IT staff salaries, along with associated years of experience, and management and education levels.
We wish to explain how salaries (S) are affected by the three exogenous variables:
- X, experience (years)
- E, education (1=Bachelor’s, 2=Master’s, 3=Ph.D)
- M, management (1=management, 0=not management)
Read and Explore Dataset¶
data = pd.read_csv('../data/salary.txt', sep='\t')
data.head()
data['E'].unique()
data['X'].unique()
Visualization¶
The quick and easy matplotlib
scatter plot is not very useful. We use Seaborn to get a more meaningful graphic. We even change the default symbols and colors, to get exactly the same graph as R produced in the STATS191 lecture
plt.plot(data['X'], data['S'], 'o')
fig, ax = plt.subplots(figsize=(10, 10))
sn.scatterplot(
'X',
'S',
data=data,
hue='E',
alpha=1,
style='M',
cmap='rainbow',
palette=['r', 'g', 'b'],
ax=ax,
s=200,
markers=['d', '^'],
)
Modelling Data¶
It appears visually that each set of points (i.e. for a given E, M combination) lies on a straight line, with common slope.
To test this, we first create variables that mimic the effect of E being a category. Although E appears to be a numeric value, it is really categorical, and so we define a variable E2 that is 1 if the employee has a Masters, and zero otherwise (and similarly for Ph D.s). The M
variable is already in this form, as there are only two levels of Management status.
data2 = data.copy()
data2['E2'] = [1 if e == 2 else 0 for e in data['E']]
data2['E3'] = [1 if e == 3 else 0 for e in data['E']]
We specify a model where the Salary (S), depend linearly in Experience (X), Education level (E2, E3), and Management Status (M). As E2, E3, and M are either 0 or 1, they just move the line of common slope (Salary vs Experience) up or down.
res = ols('S ~ X + M + E2 + E3', data2).fit()
res.summary()
Using Pandas and OLS Together¶
Another way of doing this is to use pandas
support for categorical variables. If we declare to pandas
that Education (E) is a categorical variable, then the statsmodels
OLS fitting will automatically create the extra variables for us as below (note, all values in the OLS results are the same)
Define E to be a pandas
category
data['E'] = data['E'].astype('category')
data.head()
data.dtypes
res = ols('S ~ X + M + E', data).fit()
res.summary()
Assessing the Model¶
Below, we generate a quick and easy plot of the actual values of Salary, and the predicted values. It appears we got the slope right, but the predicted offset for the line within each E, M group is not so good.
plt.plot(data['X'], data['S'], 'o')
plt.plot(data['X'], res.predict(), 'r+')
Detailed Visualization of Residuals¶
We generate some more detailed graphics showing predicted against actuals. To do this, we have to extendd our DataFrame to include the predicted values
We first show the predicted points (P=1) (small markers = predicted, large markers = actuals).
Then we visualize the fit of the estimated line to the actuals
data3 = data.copy()
data3['S'] = res.predict()
data3['P'] = [1] * len(data3)
data2 = data.copy()
data2['P'] = [0] * len(data2)
data4 = pd.concat([data2, data3])
fig, ax = plt.subplots(figsize=(8, 8))
sn.scatterplot(
'X',
'S',
data=data4,
hue='E',
alpha=1,
style='M',
size='P',
cmap='rainbow',
palette=['r', 'g', 'b'],
ax=ax,
sizes=[200, 100],
markers=['d', '^'],
)
We visualize estimated lines against actual Salary values
data7 = data.copy()
# compute group for data points (3 education levels * 2 management levels)
data7['G'] = [
2 * e - 1 + m for e, m in zip(data['E'], data['M'])
]
Compute the predicted lines for each (E, M) group
data['P'] = res.predict()
cols = ['r', 'g', 'b']
x = []
y = []
c = []
for i in range(1, 7):
x.append(list(data[data7['G'] == i]['X']))
y.append(list(data[data7['G'] == i]['P']))
icol = (i + 1) // 2
c.append(cols[icol - 1])
# endif
fig, ax = plt.subplots(figsize=(10, 10))
sn.scatterplot(
'X',
'S',
data=data,
hue='E',
alpha=1,
style='M',
cmap='rainbow',
palette=['r', 'g', 'b'],
ax=ax,
s=200,
markers=['d', '^'],
)
for xx, yy, col in zip(x, y, c):
ax.plot(xx, yy, col + '--')
# end for
We do a quick and easy plot of the residuals against Experience (X)
plt.plot(data['X'], res.resid, 'o')
Conclusions¶
It is clear that our model could be improved (even though R^2 is quite close to 1.0)
Include Interactions Between Input Variables¶
We now move on to models that include interactions between input variables
Education and Experience Interaction¶
First test if Education Level (E) changes how Experience (X) is valued (i.e. the slope of Salary against Experience changes depending upon the Education level)
res2 = ols('S ~ E + M + X + X:E', data).fit()
res2.summary()
We can test is the new model explains more of the variance: it appears to, but not at a significant level ( we might such improvment 15 time out of 100, by random chance).
Note that we have to wrap our anova_lm
call to supress some supurious messages
with warnings.catch_warnings():
warnings.simplefilter("ignore")
anv = sm.stats.anova_lm(res, res2)
# end with
anv
We visualize how the residuals behave with this model
plt.plot(data['X'], data['S'], 'o')
plt.plot(data['X'], res2.predict(), 'r+')
As before, we define a pandas column that is the group (E, M combination) the observation belongs to. We then use this in Seaborn, to show the distribution of residuals within each group
data4 = data.copy()
data4['R'] = res2.resid
data4['G'] = [
2 * e - 1 + m for e, m in zip(data['E'], data['M'])
]
fig, ax = plt.subplots(figsize=(8, 8))
sn.scatterplot(
'G',
'R',
data=data4,
hue='E',
alpha=1,
style='M',
cmap='rainbow',
palette=['r', 'g', 'b'],
ax=ax,
s=200,
markers=['d', '^'],
)
ax.axhline(0.0, color='k')
Define a column G
giving a group number for each (E, M) combination
data7 = data.copy()
data7['G'] = [
2 * e - 1 + m for e, m in zip(data['E'], data['M'])
]
Define arrays holding the predicted salary for each observation within each group. c
holds the color of the line to draw
data['P'] = res2.predict()
cols = ['r', 'g', 'b']
x = []
y = []
c = []
for i in range(1, 7):
x.append(list(data[data7['G'] == i]['X']))
y.append(list(data[data7['G'] == i]['P']))
icol = (i + 1) // 2
c.append(cols[icol - 1])
# endif
Use Seaborn to plot the lines of predicted Salary vs Experience, for each group
fig, ax = plt.subplots(figsize=(10, 10))
sn.scatterplot(
'X',
'S',
data=data,
hue='E',
alpha=1,
style='M',
cmap='rainbow',
palette=['r', 'g', 'b'],
ax=ax,
s=200,
markers=['d', '^'],
)
for xx, yy, col in zip(x, y, c):
ax.plot(xx, yy, col + '--')
# end for
We can see that our model has produced lines of different slope (Salary vs Experience) as expected, but there still seem to be non-random, systematic error is our predictions.
We can visualize this via an Interaction Plot, showing the effect both Education and Management Status have on the average of the residuals (which should be zero)
fig, ax = plt.subplots(figsize=(6, 6))
sm.graphics.interaction_plot(
data['E'],
data['M'],
res2.resid,
ax=ax,
xlabel='Education',
ylabel='Residual',
)
plt.show()
Education and Management Interaction¶
In our next model, we test if there is an interaction between Education Level (E) and Management Status (M). If we look at the original data plot at the top, we notice that in management staff (unlike operational staff), a Ph D is not valued higher than a Masters, so we should expect some interaction here.
We build a linear model as before, but also adding Education and Management Status interactions. beta-5
and beta-6
will be zero under the Null Hypothesis (H0) that there is no Education * Management Status interaction
res3 = ols('S ~ X + E + M + E:M', data=data).fit()
res3.summary()
We find that we must reject the Hypothesis that all coefficients are zero.
We now check to see if significantly more variance is explained; we find that it is
with warnings.catch_warnings():
warnings.simplefilter("ignore")
anv = sm.stats.anova_lm(res, res3)
# end with
anv
As before, we now plot our residuals, by group, and also by Experience (X)
data5 = data.copy()
data5['R'] = res3.resid
data5['G'] = [
2 * e - 1 + m for e, m in zip(data['E'], data['M'])
]
Residuals plotted for each group
fig, ax = plt.subplots(figsize=(8, 8))
sn.scatterplot(
'G',
'R',
data=data5,
hue='E',
alpha=1,
style='M',
cmap='rainbow',
palette=['r', 'g', 'b'],
ax=ax,
s=200,
markers=['d', '^'],
)
ax.axhline(0.0, color='k')
Residuals plotted by Experience (X)
fig, ax = plt.subplots(figsize=(8, 8))
sn.scatterplot(
'X',
'R',
data=data5,
hue='E',
alpha=1,
style='M',
cmap='rainbow',
palette=['r', 'g', 'b'],
ax=ax,
s=200,
markers=['d', '^'],
)
ax.axhline(0.0, color='k')
We clearly have a single outlier. We identify this observation
res3_outlier = res3.outlier_test()
_ = plt.plot(
range(len(res3.resid)), res3_outlier['bonf(p)'], 'o'
)
data[res3_outlier['bonf(p)'] < 0.8][['S', 'X', 'E', 'M']]
Check it is the bad value (it is)
print(f'Residual of Outlier = {res3.resid[32]}')
Drop the bad row, and re-do our two models (original pure linear, and linear plus E x M interactions)
data_clean = data.drop([32]).copy()
res_clean = ols('S ~ E + X + M', data=data_clean).fit()
res_clean3 = ols(
'S ~ E + X + M + E:M', data=data_clean
).fit()
We test if the improved explanation of variance in the second model could be by chance (No, it is not, to a very high degree of certainity)
with warnings.catch_warnings():
warnings.simplefilter("ignore")
anv = sm.stats.anova_lm(res_clean, res_clean3)
# end with
anv
res_clean3.summary()
We plot the predicted vs actuals
Now we get a lot better fit between predicted and actuals
plt.plot(data_clean['X'], data_clean['S'], 'o')
plt.plot(data_clean['X'], res_clean3.predict(), 'r+')
We plot the linear components of our model
data4 = data_clean.copy()
data4['G'] = [
2 * e - 1 + m
for e, m in zip(data_clean['E'], data_clean['M'])
]
data_clean['P'] = res_clean3.predict()
cols = ['r', 'g', 'b']
x = []
y = []
c = []
for i in range(1, 7):
x.append(list(data_clean[data4['G'] == i]['X']))
y.append(list(data_clean[data4['G'] == i]['P']))
icol = (i + 1) // 2
c.append(cols[icol - 1])
# endif
fig, ax = plt.subplots(figsize=(10, 10))
sn.scatterplot(
'X',
'S',
data=data_clean,
hue='E',
alpha=1,
style='M',
cmap='rainbow',
palette=['r', 'g', 'b'],
ax=ax,
s=200,
markers=['d', '^'],
)
for xx, yy, col in zip(x, y, c):
ax.plot(xx, yy, col + '--')
# end for
The plot of standardized residuals shows that they look as random as we would expect from our model
data_clean['Standard Residual'] = res_clean3.outlier_test()[
'student_resid'
]
fig, ax = plt.subplots(figsize=(10, 10))
sn.scatterplot(
'X',
'Standard Residual',
data=data_clean,
hue='E',
alpha=1,
style='M',
cmap='rainbow',
palette=['r', 'g', 'b'],
ax=ax,
s=200,
markers=['d', '^'],
)
If we display the interaction plot, using Salary adjusted for Experience (i.e. just leaving the effect of Magagement Status and Education, we can see that a Ph D is valued less that a Master (unlike in operational staff).
fig, ax = plt.subplots(figsize=(10, 10))
U = data_clean['S'] - data_clean['X'] * res_clean3.params['X']
sm.graphics.interaction_plot(
data_clean['E'],
data_clean['M'],
U,
ax=ax,
xlabel='Education',
ylabel='Salary adjusted for Experience',
)
plt.show()
Residual Comparison¶
Finally, we can compre the residuals for both models in the one graph. The second model is clearly superior
fig, ax = plt.subplots(figsize=(10, 10))
ax.plot(
data_clean['X'],
res_clean.resid,
'o',
label='Pure Linear Model',
)
ax.plot(
data_clean['X'],
res_clean3.resid,
'o',
label='Linear + Education*Management Role Interactions',
)
ax.grid()
ax.legend(loc='best')
ax.set_xlabel('Experience')
ax.set_ylabel('Residual (Actual-Predicted)')
Environment¶
%watermark -h -iv
%watermark