Stanford STATS191 in Python, Lecture 7 : Interactions and qualitative variables (Part 2)

Fri 27 March 2020

Filed under scipy

Tags statistics scipy statsmodels

Stanford Stats 191

Introduction

This is a re-creation of the Stanford Stats 191 course (see https://web.stanford.edu/class/stats191/), using Python eco-system tools, instead of R. This is lecture "Interactions and qualitative variables", Part 2

Initial Notebook Setup

watermark documents the current Python and package environment, black is my preferred Python formatter

In [54]:
%load_ext watermark
The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
In [55]:
%load_ext lab_black
The lab_black extension is already loaded. To reload it, use:
  %reload_ext lab_black
In [56]:
%matplotlib inline
In [57]:
import pandas as pd
import numpy as np
import seaborn as sn

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

from scipy import stats
from statsmodels.formula.api import ols
from statsmodels.formula.api import rlm
import statsmodels.api as sm


import warnings

Load and Explore Example Dataset

Variable Description
TEST Job aptitude test score
MINORITY 1 if applicant could be considered minority, 0 otherwise
PERF Job performance evaluation
In [58]:
data = pd.read_csv('../data/jobtest.txt', sep='\t')

# sort the Test Score data to make graphic easier

data = data.sort_values(by='TEST').copy()
In [59]:
data.head()
Out[59]:
TEST MINORITY JPERF
0 0.28 1 1.83
18 0.42 0 3.85
12 0.45 0 1.39
17 0.72 0 1.90
1 0.97 1 4.59
In [60]:
data['MINORITY'].unique()
Out[60]:
array([1, 0], dtype=int64)

We change the color and symbol used to distinguish observations with different Minority status (we only need two different colors and two different marker symbols)

In [61]:
fig, ax = plt.subplots(figsize=(8, 8))
sn.scatterplot(
    'TEST',
    'JPERF',
    data=data,
    hue='MINORITY',
    ax=ax,
    alpha=0.99,
    s=100,
    markers=['o', 'v'],
    palette=['purple', 'green'],
    style='MINORITY',
)
Out[61]:
<matplotlib.axes._subplots.AxesSubplot at 0x2206da4c7f0>

Develop Models

Our general model has the equation as below

$$JPERF_i = \beta_0 + \beta_1 TEST_i + \beta_2 MINORITY_i + \beta_3 MINORITY_i * TEST_i + \varepsilon_i.$$

Model 1, TEST Only

First, we create a model where Job Performance depends only on the Test Scores (equivalent to beta-2, beta-3 = 0 in the equation above)

$$JPERF_i = \beta_0 + \beta_1 TEST_i + \varepsilon_i.$$
In [62]:
res1 = ols('JPERF ~ TEST', data=data).fit()
res1.summary()
Out[62]:
OLS Regression Results
Dep. Variable: JPERF R-squared: 0.517
Model: OLS Adj. R-squared: 0.490
Method: Least Squares F-statistic: 19.25
Date: Thu, 26 Mar 2020 Prob (F-statistic): 0.000356
Time: 16:22:17 Log-Likelihood: -36.614
No. Observations: 20 AIC: 77.23
Df Residuals: 18 BIC: 79.22
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 1.0350 0.868 1.192 0.249 -0.789 2.859
TEST 2.3605 0.538 4.387 0.000 1.230 3.491
Omnibus: 0.324 Durbin-Watson: 1.852
Prob(Omnibus): 0.850 Jarque-Bera (JB): 0.483
Skew: -0.186 Prob(JB): 0.785
Kurtosis: 2.336 Cond. No. 5.26


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Display the line of best fit (not looking as if this model is wonderful!)

In [63]:
fig, ax = plt.subplots(figsize=(8, 8))
sn.scatterplot(
    'TEST',
    'JPERF',
    data=data,
    hue='MINORITY',
    ax=ax,
    alpha=0.99,
    s=200,
    markers=['o', 'v'],
    palette=['purple', 'green'],
    style='MINORITY',
)

y1 = res1.predict()
ax.plot(data['TEST'], y1, 'r-', label='Predicted')
ax.grid()
ax.legend(loc='best')
Out[63]:
<matplotlib.legend.Legend at 0x2206da864e0>

Display the residuals. At least they appear to have constant variance

In [64]:
plt.plot(data['TEST'], res1.resid, 'o')
plt.axhline(0, color='r')
Out[64]:
<matplotlib.lines.Line2D at 0x2206eb3ad30>

Model 2, TEST and Minority Status

We now build a model where Job Performamnce depends upon Test Scores and Minority Status. This maodel says we have the same slope of the JPERF - TEST line for each Minority Status (but each Minority Status line has a different intercept) This is equivalent to setting beta-3 = 0 in our general equation above

$$JPERF_i = \beta_0 + \beta_1 TEST_i + \beta_2 MINORITY_i + \varepsilon_i.$$
In [65]:
res2 = ols('JPERF ~ TEST + MINORITY', data=data).fit()
res2.summary()
Out[65]:
OLS Regression Results
Dep. Variable: JPERF R-squared: 0.572
Model: OLS Adj. R-squared: 0.522
Method: Least Squares F-statistic: 11.38
Date: Thu, 26 Mar 2020 Prob (F-statistic): 0.000731
Time: 16:22:17 Log-Likelihood: -35.390
No. Observations: 20 AIC: 76.78
Df Residuals: 17 BIC: 79.77
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 0.6120 0.887 0.690 0.500 -1.260 2.483
TEST 2.2988 0.522 4.400 0.000 1.197 3.401
MINORITY 1.0276 0.691 1.487 0.155 -0.430 2.485
Omnibus: 0.251 Durbin-Watson: 1.834
Prob(Omnibus): 0.882 Jarque-Bera (JB): 0.437
Skew: -0.059 Prob(JB): 0.804
Kurtosis: 2.286 Cond. No. 5.72


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Plotting the two lines our model gives us look OK, but the residuals plot does not look to be significantly better

In [66]:
fig, ax = plt.subplots(figsize=(8, 8))
sn.scatterplot(
    'TEST',
    'JPERF',
    data=data,
    hue='MINORITY',
    ax=ax,
    alpha=0.99,
    s=200,
    markers=['o', 'v'],
    palette=['purple', 'green'],
    style='MINORITY',
)

data['PREDICT'] = res2.predict()
mask0 = data['MINORITY'] == 0
mask1 = data['MINORITY'] == 1

y1 = data[mask0]['PREDICT']
y2 = data[mask1]['PREDICT']
ax.plot(
    data[mask0]['TEST'],
    y1,
    color='purple',
    label='Predicted (M=0)',
)
ax.plot(
    data[mask1]['TEST'], y2, 'g-', label='Predicted (M=1)'
)
ax.grid()
ax.legend(loc='best')
Out[66]:
<matplotlib.legend.Legend at 0x2206d73fe10>
In [67]:
plt.plot(data['TEST'], res2.resid, 'o')
plt.axhline(0, color='r')
Out[67]:
<matplotlib.lines.Line2D at 0x2206ed9f7f0>

Model 3, TEST with MINORITY Interaction

In our third model, we allow each Minority Status group to have diffent slope but same intercept in the TEST - JPERF line, equivalent to beta-2 = 0

$$JPERF_i = \beta_0 + \beta_1 TEST_i + \beta_3 MINORITY_i * TEST_i + \varepsilon_i.$$
In [68]:
res3 = ols('JPERF ~ TEST + TEST:MINORITY', data=data).fit()
res3.summary()
Out[68]:
OLS Regression Results
Dep. Variable: JPERF R-squared: 0.632
Model: OLS Adj. R-squared: 0.589
Method: Least Squares F-statistic: 14.59
Date: Thu, 26 Mar 2020 Prob (F-statistic): 0.000204
Time: 16:22:18 Log-Likelihood: -33.891
No. Observations: 20 AIC: 73.78
Df Residuals: 17 BIC: 76.77
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 1.1211 0.780 1.437 0.169 -0.525 2.768
TEST 1.8276 0.536 3.412 0.003 0.698 2.958
TEST:MINORITY 0.9161 0.397 2.306 0.034 0.078 1.754
Omnibus: 0.388 Durbin-Watson: 1.781
Prob(Omnibus): 0.823 Jarque-Bera (JB): 0.514
Skew: 0.050 Prob(JB): 0.773
Kurtosis: 2.221 Cond. No. 5.96


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Subjectively, this is a better model (especially for Minority Status = 0 observations). However, the beauty of statistics is that we don't have to rely on subjective assessment!

The code below add the TEST = 0 point to each line, to show they have the same intercept

In [69]:
fig, ax = plt.subplots(figsize=(8, 8))
sn.scatterplot(
    'TEST',
    'JPERF',
    data=data,
    hue='MINORITY',
    ax=ax,
    alpha=0.99,
    s=200,
    markers=['o', 'v'],
    palette=['purple', 'green'],
    style='MINORITY',
)

data['PREDICT'] = res3.predict()
mask0 = data['MINORITY'] == 0
mask1 = data['MINORITY'] == 1

y1 = data[mask0]['PREDICT']
y2 = data[mask1]['PREDICT']
ax.plot(
    [0] + list(data[mask0]['TEST']),
    [res3.params['Intercept']] + list(y1),
    'b-',
    label='Predicted (M=0)',
)
ax.plot(
    [0] + list(data[mask1]['TEST']),
    [res3.params['Intercept']] + list(y2),
    'y-',
    label='Predicted (M=1)',
)
ax.grid()
ax.legend(loc='best')
Out[69]:
<matplotlib.legend.Legend at 0x2206ec85c18>

We plot the residuals

In [70]:
plt.plot(data['TEST'], res3.resid, 'o')
plt.axhline(0, color='r')
Out[70]:
<matplotlib.lines.Line2D at 0x2206ed0c2b0>

Model 4, All Possible Contributions to Linear Model

Our last model has no constraints on the beta values

$$JPERF_i = \beta_0 + \beta_1 TEST_i + \beta_2 MINORITY_i + \beta_3 MINORITY_i * TEST_i + \varepsilon_i.$$
In [71]:
#
# * has special meaning in ```ols``` formulas

res4 = ols('JPERF ~ TEST * MINORITY', data=data).fit()
res4.summary()
Out[71]:
OLS Regression Results
Dep. Variable: JPERF R-squared: 0.664
Model: OLS Adj. R-squared: 0.601
Method: Least Squares F-statistic: 10.55
Date: Thu, 26 Mar 2020 Prob (F-statistic): 0.000451
Time: 16:22:18 Log-Likelihood: -32.971
No. Observations: 20 AIC: 73.94
Df Residuals: 16 BIC: 77.92
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 2.0103 1.050 1.914 0.074 -0.216 4.236
TEST 1.3134 0.670 1.959 0.068 -0.108 2.735
MINORITY -1.9132 1.540 -1.242 0.232 -5.179 1.352
TEST:MINORITY 1.9975 0.954 2.093 0.053 -0.026 4.021
Omnibus: 3.377 Durbin-Watson: 1.695
Prob(Omnibus): 0.185 Jarque-Bera (JB): 1.330
Skew: 0.120 Prob(JB): 0.514
Kurtosis: 1.760 Cond. No. 13.8


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [72]:
fig, ax = plt.subplots(figsize=(8, 8))
sn.scatterplot(
    'TEST',
    'JPERF',
    data=data,
    hue='MINORITY',
    ax=ax,
    alpha=0.99,
    s=200,
    markers=['o', 'v'],
    palette=['purple', 'green'],
    style='MINORITY',
)

data['PREDICT'] = res4.predict()
mask0 = data['MINORITY'] == 0
mask1 = data['MINORITY'] == 1

y1 = data[mask0]['PREDICT']
y2 = data[mask1]['PREDICT']
ax.plot(
    list(data[mask0]['TEST']),
    list(y1),
    'b-',
    label='Predicted (M=0)',
)
ax.plot(
    list(data[mask1]['TEST']),
    list(y2),
    'y-',
    label='Predicted (M=1)',
)
ax.grid()
ax.legend(loc='best')
Out[72]:
<matplotlib.legend.Legend at 0x2206edd1668>
In [73]:
plt.plot(data['TEST'], res4.resid, 'o')
Out[73]:
[<matplotlib.lines.Line2D at 0x2206ee379e8>]

It is interesting to see the Mean Square Error in the Residuals for each model

In [74]:
res = [res1, res2, res3, res4]
In [75]:
fig, ax = plt.subplots(figsize=(8, 8))
sn.barplot(
    ['T', 'T+M', 'T+T:M', 'T*M'],
    [a.mse_resid for a in res],
    ax=ax,
)
ax.set_xlabel('Model Formula')
ax.set_ylabel('Mean Square Residuals')
Out[75]:
Text(0, 0.5, 'Mean Square Residuals')

Analysis of Variance

We now compare each model against another, to see if there is a significantly better explanation for the variance

TEST vs TEST * MINORITY

Not significant at the 5% level

In [76]:
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    anv = sm.stats.anova_lm(res1, res4)
# end with
anv
Out[76]:
df_resid ssr df_diff ss_diff F Pr(>F)
0 18.0 45.568297 0.0 NaN NaN NaN
1 16.0 31.655473 2.0 13.912824 3.516061 0.054236

TEST vs TEST + MINORITY

Not significant

In [77]:
sm.stats.anova_lm(res1, res2)
D:\Anaconda3\envs\ac5-py37\lib\site-packages\scipy\stats\_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in greater
  return (self.a < x) & (x < self.b)
D:\Anaconda3\envs\ac5-py37\lib\site-packages\scipy\stats\_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in less
  return (self.a < x) & (x < self.b)
D:\Anaconda3\envs\ac5-py37\lib\site-packages\scipy\stats\_distn_infrastructure.py:1821: RuntimeWarning: invalid value encountered in less_equal
  cond2 = cond0 & (x <= self.a)
Out[77]:
df_resid ssr df_diff ss_diff F Pr(>F)
0 18.0 45.568297 0.0 NaN NaN NaN
1 17.0 40.321546 1.0 5.246751 2.212087 0.155246

TEST + TEST:MINORITY vs TEST * MINORITY

Not significant

In [78]:
sm.stats.anova_lm(res3, res4)
Out[78]:
df_resid ssr df_diff ss_diff F Pr(>F)
0 17.0 34.707653 0.0 NaN NaN NaN
1 16.0 31.655473 1.0 3.05218 1.542699 0.232115

TEST vs TEST + TEST:MINORITY

Significant at the 5% level

In [79]:
sm.stats.anova_lm(res1, res3)
Out[79]:
df_resid ssr df_diff ss_diff F Pr(>F)
0 18.0 45.568297 0.0 NaN NaN NaN
1 17.0 34.707653 1.0 10.860644 5.319603 0.033949

TEST + MINORITY vs TEST:MINORITY

Not significant at the 5% level

In [80]:
sm.stats.anova_lm(res2, res4)
Out[80]:
df_resid ssr df_diff ss_diff F Pr(>F)
0 17.0 40.321546 0.0 NaN NaN NaN
1 16.0 31.655473 1.0 8.666073 4.380196 0.05265

Finally we plot the residuals of all the models together

In [81]:
# ['T', 'T+M', 'T+T:M', 'T*M']

_ = plt.plot(range(len(data)), res1.resid, 'ro', label='T')
_ = plt.plot(
    range(len(data)), res2.resid, 'g+', label='T+M'
)
_ = plt.plot(
    range(len(data)), res3.resid, 'kd', label='T+T:M'
)
_ = plt.plot(
    range(len(data)), res4.resid, 'b*', label='T*M'
)
_ = plt.legend(loc='best')

Environment

In [82]:
%watermark -h -iv
%watermark
pandas      1.0.0
numpy       1.15.4
matplotlib  3.0.2
scipy       1.1.0
statsmodels 0.9.0
seaborn     0.9.0
host name: DESKTOP-SODFUN6
2020-03-26T16:22:20+10:00

CPython 3.7.1
IPython 7.2.0

compiler   : MSC v.1915 64 bit (AMD64)
system     : Windows
release    : 10
machine    : AMD64
processor  : Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
CPU cores  : 8
interpreter: 64bit
In [ ]:
 

Comments


net-analysis.com Data Analysis Blog © Don Cameron Powered by Pelican and Twitter Bootstrap. Icons by Font Awesome and Font Awesome More