scipy Statistics Worked Examples

Sun 15 March 2020

Filed under scipy

Tags statistics scipy

SciPy and Statistics Examples

As a familiarisation exercise, this is walkthrough of the material from https://scipy-lectures.org/packages/statistics/index.html

There is some added value in that I have explored different ways of visualising the data being analysed.

Initial Notebook Setup

watermark self-documents the environment.

In [1]:
%load_ext watermark

black is my preferred Python formatter

In [2]:
%load_ext lab_black
In [3]:
%matplotlib inline

All includes will go here

In [4]:
import pandas as pd
import numpy as np
import seaborn as sn

import matplotlib.pyplot as plt
from scipy import stats
from statsmodels.formula.api import ols
from statsmodels.formula.api import rlm
import statsmodels.api as sm

from statsmodels.sandbox.regression.predstd import (
    wls_prediction_std,
)
D:\Anaconda3\envs\ac5-py37\lib\site-packages\statsmodels\compat\pandas.py:49: FutureWarning: The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version
  data_klasses = (pandas.Series, pandas.DataFrame, pandas.Panel)

Data Load

The initial dataset contains three different IQ measures, and data on subject weight, height, and brain size (as per MRI)

In [5]:
DATA_PREFIX = '../data/'
data = pd.read_csv(
    DATA_PREFIX + 'brain_size.csv', na_values=".", sep=';'
)

Data Exploration

Display the first few rows

In [6]:
data.head()
Out[6]:
Unnamed: 0 Gender FSIQ VIQ PIQ Weight Height MRI_Count
0 1 Female 133 132 124 118.0 64.5 816932
1 2 Male 140 150 124 NaN 72.5 1001121
2 3 Male 139 123 150 143.0 73.3 1038437
3 4 Male 133 129 128 172.0 68.8 965353
4 5 Female 137 132 134 147.0 65.0 951545

Display the data types in our dataset

In [7]:
data.dtypes
Out[7]:
Unnamed: 0      int64
Gender         object
FSIQ            int64
VIQ             int64
PIQ             int64
Weight        float64
Height        float64
MRI_Count       int64
dtype: object

Use describe to give a snapshot of each column that is numeric

In [8]:
data.describe()
Out[8]:
Unnamed: 0 FSIQ VIQ PIQ Weight Height MRI_Count
count 40.000000 40.000000 40.000000 40.00000 38.000000 39.000000 4.000000e+01
mean 20.500000 113.450000 112.350000 111.02500 151.052632 68.525641 9.087550e+05
std 11.690452 24.082071 23.616107 22.47105 23.478509 3.994649 7.228205e+04
min 1.000000 77.000000 71.000000 72.00000 106.000000 62.000000 7.906190e+05
25% 10.750000 89.750000 90.000000 88.25000 135.250000 66.000000 8.559185e+05
50% 20.500000 116.500000 113.000000 115.00000 146.500000 68.000000 9.053990e+05
75% 30.250000 135.500000 129.750000 128.00000 172.000000 70.500000 9.500780e+05
max 40.000000 144.000000 150.000000 150.00000 192.000000 77.000000 1.079549e+06

Let us explore the difference gender makes. We use the pandas groupby function, and then show the snapshot of each group

In [9]:
gender = data.groupby('Gender')

We see that we have a new type of object

In [10]:
gender
Out[10]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002BFD7C16F60>
In [11]:
gender.describe()
Out[11]:
Unnamed: 0 FSIQ ... Height MRI_Count
count mean std min 25% 50% 75% max count mean ... 75% max count mean std min 25% 50% 75% max
Gender
Female 20.0 19.65 11.356774 1.0 10.25 18.0 29.25 38.0 20.0 111.9 ... 66.875 70.5 20.0 862654.6 55893.55578 790619.0 828062.00 855365.0 882668.5 991305.0
Male 20.0 21.35 12.248630 2.0 11.50 21.5 32.25 40.0 20.0 115.0 ... 73.750 77.0 20.0 954855.4 55911.35471 879987.0 919529.25 947241.5 973496.0 1079549.0

2 rows × 56 columns

In [12]:
gender.mean()
Out[12]:
Unnamed: 0 FSIQ VIQ PIQ Weight Height MRI_Count
Gender
Female 19.65 111.9 109.45 110.45 137.200000 65.765000 862654.6
Male 21.35 115.0 115.25 111.60 166.444444 71.431579 954855.4
In [13]:
gender.size()
Out[13]:
Gender
Female    20
Male      20
dtype: int64

Display the first four rows of each Gender

In [14]:
gender.head(4)
Out[14]:
Unnamed: 0 Gender FSIQ VIQ PIQ Weight Height MRI_Count
0 1 Female 133 132 124 118.0 64.5 816932
1 2 Male 140 150 124 NaN 72.5 1001121
2 3 Male 139 123 150 143.0 73.3 1038437
3 4 Male 133 129 128 172.0 68.8 965353
4 5 Female 137 132 134 147.0 65.0 951545
5 6 Female 99 90 110 146.0 69.0 928799
6 7 Female 138 136 131 138.0 64.5 991305
8 9 Male 89 93 84 134.0 66.3 904858

Data Visualization

We use the pandas plotting functions to quickly show the weights of each subject, color-coded by gender.

In [15]:
gender['Weight'].plot(style=['o', '+'])
Out[15]:
Gender
Female    AxesSubplot(0.125,0.125;0.775x0.755)
Male      AxesSubplot(0.125,0.125;0.775x0.755)
Name: Weight, dtype: object

Show the histogram of weights, grouped by gender

In [16]:
gender['Weight'].hist()
Out[16]:
Gender
Female    AxesSubplot(0.125,0.125;0.775x0.755)
Male      AxesSubplot(0.125,0.125;0.775x0.755)
Name: Weight, dtype: object

Show a boxplot with IQ scores, grouped by gender

In [17]:
gender.boxplot(column=['FSIQ', 'PIQ', 'VIQ'])
Out[17]:
Female         AxesSubplot(0.1,0.15;0.363636x0.75)
Male      AxesSubplot(0.536364,0.15;0.363636x0.75)
dtype: object

Show a scatter plot of the non-IQ numeric data (one plot for each pair of variables). As we might expect, aspects of body size appear to be positively correlated.

In [18]:
__ = pd.plotting.scatter_matrix(
    data[['Weight', 'Height', 'MRI_Count']]
)

Show a scatter plot for IQ measures (pairwise). Different measures of IQ also appear to be positively correlated

In [19]:
__ = pd.plotting.scatter_matrix(
    data[['FSIQ', 'PIQ', 'VIQ']]
)

We first clean the data by dropping all rows with a NA value, then create a new numeric column (1 for Male, 2 for Female), and then re-do the scatter plot as before, but coloring each point with a gender-specific color. We set alpha to be almost one to get brighter colors.

In [20]:
clean_data = data.dropna()
In [21]:
sex_code = {'Male': 1, 'Female': 2}
sex_color = [sex_code[x] for x in clean_data['Gender']]
In [22]:
__ = pd.plotting.scatter_matrix(
    clean_data[['Weight', 'Height', 'MRI_Count']],
    c=sex_color,
    cmap='rainbow',
    alpha=0.95,
)

The same plot, but with default alpha

In [23]:
__ = pd.plotting.scatter_matrix(
    clean_data[['FSIQ', 'PIQ', 'VIQ']],
    c=sex_color,
    cmap='rainbow',
)

Almost too small at default size, but showing the relationships between all numeric variables. FSIQ seems have have a gap in its values; if this was a real data analysis job, we would to investigate that.

In [24]:
__ = pd.plotting.scatter_matrix(
    clean_data[
        [
            'FSIQ',
            'PIQ',
            'VIQ',
            'Height',
            'Weight',
            'MRI_Count',
        ]
    ],
    c=sex_color,
    cmap='rainbow',
)

Statistical Tests

Test of means > 0

As a finger exercise, test if any of the variables have a mean of zero (we expect none). As expected, the probability of seeing the data we see, with a true mean of zero, is vanishingly small: we reject the mean=0 hypothesis.

We try VIQ first

In [25]:
stats.ttest_1samp(data['VIQ'], 0)
Out[25]:
Ttest_1sampResult(statistic=30.088099970849328, pvalue=1.3289196468728067e-28)

Check all the variables

In [26]:
for name in [
    'Weight',
    'Height',
    'VIQ',
    'PIQ',
    'FSIQ',
    'MRI_Count',
]:
    print(
        f'T Test values for {name:20}: {stats.ttest_1samp(clean_data[name], 0)}'
    )
# end for
T Test values for Weight              : Ttest_1sampResult(statistic=39.65971373853829, pvalue=6.39263888216351e-32)
T Test values for Height              : Ttest_1sampResult(statistic=105.60788971919253, pvalue=1.6747259486645957e-47)
T Test values for VIQ                 : Ttest_1sampResult(statistic=30.132405743476752, pvalue=1.2294004595027533e-27)
T Test values for PIQ                 : Ttest_1sampResult(statistic=30.372726699843938, pvalue=9.265683776894988e-28)
T Test values for FSIQ                : Ttest_1sampResult(statistic=29.3921457404181, pvalue=2.976917593395147e-27)
T Test values for MRI_Count           : Ttest_1sampResult(statistic=77.03243851203136, pvalue=1.8665995958235843e-42)

Test for gender difference

We now test for gender-related differences in the numeric values. As expected, body-related variables (height, weight, MRI-count) show differences too large to be explained by chance. The IQ measures, not so. Assuming unequal variance in the populations is a more stringent test (in the sense that the T values are smaller, and hence more likley to occur by chance, and hence we are less likely to reject the Null Hypothesis that means are equal.

In [27]:
for name in [
    'Weight',
    'Height',
    'VIQ',
    'PIQ',
    'FSIQ',
    'MRI_Count',
]:
    f_values = clean_data[clean_data['Gender'] == 'Female'][
        name
    ]
    m_values = clean_data[clean_data['Gender'] == 'Male'][
        name
    ]
    print(
        f'2 Sample T Test (Welch’s t-test) values for {name:10}: {stats.ttest_ind(f_values, m_values, equal_var=False)}'
    )
# end for
2 Sample T Test (Welch’s t-test) values for Weight    : Ttest_indResult(statistic=-4.827378552057042, pvalue=2.964900309600948e-05)
2 Sample T Test (Welch’s t-test) values for Height    : Ttest_indResult(statistic=-5.937236386105681, pvalue=1.7712232206445267e-06)
2 Sample T Test (Welch’s t-test) values for VIQ       : Ttest_indResult(statistic=-0.7501099096274505, pvalue=0.45832254829707375)
2 Sample T Test (Welch’s t-test) values for PIQ       : Ttest_indResult(statistic=-0.25209118945110864, pvalue=0.8024574345106845)
2 Sample T Test (Welch’s t-test) values for FSIQ      : Ttest_indResult(statistic=-0.4451800313888085, pvalue=0.6589088711679706)
2 Sample T Test (Welch’s t-test) values for MRI_Count : Ttest_indResult(statistic=-5.120860587619881, pvalue=1.0696251055530656e-05)

We perform the same test, this time assuming equal population variance. The conclusions are the same, even though the T values differ slightly.

In [28]:
for name in [
    'Weight',
    'Height',
    'VIQ',
    'PIQ',
    'FSIQ',
    'MRI_Count',
]:
    f_values = clean_data[clean_data['Gender'] == 'Female'][
        name
    ]
    m_values = clean_data[clean_data['Gender'] == 'Male'][
        name
    ]
    print(
        f'2 Sample T Test (equal variance assumed) values for {name:10}: {stats.ttest_ind(f_values, m_values, equal_var=True)}'
    )
# end for
2 Sample T Test (equal variance assumed) values for Weight    : Ttest_indResult(statistic=-4.870950921940696, pvalue=2.227293018362118e-05)
2 Sample T Test (equal variance assumed) values for Height    : Ttest_indResult(statistic=-6.056794879983802, pvalue=5.83598319801381e-07)
2 Sample T Test (equal variance assumed) values for VIQ       : Ttest_indResult(statistic=-0.7551589849287512, pvalue=0.4550660394858371)
2 Sample T Test (equal variance assumed) values for PIQ       : Ttest_indResult(statistic=-0.253253883465599, pvalue=0.8015127184458677)
2 Sample T Test (equal variance assumed) values for FSIQ      : Ttest_indResult(statistic=-0.44600097415905493, pvalue=0.6582675757550638)
2 Sample T Test (equal variance assumed) values for MRI_Count : Ttest_indResult(statistic=-5.121466354847687, pvalue=1.0360213915002095e-05)

Test for difference in measures

We now testfor differences in the measures of IQ. The greatest difference is between PIQ and FSIQ, but not enough to reject Null Hypothesis at the 5% level. We also use the Wilcoxon test to test for differences in the underlying distribution, from which we assume each IQ measure is drawn. Again, we cannot reject the Null Hypothesis that all distributions are equal.

In [29]:
stats.ttest_rel(data['FSIQ'], data['PIQ'])
Out[29]:
Ttest_relResult(statistic=1.7842019405859857, pvalue=0.08217263818364236)
In [30]:
stats.ttest_1samp(data['FSIQ'] - data['PIQ'], 0)
Out[30]:
Ttest_1sampResult(statistic=1.7842019405859857, pvalue=0.08217263818364236)
In [31]:
stats.ttest_rel(data['FSIQ'], data['VIQ'])
Out[31]:
Ttest_relResult(statistic=0.8913911665594247, pvalue=0.37818604189634275)
In [32]:
stats.ttest_rel(data['PIQ'], data['VIQ'])
Out[32]:
Ttest_relResult(statistic=-0.5445841763599564, pvalue=0.5891401178507841)
In [33]:
stats.wilcoxon(data['FSIQ'], data['PIQ'])
Out[33]:
WilcoxonResult(statistic=274.5, pvalue=0.10659492713506856)
In [34]:
stats.wilcoxon(data['FSIQ'], data['VIQ'])
Out[34]:
WilcoxonResult(statistic=335.5, pvalue=0.6112323364449876)
In [35]:
stats.wilcoxon(data['VIQ'], data['PIQ'])
Out[35]:
WilcoxonResult(statistic=353.0, pvalue=0.4433735618354926)

Test for means different

To show an example where large sample sizes allow more precise testing, we draw 1,000 samples from two distributions that visually look to have the same mean. However, the T test allows us to reject the Null Hypothesis at the 5% level (the difference in observed means for this experiment would only happen about 0.1% of the time if the true means were equal).

In [36]:
rvs1 = stats.norm.rvs(loc=5, scale=10, size=1000)
rvs5 = stats.norm.rvs(loc=8, scale=20, size=1000)
stats.ttest_ind(rvs1, rvs5, equal_var=False)
Out[36]:
Ttest_indResult(statistic=-3.483444735595557, pvalue=0.0005095289454042837)

Display the two histograms of sample values

In [37]:
_ = plt.hist(rvs1)
_ = plt.hist(rvs5, alpha=0.3)
In [38]:
rvs1.mean(), rvs5.mean()
Out[38]:
(4.986770485224489, 7.457020805490511)

Linear Regression

Using R style symbolic equations, we create a linear dataset, and visualize it (this time using matplotlib). In this case we are using the package statsmodels, and not scipy.stats.

In [39]:
x = np.linspace(-5, 5, 20)
np.random.seed(1)
y = -5 + 3 * x + 4 * np.random.normal(size=x.shape)
y_no_error = -5 + 3 * x
In [40]:
plt.plot(x, y, 'o')
Out[40]:
[<matplotlib.lines.Line2D at 0x2bfd89b56d8>]

We now create a pandas dataframe from the x,y data, and declare a model that says y is a linear function of x. Note that we do not have to declare a constant in our model, for this variant of Ordinary Least Squares fitting.

In [41]:
data = pd.DataFrame({'x': x, 'y': y})
In [42]:
model = ols('y ~ x', data).fit()
In [43]:
model.summary()
Out[43]:
OLS Regression Results
Dep. Variable: y R-squared: 0.804
Model: OLS Adj. R-squared: 0.794
Method: Least Squares F-statistic: 74.03
Date: Sun, 15 Mar 2020 Prob (F-statistic): 8.56e-08
Time: 15:04:03 Log-Likelihood: -57.988
No. Observations: 20 AIC: 120.0
Df Residuals: 18 BIC: 122.0
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -5.5335 1.036 -5.342 0.000 -7.710 -3.357
x 2.9369 0.341 8.604 0.000 2.220 3.654
Omnibus: 0.100 Durbin-Watson: 2.956
Prob(Omnibus): 0.951 Jarque-Bera (JB): 0.322
Skew: -0.058 Prob(JB): 0.851
Kurtosis: 2.390 Cond. No. 3.03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Just for fun, I plotted the four lines you get when taking the 5% extremes for intercept and slope.

In [44]:
y1 = 2.220 * x - 7.71
y2 = 3.654 * x - 3.357
y3 = 2.220 * x - 3.357
y4 = 3.654 * x - 7.71
plt.plot(x, y1, '-')
plt.plot(x, y2, '-')
plt.plot(x, y3, '-')
plt.plot(x, y4, '-')
Out[44]:
[<matplotlib.lines.Line2D at 0x2bfd8a544a8>]

summary2 is an experimental results summary function. Nothing looks markedly different.

In [45]:
model.summary2()
Out[45]:
Model: OLS Adj. R-squared: 0.794
Dependent Variable: y AIC: 119.9767
Date: 2020-03-15 15:04 BIC: 121.9682
No. Observations: 20 Log-Likelihood: -57.988
Df Model: 1 F-statistic: 74.03
Df Residuals: 18 Prob (F-statistic): 8.56e-08
R-squared: 0.804 Scale: 21.463
Coef. Std.Err. t P>|t| [0.025 0.975]
Intercept -5.5335 1.0359 -5.3416 0.0000 -7.7099 -3.3571
x 2.9369 0.3413 8.6040 0.0000 2.2198 3.6540
Omnibus: 0.100 Durbin-Watson: 2.956
Prob(Omnibus): 0.951 Jarque-Bera (JB): 0.322
Skew: -0.058 Prob(JB): 0.851
Kurtosis: 2.390 Condition No.: 3

Show a plot of actual versus predicted values.

In [46]:
model.params
Out[46]:
Intercept   -5.533459
x            2.936888
dtype: float64
In [47]:
model.predict({'x': [0]})
Out[47]:
0   -5.533459
dtype: float64
In [48]:
yhat = model.predict({'x': x})
In [49]:
plt.plot(x, y, 'o')
plt.plot(x, yhat, '+-')
Out[49]:
[<matplotlib.lines.Line2D at 0x2bfd9c5a588>]

Use explicit OLS

As an alternative to R style model specification, the OLS class can be called directly, but in this case if we want a non-zero intercept, we have to add a constant 1 to our input data arrays.

In [50]:
x_const_added = sm.add_constant(x, prepend=False)
In [51]:
model2 = sm.OLS(y, x_const_added)
In [52]:
res = model2.fit()

We get exactly the same results, and visualization.

In [53]:
res.summary()
Out[53]:
OLS Regression Results
Dep. Variable: y R-squared: 0.804
Model: OLS Adj. R-squared: 0.794
Method: Least Squares F-statistic: 74.03
Date: Sun, 15 Mar 2020 Prob (F-statistic): 8.56e-08
Time: 15:04:04 Log-Likelihood: -57.988
No. Observations: 20 AIC: 120.0
Df Residuals: 18 BIC: 122.0
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
x1 2.9369 0.341 8.604 0.000 2.220 3.654
const -5.5335 1.036 -5.342 0.000 -7.710 -3.357
Omnibus: 0.100 Durbin-Watson: 2.956
Prob(Omnibus): 0.951 Jarque-Bera (JB): 0.322
Skew: -0.058 Prob(JB): 0.851
Kurtosis: 2.390 Cond. No. 3.03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [54]:
yhat2 = res.predict()
In [55]:
yhat2
Out[55]:
array([-20.21790012, -18.67216943, -17.12643874, -15.58070805,
       -14.03497735, -12.48924666, -10.94351597,  -9.39778528,
        -7.85205458,  -6.30632389,  -4.7605932 ,  -3.21486251,
        -1.66913181,  -0.12340112,   1.42232957,   2.96806026,
         4.51379095,   6.05952165,   7.60525234,   9.15098303])
In [ ]:
 
In [56]:
plt.plot(x, y, 'o')
plt.plot(x, yhat2, '+-')
Out[56]:
[<matplotlib.lines.Line2D at 0x2bfd88ff240>]

I thought I would lift my game, and prepare a more professional graphic, as the expense of more code.

I create a matplotlib figure and axes object. In the axes objecr I plot the raw data, the predicted data, and the underlying linear model. I create a grid, and moves the X and Y axis to run through (0,0) (and not at the edges of the graph. I finally plot the 95% confidence limits for predictions made with the equation we have derived from the model.

In [57]:
prstd, iv_l, iv_u = wls_prediction_std(res)

fig, ax = plt.subplots(figsize=(10, 10))

ax.plot(x, y, 'o', label='data')
ax.plot(x, yhat2, 'r-', label='OLS')
ax.plot(x, y_no_error, 'g-', label='True')
# ax.axvline(0, c='black')
# ax.axhline(0, c='black')
ax.grid()

ax.spines['bottom'].set_position('zero')
ax.spines['left'].set_position('zero')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

ax.plot(x, iv_l, 'r--', label='95% Pred Int (l)')
ax.plot(x, iv_u, 'r--', label='95% Pred Int (u)')

ax.legend(loc='best')
Out[57]:
<matplotlib.legend.Legend at 0x2bfd899f208>

Here we show some of the attributes of the RegressionResults object.

In [58]:
res.predict([0, 1])
Out[58]:
array([-5.53345855])
In [59]:
res.rsquared
Out[59]:
0.8044102939875739
In [60]:
res.rsquared_adj
Out[60]:
0.7935441992091058
In [61]:
res.conf_int()
Out[61]:
array([[ 2.21976259,  3.65401404],
       [-7.70985258, -3.35706451]])
In [62]:
gp = res.get_prediction([[0, 1], [10, 1]])
In [63]:
gp.conf_int()
Out[63]:
array([[-7.70985258, -3.35706451],
       [16.34118537, 31.32966385]])

We now create a graphic showing the model confidence limits. We first create x2, a linear array spanning our data range. We add a constant 1 to this array, and use this to get the predictions of the output for this input array. The PredictionResults object returned by the call to get_prediction can be used (by calling the method conf_int) to the model confidence intervals.

In [64]:
x2 = np.linspace(-10, 10, 21)
In [65]:
x2
Out[65]:
array([-10.,  -9.,  -8.,  -7.,  -6.,  -5.,  -4.,  -3.,  -2.,  -1.,   0.,
         1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.])
In [66]:
x2_enh = sm.add_constant(x2, prepend=False)
In [67]:
x2_enh[0:5]
Out[67]:
array([[-10.,   1.],
       [ -9.,   1.],
       [ -8.,   1.],
       [ -7.,   1.],
       [ -6.,   1.]])
In [68]:
gp = res.get_prediction(x2_enh)
In [69]:
ci = gp.conf_int(alpha=0.01)
In [70]:
ci[0:5]
Out[70]:
array([[-45.17008183, -24.63460157],
       [-41.29737811, -22.63352866],
       [-37.43533928, -20.62179086],
       [-33.58791501, -18.5954385 ],
       [-29.76115093, -16.54842594]])
In [71]:
upper = [z[1] for z in ci]
lower = [z[0] for z in ci]
In [72]:
prstd, iv_l, iv_u = wls_prediction_std(res)

In the graph below, note that we are more certain about the line we draw (especially about the centre), than we are about any prediction based on that line. Even if the line was exact (and the more data points we have, the better it is), the unavoidable error term in our model will still smear the observed values around the linear equation.

In [73]:
fig, ax = plt.subplots(figsize=(10, 10))

ax.plot(x, y, 'o', label='data')
ax.plot(x, yhat2, 'r-', label='OLS')
ax.plot(x, y_no_error, 'g-', label='True')

ax.plot(x2, upper, 'r--', label='99% Conf Int (u)')
ax.plot(x2, lower, 'r--', label='99% Conf Int (l)')
ax.plot(x, iv_l, 'y-', label='95% Pred Int (l)')
ax.plot(x, iv_u, 'y-', label='95% Pred Int (u)')

ax.grid()

ax.spines['bottom'].set_position('zero')
ax.spines['left'].set_position('zero')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)


ax.legend(loc='best')
Out[73]:
<matplotlib.legend.Legend at 0x2bfda215940>
In [74]:
type(gp)
Out[74]:
statsmodels.regression._prediction.PredictionResults

Nonlinear Regression

We now move to an example on non-linear regression, where we concoct some data using a combination of linear, quadratic and sine functions.

In [75]:
nsample = 100
sig = 0.5
x = np.linspace(0, 20, nsample)
x2 = np.sin(x)
x3 = (x - 5) ** 2
x4 = np.ones(nsample)

Display the components that make up our data

In [76]:
fig, ax = plt.subplots(figsize=(10, 10))

ax.plot(x, x * 0.5)
ax.plot(x, x2 * 0.5)
ax.plot(x, x3 * (-0.02))
ax.plot(x, x4 * 5.0)

ax.grid()

ax.spines['bottom'].set_position('zero')
ax.spines['left'].set_position('zero')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

Create the combination of functions, and add Gaussian noise.

In [77]:
x5 = np.column_stack((x, x2, x3, x4))
beta = [0.5, 0.5, -0.02, 5.0]
y_true = np.dot(x5, beta)
y = y_true + sig * np.random.normal(size=nsample)

Quick and dirty plot.

In [78]:
plt.plot(x, y, 'o')
Out[78]:
[<matplotlib.lines.Line2D at 0x2bfda40a8d0>]

Now fit the functions to the observed results, using Ordinary Least Squares.

In [79]:
res = sm.OLS(y, x5).fit()
In [80]:
res.summary()
Out[80]:
OLS Regression Results
Dep. Variable: y R-squared: 0.953
Model: OLS Adj. R-squared: 0.951
Method: Least Squares F-statistic: 646.4
Date: Sun, 15 Mar 2020 Prob (F-statistic): 1.67e-63
Time: 15:04:06 Log-Likelihood: -52.024
No. Observations: 100 AIC: 112.0
Df Residuals: 96 BIC: 122.5
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
x1 0.5189 0.016 33.013 0.000 0.488 0.550
x2 0.5298 0.061 8.723 0.000 0.409 0.650
x3 -0.0221 0.001 -15.983 0.000 -0.025 -0.019
const 4.9675 0.102 48.746 0.000 4.765 5.170
Omnibus: 0.130 Durbin-Watson: 1.857
Prob(Omnibus): 0.937 Jarque-Bera (JB): 0.241
Skew: 0.079 Prob(JB): 0.886
Kurtosis: 2.818 Cond. No. 223.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Show the results: the fit is very good!

In [81]:
fig, ax = plt.subplots(figsize=(10, 10))

yhat = res.predict()
ax.plot(x, y, 'o')
ax.plot(x, yhat, 'r-')
ax.plot(x, y_true, 'g-')
ax.grid()

As before, create a graphic showing the model and prediction confidence intervals.

In [82]:
gp2 = res.get_prediction(x5)

ci2 = gp2.conf_int(alpha=0.01)
upper = [z[1] for z in ci2]
lower = [z[0] for z in ci2]
In [83]:
prstd, iv_l, iv_u = wls_prediction_std(res)
In [84]:
fig, ax = plt.subplots(figsize=(10, 10))

yhat = res.predict()
ax.plot(x, y, 'o', label='Data')
ax.plot(x, yhat, 'r-', label='OLS estimate')
ax.plot(x, y_true, 'g-', label='Y True')
ax.plot(x, upper, 'r--', label='Confidence Int (upper)')
ax.plot(x, lower, 'r--', label='Confidence Int (lower)')
ax.plot(x, iv_u, 'y-', label='Pred. Int (upper)')
ax.plot(x, iv_l, 'y-', label='Pred. Int (lower)')

ax.legend(loc='best')

ax.grid()

# ax.spines['bottom'].set_position('zero')
ax.spines['left'].set_position('zero')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

Multivariate Regression

Iris Data Set

We look at the famous Iris flower dataset, and investigate the idea that there is a linear relationship between sepal width and petal length across all three species.

In [85]:
iris = pd.read_csv(DATA_PREFIX + 'iris.csv')
iris.head(4)
Out[85]:
sepal_length sepal_width petal_length petal_width name
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa

Fit the OLS model, and show the results summary. We see that model parameters we see are unlikely to have been observed by chance in a world with no true relationship.

In [86]:
model = ols('sepal_width ~ name + petal_length', iris).fit()
In [87]:
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:            sepal_width   R-squared:                       0.478
Model:                            OLS   Adj. R-squared:                  0.468
Method:                 Least Squares   F-statistic:                     44.63
Date:                Sun, 15 Mar 2020   Prob (F-statistic):           1.58e-20
Time:                        15:04:06   Log-Likelihood:                -38.185
No. Observations:                 150   AIC:                             84.37
Df Residuals:                     146   BIC:                             96.41
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept              2.9813      0.099     29.989      0.000       2.785       3.178
name[T.versicolor]    -1.4821      0.181     -8.190      0.000      -1.840      -1.124
name[T.virginica]     -1.6635      0.256     -6.502      0.000      -2.169      -1.158
petal_length           0.2983      0.061      4.920      0.000       0.178       0.418
==============================================================================
Omnibus:                        2.868   Durbin-Watson:                   1.753
Prob(Omnibus):                  0.238   Jarque-Bera (JB):                2.885
Skew:                          -0.082   Prob(JB):                        0.236
Kurtosis:                       3.659   Cond. No.                         54.0
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Create a color code in (R,G,B) format for each species (I am cheating as I know that there are only three species)

In [88]:
categories = pd.Categorical(iris['name'])

lut = {0: (1, 0, 0), 1: (0, 1, 0), 2: (0, 0, 1)}

species_color = [lut[code] for code in categories.codes]

Create a graphic, with each species data point is color coded. We see the best OLS model for a line that explains sepal width for each species, with a common slope, and different intercept. Versicola and Virginica look OK, but the Setosa data seems to have no such linear relation. The R2 is quiet low, showing that is a lot of variation in sepal width not explained by our model (or the data is very noisy).

In [89]:
fig, ax = plt.subplots(figsize=(10, 10))

yhat = model.predict()
categories = pd.Categorical(iris['name'])
for name, color in zip(
    ['setosa', 'versicolor', 'virginica'], ['g', 'b', 'c']
):
    ax.plot(
        iris[iris['name'] == name]['petal_length'],
        iris[iris['name'] == name]['sepal_width'],
        'o',
        label='Data: ' + name,
        c=color,
    )
# end for
ax.plot(
    iris['petal_length'], yhat, 'ro', label='OLS estimate'
)
ax.legend(loc='best')
ax.grid()
ax.set_xlabel('Petal Length')
ax.set_ylabel('Sepal Width')
Out[89]:
Text(0, 0.5, 'Sepal Width')

We can explore this by using the pandas plotting facilities to get a quick and dirty graphic showing the relationship between numeric variables.

In [90]:
fig, ax = plt.subplots(figsize=(10, 10))

categories = pd.Categorical(iris['name'])
__ = pd.plotting.scatter_matrix(
    iris,
    c=categories.codes,
    ax=ax,
    marker='o',
    # cmap='jet',
    alpha=0.95,
)
D:\Anaconda3\envs\ac5-py37\lib\site-packages\ipykernel_launcher.py:10: UserWarning: To output multiple subplots, the figure containing the passed axes is being cleared
  # Remove the CWD from sys.path while we load stuff.

We can use the F test to check that our parameters of the model are unlikely to have happened by chance.

In [91]:
model.f_test([0, 1, -1, 0])
Out[91]:
<class 'statsmodels.stats.contrast.ContrastResults'>
<F test: F=array([[3.24533535]]), p=0.07369058781700738, df_denom=146, df_num=1>
In [92]:
dict(zip(categories.codes, iris['name']))
Out[92]:
{0: 'setosa', 1: 'versicolor', 2: 'virginica'}
In [93]:
A = np.identity(len(model.params))
A = A[1:, :]

# This tests that each coefficient is jointly statistically
# significantly different from zero.

print(model.f_test(A))
<F test: F=array([[44.62737137]]), p=1.5774279638421893e-20, df_denom=146, df_num=3>
In [94]:
A
Out[94]:
array([[0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])
In [95]:
model.fvalue, model.f_pvalue
Out[95]:
(44.627371372682035, 1.577427963841873e-20)
In [96]:
categories.describe()
Out[96]:
counts freqs
categories
setosa 50 0.333333
versicolor 50 0.333333
virginica 50 0.333333

We can use Seaborn to get a better graphic (with a legend this time! What were you thinking pandas, in not having a legend)

In [97]:
_ = sn.pairplot(
    iris,
    vars=[
        'sepal_length',
        'sepal_width',
        'petal_length',
        'petal_width',
    ],
    kind='scatter',
    hue='name',
    diag_kind='hist',
)

When we repeat our analysis on just the Setosa data points, we get a linear affect that would be seen roughly 1 in every 5 experiments, just by random chance, assuming no true linear effect. We would be justified in confirming the Null Hypothesis that in the Setosa species sepal width is independent of petal length.

In [98]:
setosa = iris[iris['name'] == 'setosa']
In [99]:
model = ols('sepal_width ~ petal_length', setosa).fit()
In [100]:
model.summary()
Out[100]:
OLS Regression Results
Dep. Variable: sepal_width R-squared: 0.031
Model: OLS Adj. R-squared: 0.011
Method: Least Squares F-statistic: 1.547
Date: Sun, 15 Mar 2020 Prob (F-statistic): 0.220
Time: 15:04:11 Log-Likelihood: -21.404
No. Observations: 50 AIC: 46.81
Df Residuals: 48 BIC: 50.63
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 2.8499 0.460 6.197 0.000 1.925 3.775
petal_length 0.3880 0.312 1.244 0.220 -0.239 1.015
Omnibus: 2.143 Durbin-Watson: 1.500
Prob(Omnibus): 0.343 Jarque-Bera (JB): 1.249
Skew: 0.242 Prob(JB): 0.536
Kurtosis: 3.604 Cond. No. 18.4


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [101]:
fig, ax = plt.subplots(figsize=(10, 10))

ax.plot(setosa['petal_length'], setosa['sepal_width'], 'o')
yhat = model.predict(setosa['petal_length'])
ax.plot(setosa['petal_length'], yhat, 'r-')
Out[101]:
[<matplotlib.lines.Line2D at 0x2bfd9d5b748>]

IQ Data Set

Here, we go back to our IQ dataset, and look to see what variables are correlated with VIQ. Unsurprisingly, we are warned that there is a degree of correlation between our independent variable: this means that the inverse of the (X @ X.T) matrix (where X is our matrix of independent variable observations might have significant errors, due to round-off. Most correlations would be rejected at the 5% level.

In [102]:
clean_data.head(4)
Out[102]:
Unnamed: 0 Gender FSIQ VIQ PIQ Weight Height MRI_Count
0 1 Female 133 132 124 118.0 64.5 816932
2 3 Male 139 123 150 143.0 73.3 1038437
3 4 Male 133 129 128 172.0 68.8 965353
4 5 Female 137 132 134 147.0 65.0 951545
In [103]:
model = ols(
    'VIQ ~ Height + Weight + MRI_Count + C(Gender) -1',
    clean_data,
).fit()
In [104]:
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    VIQ   R-squared:                       0.249
Model:                            OLS   Adj. R-squared:                  0.158
Method:                 Least Squares   F-statistic:                     2.733
Date:                Sun, 15 Mar 2020   Prob (F-statistic):             0.0455
Time:                        15:04:11   Log-Likelihood:                -167.03
No. Observations:                  38   AIC:                             344.1
Df Residuals:                      33   BIC:                             352.2
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
=====================================================================================
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
C(Gender)[Female]   169.7719     90.054      1.885      0.068     -13.443     352.987
C(Gender)[Male]     179.9298     97.671      1.842      0.074     -18.784     378.643
Height               -2.7337      1.410     -1.938      0.061      -5.603       0.136
Weight               -0.1427      0.215     -0.665      0.511      -0.579       0.294
MRI_Count             0.0002   6.48e-05      2.489      0.018    2.94e-05       0.000
==============================================================================
Omnibus:                        5.002   Durbin-Watson:                   2.272
Prob(Omnibus):                  0.082   Jarque-Bera (JB):                1.892
Skew:                          -0.066   Prob(JB):                        0.388
Kurtosis:                       1.915   Cond. No.                     3.54e+07
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.54e+07. This might indicate that there are
strong multicollinearity or other numerical problems.
In [105]:
_ = sn.pairplot(
    clean_data,
    vars=['VIQ', 'Height', 'Weight', 'MRI_Count'],
    kind='scatter',
    hue='Gender',
    diag_kind='hist',
)

Usung the F test, we reject the hypothesis that there are true gender difference for VIQ

In [106]:
model.f_test([1, -1, 0, 0, 0])
Out[106]:
<class 'statsmodels.stats.contrast.ContrastResults'>
<F test: F=array([[0.86984607]]), p=0.35777194549809077, df_denom=33, df_num=1>
In [107]:
model.f_test([-1, +1, 0, 0, 0])
Out[107]:
<class 'statsmodels.stats.contrast.ContrastResults'>
<F test: F=array([[0.86984607]]), p=0.35777194549809077, df_denom=33, df_num=1>

Just for fun, it appears that Weight is strongly correlated with Height, and less strongly with Gender.

In [108]:
model = ols(
    'Weight ~ Height + C(Gender) + 1 ', clean_data
).fit()
In [109]:
model.summary()
Out[109]:
OLS Regression Results
Dep. Variable: Weight R-squared: 0.525
Model: OLS Adj. R-squared: 0.498
Method: Least Squares F-statistic: 19.37
Date: Sun, 15 Mar 2020 Prob (F-statistic): 2.17e-06
Time: 15:04:14 Log-Likelihood: -159.19
No. Observations: 38 AIC: 324.4
Df Residuals: 35 BIC: 329.3
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -59.3879 64.085 -0.927 0.360 -189.487 70.711
C(Gender)[T.Male] 12.4831 7.678 1.626 0.113 -3.104 28.070
Height 2.9892 0.973 3.073 0.004 1.014 4.964
Omnibus: 0.513 Durbin-Watson: 2.028
Prob(Omnibus): 0.774 Jarque-Bera (JB): 0.450
Skew: 0.246 Prob(JB): 0.798
Kurtosis: 2.796 Cond. No. 1.63e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.63e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
In [110]:
model = ols('Weight ~ Height  + 1 ', clean_data).fit()
In [111]:
model.summary()
Out[111]:
OLS Regression Results
Dep. Variable: Weight R-squared: 0.489
Model: OLS Adj. R-squared: 0.475
Method: Least Squares F-statistic: 34.51
Date: Sun, 15 Mar 2020 Prob (F-statistic): 1.02e-06
Time: 15:04:14 Log-Likelihood: -160.57
No. Observations: 38 AIC: 325.1
Df Residuals: 36 BIC: 328.4
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -130.3535 47.980 -2.717 0.010 -227.661 -33.046
Height 4.1129 0.700 5.875 0.000 2.693 5.533
Omnibus: 2.899 Durbin-Watson: 2.011
Prob(Omnibus): 0.235 Jarque-Bera (JB): 2.041
Skew: 0.561 Prob(JB): 0.360
Kurtosis: 3.176 Cond. No. 1.19e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.19e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
In [ ]:
 

Test for IQ Gender Differences

If we look at a model where we try to explain all the variation in VIQ, we get a result (F Test) that suggests that what we see would happen about 50% of the time, assuming no gender differences.

In [112]:
model = ols('VIQ ~ C(Gender) - 1 ', clean_data).fit()
In [113]:
model.summary()
Out[113]:
OLS Regression Results
Dep. Variable: VIQ R-squared: 0.016
Model: OLS Adj. R-squared: -0.012
Method: Least Squares F-statistic: 0.5703
Date: Sun, 15 Mar 2020 Prob (F-statistic): 0.455
Time: 15:04:14 Log-Likelihood: -172.16
No. Observations: 38 AIC: 348.3
Df Residuals: 36 BIC: 351.6
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
C(Gender)[Female] 109.4500 5.160 21.213 0.000 98.986 119.914
C(Gender)[Male] 115.1111 5.439 21.166 0.000 104.081 126.141
Omnibus: 21.889 Durbin-Watson: 1.891
Prob(Omnibus): 0.000 Jarque-Bera (JB): 3.421
Skew: 0.002 Prob(JB): 0.181
Kurtosis: 1.530 Cond. No. 1.05


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [114]:
model.f_test([1, -1])
Out[114]:
<class 'statsmodels.stats.contrast.ContrastResults'>
<F test: F=array([[0.57026509]]), p=0.45506603948583946, df_denom=36, df_num=1>

Economic Data

As an example of regression of multiple variables, we load some economic data.

In [115]:
header = [
    'EDUCATION',
    'SOUTH',
    'SEX',
    'EXPERIENCE',
    'UNION',
    'WAGE',
    'AGE',
    'RACE',
    'OCCUPATION',
    'SECTOR',
    'MARR',
]
In [116]:
DATA_PREFIX = '../data/'
longly = pd.read_csv(
    DATA_PREFIX + 'CPS_85_Wages.txt',
    sep='\t',
    skiprows=26,
    skipfooter=6,
    header=None,
    names=header,
)
D:\Anaconda3\envs\ac5-py37\lib\site-packages\ipykernel_launcher.py:8: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support skipfooter; you can avoid this warning by specifying engine='python'.
  
In [117]:
longly.head()
Out[117]:
EDUCATION SOUTH SEX EXPERIENCE UNION WAGE AGE RACE OCCUPATION SECTOR MARR
0 8 0 1 21 0 5.10 35 2 6 1 1
1 9 0 1 42 0 4.95 57 3 6 1 1
2 12 0 0 1 0 6.67 19 3 6 1 0
3 12 0 0 4 0 4.00 22 3 6 0 0
4 12 0 0 17 0 7.50 35 3 6 0 1
In [118]:
longly.tail()
Out[118]:
EDUCATION SOUTH SEX EXPERIENCE UNION WAGE AGE RACE OCCUPATION SECTOR MARR
529 18 0 0 5 0 11.36 29 3 5 0 0
530 12 0 1 33 0 6.10 51 1 5 0 1
531 17 0 1 25 1 23.25 48 1 5 0 1
532 12 1 0 13 1 19.88 31 3 5 0 1
533 16 0 0 33 0 15.38 55 3 5 1 1

Visualizing Regression Linear Models by Seaborn

Seaborn customization includes:

  • setting the diagonal histograms to have alpha=0.5, so that the overlapping histograms are not obscured

  • setting the off diagonal regression marker size to a smaller-than-default value, for clarity

In [119]:
sn.pairplot(
    longly,
    vars=['WAGE', 'AGE', 'EDUCATION'],
    kind='reg',
    hue='SEX',
    diag_kind='hist',
    plot_kws={'scatter_kws': {'alpha': 0.5, 's': 5}},
    diag_kws={'alpha': 0.5},
)
D:\Anaconda3\envs\ac5-py37\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[119]:
<seaborn.axisgrid.PairGrid at 0x2bfd9d83898>
In [120]:
sn.lmplot(y='WAGE', x='EDUCATION', data=longly, hue='SEX')
Out[120]:
<seaborn.axisgrid.FacetGrid at 0x2bfdb15f940>

Comparing Ordinary Linear Regression, and Robust Linear Regression

In [121]:
wage = longly['WAGE']
education = longly['EDUCATION']
education = sm.add_constant(education)
res1 = sm.RLM(wage, education).fit()
print(res1.summary())
                    Robust linear Model Regression Results                    
==============================================================================
Dep. Variable:                   WAGE   No. Observations:                  534
Model:                            RLM   Df Residuals:                      532
Method:                          IRLS   Df Model:                            1
Norm:                          HuberT                                         
Scale Est.:                       mad                                         
Cov Type:                          H1                                         
Date:                Sun, 15 Mar 2020                                         
Time:                        15:04:17                                         
No. Iterations:                    18                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.2165      0.890      0.243      0.808      -1.527       1.960
EDUCATION      0.6377      0.067      9.518      0.000       0.506       0.769
==============================================================================

If the model instance has been used for another fit with different fit
parameters, then the fit options might not be the correct ones anymore .
In [122]:
res2 = sm.OLS(wage, education).fit()
print(res2.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   WAGE   R-squared:                       0.146
Model:                            OLS   Adj. R-squared:                  0.144
Method:                 Least Squares   F-statistic:                     90.85
Date:                Sun, 15 Mar 2020   Prob (F-statistic):           5.47e-20
Time:                        15:04:17   Log-Likelihood:                -1589.2
No. Observations:                 534   AIC:                             3182.
Df Residuals:                     532   BIC:                             3191.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.7460      1.045     -0.714      0.476      -2.800       1.308
EDUCATION      0.7505      0.079      9.532      0.000       0.596       0.905
==============================================================================
Omnibus:                      195.064   Durbin-Watson:                   1.833
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              924.535
Skew:                           1.566   Prob(JB):                    1.74e-201
Kurtosis:                       8.634   Cond. No.                         67.8
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [123]:
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111)
x = education['EDUCATION']
ax.plot(x, wage, 'o', label='Data')
ax.plot(x, res1.fittedvalues, 'r-', label='OLS fit')
ax.plot(x, res2.fittedvalues, 'g-', label='RLM fit')

ax.grid()
ax.set_xlabel('Education')
ax.set_ylabel('Wage ($/hour)')


# ax.spines['bottom'].set_position('zero')
# ax.spines['left'].set_position('zero')
# ax.spines['top'].set_visible(False)
# ax.spines['right'].set_visible(False)


ax.legend(loc='best')
Out[123]:
<matplotlib.legend.Legend at 0x2bfdbff05c0>
In [124]:
education.head()
Out[124]:
const EDUCATION
0 1.0 8
1 1.0 9
2 1.0 12
3 1.0 12
4 1.0 12

Testing for a interaction between between EDUCATION and SEX, I think the conclusion is that WAGE strongly depends upon EDUCATION, less strongly on SEX (Females earn less); there is a Education * Sex interaction, but it is not strongly significant.

In [125]:
res = ols(
    formula='WAGE ~ EDUCATION + SEX + EDUCATION * SEX',
    data=longly,
).fit()
print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   WAGE   R-squared:                       0.190
Model:                            OLS   Adj. R-squared:                  0.186
Method:                 Least Squares   F-statistic:                     41.50
Date:                Sun, 15 Mar 2020   Prob (F-statistic):           4.24e-24
Time:                        15:04:18   Log-Likelihood:                -1575.0
No. Observations:                 534   AIC:                             3158.
Df Residuals:                     530   BIC:                             3175.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         1.1046      1.314      0.841      0.401      -1.476       3.685
EDUCATION         0.6831      0.099      6.918      0.000       0.489       0.877
SEX              -4.3704      2.085     -2.096      0.037      -8.466      -0.274
EDUCATION:SEX     0.1725      0.157      1.098      0.273      -0.136       0.481
==============================================================================
Omnibus:                      208.151   Durbin-Watson:                   1.863
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1278.081
Skew:                           1.587   Prob(JB):                    2.94e-278
Kurtosis:                       9.883   Cond. No.                         170.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Reproducibility

In [126]:
%watermark -h -iv
seaborn     0.9.0
matplotlib  3.0.2
statsmodels 0.9.0
numpy       1.15.4
pandas      1.0.0
scipy       1.1.0
host name: DESKTOP-SODFUN6
In [127]:
%watermark
2020-03-15T15:04:18+10:00

CPython 3.7.1
IPython 7.2.0

compiler   : MSC v.1915 64 bit (AMD64)
system     : Windows
release    : 10
machine    : AMD64
processor  : Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
CPU cores  : 8
interpreter: 64bit

Comments


net-analysis.com Data Analysis Blog © Don Cameron Powered by Pelican and Twitter Bootstrap. Icons by Font Awesome and Font Awesome More