Stanford STATS191 in Python, Lecture 14 : Logistic Regression

Fri 22 May 2020

Filed under scipy

Tags statistics scipy statsmodels

Stanford Stats 191

Introduction

This is a re-creation of the Stanford Stats 191 course, using Python eco-system tools, instead of R. This is lecture "Logistic: " ( see https://web.stanford.edu/class/stats191/notebooks/Logistic.html )

Initial Notebook Setup

watermark documents the Python and package environment, black is my chosen Python formatter

In [180]:
%load_ext watermark
The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
In [181]:
%reload_ext lab_black
In [182]:
%matplotlib inline
In [183]:
import pandas as pd
import numpy as np
import seaborn as sn

import math

import warnings

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

from scipy import stats
from statsmodels.formula.api import logit
from statsmodels.formula.api import probit
from statsmodels.formula.api import glm


import statsmodels.api as sm

import os

Initial Exploration

In order to make sure that I understood the statsmodels API, I applied the API to the example given in Wikipedia (https://en.wikipedia.org/wiki/Logistic_regression)

We have a dataset that relates hours studied to the PASS / FAIL result in an exam.

Note that the result (dependent variable) is a binary choice, and hence linear regression would not be an appropriate method

In [184]:
Hours = [
    0.50,
    0.75,
    1.00,
    1.25,
    1.50,
    1.75,
    1.75,
    2.00,
    2.25,
    2.50,
    2.75,
    3.00,
    3.25,
    3.50,
    4.0,
    4.25,
    4.50,
    4.75,
    5.00,
    5.50,
]
Pass = [
    0,
    0,
    0,
    0,
    0,
    0,
    1,
    0,
    1,
    0,
    1,
    0,
    1,
    0,
    1,
    1,
    1,
    1,
    1,
    1,
]

Create a pandas dataframe

In [185]:
exam_dict = {'Hours': Hours, 'Pass': Pass}

exam = pd.DataFrame(exam_dict)

exam.head()
Out[185]:
Hours Pass
0 0.50 0
1 0.75 0
2 1.00 0
3 1.25 0
4 1.50 0

Use the statsmodels formula API to perform Logistic Regression.

In [186]:
res = logit('Pass ~ Hours', data=exam).fit()

res.summary()
Optimization terminated successfully.
         Current function value: 0.401494
         Iterations 7
Out[186]:
Logit Regression Results
Dep. Variable: Pass No. Observations: 20
Model: Logit Df Residuals: 18
Method: MLE Df Model: 1
Date: Thu, 21 May 2020 Pseudo R-squ.: 0.4208
Time: 18:49:15 Log-Likelihood: -8.0299
converged: True LL-Null: -13.863
LLR p-value: 0.0006365
coef std err z P>|z| [0.025 0.975]
Intercept -4.0777 1.761 -2.316 0.021 -7.529 -0.626
Hours 1.5046 0.629 2.393 0.017 0.272 2.737

Do a quick plot to show the raw data and the results

In [187]:
_ = plt.plot(
    exam['Hours'], res.predict(), 'o', label='Predicted'
)
_ = plt.plot(
    exam['Hours'], exam['Pass'], 'r+', label='Data'
)
_ = plt.legend(loc='center left')

Compare the probability of passing for two students (one studied for 2 hours, the latter for 4 hours)

In [188]:
xx = {'Hours': [2, 4]}
xx_df = pd.DataFrame(xx)
res.predict(xx_df)
Out[188]:
0    0.255703
1    0.874448
dtype: float64
In [189]:
hours = 2
prob = 1.0 / (
    1.0 + np.exp(-(res.params[0] + res.params[1] * hours))
)
prob
Out[189]:
0.25570318264090985

The second variant of the summary function doesn't seem to be much different (?)

In [190]:
res.summary2()
Out[190]:
Model: Logit Pseudo R-squared: 0.421
Dependent Variable: Pass AIC: 20.0598
Date: 2020-05-21 18:49 BIC: 22.0512
No. Observations: 20 Log-Likelihood: -8.0299
Df Model: 1 LL-Null: -13.863
Df Residuals: 18 LLR p-value: 0.00063648
Converged: 1.0000 Scale: 1.0000
No. Iterations: 7.0000
Coef. Std.Err. z P>|z| [0.025 0.975]
Intercept -4.0777 1.7610 -2.3156 0.0206 -7.5292 -0.6262
Hours 1.5046 0.6287 2.3932 0.0167 0.2724 2.7369

Probit

In order to analyse binary data, and interprete the predictions as probability / odds / likelihood of success or failure, we want a function that ranges from 0 to 1, and has a steep transition, and has tractible mathematical properties. The usual function is the logistic function, as shown below.

$$f(x) = \frac{L}{1 + e^{-k(x-x_0)}}$$

Some areas of analysis use the Probit function, being the Cumulative Distribution Function (CDF) of the standard normal distribution.

To quote https://tutorials.methodsconsultants.com/posts/what-is-the-difference-between-logit-and-probit-models/

logistic regression – is more popular in health sciences like epidemiology partly because coefficients can be interpreted in terms of odds ratios. Probit models can be generalized to account for non-constant error variances in more advanced econometric settings (known as heteroskedastic probit models) and hence are used in some contexts by economists and political scientists.

statsmodels supports Probit models

In [191]:
res2 = probit('Pass ~ Hours', data=exam).fit()
res2.summary()
Optimization terminated successfully.
         Current function value: 0.394887
         Iterations 6
Out[191]:
Probit Regression Results
Dep. Variable: Pass No. Observations: 20
Model: Probit Df Residuals: 18
Method: MLE Df Model: 1
Date: Thu, 21 May 2020 Pseudo R-squ.: 0.4303
Time: 18:49:16 Log-Likelihood: -7.8977
converged: True LL-Null: -13.863
LLR p-value: 0.0005522
coef std err z P>|z| [0.025 0.975]
Intercept -2.4728 0.984 -2.514 0.012 -4.401 -0.545
Hours 0.9127 0.351 2.601 0.009 0.225 1.600

Display the predictions versus actual data

In [192]:
_ = plt.plot(exam['Hours'], res2.predict(), 'o')
_ = plt.plot(exam['Hours'], exam['Pass'], 'r+')

The difference between the Logistic and Probit models for the exam dataset is minimal

In [193]:
_ = plt.plot(
    exam['Hours'], res.predict() - res2.predict(), 'r+'
)
_ = plt.axhline(0, color='black')

Confidence Intervals

At https://stackoverflow.com/questions/47414842/confidence-interval-of-probability-prediction-from-logistic-regression-statsmode, there is a very good discussion of Confidence Intervals in the Logit Model. I was not able to reproduce the results of the R functions used in the lecture notes

Approach 1

This is a very mis-guided approach, assuming that the normality of the parameters of our linear model input to the logistic function. The conf-int method returns the CI limits for these parameters

In [194]:
res.summary()
Out[194]:
Logit Regression Results
Dep. Variable: Pass No. Observations: 20
Model: Logit Df Residuals: 18
Method: MLE Df Model: 1
Date: Thu, 21 May 2020 Pseudo R-squ.: 0.4208
Time: 18:49:16 Log-Likelihood: -8.0299
converged: True LL-Null: -13.863
LLR p-value: 0.0006365
coef std err z P>|z| [0.025 0.975]
Intercept -4.0777 1.761 -2.316 0.021 -7.529 -0.626
Hours 1.5046 0.629 2.393 0.017 0.272 2.737
In [195]:
h = np.linspace(0, 6, 20)

ci = res.conf_int(0.05)
print(ci)
lower = ci[0]
upper = ci[1]
prob1 = 1 / (1 + np.exp(-(ci[0][0] + ci[0][1] * h)))
prob2 = 1 / (1 + np.exp(-(ci[1][0] + ci[1][1] * h)))

_ = plt.plot(h, prob1, 'r-')
_ = plt.plot(h, prob2, 'g-')
_ = plt.plot(exam['Hours'], res.predict(), 'o')
                  0         1
Intercept -7.529199 -0.626228
Hours      0.272375  2.736916

As you can see, the results are close to meaningless


Delta Approach

The delta method assumes predicted probabilities are normal

Run the regression again, using the object API

In [196]:
x = np.array(exam['Hours'])
X = sm.add_constant(x)
y = exam['Pass']

model = sm.Logit(y, X).fit()
proba = model.predict(X)

_ = plt.plot(x, proba, 'o', label='Prediction')
_ = plt.plot(x, y, 'r+', label='Actual')
_ = plt.legend(loc='center left')
Optimization terminated successfully.
         Current function value: 0.401494
         Iterations 7

Code taken from the reference above

In [197]:
# estimate confidence interval for predicted probabilities
cov = model.cov_params()
gradient = (
    proba * (1 - proba) * X.T
).T  # matrix of gradients for each observation
std_errors = np.array(
    [np.sqrt(np.dot(np.dot(g, cov), g)) for g in gradient]
)
c = 1.96  # multiplier for confidence interval
upper = np.maximum(0, np.minimum(1, proba + std_errors * c))
lower = np.maximum(0, np.minimum(1, proba - std_errors * c))
In [198]:
_ = plt.plot(x, proba, 'o', label='Prediction')
_ = plt.plot(x, upper, 'g-', label='Upper 95% CI')
_ = plt.plot(x, lower, 'r-', label='Lower 95% CI')
_ = plt.legend(loc='upper left')

This time the results accord more with my intuition


Bootstrap Approach

In this approach we sample the original dataset with replacement, and get an empirical assessment of the spread in predictions. For each sample, we fit a model, and predict the response.

Some Gotcha-s:

  • sampling a small dataset of Logistic regression data can lead to PerfectSeparationError errors: there is no overlap in Pass / Fail subsets

  • we can get Linear Algebra Warnings (I think a result of our small dataset)

  • statsmodels issues warning about the models we subsequently throw away. We have to shut these warning off

We throw any bad sample datasets away, and move on. At the end we use numpy to get the 95% range of prediction values

In [199]:
# Turn off statsmodels warnings
with warnings.catch_warnings():
    warnings.filterwarnings("ignore")

    x = np.array(exam['Hours'])
    X = sm.add_constant(x)
    y = exam['Pass']

    # preds holds the predictions for each sub-sample
    preds = []

    # count holds successful fit run count
    count = 0
    for trial in range(1000):

        # get sub-sample
        exam_sample = exam.sample(frac=1, replace=True)

        # fit and predict
        try:
            x_sample = np.array(exam_sample['Hours'])
            X_sample = sm.add_constant(x_sample)
            y_sample = exam_sample['Pass']
            # disp=0 will suppress suprious convergece message
            res_sample = sm.Logit(y_sample, X_sample).fit(
                disp=0
            )
            preds.append(res_sample.predict(X))
            count = count + 1
        except:
            pass
        # end try
    # end for

    print(f'{count} samples were processed')

    # get array of predictions into numpy format
    predictions_array = np.array(preds)

    _ = plt.plot(
        exam['Hours'], exam['Pass'], 'r+', label='Actuals'
    )
    _ = plt.plot(
        exam['Hours'],
        res.predict(),
        'o',
        label='Predictions',
    )
    _ = plt.plot(
        exam['Hours'],
        np.percentile(predictions_array, 97.5, axis=0),
        'g-',
        label='95% CI limit',
    )
    _ = plt.plot(
        exam['Hours'],
        np.percentile(predictions_array, 2.5, axis=0),
        'r-',
        label='95% CI limit',
    )
    _ = plt.legend(loc='lower right')

# end with
976 samples were processed

An Introduction to Statistical Learning

The book "An Introduction to Statistical Learning" apparently assumes the log-odds are normal. Again, using code from the article referenced above (https://stackoverflow.com/questions/47414842/confidence-interval-of-probability-prediction-from-logistic-regression-statsmode), we get the mean, and single observation CIs

In [200]:
fit_mean = res.model.exog.dot(res.params)
fit_mean_se = (
    (
        res.model.exog
        * res.model.exog.dot(res.cov_params())
    ).sum(axis=1)
) ** 0.5
fit_obs_se = (
    (
        (res.model.endog - fit_mean).std(
            ddof=res.params.shape[0]
        )
    )
    ** 2
    + fit_mean_se ** 2
) ** 0.5

Visualizing the results

In [201]:
_ = plt.plot(x, 1 / (1 + np.exp(-fit_mean)), 'o')
_ = plt.plot(
    x,
    1 / (1 + np.exp(-fit_mean - 1.96 * fit_mean_se)),
    'g-',
    label='Mean CI 95%',
)
_ = plt.plot(
    x,
    1 / (1 + np.exp(-fit_mean + 1.96 * fit_mean_se)),
    'r-',
    label='Mean CI 95%',
)
_ = plt.plot(
    x,
    1 / (1 + np.exp(-fit_mean + 1.96 * fit_obs_se)),
    'r:',
    label='Observation CI 95%',
)
_ = plt.plot(
    x,
    1 / (1 + np.exp(-fit_mean - 1.96 * fit_obs_se)),
    'g:',
    label='Observation CI 95%',
)
_ = plt.legend(loc='lower right')

Flu Dataset

The STATS191 lecture references a dataset, relating getting a flu shot to age, and general health awareness

Read and Visualize Dataset

In [202]:
flu = pd.read_fwf('../data/flu.txt', widths=[7, 7, 7])
flu.columns = ['Shot', 'Age', 'Aware']
flu.sort_values(by='Shot', inplace=True)
In [203]:
flu.head()
Out[203]:
Shot Age Aware
0 0.0 38.0 40.0
22 0.0 53.0 32.0
23 0.0 42.0 42.0
48 0.0 45.0 35.0
25 0.0 42.0 32.0
In [204]:
sn.scatterplot('Age', 'Aware', data=flu, hue='Shot')
Out[204]:
<matplotlib.axes._subplots.AxesSubplot at 0x23f85aef2b0>
In [205]:
_ = plt.plot(flu['Age'], flu['Shot'], 'o')
In [206]:
_ = plt.plot(flu['Aware'], flu['Shot'], 'o')

Logistic Regression

In [207]:
res3 = logit('Shot ~ Age + Aware', data=flu).fit()
Optimization terminated successfully.
         Current function value: 0.324163
         Iterations 8
In [208]:
res3.summary()
Out[208]:
Logit Regression Results
Dep. Variable: Shot No. Observations: 50
Model: Logit Df Residuals: 47
Method: MLE Df Model: 2
Date: Thu, 21 May 2020 Pseudo R-squ.: 0.5235
Time: 18:49:21 Log-Likelihood: -16.208
converged: True LL-Null: -34.015
LLR p-value: 1.848e-08
coef std err z P>|z| [0.025 0.975]
Intercept -21.5846 6.418 -3.363 0.001 -34.164 -9.005
Age 0.2218 0.074 2.983 0.003 0.076 0.368
Aware 0.2035 0.063 3.244 0.001 0.081 0.326

Prediction for a 35 year old with health awareness 50 compared to a 45 year old with the same health awareness

In [209]:
examples = {'Age': [35, 45], 'Aware': [50, 50]}
res3.predict(examples)
Out[209]:
0    0.025406
1    0.193210
dtype: float64

We can calculate the probability of a flu shot explicitly from the model parameters, and the logistic function

In [210]:
age = 35
aware = 50

prob = 1.0 / (
    1.0
    + np.exp(
        -(
            res3.params[0]
            + res3.params[1] * age
            + res3.params[2] * aware
        )
    )
)
prob
Out[210]:
0.025405604440954365

General Linear Models

It is also possible to perform a Logistic Regression via the statsmodels General Linear Model API. Binomial here refers to the fact we have two choices of outcome.

In [211]:
res4 = glm(
    'Shot ~ Age + Aware',
    data=flu,
    family=sm.families.Binomial(),
).fit()
res4.summary()
Out[211]:
Generalized Linear Model Regression Results
Dep. Variable: Shot No. Observations: 50
Model: GLM Df Residuals: 47
Model Family: Binomial Df Model: 2
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -16.208
Date: Thu, 21 May 2020 Deviance: 32.416
Time: 18:49:21 Pearson chi2: 34.6
No. Iterations: 6 Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
Intercept -21.5846 6.418 -3.363 0.001 -34.164 -9.005
Age 0.2218 0.074 2.983 0.003 0.076 0.368
Aware 0.2035 0.063 3.244 0.001 0.081 0.326

Predictions are the same as previous regression

In [212]:
examples = {'Age': [35, 45], 'Aware': [50, 50]}
res3.predict(examples)
Out[212]:
0    0.025406
1    0.193210
dtype: float64

Probit Regression

We can also perform Probit Regression in two ways, via the explicit probit function, or via a GLM call

In [213]:
res5 = probit('Shot ~ Age + Aware', data=flu).fit()
res5.summary()
Optimization terminated successfully.
         Current function value: 0.320758
         Iterations 7
Out[213]:
Probit Regression Results
Dep. Variable: Shot No. Observations: 50
Model: Probit Df Residuals: 47
Method: MLE Df Model: 2
Date: Thu, 21 May 2020 Pseudo R-squ.: 0.5285
Time: 18:49:21 Log-Likelihood: -16.038
converged: True LL-Null: -34.015
LLR p-value: 1.559e-08
coef std err z P>|z| [0.025 0.975]
Intercept -12.3504 3.284 -3.761 0.000 -18.787 -5.914
Age 0.1279 0.040 3.224 0.001 0.050 0.206
Aware 0.1164 0.033 3.568 0.000 0.052 0.180

To access Probit Regression via GLM, we have to specify the probit function.

To get a list of supported functions in the Bionomial family, see below (we need the second item of the supported functions list)

In [214]:
sm.families.Binomial.links
Out[214]:
[statsmodels.genmod.families.links.logit,
 statsmodels.genmod.families.links.probit,
 statsmodels.genmod.families.links.cauchy,
 statsmodels.genmod.families.links.log,
 statsmodels.genmod.families.links.cloglog,
 statsmodels.genmod.families.links.identity]
In [215]:
res6 = glm(
    'Shot ~ Age + Aware',
    data=flu,
    family=sm.families.Binomial(
        sm.families.Binomial.links[1]
    ),
).fit()
res6.summary()
Out[215]:
Generalized Linear Model Regression Results
Dep. Variable: Shot No. Observations: 50
Model: GLM Df Residuals: 47
Model Family: Binomial Df Model: 2
Link Function: probit Scale: 1.0000
Method: IRLS Log-Likelihood: -16.038
Date: Thu, 21 May 2020 Deviance: 32.076
Time: 18:49:21 Pearson chi2: 33.4
No. Iterations: 8 Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
Intercept -12.3504 3.228 -3.826 0.000 -18.677 -6.023
Age 0.1279 0.039 3.289 0.001 0.052 0.204
Aware 0.1164 0.032 3.596 0.000 0.053 0.180

The two APIs give the same results

In [216]:
res5.predict(examples)
Out[216]:
0    0.019966
1    0.218923
dtype: float64
In [217]:
res6.predict(examples)
Out[217]:
0    0.019966
1    0.218923
dtype: float64

Reproducibility

In [218]:
%watermark -h -iv
%watermark
seaborn     0.9.0
matplotlib  3.0.2
numpy       1.15.4
statsmodels 0.9.0
pandas      1.0.0
scipy       1.1.0
host name: DESKTOP-SODFUN6
2020-05-21T18:49:21+10:00

CPython 3.7.1
IPython 7.2.0

compiler   : MSC v.1915 64 bit (AMD64)
system     : Windows
release    : 10
machine    : AMD64
processor  : Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
CPU cores  : 8
interpreter: 64bit
In [219]:
sm.show_versions()
INSTALLED VERSIONS
------------------
Python: 3.7.1.final.0

Statsmodels
===========

Installed: 0.9.0 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\statsmodels)

Required Dependencies
=====================

cython: 0.29.2 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\Cython)
numpy: 1.15.4 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\numpy)
scipy: 1.1.0 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\scipy)
pandas: 1.0.0 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\pandas)
    dateutil: 2.7.5 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\dateutil)
patsy: 0.5.1 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\patsy)

Optional Dependencies
=====================

matplotlib: 3.0.2 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\matplotlib)
    backend: module://ipykernel.pylab.backend_inline 
cvxopt: Not installed
joblib: 0.13.2 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\joblib)

Developer Tools
================

IPython: 7.2.0 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\IPython)
    jinja2: 2.10.1 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\jinja2)
sphinx: 1.8.2 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\sphinx)
    pygments: 2.3.1 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\pygments)
nose: 1.3.7 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\nose)
pytest: 5.0.1 (D:\Anaconda3\envs\ac5-py37\lib\site-packages)
virtualenv: Not installed


In [ ]:
 

Comments


net-analysis.com Data Analysis Blog © Don Cameron Powered by Pelican and Twitter Bootstrap. Icons by Font Awesome and Font Awesome More