Stanford STATS191 in Python, Lecture 14 : Logistic Regression
Stanford Stats 191¶
Introduction¶
This is a re-creation of the Stanford Stats 191 course, using Python eco-system tools, instead of R. This is lecture "Logistic: " ( see https://web.stanford.edu/class/stats191/notebooks/Logistic.html )
Initial Notebook Setup¶
watermark
documents the Python and package environment, black
is my chosen Python formatter
%load_ext watermark
%reload_ext lab_black
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sn
import math
import warnings
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from scipy import stats
from statsmodels.formula.api import logit
from statsmodels.formula.api import probit
from statsmodels.formula.api import glm
import statsmodels.api as sm
import os
Initial Exploration¶
In order to make sure that I understood the statsmodels
API, I applied the API to the example given in Wikipedia (https://en.wikipedia.org/wiki/Logistic_regression)
We have a dataset that relates hours studied to the PASS / FAIL result in an exam.
Note that the result (dependent variable) is a binary choice, and hence linear regression would not be an appropriate method
Hours = [
0.50,
0.75,
1.00,
1.25,
1.50,
1.75,
1.75,
2.00,
2.25,
2.50,
2.75,
3.00,
3.25,
3.50,
4.0,
4.25,
4.50,
4.75,
5.00,
5.50,
]
Pass = [
0,
0,
0,
0,
0,
0,
1,
0,
1,
0,
1,
0,
1,
0,
1,
1,
1,
1,
1,
1,
]
Create a pandas
dataframe
exam_dict = {'Hours': Hours, 'Pass': Pass}
exam = pd.DataFrame(exam_dict)
exam.head()
Use the statsmodels
formula API to perform Logistic Regression.
res = logit('Pass ~ Hours', data=exam).fit()
res.summary()
Do a quick plot to show the raw data and the results
_ = plt.plot(
exam['Hours'], res.predict(), 'o', label='Predicted'
)
_ = plt.plot(
exam['Hours'], exam['Pass'], 'r+', label='Data'
)
_ = plt.legend(loc='center left')
Compare the probability of passing for two students (one studied for 2 hours, the latter for 4 hours)
xx = {'Hours': [2, 4]}
xx_df = pd.DataFrame(xx)
res.predict(xx_df)
hours = 2
prob = 1.0 / (
1.0 + np.exp(-(res.params[0] + res.params[1] * hours))
)
prob
The second variant of the summary function doesn't seem to be much different (?)
res.summary2()
Probit¶
In order to analyse binary data, and interprete the predictions as probability / odds / likelihood of success or failure, we want a function that ranges from 0 to 1, and has a steep transition, and has tractible mathematical properties. The usual function is the logistic function, as shown below.
Some areas of analysis use the Probit function, being the Cumulative Distribution Function (CDF) of the standard normal distribution.
logistic regression – is more popular in health sciences like epidemiology partly because coefficients can be interpreted in terms of odds ratios. Probit models can be generalized to account for non-constant error variances in more advanced econometric settings (known as heteroskedastic probit models) and hence are used in some contexts by economists and political scientists.
statsmodels
supports Probit models
res2 = probit('Pass ~ Hours', data=exam).fit()
res2.summary()
Display the predictions versus actual data
_ = plt.plot(exam['Hours'], res2.predict(), 'o')
_ = plt.plot(exam['Hours'], exam['Pass'], 'r+')
The difference between the Logistic and Probit models for the exam dataset is minimal
_ = plt.plot(
exam['Hours'], res.predict() - res2.predict(), 'r+'
)
_ = plt.axhline(0, color='black')
Confidence Intervals¶
At https://stackoverflow.com/questions/47414842/confidence-interval-of-probability-prediction-from-logistic-regression-statsmode, there is a very good discussion of Confidence Intervals in the Logit Model. I was not able to reproduce the results of the R functions used in the lecture notes
Approach 1¶
This is a very mis-guided approach, assuming that the normality of the parameters of our linear model input to the logistic function. The conf-int
method returns the CI limits for these parameters
res.summary()
h = np.linspace(0, 6, 20)
ci = res.conf_int(0.05)
print(ci)
lower = ci[0]
upper = ci[1]
prob1 = 1 / (1 + np.exp(-(ci[0][0] + ci[0][1] * h)))
prob2 = 1 / (1 + np.exp(-(ci[1][0] + ci[1][1] * h)))
_ = plt.plot(h, prob1, 'r-')
_ = plt.plot(h, prob2, 'g-')
_ = plt.plot(exam['Hours'], res.predict(), 'o')
As you can see, the results are close to meaningless
Delta Approach¶
The delta method assumes predicted probabilities are normal
Run the regression again, using the object API
x = np.array(exam['Hours'])
X = sm.add_constant(x)
y = exam['Pass']
model = sm.Logit(y, X).fit()
proba = model.predict(X)
_ = plt.plot(x, proba, 'o', label='Prediction')
_ = plt.plot(x, y, 'r+', label='Actual')
_ = plt.legend(loc='center left')
Code taken from the reference above
# estimate confidence interval for predicted probabilities
cov = model.cov_params()
gradient = (
proba * (1 - proba) * X.T
).T # matrix of gradients for each observation
std_errors = np.array(
[np.sqrt(np.dot(np.dot(g, cov), g)) for g in gradient]
)
c = 1.96 # multiplier for confidence interval
upper = np.maximum(0, np.minimum(1, proba + std_errors * c))
lower = np.maximum(0, np.minimum(1, proba - std_errors * c))
_ = plt.plot(x, proba, 'o', label='Prediction')
_ = plt.plot(x, upper, 'g-', label='Upper 95% CI')
_ = plt.plot(x, lower, 'r-', label='Lower 95% CI')
_ = plt.legend(loc='upper left')
This time the results accord more with my intuition
Bootstrap Approach¶
In this approach we sample the original dataset with replacement, and get an empirical assessment of the spread in predictions. For each sample, we fit a model, and predict the response.
Some Gotcha-s:
sampling a small dataset of Logistic regression data can lead to
PerfectSeparationError
errors: there is no overlap in Pass / Fail subsetswe can get Linear Algebra Warnings (I think a result of our small dataset)
statsmodels
issues warning about the models we subsequently throw away. We have to shut these warning off
We throw any bad sample datasets away, and move on. At the end we use numpy
to get the 95% range of prediction values
# Turn off statsmodels warnings
with warnings.catch_warnings():
warnings.filterwarnings("ignore")
x = np.array(exam['Hours'])
X = sm.add_constant(x)
y = exam['Pass']
# preds holds the predictions for each sub-sample
preds = []
# count holds successful fit run count
count = 0
for trial in range(1000):
# get sub-sample
exam_sample = exam.sample(frac=1, replace=True)
# fit and predict
try:
x_sample = np.array(exam_sample['Hours'])
X_sample = sm.add_constant(x_sample)
y_sample = exam_sample['Pass']
# disp=0 will suppress suprious convergece message
res_sample = sm.Logit(y_sample, X_sample).fit(
disp=0
)
preds.append(res_sample.predict(X))
count = count + 1
except:
pass
# end try
# end for
print(f'{count} samples were processed')
# get array of predictions into numpy format
predictions_array = np.array(preds)
_ = plt.plot(
exam['Hours'], exam['Pass'], 'r+', label='Actuals'
)
_ = plt.plot(
exam['Hours'],
res.predict(),
'o',
label='Predictions',
)
_ = plt.plot(
exam['Hours'],
np.percentile(predictions_array, 97.5, axis=0),
'g-',
label='95% CI limit',
)
_ = plt.plot(
exam['Hours'],
np.percentile(predictions_array, 2.5, axis=0),
'r-',
label='95% CI limit',
)
_ = plt.legend(loc='lower right')
# end with
An Introduction to Statistical Learning¶
The book "An Introduction to Statistical Learning" apparently assumes the log-odds are normal. Again, using code from the article referenced above (https://stackoverflow.com/questions/47414842/confidence-interval-of-probability-prediction-from-logistic-regression-statsmode), we get the mean, and single observation CIs
fit_mean = res.model.exog.dot(res.params)
fit_mean_se = (
(
res.model.exog
* res.model.exog.dot(res.cov_params())
).sum(axis=1)
) ** 0.5
fit_obs_se = (
(
(res.model.endog - fit_mean).std(
ddof=res.params.shape[0]
)
)
** 2
+ fit_mean_se ** 2
) ** 0.5
Visualizing the results
_ = plt.plot(x, 1 / (1 + np.exp(-fit_mean)), 'o')
_ = plt.plot(
x,
1 / (1 + np.exp(-fit_mean - 1.96 * fit_mean_se)),
'g-',
label='Mean CI 95%',
)
_ = plt.plot(
x,
1 / (1 + np.exp(-fit_mean + 1.96 * fit_mean_se)),
'r-',
label='Mean CI 95%',
)
_ = plt.plot(
x,
1 / (1 + np.exp(-fit_mean + 1.96 * fit_obs_se)),
'r:',
label='Observation CI 95%',
)
_ = plt.plot(
x,
1 / (1 + np.exp(-fit_mean - 1.96 * fit_obs_se)),
'g:',
label='Observation CI 95%',
)
_ = plt.legend(loc='lower right')
flu = pd.read_fwf('../data/flu.txt', widths=[7, 7, 7])
flu.columns = ['Shot', 'Age', 'Aware']
flu.sort_values(by='Shot', inplace=True)
flu.head()
sn.scatterplot('Age', 'Aware', data=flu, hue='Shot')
_ = plt.plot(flu['Age'], flu['Shot'], 'o')
_ = plt.plot(flu['Aware'], flu['Shot'], 'o')
Logistic Regression¶
res3 = logit('Shot ~ Age + Aware', data=flu).fit()
res3.summary()
Prediction for a 35 year old with health awareness 50 compared to a 45 year old with the same health awareness
examples = {'Age': [35, 45], 'Aware': [50, 50]}
res3.predict(examples)
We can calculate the probability of a flu shot explicitly from the model parameters, and the logistic function
age = 35
aware = 50
prob = 1.0 / (
1.0
+ np.exp(
-(
res3.params[0]
+ res3.params[1] * age
+ res3.params[2] * aware
)
)
)
prob
General Linear Models¶
It is also possible to perform a Logistic Regression via the statsmodels
General Linear Model API. Binomial here refers to the fact we have two choices of outcome.
res4 = glm(
'Shot ~ Age + Aware',
data=flu,
family=sm.families.Binomial(),
).fit()
res4.summary()
Predictions are the same as previous regression
examples = {'Age': [35, 45], 'Aware': [50, 50]}
res3.predict(examples)
Probit Regression¶
We can also perform Probit Regression in two ways, via the explicit probit
function, or via a GLM call
res5 = probit('Shot ~ Age + Aware', data=flu).fit()
res5.summary()
To access Probit Regression via GLM, we have to specify the probit
function.
To get a list of supported functions in the Bionomial family, see below (we need the second item of the supported functions list)
sm.families.Binomial.links
res6 = glm(
'Shot ~ Age + Aware',
data=flu,
family=sm.families.Binomial(
sm.families.Binomial.links[1]
),
).fit()
res6.summary()
The two APIs give the same results
res5.predict(examples)
res6.predict(examples)
Reproducibility¶
%watermark -h -iv
%watermark
sm.show_versions()