SciPy and Statistics Examples¶
As a familiarisation exercise, this is walkthrough of the material from https://scipy-lectures.org/packages/statistics/index.html
There is some added value in that I have explored different ways of visualising the data being analysed.
Initial Notebook Setup¶
watermark
self-documents the environment.
%load_ext watermark
black
is my preferred Python formatter
%load_ext lab_black
%matplotlib inline
All includes will go here
import pandas as pd
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt
from scipy import stats
from statsmodels.formula.api import ols
from statsmodels.formula.api import rlm
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import (
wls_prediction_std,
)
Data Load¶
The initial dataset contains three different IQ measures, and data on subject weight, height, and brain size (as per MRI)
DATA_PREFIX = '../data/'
data = pd.read_csv(
DATA_PREFIX + 'brain_size.csv', na_values=".", sep=';'
)
Data Exploration¶
Display the first few rows
data.head()
Display the data types in our dataset
data.dtypes
Use describe
to give a snapshot of each column that is numeric
data.describe()
Let us explore the difference gender makes. We use the pandas
groupby
function, and then show the snapshot of each group
gender = data.groupby('Gender')
We see that we have a new type of object
gender
gender.describe()
gender.mean()
gender.size()
Display the first four rows of each Gender
gender.head(4)
Data Visualization¶
We use the pandas
plotting functions to quickly show the weights of each subject, color-coded by gender.
gender['Weight'].plot(style=['o', '+'])
Show the histogram of weights, grouped by gender
gender['Weight'].hist()
Show a boxplot with IQ scores, grouped by gender
gender.boxplot(column=['FSIQ', 'PIQ', 'VIQ'])
Show a scatter plot of the non-IQ numeric data (one plot for each pair of variables). As we might expect, aspects of body size appear to be positively correlated.
__ = pd.plotting.scatter_matrix(
data[['Weight', 'Height', 'MRI_Count']]
)
Show a scatter plot for IQ measures (pairwise). Different measures of IQ also appear to be positively correlated
__ = pd.plotting.scatter_matrix(
data[['FSIQ', 'PIQ', 'VIQ']]
)
We first clean the data by dropping all rows with a NA value, then create a new numeric column (1 for Male, 2 for Female), and then re-do the scatter plot as before, but coloring each point with a gender-specific color. We set alpha
to be almost one to get brighter colors.
clean_data = data.dropna()
sex_code = {'Male': 1, 'Female': 2}
sex_color = [sex_code[x] for x in clean_data['Gender']]
__ = pd.plotting.scatter_matrix(
clean_data[['Weight', 'Height', 'MRI_Count']],
c=sex_color,
cmap='rainbow',
alpha=0.95,
)
The same plot, but with default alpha
__ = pd.plotting.scatter_matrix(
clean_data[['FSIQ', 'PIQ', 'VIQ']],
c=sex_color,
cmap='rainbow',
)
Almost too small at default size, but showing the relationships between all numeric variables. FSIQ seems have have a gap in its values; if this was a real data analysis job, we would to investigate that.
__ = pd.plotting.scatter_matrix(
clean_data[
[
'FSIQ',
'PIQ',
'VIQ',
'Height',
'Weight',
'MRI_Count',
]
],
c=sex_color,
cmap='rainbow',
)
Statistical Tests¶
Test of means > 0¶
As a finger exercise, test if any of the variables have a mean of zero (we expect none). As expected, the probability of seeing the data we see, with a true mean of zero, is vanishingly small: we reject the mean=0 hypothesis.
We try VIQ
first
stats.ttest_1samp(data['VIQ'], 0)
Check all the variables
for name in [
'Weight',
'Height',
'VIQ',
'PIQ',
'FSIQ',
'MRI_Count',
]:
print(
f'T Test values for {name:20}: {stats.ttest_1samp(clean_data[name], 0)}'
)
# end for
Test for gender difference¶
We now test for gender-related differences in the numeric values. As expected, body-related variables (height, weight, MRI-count) show differences too large to be explained by chance. The IQ measures, not so. Assuming unequal variance in the populations is a more stringent test (in the sense that the T values are smaller, and hence more likley to occur by chance, and hence we are less likely to reject the Null Hypothesis that means are equal.
for name in [
'Weight',
'Height',
'VIQ',
'PIQ',
'FSIQ',
'MRI_Count',
]:
f_values = clean_data[clean_data['Gender'] == 'Female'][
name
]
m_values = clean_data[clean_data['Gender'] == 'Male'][
name
]
print(
f'2 Sample T Test (Welch’s t-test) values for {name:10}: {stats.ttest_ind(f_values, m_values, equal_var=False)}'
)
# end for
We perform the same test, this time assuming equal population variance. The conclusions are the same, even though the T values differ slightly.
for name in [
'Weight',
'Height',
'VIQ',
'PIQ',
'FSIQ',
'MRI_Count',
]:
f_values = clean_data[clean_data['Gender'] == 'Female'][
name
]
m_values = clean_data[clean_data['Gender'] == 'Male'][
name
]
print(
f'2 Sample T Test (equal variance assumed) values for {name:10}: {stats.ttest_ind(f_values, m_values, equal_var=True)}'
)
# end for
Test for difference in measures¶
We now testfor differences in the measures of IQ. The greatest difference is between PIQ and FSIQ, but not enough to reject Null Hypothesis at the 5% level. We also use the Wilcoxon test to test for differences in the underlying distribution, from which we assume each IQ measure is drawn. Again, we cannot reject the Null Hypothesis that all distributions are equal.
stats.ttest_rel(data['FSIQ'], data['PIQ'])
stats.ttest_1samp(data['FSIQ'] - data['PIQ'], 0)
stats.ttest_rel(data['FSIQ'], data['VIQ'])
stats.ttest_rel(data['PIQ'], data['VIQ'])
stats.wilcoxon(data['FSIQ'], data['PIQ'])
stats.wilcoxon(data['FSIQ'], data['VIQ'])
stats.wilcoxon(data['VIQ'], data['PIQ'])
Test for means different¶
To show an example where large sample sizes allow more precise testing, we draw 1,000 samples from two distributions that visually look to have the same mean. However, the T test allows us to reject the Null Hypothesis at the 5% level (the difference in observed means for this experiment would only happen about 0.1% of the time if the true means were equal).
rvs1 = stats.norm.rvs(loc=5, scale=10, size=1000)
rvs5 = stats.norm.rvs(loc=8, scale=20, size=1000)
stats.ttest_ind(rvs1, rvs5, equal_var=False)
Display the two histograms of sample values
_ = plt.hist(rvs1)
_ = plt.hist(rvs5, alpha=0.3)
rvs1.mean(), rvs5.mean()
Linear Regression¶
Using R style symbolic equations, we create a linear dataset, and visualize it (this time using matplotlib
). In this case we are using the package statsmodels
, and not scipy.stats
.
x = np.linspace(-5, 5, 20)
np.random.seed(1)
y = -5 + 3 * x + 4 * np.random.normal(size=x.shape)
y_no_error = -5 + 3 * x
plt.plot(x, y, 'o')
We now create a pandas dataframe from the x,y data, and declare a model that says y
is a linear function of x
. Note that we do not have to declare a constant in our model, for this variant of Ordinary Least Squares fitting.
data = pd.DataFrame({'x': x, 'y': y})
model = ols('y ~ x', data).fit()
model.summary()
Just for fun, I plotted the four lines you get when taking the 5% extremes for intercept and slope.
y1 = 2.220 * x - 7.71
y2 = 3.654 * x - 3.357
y3 = 2.220 * x - 3.357
y4 = 3.654 * x - 7.71
plt.plot(x, y1, '-')
plt.plot(x, y2, '-')
plt.plot(x, y3, '-')
plt.plot(x, y4, '-')
summary2
is an experimental results summary function. Nothing looks markedly different.
model.summary2()
Show a plot of actual versus predicted values.
model.params
model.predict({'x': [0]})
yhat = model.predict({'x': x})
plt.plot(x, y, 'o')
plt.plot(x, yhat, '+-')
Use explicit OLS¶
As an alternative to R style model specification, the OLS
class can be called directly, but in this case if we want a non-zero intercept, we have to add a constant 1 to our input data arrays.
x_const_added = sm.add_constant(x, prepend=False)
model2 = sm.OLS(y, x_const_added)
res = model2.fit()
We get exactly the same results, and visualization.
res.summary()
yhat2 = res.predict()
yhat2
plt.plot(x, y, 'o')
plt.plot(x, yhat2, '+-')
I thought I would lift my game, and prepare a more professional graphic, as the expense of more code.
I create a matplotlib
figure and axes object. In the axes objecr I plot the raw data, the predicted data, and the underlying linear model. I create a grid, and moves the X and Y axis to run through (0,0) (and not at the edges of the graph. I finally plot the 95% confidence limits for predictions made with the equation we have derived from the model.
prstd, iv_l, iv_u = wls_prediction_std(res)
fig, ax = plt.subplots(figsize=(10, 10))
ax.plot(x, y, 'o', label='data')
ax.plot(x, yhat2, 'r-', label='OLS')
ax.plot(x, y_no_error, 'g-', label='True')
# ax.axvline(0, c='black')
# ax.axhline(0, c='black')
ax.grid()
ax.spines['bottom'].set_position('zero')
ax.spines['left'].set_position('zero')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.plot(x, iv_l, 'r--', label='95% Pred Int (l)')
ax.plot(x, iv_u, 'r--', label='95% Pred Int (u)')
ax.legend(loc='best')
Here we show some of the attributes of the RegressionResults object.
res.predict([0, 1])
res.rsquared
res.rsquared_adj
res.conf_int()
gp = res.get_prediction([[0, 1], [10, 1]])
gp.conf_int()
We now create a graphic showing the model confidence limits. We first create x2
, a linear array spanning our data range. We add a constant 1 to this array, and use this to get the predictions of the output for this input array. The PredictionResults object returned by the call to get_prediction
can be used (by calling the method conf_int
) to the model confidence intervals.
x2 = np.linspace(-10, 10, 21)
x2
x2_enh = sm.add_constant(x2, prepend=False)
x2_enh[0:5]
gp = res.get_prediction(x2_enh)
ci = gp.conf_int(alpha=0.01)
ci[0:5]
upper = [z[1] for z in ci]
lower = [z[0] for z in ci]
prstd, iv_l, iv_u = wls_prediction_std(res)
In the graph below, note that we are more certain about the line we draw (especially about the centre), than we are about any prediction based on that line. Even if the line was exact (and the more data points we have, the better it is), the unavoidable error term in our model will still smear the observed values around the linear equation.
fig, ax = plt.subplots(figsize=(10, 10))
ax.plot(x, y, 'o', label='data')
ax.plot(x, yhat2, 'r-', label='OLS')
ax.plot(x, y_no_error, 'g-', label='True')
ax.plot(x2, upper, 'r--', label='99% Conf Int (u)')
ax.plot(x2, lower, 'r--', label='99% Conf Int (l)')
ax.plot(x, iv_l, 'y-', label='95% Pred Int (l)')
ax.plot(x, iv_u, 'y-', label='95% Pred Int (u)')
ax.grid()
ax.spines['bottom'].set_position('zero')
ax.spines['left'].set_position('zero')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.legend(loc='best')
type(gp)
Nonlinear Regression¶
We now move to an example on non-linear regression, where we concoct some data using a combination of linear, quadratic and sine functions.
nsample = 100
sig = 0.5
x = np.linspace(0, 20, nsample)
x2 = np.sin(x)
x3 = (x - 5) ** 2
x4 = np.ones(nsample)
Display the components that make up our data
fig, ax = plt.subplots(figsize=(10, 10))
ax.plot(x, x * 0.5)
ax.plot(x, x2 * 0.5)
ax.plot(x, x3 * (-0.02))
ax.plot(x, x4 * 5.0)
ax.grid()
ax.spines['bottom'].set_position('zero')
ax.spines['left'].set_position('zero')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
Create the combination of functions, and add Gaussian noise.
x5 = np.column_stack((x, x2, x3, x4))
beta = [0.5, 0.5, -0.02, 5.0]
y_true = np.dot(x5, beta)
y = y_true + sig * np.random.normal(size=nsample)
Quick and dirty plot.
plt.plot(x, y, 'o')
Now fit the functions to the observed results, using Ordinary Least Squares.
res = sm.OLS(y, x5).fit()
res.summary()
Show the results: the fit is very good!
fig, ax = plt.subplots(figsize=(10, 10))
yhat = res.predict()
ax.plot(x, y, 'o')
ax.plot(x, yhat, 'r-')
ax.plot(x, y_true, 'g-')
ax.grid()
As before, create a graphic showing the model and prediction confidence intervals.
gp2 = res.get_prediction(x5)
ci2 = gp2.conf_int(alpha=0.01)
upper = [z[1] for z in ci2]
lower = [z[0] for z in ci2]
prstd, iv_l, iv_u = wls_prediction_std(res)
fig, ax = plt.subplots(figsize=(10, 10))
yhat = res.predict()
ax.plot(x, y, 'o', label='Data')
ax.plot(x, yhat, 'r-', label='OLS estimate')
ax.plot(x, y_true, 'g-', label='Y True')
ax.plot(x, upper, 'r--', label='Confidence Int (upper)')
ax.plot(x, lower, 'r--', label='Confidence Int (lower)')
ax.plot(x, iv_u, 'y-', label='Pred. Int (upper)')
ax.plot(x, iv_l, 'y-', label='Pred. Int (lower)')
ax.legend(loc='best')
ax.grid()
# ax.spines['bottom'].set_position('zero')
ax.spines['left'].set_position('zero')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
Multivariate Regression¶
Iris Data Set¶
We look at the famous Iris flower dataset, and investigate the idea that there is a linear relationship between sepal width and petal length across all three species.
iris = pd.read_csv(DATA_PREFIX + 'iris.csv')
iris.head(4)
Fit the OLS model, and show the results summary. We see that model parameters we see are unlikely to have been observed by chance in a world with no true relationship.
model = ols('sepal_width ~ name + petal_length', iris).fit()
print(model.summary())
Create a color code in (R,G,B) format for each species (I am cheating as I know that there are only three species)
categories = pd.Categorical(iris['name'])
lut = {0: (1, 0, 0), 1: (0, 1, 0), 2: (0, 0, 1)}
species_color = [lut[code] for code in categories.codes]
Create a graphic, with each species data point is color coded. We see the best OLS model for a line that explains sepal width for each species, with a common slope, and different intercept. Versicola and Virginica look OK, but the Setosa data seems to have no such linear relation. The R2 is quiet low, showing that is a lot of variation in sepal width not explained by our model (or the data is very noisy).
fig, ax = plt.subplots(figsize=(10, 10))
yhat = model.predict()
categories = pd.Categorical(iris['name'])
for name, color in zip(
['setosa', 'versicolor', 'virginica'], ['g', 'b', 'c']
):
ax.plot(
iris[iris['name'] == name]['petal_length'],
iris[iris['name'] == name]['sepal_width'],
'o',
label='Data: ' + name,
c=color,
)
# end for
ax.plot(
iris['petal_length'], yhat, 'ro', label='OLS estimate'
)
ax.legend(loc='best')
ax.grid()
ax.set_xlabel('Petal Length')
ax.set_ylabel('Sepal Width')
We can explore this by using the pandas
plotting facilities to get a quick and dirty graphic showing the relationship between numeric variables.
fig, ax = plt.subplots(figsize=(10, 10))
categories = pd.Categorical(iris['name'])
__ = pd.plotting.scatter_matrix(
iris,
c=categories.codes,
ax=ax,
marker='o',
# cmap='jet',
alpha=0.95,
)
We can use the F test to check that our parameters of the model are unlikely to have happened by chance.
model.f_test([0, 1, -1, 0])
dict(zip(categories.codes, iris['name']))
A = np.identity(len(model.params))
A = A[1:, :]
# This tests that each coefficient is jointly statistically
# significantly different from zero.
print(model.f_test(A))
A
model.fvalue, model.f_pvalue
categories.describe()
We can use Seaborn to get a better graphic (with a legend this time! What were you thinking pandas
, in not having a legend)
_ = sn.pairplot(
iris,
vars=[
'sepal_length',
'sepal_width',
'petal_length',
'petal_width',
],
kind='scatter',
hue='name',
diag_kind='hist',
)
When we repeat our analysis on just the Setosa data points, we get a linear affect that would be seen roughly 1 in every 5 experiments, just by random chance, assuming no true linear effect. We would be justified in confirming the Null Hypothesis that in the Setosa species sepal width is independent of petal length.
setosa = iris[iris['name'] == 'setosa']
model = ols('sepal_width ~ petal_length', setosa).fit()
model.summary()
fig, ax = plt.subplots(figsize=(10, 10))
ax.plot(setosa['petal_length'], setosa['sepal_width'], 'o')
yhat = model.predict(setosa['petal_length'])
ax.plot(setosa['petal_length'], yhat, 'r-')
IQ Data Set¶
Here, we go back to our IQ dataset, and look to see what variables are correlated with VIQ. Unsurprisingly, we are warned that there is a degree of correlation between our independent variable: this means that the inverse of the (X @ X.T) matrix (where X is our matrix of independent variable observations might have significant errors, due to round-off. Most correlations would be rejected at the 5% level.
clean_data.head(4)
model = ols(
'VIQ ~ Height + Weight + MRI_Count + C(Gender) -1',
clean_data,
).fit()
print(model.summary())
_ = sn.pairplot(
clean_data,
vars=['VIQ', 'Height', 'Weight', 'MRI_Count'],
kind='scatter',
hue='Gender',
diag_kind='hist',
)
Usung the F test, we reject the hypothesis that there are true gender difference for VIQ
model.f_test([1, -1, 0, 0, 0])
model.f_test([-1, +1, 0, 0, 0])
Just for fun, it appears that Weight
is strongly correlated with Height
, and less strongly with Gender.
model = ols(
'Weight ~ Height + C(Gender) + 1 ', clean_data
).fit()
model.summary()
model = ols('Weight ~ Height + 1 ', clean_data).fit()
model.summary()
Test for IQ Gender Differences¶
If we look at a model where we try to explain all the variation in VIQ
, we get a result (F Test) that suggests that what we see would happen about 50% of the time, assuming no gender differences.
model = ols('VIQ ~ C(Gender) - 1 ', clean_data).fit()
model.summary()
model.f_test([1, -1])
header = [
'EDUCATION',
'SOUTH',
'SEX',
'EXPERIENCE',
'UNION',
'WAGE',
'AGE',
'RACE',
'OCCUPATION',
'SECTOR',
'MARR',
]
DATA_PREFIX = '../data/'
longly = pd.read_csv(
DATA_PREFIX + 'CPS_85_Wages.txt',
sep='\t',
skiprows=26,
skipfooter=6,
header=None,
names=header,
)
longly.head()
longly.tail()
Visualizing Regression Linear Models by Seaborn¶
Seaborn customization includes:
setting the diagonal histograms to have alpha=0.5, so that the overlapping histograms are not obscured
setting the off diagonal regression marker size to a smaller-than-default value, for clarity
sn.pairplot(
longly,
vars=['WAGE', 'AGE', 'EDUCATION'],
kind='reg',
hue='SEX',
diag_kind='hist',
plot_kws={'scatter_kws': {'alpha': 0.5, 's': 5}},
diag_kws={'alpha': 0.5},
)
sn.lmplot(y='WAGE', x='EDUCATION', data=longly, hue='SEX')
Comparing Ordinary Linear Regression, and Robust Linear Regression¶
wage = longly['WAGE']
education = longly['EDUCATION']
education = sm.add_constant(education)
res1 = sm.RLM(wage, education).fit()
print(res1.summary())
res2 = sm.OLS(wage, education).fit()
print(res2.summary())
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111)
x = education['EDUCATION']
ax.plot(x, wage, 'o', label='Data')
ax.plot(x, res1.fittedvalues, 'r-', label='OLS fit')
ax.plot(x, res2.fittedvalues, 'g-', label='RLM fit')
ax.grid()
ax.set_xlabel('Education')
ax.set_ylabel('Wage ($/hour)')
# ax.spines['bottom'].set_position('zero')
# ax.spines['left'].set_position('zero')
# ax.spines['top'].set_visible(False)
# ax.spines['right'].set_visible(False)
ax.legend(loc='best')
education.head()
Testing for a interaction between between EDUCATION
and SEX
, I think the conclusion is that WAGE
strongly depends upon EDUCATION
, less strongly on SEX
(Females earn less); there is a Education * Sex interaction, but it is not strongly significant.
res = ols(
formula='WAGE ~ EDUCATION + SEX + EDUCATION * SEX',
data=longly,
).fit()
print(res.summary())
Reproducibility¶
%watermark -h -iv
%watermark