Prediction (out of sample)¶
[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)
Artificial data¶
[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1 - 5) ** 2))
X = sm.add_constant(X)
beta = [5.0, 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)
Estimation¶
[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.983
Model: OLS Adj. R-squared: 0.982
Method: Least Squares F-statistic: 908.8
Date: Sat, 19 Aug 2023 Prob (F-statistic): 6.23e-41
Time: 20:59:16 Log-Likelihood: 1.5535
No. Observations: 50 AIC: 4.893
Df Residuals: 46 BIC: 12.54
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 5.1162 0.083 61.378 0.000 4.948 5.284
x1 0.4776 0.013 37.151 0.000 0.452 0.503
x2 0.4516 0.051 8.937 0.000 0.350 0.553
x3 -0.0181 0.001 -16.064 0.000 -0.020 -0.016
==============================================================================
Omnibus: 2.215 Durbin-Watson: 2.464
Prob(Omnibus): 0.330 Jarque-Bera (JB): 2.067
Skew: -0.478 Prob(JB): 0.356
Kurtosis: 2.722 Cond. No. 221.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In-sample prediction¶
[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.66285433 5.1080339 5.51771993 5.86729919 6.14104125 6.33468292
6.4561287 6.52415205 6.56531102 6.60958481 6.68544826 6.81519354
7.01126759 7.27422713 7.5926472 7.94499824 8.30318332 8.63715443
8.91984924 9.13163739 9.26354568 9.31873237 9.31196825 9.26720989
9.2136636 9.18098693 9.19441601 9.27061937 9.41495929 9.6206108
9.86968305 10.13615868 10.39016841 10.60289848 10.7513247 10.82199518
10.81324053 10.73544808 10.60935527 10.46264259 10.3253857 10.22511413
10.18228822 10.20693757 10.2970132 10.43872416 10.60880342 10.77833047
10.91748298 11.00043647]
Create a new sample of explanatory variables Xnew, predict and plot¶
[6]:
x1n = np.linspace(20.5, 25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n - 5) ** 2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[11.00060123 10.88147682 10.66077431 10.37885496 10.08884841 9.84364467
9.68294478 9.62354068 9.65520388 9.74318979]
Plot comparison¶
[7]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(x1, y, "o", label="Data")
ax.plot(x1, y_true, "b-", label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), "r", label="OLS prediction")
ax.legend(loc="best")
[7]:
<matplotlib.legend.Legend at 0x7fd6a6794f50>

Predicting with Formulas¶
Using formulas can make both estimation and prediction a lot easier
[8]:
from statsmodels.formula.api import ols
data = {"x1": x1, "y": y}
res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()
We use the I
to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2
[9]:
res.params
[9]:
Intercept 5.116151
x1 0.477582
np.sin(x1) 0.451626
I((x1 - 5) ** 2) -0.018132
dtype: float64
Now we only have to pass the single variable and we get the transformed right-hand side variables automatically
[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0 11.000601
1 10.881477
2 10.660774
3 10.378855
4 10.088848
5 9.843645
6 9.682945
7 9.623541
8 9.655204
9 9.743190
dtype: float64