Prediction (out of sample)

[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt

import statsmodels.api as sm

plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)

Artificial data

[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1 - 5) ** 2))
X = sm.add_constant(X)
beta = [5.0, 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.983
Model:                            OLS   Adj. R-squared:                  0.982
Method:                 Least Squares   F-statistic:                     908.8
Date:                Sat, 19 Aug 2023   Prob (F-statistic):           6.23e-41
Time:                        20:59:16   Log-Likelihood:                 1.5535
No. Observations:                  50   AIC:                             4.893
Df Residuals:                      46   BIC:                             12.54
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.1162      0.083     61.378      0.000       4.948       5.284
x1             0.4776      0.013     37.151      0.000       0.452       0.503
x2             0.4516      0.051      8.937      0.000       0.350       0.553
x3            -0.0181      0.001    -16.064      0.000      -0.020      -0.016
==============================================================================
Omnibus:                        2.215   Durbin-Watson:                   2.464
Prob(Omnibus):                  0.330   Jarque-Bera (JB):                2.067
Skew:                          -0.478   Prob(JB):                        0.356
Kurtosis:                       2.722   Cond. No.                         221.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.66285433  5.1080339   5.51771993  5.86729919  6.14104125  6.33468292
  6.4561287   6.52415205  6.56531102  6.60958481  6.68544826  6.81519354
  7.01126759  7.27422713  7.5926472   7.94499824  8.30318332  8.63715443
  8.91984924  9.13163739  9.26354568  9.31873237  9.31196825  9.26720989
  9.2136636   9.18098693  9.19441601  9.27061937  9.41495929  9.6206108
  9.86968305 10.13615868 10.39016841 10.60289848 10.7513247  10.82199518
 10.81324053 10.73544808 10.60935527 10.46264259 10.3253857  10.22511413
 10.18228822 10.20693757 10.2970132  10.43872416 10.60880342 10.77833047
 10.91748298 11.00043647]

Create a new sample of explanatory variables Xnew, predict and plot

[6]:
x1n = np.linspace(20.5, 25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n - 5) ** 2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew)  # predict out of sample
print(ynewpred)
[11.00060123 10.88147682 10.66077431 10.37885496 10.08884841  9.84364467
  9.68294478  9.62354068  9.65520388  9.74318979]

Plot comparison

[7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, "o", label="Data")
ax.plot(x1, y_true, "b-", label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), "r", label="OLS prediction")
ax.legend(loc="best")
[7]:
<matplotlib.legend.Legend at 0x7fd6a6794f50>
../../../_images/examples_notebooks_generated_predict_12_1.png

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

[8]:
from statsmodels.formula.api import ols

data = {"x1": x1, "y": y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2

[9]:
res.params
[9]:
Intercept           5.116151
x1                  0.477582
np.sin(x1)          0.451626
I((x1 - 5) ** 2)   -0.018132
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0    11.000601
1    10.881477
2    10.660774
3    10.378855
4    10.088848
5     9.843645
6     9.682945
7     9.623541
8     9.655204
9     9.743190
dtype: float64