Logo

Prediction (out of sample)ΒΆ

Link to Notebook GitHub

In [1]:
from __future__ import print_function
import numpy as np
import statsmodels.api as sm

Artificial data

In [2]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [3]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.987
Model:                            OLS   Adj. R-squared:                  0.986
Method:                 Least Squares   F-statistic:                     1144.
Date:                Wed, 20 May 2015   Prob (F-statistic):           3.40e-43
Time:                        21:54:25   Log-Likelihood:                 6.3967
No. Observations:                  50   AIC:                            -4.793
Df Residuals:                      46   BIC:                             2.855
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          4.9942      0.076     66.009      0.000         4.842     5.146
x1             0.4894      0.012     41.941      0.000         0.466     0.513
x2             0.5587      0.046     12.179      0.000         0.466     0.651
x3            -0.0189      0.001    -18.405      0.000        -0.021    -0.017
==============================================================================
Omnibus:                        2.461   Durbin-Watson:                   2.238
Prob(Omnibus):                  0.292   Jarque-Bera (JB):                2.149
Skew:                          -0.397   Prob(JB):                        0.342
Kurtosis:                       2.368   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [4]:
ypred = olsres.predict(X)
print(ypred)
[  4.5228   5.0181   5.4707   5.8502   6.137    6.3259   6.4267   6.4627
   6.4665   6.4755   6.5246   6.6417   6.842    7.1262   7.4802   7.8773
   8.2829   8.6597   8.9743   9.2023   9.3327   9.3694   9.3307   9.2462
   9.1519   9.0847   9.0757   9.1456   9.3009   9.5333   9.8206  10.1305
  10.4263  10.6722  10.8397  10.9122  10.8877  10.779   10.6117  10.4201
  10.2413  10.1096  10.05    10.0752  10.1824  10.3547  10.5633  10.7725
  10.9452  11.0496]

Create a new sample of explanatory variables Xnew, predict and plot

In [5]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[ 11.0536  10.9118  10.6462  10.3066   9.9588   9.6683   9.484    9.4268
   9.4841   9.6128]

Plot comparison

In [6]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [7]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [8]:
res.params
Out[8]:
Intercept           4.994189
x1                  0.489395
np.sin(x1)          0.558671
I((x1 - 5) ** 2)   -0.018856
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [9]:
res.predict(exog=dict(x1=x1n))
Out[9]:
array([ 11.0536,  10.9118,  10.6462,  10.3066,   9.9588,   9.6683,
         9.484 ,   9.4268,   9.4841,   9.6128])

This Page