Prediction (out of sample)

In [1]:
%matplotlib inline

from __future__ import print_function
import numpy as np
import statsmodels.api as sm
/build/statsmodels-hMUIZz/statsmodels-0.8.0/.pybuild/pythonX.Y_3.5/build/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

Artificial data

In [2]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [3]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.975
Model:                            OLS   Adj. R-squared:                  0.973
Method:                 Least Squares   F-statistic:                     591.7
Date:                Sat, 30 Sep 2017   Prob (F-statistic):           9.79e-37
Time:                        07:23:25   Log-Likelihood:                -10.321
No. Observations:                  50   AIC:                             28.64
Df Residuals:                      46   BIC:                             36.29
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.9791      0.106     47.106      0.000       4.766       5.192
x1             0.4941      0.016     30.310      0.000       0.461       0.527
x2             0.5002      0.064      7.806      0.000       0.371       0.629
x3            -0.0192      0.001    -13.417      0.000      -0.022      -0.016
==============================================================================
Omnibus:                        4.380   Durbin-Watson:                   2.526
Prob(Omnibus):                  0.112   Jarque-Bera (JB):                2.130
Skew:                           0.197   Prob(JB):                        0.345
Kurtosis:                       2.069   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [4]:
ypred = olsres.predict(X)
print(ypred)
[  4.49898581   4.97438734   5.41076889   5.78086856   6.06726317
   6.26523074   6.38352639   6.44294383   6.47289904   6.50659733
   6.57557774   6.70453133   6.90724448   7.18433382   7.52314493
   7.89983142   8.28327284   8.64018765   8.94060131   9.16277102
   9.29675797   9.34606026   9.32703822   9.26622638   9.195974     9.1491301
   9.15364653   9.22798551   9.37808637   9.59638972   9.86307942
  10.14933837  10.42208301  10.6493989   10.80578418  10.8763394
  10.85921551  10.76591763  10.61941481  10.45036624  10.29208408
  10.17506033  10.12195711  10.14388332  10.23856946  10.39074056
  10.57462575  10.75819239  10.9084089   10.9966718 ]

Create a new sample of explanatory variables Xnew, predict and plot

In [5]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[ 10.99282248  10.85728691  10.60968206  10.29471254   9.97122532
   9.69780197   9.51841583   9.45166581   9.48622274   9.58360312]

Plot comparison

In [6]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [7]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [8]:
res.params
Out[8]:
Intercept           4.979073
x1                  0.494085
np.sin(x1)          0.500226
I((x1 - 5) ** 2)   -0.019203
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [9]:
res.predict(exog=dict(x1=x1n))
Out[9]:
0    10.992822
1    10.857287
2    10.609682
3    10.294713
4     9.971225
5     9.697802
6     9.518416
7     9.451666
8     9.486223
9     9.583603
dtype: float64