Prediction (out of sample)ΒΆ

Link to Notebook GitHub

In [1]:
from __future__ import print_function
import numpy as np
import statsmodels.api as sm

Artificial data

In [2]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [3]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.986
Model:                            OLS   Adj. R-squared:                  0.985
Method:                 Least Squares   F-statistic:                     1059.
Date:                Wed, 27 Apr 2016   Prob (F-statistic):           1.94e-42
Time:                        01:52:12   Log-Likelihood:                 4.1650
No. Observations:                  50   AIC:                           -0.3299
Df Residuals:                      46   BIC:                             7.318
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          4.9818      0.079     62.970      0.000         4.823     5.141
x1             0.4867      0.012     39.889      0.000         0.462     0.511
x2             0.4739      0.048      9.880      0.000         0.377     0.570
x3            -0.0183      0.001    -17.044      0.000        -0.020    -0.016
==============================================================================
Omnibus:                        0.796   Durbin-Watson:                   2.483
Prob(Omnibus):                  0.672   Jarque-Bera (JB):                0.293
Skew:                          -0.157   Prob(JB):                        0.864
Kurtosis:                       3.205   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [4]:
ypred = olsres.predict(X)
print(ypred)
[  4.5253   4.9835   5.4047   5.7632   6.0423   6.2376   6.3574   6.4214
   6.4575   6.4971   6.5701   6.6999   6.8996   7.1697   7.4982   7.8626
   8.2334   8.579    8.8711   9.089    9.2234   9.2775   9.2669   9.2167
   9.1575   9.1204   9.132    9.2097   9.3592   9.5732   9.8331  10.1115
  10.377   10.5995  10.7548  10.8287  10.8196  10.7383  10.6065  10.4534
  10.3105  10.2066  10.1632  10.1909  10.2875  10.4386  10.6196  10.8004
  10.9495  11.0399]

Create a new sample of explanatory variables Xnew, predict and plot

In [5]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[ 11.0445  10.9244  10.6981  10.4079  10.1096   9.8587   9.6968   9.6417
   9.6824   9.7827]

Plot comparison

In [6]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");
Error in callback <function post_execute at 0x7f89c987e578> (for post_execute):

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/usr/lib/python2.7/dist-packages/matplotlib/pyplot.pyc in post_execute()
    145             def post_execute():
    146                 if matplotlib.is_interactive():
--> 147                     draw_all()
    148 
    149             # IPython >= 2

/usr/lib/python2.7/dist-packages/matplotlib/_pylab_helpers.pyc in draw_all(cls, force)
    148         for f_mgr in cls.get_all_fig_managers():
    149             if force or f_mgr.canvas.figure.stale:
--> 150                 f_mgr.canvas.draw_idle()
    151 
    152 atexit.register(Gcf.destroy_all)

/usr/lib/python2.7/dist-packages/matplotlib/backend_bases.pyc in draw_idle(self, *args, **kwargs)
   2024         if not self._is_idle_drawing:
   2025             with self._idle_draw_cntx():
-> 2026                 self.draw(*args, **kwargs)
   2027 
   2028     def draw_cursor(self, event):

/usr/lib/python2.7/dist-packages/matplotlib/backends/backend_agg.pyc in draw(self)
    472 
    473         try:
--> 474             self.figure.draw(self.renderer)
    475         finally:
    476             RendererAgg.lock.release()

/usr/lib/python2.7/dist-packages/matplotlib/artist.pyc in draw_wrapper(artist, renderer, *args, **kwargs)
     59     def draw_wrapper(artist, renderer, *args, **kwargs):
     60         before(artist, renderer)
---> 61         draw(artist, renderer, *args, **kwargs)
     62         after(artist, renderer)
     63 

/usr/lib/python2.7/dist-packages/matplotlib/figure.pyc in draw(self, renderer)
   1157         dsu.sort(key=itemgetter(0))
   1158         for zorder, a, func, args in dsu:
-> 1159             func(*args)
   1160 
   1161         renderer.close_group('figure')

/usr/lib/python2.7/dist-packages/matplotlib/artist.pyc in draw_wrapper(artist, renderer, *args, **kwargs)
     59     def draw_wrapper(artist, renderer, *args, **kwargs):
     60         before(artist, renderer)
---> 61         draw(artist, renderer, *args, **kwargs)
     62         after(artist, renderer)
     63 

/usr/lib/python2.7/dist-packages/matplotlib/axes/_base.pyc in draw(self, renderer, inframe)
   2322 
   2323         for zorder, a in dsu:
-> 2324             a.draw(renderer)
   2325 
   2326         renderer.close_group('axes')

/usr/lib/python2.7/dist-packages/matplotlib/artist.pyc in draw_wrapper(artist, renderer, *args, **kwargs)
     59     def draw_wrapper(artist, renderer, *args, **kwargs):
     60         before(artist, renderer)
---> 61         draw(artist, renderer, *args, **kwargs)
     62         after(artist, renderer)
     63 

/usr/lib/python2.7/dist-packages/matplotlib/axis.pyc in draw(self, renderer, *args, **kwargs)
   1106         ticks_to_draw = self._update_ticks(renderer)
   1107         ticklabelBoxes, ticklabelBoxes2 = self._get_tick_bboxes(ticks_to_draw,
-> 1108                                                                 renderer)
   1109 
   1110         for tick in ticks_to_draw:

/usr/lib/python2.7/dist-packages/matplotlib/axis.pyc in _get_tick_bboxes(self, ticks, renderer)
   1056         for tick in ticks:
   1057             if tick.label1On and tick.label1.get_visible():
-> 1058                 extent = tick.label1.get_window_extent(renderer)
   1059                 ticklabelBoxes.append(extent)
   1060             if tick.label2On and tick.label2.get_visible():

/usr/lib/python2.7/dist-packages/matplotlib/text.pyc in get_window_extent(self, renderer, dpi)
    959             raise RuntimeError('Cannot get window extent w/o renderer')
    960 
--> 961         bbox, info, descent = self._get_layout(self._renderer)
    962         x, y = self.get_unitless_position()
    963         x, y = self.get_transform().transform_point((x, y))

/usr/lib/python2.7/dist-packages/matplotlib/text.pyc in _get_layout(self, renderer)
    350         tmp, lp_h, lp_bl = renderer.get_text_width_height_descent('lp',
    351                                                          self._fontproperties,
--> 352                                                          ismath=False)
    353         offsety = (lp_h - lp_bl) * self._linespacing
    354 

/usr/lib/python2.7/dist-packages/matplotlib/backends/backend_agg.pyc in get_text_width_height_descent(self, s, prop, ismath)
    227             fontsize = prop.get_size_in_points()
    228             w, h, d = texmanager.get_text_width_height_descent(s, fontsize,
--> 229                                                                renderer=self)
    230             return w, h, d
    231 

/usr/lib/python2.7/dist-packages/matplotlib/texmanager.pyc in get_text_width_height_descent(self, tex, fontsize, renderer)
    673         else:
    674             # use dviread. It sometimes returns a wrong descent.
--> 675             dvifile = self.make_dvi(tex, fontsize)
    676             dvi = dviread.Dvi(dvifile, 72 * dpi_fraction)
    677             try:

/usr/lib/python2.7/dist-packages/matplotlib/texmanager.pyc in make_dvi(self, tex, fontsize)
    420                      'string:\n%s\nHere is the full report generated by '
    421                      'LaTeX: \n\n' % repr(tex.encode('unicode_escape')) +
--> 422                      report))
    423             else:
    424                 mpl.verbose.report(report, 'debug')

RuntimeError: LaTeX was not able to process the following string:
'lp'
Here is the full report generated by LaTeX:

<matplotlib.figure.Figure at 0x7f89b6a0ef10>

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [7]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [8]:
res.params
Out[8]:
Intercept           4.981756
x1                  0.486687
np.sin(x1)          0.473874
I((x1 - 5) ** 2)   -0.018259
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [9]:
res.predict(exog=dict(x1=x1n))
Out[9]:
array([ 11.0445,  10.9244,  10.6981,  10.4079,  10.1096,   9.8587,
         9.6968,   9.6417,   9.6824,   9.7827])