Regression diagnosticsΒΆ

Link to Notebook GitHub

This example file shows how to use a few of the statsmodels regression diagnostic tests in a real-life context. You can learn about more tests and find out more information abou the tests here on the Regression Diagnostics page.

Note that most of the tests described here only return a tuple of numbers, without any annotation. A full description of outputs is always included in the docstring and in the online statsmodels documentation. For presentation purposes, we use the zip(name,test) construct to pretty-print short descriptions in the examples below.

Estimate a regression model

In [1]:
from __future__ import print_function
from statsmodels.compat import lzip
import statsmodels
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.stats.api as sms
import matplotlib.pyplot as plt

# Load data
url = 'http://vincentarelbundock.github.io/Rdatasets/csv/HistData/Guerry.csv'
dat = pd.read_csv(url)

# Fit regression model (using the natural log of one of the regressaors)
results = smf.ols('Lottery ~ Literacy + np.log(Pop1831)', data=dat).fit()

# Inspect the results
print(results.summary())
---------------------------------------------------------------------------
URLError                                  Traceback (most recent call last)
<ipython-input-46-5ab57e184000> in <module>()
     10 # Load data
     11 url = 'http://vincentarelbundock.github.io/Rdatasets/csv/HistData/Guerry.csv'
---> 12 dat = pd.read_csv(url)
     13 
     14 # Fit regression model (using the natural log of one of the regressaors)

/usr/lib/python2.7/dist-packages/pandas/io/parsers.pyc in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    539                     skip_blank_lines=skip_blank_lines)
    540 
--> 541         return _read(filepath_or_buffer, kwds)
    542 
    543     parser_f.__name__ = name

/usr/lib/python2.7/dist-packages/pandas/io/parsers.pyc in _read(filepath_or_buffer, kwds)
    291     filepath_or_buffer, _, compression = get_filepath_or_buffer(
    292         filepath_or_buffer, encoding,
--> 293         compression=kwds.get('compression', None))
    294     kwds['compression'] = (inferred_compression if compression == 'infer'
    295                            else compression)

/usr/lib/python2.7/dist-packages/pandas/io/common.pyc in get_filepath_or_buffer(filepath_or_buffer, encoding, compression)
    306 
    307     if _is_url(filepath_or_buffer):
--> 308         req = _urlopen(str(filepath_or_buffer))
    309         if compression == 'infer':
    310             content_encoding = req.headers.get('Content-Encoding', None)

/usr/lib/python2.7/urllib2.pyc in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    152     else:
    153         opener = _opener
--> 154     return opener.open(url, data, timeout)
    155 
    156 def install_opener(opener):

/usr/lib/python2.7/urllib2.pyc in open(self, fullurl, data, timeout)
    427             req = meth(req)
    428 
--> 429         response = self._open(req, data)
    430 
    431         # post-process response

/usr/lib/python2.7/urllib2.pyc in _open(self, req, data)
    445         protocol = req.get_type()
    446         result = self._call_chain(self.handle_open, protocol, protocol +
--> 447                                   '_open', req)
    448         if result:
    449             return result

/usr/lib/python2.7/urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
    405             func = getattr(handler, meth_name)
    406 
--> 407             result = func(*args)
    408             if result is not None:
    409                 return result

/usr/lib/python2.7/urllib2.pyc in http_open(self, req)
   1226 
   1227     def http_open(self, req):
-> 1228         return self.do_open(httplib.HTTPConnection, req)
   1229 
   1230     http_request = AbstractHTTPHandler.do_request_

/usr/lib/python2.7/urllib2.pyc in do_open(self, http_class, req, **http_conn_args)
   1196         except socket.error, err: # XXX what error?
   1197             h.close()
-> 1198             raise URLError(err)
   1199         else:
   1200             try:

URLError: <urlopen error [Errno -3] Temporary failure in name resolution>

Normality of the residuals

Jarque-Bera test:

In [2]:
name = ['Jarque-Bera', 'Chi^2 two-tail prob.', 'Skew', 'Kurtosis']
test = sms.jarque_bera(results.resid)
lzip(name, test)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-47-911bd7c54c5b> in <module>()
      1 name = ['Jarque-Bera', 'Chi^2 two-tail prob.', 'Skew', 'Kurtosis']
----> 2 test = sms.jarque_bera(results.resid)
      3 lzip(name, test)

NameError: name 'results' is not defined

Omni test:

In [3]:
name = ['Chi^2', 'Two-tail probability']
test = sms.omni_normtest(results.resid)
lzip(name, test)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-48-5d66593e2f9b> in <module>()
      1 name = ['Chi^2', 'Two-tail probability']
----> 2 test = sms.omni_normtest(results.resid)
      3 lzip(name, test)

NameError: name 'results' is not defined

Influence tests

Once created, an object of class OLSInfluence holds attributes and methods that allow users to assess the influence of each observation. For example, we can compute and extract the first few rows of DFbetas by:

In [4]:
from statsmodels.stats.outliers_influence import OLSInfluence
test_class = OLSInfluence(results)
test_class.dfbetas[:5,:]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-49-b1e3d0f42941> in <module>()
      1 from statsmodels.stats.outliers_influence import OLSInfluence
----> 2 test_class = OLSInfluence(results)
      3 test_class.dfbetas[:5,:]

NameError: name 'results' is not defined

Explore other options by typing dir(influence_test)

Useful information on leverage can also be plotted:

In [5]:
from statsmodels.graphics.regressionplots import plot_leverage_resid2
fig, ax = plt.subplots(figsize=(8,6))
fig = plot_leverage_resid2(results, ax = ax)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-50-7590c2fd5fb2> in <module>()
      1 from statsmodels.graphics.regressionplots import plot_leverage_resid2
      2 fig, ax = plt.subplots(figsize=(8,6))
----> 3 fig = plot_leverage_resid2(results, ax = ax)

NameError: name 'results' is not defined
Error in callback <function post_execute at 0x7f89c987e578> (for post_execute):

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/usr/lib/python2.7/dist-packages/matplotlib/pyplot.pyc in post_execute()
    145             def post_execute():
    146                 if matplotlib.is_interactive():
--> 147                     draw_all()
    148 
    149             # IPython >= 2

/usr/lib/python2.7/dist-packages/matplotlib/_pylab_helpers.pyc in draw_all(cls, force)
    148         for f_mgr in cls.get_all_fig_managers():
    149             if force or f_mgr.canvas.figure.stale:
--> 150                 f_mgr.canvas.draw_idle()
    151 
    152 atexit.register(Gcf.destroy_all)

/usr/lib/python2.7/dist-packages/matplotlib/backend_bases.pyc in draw_idle(self, *args, **kwargs)
   2024         if not self._is_idle_drawing:
   2025             with self._idle_draw_cntx():
-> 2026                 self.draw(*args, **kwargs)
   2027 
   2028     def draw_cursor(self, event):

/usr/lib/python2.7/dist-packages/matplotlib/backends/backend_agg.pyc in draw(self)
    472 
    473         try:
--> 474             self.figure.draw(self.renderer)
    475         finally:
    476             RendererAgg.lock.release()

/usr/lib/python2.7/dist-packages/matplotlib/artist.pyc in draw_wrapper(artist, renderer, *args, **kwargs)
     59     def draw_wrapper(artist, renderer, *args, **kwargs):
     60         before(artist, renderer)
---> 61         draw(artist, renderer, *args, **kwargs)
     62         after(artist, renderer)
     63 

/usr/lib/python2.7/dist-packages/matplotlib/figure.pyc in draw(self, renderer)
   1157         dsu.sort(key=itemgetter(0))
   1158         for zorder, a, func, args in dsu:
-> 1159             func(*args)
   1160 
   1161         renderer.close_group('figure')

/usr/lib/python2.7/dist-packages/matplotlib/artist.pyc in draw_wrapper(artist, renderer, *args, **kwargs)
     59     def draw_wrapper(artist, renderer, *args, **kwargs):
     60         before(artist, renderer)
---> 61         draw(artist, renderer, *args, **kwargs)
     62         after(artist, renderer)
     63 

/usr/lib/python2.7/dist-packages/matplotlib/axes/_base.pyc in draw(self, renderer, inframe)
   2322 
   2323         for zorder, a in dsu:
-> 2324             a.draw(renderer)
   2325 
   2326         renderer.close_group('axes')

/usr/lib/python2.7/dist-packages/matplotlib/artist.pyc in draw_wrapper(artist, renderer, *args, **kwargs)
     59     def draw_wrapper(artist, renderer, *args, **kwargs):
     60         before(artist, renderer)
---> 61         draw(artist, renderer, *args, **kwargs)
     62         after(artist, renderer)
     63 

/usr/lib/python2.7/dist-packages/matplotlib/axis.pyc in draw(self, renderer, *args, **kwargs)
   1106         ticks_to_draw = self._update_ticks(renderer)
   1107         ticklabelBoxes, ticklabelBoxes2 = self._get_tick_bboxes(ticks_to_draw,
-> 1108                                                                 renderer)
   1109 
   1110         for tick in ticks_to_draw:

/usr/lib/python2.7/dist-packages/matplotlib/axis.pyc in _get_tick_bboxes(self, ticks, renderer)
   1056         for tick in ticks:
   1057             if tick.label1On and tick.label1.get_visible():
-> 1058                 extent = tick.label1.get_window_extent(renderer)
   1059                 ticklabelBoxes.append(extent)
   1060             if tick.label2On and tick.label2.get_visible():

/usr/lib/python2.7/dist-packages/matplotlib/text.pyc in get_window_extent(self, renderer, dpi)
    959             raise RuntimeError('Cannot get window extent w/o renderer')
    960 
--> 961         bbox, info, descent = self._get_layout(self._renderer)
    962         x, y = self.get_unitless_position()
    963         x, y = self.get_transform().transform_point((x, y))

/usr/lib/python2.7/dist-packages/matplotlib/text.pyc in _get_layout(self, renderer)
    350         tmp, lp_h, lp_bl = renderer.get_text_width_height_descent('lp',
    351                                                          self._fontproperties,
--> 352                                                          ismath=False)
    353         offsety = (lp_h - lp_bl) * self._linespacing
    354 

/usr/lib/python2.7/dist-packages/matplotlib/backends/backend_agg.pyc in get_text_width_height_descent(self, s, prop, ismath)
    227             fontsize = prop.get_size_in_points()
    228             w, h, d = texmanager.get_text_width_height_descent(s, fontsize,
--> 229                                                                renderer=self)
    230             return w, h, d
    231 

/usr/lib/python2.7/dist-packages/matplotlib/texmanager.pyc in get_text_width_height_descent(self, tex, fontsize, renderer)
    673         else:
    674             # use dviread. It sometimes returns a wrong descent.
--> 675             dvifile = self.make_dvi(tex, fontsize)
    676             dvi = dviread.Dvi(dvifile, 72 * dpi_fraction)
    677             try:

/usr/lib/python2.7/dist-packages/matplotlib/texmanager.pyc in make_dvi(self, tex, fontsize)
    420                      'string:\n%s\nHere is the full report generated by '
    421                      'LaTeX: \n\n' % repr(tex.encode('unicode_escape')) +
--> 422                      report))
    423             else:
    424                 mpl.verbose.report(report, 'debug')

RuntimeError: LaTeX was not able to process the following string:
'lp'
Here is the full report generated by LaTeX:

<matplotlib.figure.Figure at 0x7f89bd6db610>

Other plotting options can be found on the Graphics page.

Multicollinearity

Condition number:

In [6]:
np.linalg.cond(results.model.exog)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-51-c5af1dc95c76> in <module>()
----> 1 np.linalg.cond(results.model.exog)

NameError: name 'results' is not defined

Heteroskedasticity tests

Breush-Pagan test:

In [7]:
name = ['Lagrange multiplier statistic', 'p-value',
        'f-value', 'f p-value']
test = sms.het_breushpagan(results.resid, results.model.exog)
lzip(name, test)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-52-a4e5ba55306f> in <module>()
      1 name = ['Lagrange multiplier statistic', 'p-value',
      2         'f-value', 'f p-value']
----> 3 test = sms.het_breushpagan(results.resid, results.model.exog)
      4 lzip(name, test)

NameError: name 'results' is not defined

Goldfeld-Quandt test

In [8]:
name = ['F statistic', 'p-value']
test = sms.het_goldfeldquandt(results.resid, results.model.exog)
lzip(name, test)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-53-08d227eb273b> in <module>()
      1 name = ['F statistic', 'p-value']
----> 2 test = sms.het_goldfeldquandt(results.resid, results.model.exog)
      3 lzip(name, test)

NameError: name 'results' is not defined

Linearity

Harvey-Collier multiplier test for Null hypothesis that the linear specification is correct:

In [9]:
name = ['t value', 'p value']
test = sms.linear_harvey_collier(results)
lzip(name, test)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-54-239f988c7dd2> in <module>()
      1 name = ['t value', 'p value']
----> 2 test = sms.linear_harvey_collier(results)
      3 lzip(name, test)

NameError: name 'results' is not defined