Logo

Regression PlotsΒΆ

Link to Notebook GitHub

In [1]:
from __future__ import print_function
from statsmodels.compat import lzip
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols

Duncan's Prestige Dataset

Load the Data

We can use a utility function to load any R dataset available from the great Rdatasets package.

In [2]:
prestige = sm.datasets.get_rdataset("Duncan", "car", cache=True).data
In [3]:
prestige.head()
Out[3]:
type income education prestige
accountant prof 62 86 82
pilot prof 72 76 83
architect prof 75 92 90
author prof 55 90 76
chemist prof 64 86 90
In [4]:
prestige_model = ols("prestige ~ income + education", data=prestige).fit()
In [5]:
print(prestige_model.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:               prestige   R-squared:                       0.828
Model:                            OLS   Adj. R-squared:                  0.820
Method:                 Least Squares   F-statistic:                     101.2
Date:                Wed, 20 May 2015   Prob (F-statistic):           8.65e-17
Time:                        21:54:39   Log-Likelihood:                -178.98
No. Observations:                  45   AIC:                             364.0
Df Residuals:                      42   BIC:                             369.4
Df Model:                           2
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     -6.0647      4.272     -1.420      0.163       -14.686     2.556
income         0.5987      0.120      5.003      0.000         0.357     0.840
education      0.5458      0.098      5.555      0.000         0.348     0.744
==============================================================================
Omnibus:                        1.279   Durbin-Watson:                   1.458
Prob(Omnibus):                  0.528   Jarque-Bera (JB):                0.520
Skew:                           0.155   Prob(JB):                        0.771
Kurtosis:                       3.426   Cond. No.                         163.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Influence plots

Influence plots show the (externally) studentized residuals vs. the leverage of each observation as measured by the hat matrix.

Externally studentized residuals are residuals that are scaled by their standard deviation where

$$var(\hat{\epsilon}_i)=\hat{\sigma}^2_i(1-h_{ii})$$

with

$$\hat{\sigma}^2_i=\frac{1}{n - p - 1 \;\;}\sum_{j}^{n}\;\;\;\forall \;\;\; j \neq i$$

$n$ is the number of observations and $p$ is the number of regressors. $h_{ii}$ is the $i$-th diagonal element of the hat matrix

$$H=X(X^{\;\prime}X)^{-1}X^{\;\prime}$$

The influence of each point can be visualized by the criterion keyword argument. Options are Cook's distance and DFFITS, two measures of influence.

In [6]:
fig, ax = plt.subplots(figsize=(12,8))
fig = sm.graphics.influence_plot(prestige_model, ax=ax, criterion="cooks")
<matplotlib.figure.Figure at 0x7fd0d8011c50>

As you can see there are a few worrisome observations. Both contractor and reporter have low leverage but a large residual.
RR.engineer has small residual and large leverage. Conductor and minister have both high leverage and large residuals, and,
therefore, large influence.

Partial Regression Plots

Since we are doing multivariate regressions, we cannot just look at individual bivariate plots to discern relationships.
Instead, we want to look at the relationship of the dependent variable and independent variables conditional on the other
independent variables. We can do this through using partial regression plots, otherwise known as added variable plots.

In a partial regression plot, to discern the relationship between the response variable and the $k$-th variabe, we compute
the residuals by regressing the response variable versus the independent variables excluding $X_k$. We can denote this by
$X_{\sim k}$. We then compute the residuals by regressing $X_k$ on $X_{\sim k}$. The partial regression plot is the plot
of the former versus the latter residuals.

The notable points of this plot are that the fitted line has slope $\beta_k$ and intercept zero. The residuals of this plot
are the same as those of the least squares fit of the original model with full $X$. You can discern the effects of the
individual data values on the estimation of a coefficient easily. If obs_labels is True, then these points are annotated
with their observation label. You can also see the violation of underlying assumptions such as homooskedasticity and
linearity.

In [7]:
fig, ax = plt.subplots(figsize=(12,8))
fig = sm.graphics.plot_partregress("prestige", "income", ["income", "education"], data=prestige, ax=ax)
<matplotlib.figure.Figure at 0x7fd0d2c6e7d0>
In [8]:
fix, ax = plt.subplots(figsize=(12,14))
fig = sm.graphics.plot_partregress("prestige", "income", ["education"], data=prestige, ax=ax)
<matplotlib.figure.Figure at 0x7fd0d2ba35d0>

As you can see the partial regression plot confirms the influence of conductor, minister, and RR.engineer on the partial relationship between income and prestige. The cases greatly decrease the effect of income on prestige. Dropping these cases confirms this.

In [9]:
subset = ~prestige.index.isin(["conductor", "RR.engineer", "minister"])
prestige_model2 = ols("prestige ~ income + education", data=prestige, subset=subset).fit()
print(prestige_model2.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:               prestige   R-squared:                       0.876
Model:                            OLS   Adj. R-squared:                  0.870
Method:                 Least Squares   F-statistic:                     138.1
Date:                Wed, 20 May 2015   Prob (F-statistic):           2.02e-18
Time:                        21:54:40   Log-Likelihood:                -160.59
No. Observations:                  42   AIC:                             327.2
Df Residuals:                      39   BIC:                             332.4
Df Model:                           2
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     -6.3174      3.680     -1.717      0.094       -13.760     1.125
income         0.9307      0.154      6.053      0.000         0.620     1.242
education      0.2846      0.121      2.345      0.024         0.039     0.530
==============================================================================
Omnibus:                        3.811   Durbin-Watson:                   1.468
Prob(Omnibus):                  0.149   Jarque-Bera (JB):                2.802
Skew:                          -0.614   Prob(JB):                        0.246
Kurtosis:                       3.303   Cond. No.                         158.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

For a quick check of all the regressors, you can use plot_partregress_grid. These plots will not label the
points, but you can use them to identify problems and then use plot_partregress to get more information.

In [10]:
fig = plt.figure(figsize=(12,8))
fig = sm.graphics.plot_partregress_grid(prestige_model, fig=fig)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-309-8a4e78936888> in <module>()
      1 fig = plt.figure(figsize=(12,8))
----> 2 fig = sm.graphics.plot_partregress_grid(prestige_model, fig=fig)

/tmp/buildd/statsmodels-0.6.1/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/graphics/regressionplots.pyc in plot_partregress_grid(results, exog_idx, grid, fig)
    461     fig.suptitle("Partial Regression Plot", fontsize="large")
    462 
--> 463     fig.tight_layout()
    464 
    465     fig.subplots_adjust(top=.95)

/usr/lib/python2.7/dist-packages/matplotlib/figure.pyc in tight_layout(self, renderer, pad, h_pad, w_pad, rect)
   1652                                          renderer,
   1653                                          pad=pad, h_pad=h_pad, w_pad=w_pad,
-> 1654                                          rect=rect)
   1655 
   1656         self.subplots_adjust(**kwargs)

/usr/lib/python2.7/dist-packages/matplotlib/tight_layout.pyc in get_tight_layout_figure(fig, axes_list, subplotspec_list, renderer, pad, h_pad, w_pad, rect)
    350                                      subplot_list=subplot_list,
    351                                      ax_bbox_list=ax_bbox_list,
--> 352                                      pad=pad, h_pad=h_pad, w_pad=w_pad)
    353 
    354     if rect is not None:

/usr/lib/python2.7/dist-packages/matplotlib/tight_layout.pyc in auto_adjust_subplotpars(fig, renderer, nrows_ncols, num1num2_list, subplot_list, ax_bbox_list, pad, h_pad, w_pad, rect)
    127         #ax_bbox = union([ax.get_position(original=True) for ax in subplots])
    128 
--> 129         tight_bbox_raw = union([ax.get_tightbbox(renderer) for ax in subplots])
    130         tight_bbox = TransformedBbox(tight_bbox_raw,
    131                                      fig.transFigure.inverted())

/usr/lib/python2.7/dist-packages/matplotlib/axes/_base.pyc in get_tightbbox(self, renderer, call_axes_locator)
   3256             bb.append(self._right_title.get_window_extent(renderer))
   3257 
-> 3258         bb_xaxis = self.xaxis.get_tightbbox(renderer)
   3259         if bb_xaxis:
   3260             bb.append(bb_xaxis)

/usr/lib/python2.7/dist-packages/matplotlib/axis.pyc in get_tightbbox(self, renderer)
   1080         ticks_to_draw = self._update_ticks(renderer)
   1081         ticklabelBoxes, ticklabelBoxes2 = self._get_tick_bboxes(ticks_to_draw,
-> 1082                                                                 renderer)
   1083 
   1084         self._update_label_position(ticklabelBoxes, ticklabelBoxes2)

/usr/lib/python2.7/dist-packages/matplotlib/axis.pyc in _get_tick_bboxes(self, ticks, renderer)
   1063         for tick in ticks:
   1064             if tick.label1On and tick.label1.get_visible():
-> 1065                 extent = tick.label1.get_window_extent(renderer)
   1066                 ticklabelBoxes.append(extent)
   1067             if tick.label2On and tick.label2.get_visible():

/usr/lib/python2.7/dist-packages/matplotlib/text.pyc in get_window_extent(self, renderer, dpi)
    739             raise RuntimeError('Cannot get window extent w/o renderer')
    740 
--> 741         bbox, info, descent = self._get_layout(self._renderer)
    742         x, y = self.get_position()
    743         x, y = self.get_transform().transform_point((x, y))

/usr/lib/python2.7/dist-packages/matplotlib/text.pyc in _get_layout(self, renderer)
    309         tmp, lp_h, lp_bl = renderer.get_text_width_height_descent('lp',
    310                                                          self._fontproperties,
--> 311                                                          ismath=False)
    312         offsety = (lp_h - lp_bl) * self._linespacing
    313 

/usr/lib/python2.7/dist-packages/matplotlib/backends/backend_agg.pyc in get_text_width_height_descent(self, s, prop, ismath)
    221             fontsize = prop.get_size_in_points()
    222             w, h, d = texmanager.get_text_width_height_descent(s, fontsize,
--> 223                                                                renderer=self)
    224             return w, h, d
    225 

/usr/lib/python2.7/dist-packages/matplotlib/texmanager.pyc in get_text_width_height_descent(self, tex, fontsize, renderer)
    668         else:
    669             # use dviread. It sometimes returns a wrong descent.
--> 670             dvifile = self.make_dvi(tex, fontsize)
    671             dvi = dviread.Dvi(dvifile, 72 * dpi_fraction)
    672             try:

/usr/lib/python2.7/dist-packages/matplotlib/texmanager.pyc in make_dvi(self, tex, fontsize)
    415                      'string:\n%s\nHere is the full report generated by '
    416                      'LaTeX: \n\n' % repr(tex.encode('unicode_escape')) +
--> 417                      report))
    418             else:
    419                 mpl.verbose.report(report, 'debug')

RuntimeError: LaTeX was not able to process the following string:
'lp'
Here is the full report generated by LaTeX:

<matplotlib.figure.Figure at 0x7fd0d2ac7bd0>

Component-Component plus Residual (CCPR) Plots

The CCPR plot provides a way to judge the effect of one regressor on the
response variable by taking into account the effects of the other
independent variables. The partial residuals plot is defined as
$\text{Residuals} + B_iX_i \text{ }\text{ }$ versus $X_i$. The component adds $B_iX_i$ versus
$X_i$ to show where the fitted line would lie. Care should be taken if $X_i$
is highly correlated with any of the other independent variables. If this
is the case, the variance evident in the plot will be an underestimate of
the true variance.

In [11]:
fig, ax = plt.subplots(figsize=(12, 8))
fig = sm.graphics.plot_ccpr(prestige_model, "education", ax=ax)
<matplotlib.figure.Figure at 0x7fd0d2d58910>

As you can see the relationship between the variation in prestige explained by education conditional on income seems to be linear, though you can see there are some observations that are exerting considerable influence on the relationship. We can quickly look at more than one variable by using plot_ccpr_grid.

In [12]:
fig = plt.figure(figsize=(12, 8))
fig = sm.graphics.plot_ccpr_grid(prestige_model, fig=fig)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-311-4e390a91d878> in <module>()
      1 fig = plt.figure(figsize=(12, 8))
----> 2 fig = sm.graphics.plot_ccpr_grid(prestige_model, fig=fig)

/tmp/buildd/statsmodels-0.6.1/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/graphics/regressionplots.pyc in plot_ccpr_grid(results, exog_idx, grid, fig)
    599     fig.suptitle("Component-Component Plus Residual Plot", fontsize="large")
    600 
--> 601     fig.tight_layout()
    602 
    603     fig.subplots_adjust(top=.95)

/usr/lib/python2.7/dist-packages/matplotlib/figure.pyc in tight_layout(self, renderer, pad, h_pad, w_pad, rect)
   1652                                          renderer,
   1653                                          pad=pad, h_pad=h_pad, w_pad=w_pad,
-> 1654                                          rect=rect)
   1655 
   1656         self.subplots_adjust(**kwargs)

/usr/lib/python2.7/dist-packages/matplotlib/tight_layout.pyc in get_tight_layout_figure(fig, axes_list, subplotspec_list, renderer, pad, h_pad, w_pad, rect)
    350                                      subplot_list=subplot_list,
    351                                      ax_bbox_list=ax_bbox_list,
--> 352                                      pad=pad, h_pad=h_pad, w_pad=w_pad)
    353 
    354     if rect is not None:

/usr/lib/python2.7/dist-packages/matplotlib/tight_layout.pyc in auto_adjust_subplotpars(fig, renderer, nrows_ncols, num1num2_list, subplot_list, ax_bbox_list, pad, h_pad, w_pad, rect)
    127         #ax_bbox = union([ax.get_position(original=True) for ax in subplots])
    128 
--> 129         tight_bbox_raw = union([ax.get_tightbbox(renderer) for ax in subplots])
    130         tight_bbox = TransformedBbox(tight_bbox_raw,
    131                                      fig.transFigure.inverted())

/usr/lib/python2.7/dist-packages/matplotlib/axes/_base.pyc in get_tightbbox(self, renderer, call_axes_locator)
   3256             bb.append(self._right_title.get_window_extent(renderer))
   3257 
-> 3258         bb_xaxis = self.xaxis.get_tightbbox(renderer)
   3259         if bb_xaxis:
   3260             bb.append(bb_xaxis)

/usr/lib/python2.7/dist-packages/matplotlib/axis.pyc in get_tightbbox(self, renderer)
   1080         ticks_to_draw = self._update_ticks(renderer)
   1081         ticklabelBoxes, ticklabelBoxes2 = self._get_tick_bboxes(ticks_to_draw,
-> 1082                                                                 renderer)
   1083 
   1084         self._update_label_position(ticklabelBoxes, ticklabelBoxes2)

/usr/lib/python2.7/dist-packages/matplotlib/axis.pyc in _get_tick_bboxes(self, ticks, renderer)
   1063         for tick in ticks:
   1064             if tick.label1On and tick.label1.get_visible():
-> 1065                 extent = tick.label1.get_window_extent(renderer)
   1066                 ticklabelBoxes.append(extent)
   1067             if tick.label2On and tick.label2.get_visible():

/usr/lib/python2.7/dist-packages/matplotlib/text.pyc in get_window_extent(self, renderer, dpi)
    739             raise RuntimeError('Cannot get window extent w/o renderer')
    740 
--> 741         bbox, info, descent = self._get_layout(self._renderer)
    742         x, y = self.get_position()
    743         x, y = self.get_transform().transform_point((x, y))

/usr/lib/python2.7/dist-packages/matplotlib/text.pyc in _get_layout(self, renderer)
    309         tmp, lp_h, lp_bl = renderer.get_text_width_height_descent('lp',
    310                                                          self._fontproperties,
--> 311                                                          ismath=False)
    312         offsety = (lp_h - lp_bl) * self._linespacing
    313 

/usr/lib/python2.7/dist-packages/matplotlib/backends/backend_agg.pyc in get_text_width_height_descent(self, s, prop, ismath)
    221             fontsize = prop.get_size_in_points()
    222             w, h, d = texmanager.get_text_width_height_descent(s, fontsize,
--> 223                                                                renderer=self)
    224             return w, h, d
    225 

/usr/lib/python2.7/dist-packages/matplotlib/texmanager.pyc in get_text_width_height_descent(self, tex, fontsize, renderer)
    668         else:
    669             # use dviread. It sometimes returns a wrong descent.
--> 670             dvifile = self.make_dvi(tex, fontsize)
    671             dvi = dviread.Dvi(dvifile, 72 * dpi_fraction)
    672             try:

/usr/lib/python2.7/dist-packages/matplotlib/texmanager.pyc in make_dvi(self, tex, fontsize)
    415                      'string:\n%s\nHere is the full report generated by '
    416                      'LaTeX: \n\n' % repr(tex.encode('unicode_escape')) +
--> 417                      report))
    418             else:
    419                 mpl.verbose.report(report, 'debug')

RuntimeError: LaTeX was not able to process the following string:
'lp'
Here is the full report generated by LaTeX:

<matplotlib.figure.Figure at 0x7fd0d2ff1490>

Regression Plots

The plot_regress_exog function is a convenience function that gives a 2x2 plot containing the dependent variable and fitted values with confidence intervals vs. the independent variable chosen, the residuals of the model vs. the chosen independent variable, a partial regression plot, and a CCPR plot. This function can be used for quickly checking modeling assumptions with respect to a single regressor.

In [13]:
fig = plt.figure(figsize=(12,8))
fig = sm.graphics.plot_regress_exog(prestige_model, "education", fig=fig)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-312-8700df56dd99> in <module>()
      1 fig = plt.figure(figsize=(12,8))
----> 2 fig = sm.graphics.plot_regress_exog(prestige_model, "education", fig=fig)

/tmp/buildd/statsmodels-0.6.1/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/graphics/regressionplots.pyc in plot_regress_exog(results, exog_idx, fig)
    222     fig.suptitle('Regression Plots for %s' % exog_name, fontsize="large")
    223 
--> 224     fig.tight_layout()
    225 
    226     fig.subplots_adjust(top=.90)

/usr/lib/python2.7/dist-packages/matplotlib/figure.pyc in tight_layout(self, renderer, pad, h_pad, w_pad, rect)
   1652                                          renderer,
   1653                                          pad=pad, h_pad=h_pad, w_pad=w_pad,
-> 1654                                          rect=rect)
   1655 
   1656         self.subplots_adjust(**kwargs)

/usr/lib/python2.7/dist-packages/matplotlib/tight_layout.pyc in get_tight_layout_figure(fig, axes_list, subplotspec_list, renderer, pad, h_pad, w_pad, rect)
    350                                      subplot_list=subplot_list,
    351                                      ax_bbox_list=ax_bbox_list,
--> 352                                      pad=pad, h_pad=h_pad, w_pad=w_pad)
    353 
    354     if rect is not None:

/usr/lib/python2.7/dist-packages/matplotlib/tight_layout.pyc in auto_adjust_subplotpars(fig, renderer, nrows_ncols, num1num2_list, subplot_list, ax_bbox_list, pad, h_pad, w_pad, rect)
    127         #ax_bbox = union([ax.get_position(original=True) for ax in subplots])
    128 
--> 129         tight_bbox_raw = union([ax.get_tightbbox(renderer) for ax in subplots])
    130         tight_bbox = TransformedBbox(tight_bbox_raw,
    131                                      fig.transFigure.inverted())

/usr/lib/python2.7/dist-packages/matplotlib/axes/_base.pyc in get_tightbbox(self, renderer, call_axes_locator)
   3250 
   3251         if self.title.get_visible():
-> 3252             bb.append(self.title.get_window_extent(renderer))
   3253         if self._left_title.get_visible():
   3254             bb.append(self._left_title.get_window_extent(renderer))

/usr/lib/python2.7/dist-packages/matplotlib/text.pyc in get_window_extent(self, renderer, dpi)
    739             raise RuntimeError('Cannot get window extent w/o renderer')
    740 
--> 741         bbox, info, descent = self._get_layout(self._renderer)
    742         x, y = self.get_position()
    743         x, y = self.get_transform().transform_point((x, y))

/usr/lib/python2.7/dist-packages/matplotlib/text.pyc in _get_layout(self, renderer)
    309         tmp, lp_h, lp_bl = renderer.get_text_width_height_descent('lp',
    310                                                          self._fontproperties,
--> 311                                                          ismath=False)
    312         offsety = (lp_h - lp_bl) * self._linespacing
    313 

/usr/lib/python2.7/dist-packages/matplotlib/backends/backend_agg.pyc in get_text_width_height_descent(self, s, prop, ismath)
    221             fontsize = prop.get_size_in_points()
    222             w, h, d = texmanager.get_text_width_height_descent(s, fontsize,
--> 223                                                                renderer=self)
    224             return w, h, d
    225 

/usr/lib/python2.7/dist-packages/matplotlib/texmanager.pyc in get_text_width_height_descent(self, tex, fontsize, renderer)
    668         else:
    669             # use dviread. It sometimes returns a wrong descent.
--> 670             dvifile = self.make_dvi(tex, fontsize)
    671             dvi = dviread.Dvi(dvifile, 72 * dpi_fraction)
    672             try:

/usr/lib/python2.7/dist-packages/matplotlib/texmanager.pyc in make_dvi(self, tex, fontsize)
    415                      'string:\n%s\nHere is the full report generated by '
    416                      'LaTeX: \n\n' % repr(tex.encode('unicode_escape')) +
--> 417                      report))
    418             else:
    419                 mpl.verbose.report(report, 'debug')

RuntimeError: LaTeX was not able to process the following string:
'lp'
Here is the full report generated by LaTeX:

<matplotlib.figure.Figure at 0x7fd0d8043090>

Fit Plot

The plot_fit function plots the fitted values versus a chosen independent variable. It includes prediction confidence intervals and optionally plots the true dependent variable.

In [14]:
fig, ax = plt.subplots(figsize=(12, 8))
fig = sm.graphics.plot_fit(prestige_model, "education", ax=ax)
<matplotlib.figure.Figure at 0x7fd0d284d710>

Statewide Crime 2009 Dataset

Compare the following to http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter4/statareg_self_assessment_answers4.htm

Though the data here is not the same as in that example. You could run that example by uncommenting the necessary cells below.

In [15]:
#dta = pd.read_csv("http://www.stat.ufl.edu/~aa/social/csv_files/statewide-crime-2.csv")
#dta = dta.set_index("State", inplace=True).dropna()
#dta.rename(columns={"VR" : "crime",
#                    "MR" : "murder",
#                    "M"  : "pctmetro",
#                    "W"  : "pctwhite",
#                    "H"  : "pcths",
#                    "P"  : "poverty",
#                    "S"  : "single"
#                    }, inplace=True)
#
#crime_model = ols("murder ~ pctmetro + poverty + pcths + single", data=dta).fit()
In [16]:
dta = sm.datasets.statecrime.load_pandas().data
In [17]:
crime_model = ols("murder ~ urban + poverty + hs_grad + single", data=dta).fit()
print(crime_model.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                 murder   R-squared:                       0.813
Model:                            OLS   Adj. R-squared:                  0.797
Method:                 Least Squares   F-statistic:                     50.08
Date:                Wed, 20 May 2015   Prob (F-statistic):           3.42e-16
Time:                        21:54:43   Log-Likelihood:                -95.050
No. Observations:                  51   AIC:                             200.1
Df Residuals:                      46   BIC:                             209.8
Df Model:                           4
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept    -44.1024     12.086     -3.649      0.001       -68.430   -19.774
urban          0.0109      0.015      0.707      0.483        -0.020     0.042
poverty        0.4121      0.140      2.939      0.005         0.130     0.694
hs_grad        0.3059      0.117      2.611      0.012         0.070     0.542
single         0.6374      0.070      9.065      0.000         0.496     0.779
==============================================================================
Omnibus:                        1.618   Durbin-Watson:                   2.507
Prob(Omnibus):                  0.445   Jarque-Bera (JB):                0.831
Skew:                          -0.220   Prob(JB):                        0.660
Kurtosis:                       3.445   Cond. No.                     5.80e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.8e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Partial Regression Plots

In [18]:
fig = plt.figure(figsize=(12,8))
fig = sm.graphics.plot_partregress_grid(crime_model, fig=fig)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-317-3f3d9fd5d7d3> in <module>()
      1 fig = plt.figure(figsize=(12,8))
----> 2 fig = sm.graphics.plot_partregress_grid(crime_model, fig=fig)

/tmp/buildd/statsmodels-0.6.1/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/graphics/regressionplots.pyc in plot_partregress_grid(results, exog_idx, grid, fig)
    461     fig.suptitle("Partial Regression Plot", fontsize="large")
    462 
--> 463     fig.tight_layout()
    464 
    465     fig.subplots_adjust(top=.95)

/usr/lib/python2.7/dist-packages/matplotlib/figure.pyc in tight_layout(self, renderer, pad, h_pad, w_pad, rect)
   1652                                          renderer,
   1653                                          pad=pad, h_pad=h_pad, w_pad=w_pad,
-> 1654                                          rect=rect)
   1655 
   1656         self.subplots_adjust(**kwargs)

/usr/lib/python2.7/dist-packages/matplotlib/tight_layout.pyc in get_tight_layout_figure(fig, axes_list, subplotspec_list, renderer, pad, h_pad, w_pad, rect)
    350                                      subplot_list=subplot_list,
    351                                      ax_bbox_list=ax_bbox_list,
--> 352                                      pad=pad, h_pad=h_pad, w_pad=w_pad)
    353 
    354     if rect is not None:

/usr/lib/python2.7/dist-packages/matplotlib/tight_layout.pyc in auto_adjust_subplotpars(fig, renderer, nrows_ncols, num1num2_list, subplot_list, ax_bbox_list, pad, h_pad, w_pad, rect)
    127         #ax_bbox = union([ax.get_position(original=True) for ax in subplots])
    128 
--> 129         tight_bbox_raw = union([ax.get_tightbbox(renderer) for ax in subplots])
    130         tight_bbox = TransformedBbox(tight_bbox_raw,
    131                                      fig.transFigure.inverted())

/usr/lib/python2.7/dist-packages/matplotlib/axes/_base.pyc in get_tightbbox(self, renderer, call_axes_locator)
   3256             bb.append(self._right_title.get_window_extent(renderer))
   3257 
-> 3258         bb_xaxis = self.xaxis.get_tightbbox(renderer)
   3259         if bb_xaxis:
   3260             bb.append(bb_xaxis)

/usr/lib/python2.7/dist-packages/matplotlib/axis.pyc in get_tightbbox(self, renderer)
   1080         ticks_to_draw = self._update_ticks(renderer)
   1081         ticklabelBoxes, ticklabelBoxes2 = self._get_tick_bboxes(ticks_to_draw,
-> 1082                                                                 renderer)
   1083 
   1084         self._update_label_position(ticklabelBoxes, ticklabelBoxes2)

/usr/lib/python2.7/dist-packages/matplotlib/axis.pyc in _get_tick_bboxes(self, ticks, renderer)
   1063         for tick in ticks:
   1064             if tick.label1On and tick.label1.get_visible():
-> 1065                 extent = tick.label1.get_window_extent(renderer)
   1066                 ticklabelBoxes.append(extent)
   1067             if tick.label2On and tick.label2.get_visible():

/usr/lib/python2.7/dist-packages/matplotlib/text.pyc in get_window_extent(self, renderer, dpi)
    739             raise RuntimeError('Cannot get window extent w/o renderer')
    740 
--> 741         bbox, info, descent = self._get_layout(self._renderer)
    742         x, y = self.get_position()
    743         x, y = self.get_transform().transform_point((x, y))

/usr/lib/python2.7/dist-packages/matplotlib/text.pyc in _get_layout(self, renderer)
    309         tmp, lp_h, lp_bl = renderer.get_text_width_height_descent('lp',
    310                                                          self._fontproperties,
--> 311                                                          ismath=False)
    312         offsety = (lp_h - lp_bl) * self._linespacing
    313 

/usr/lib/python2.7/dist-packages/matplotlib/backends/backend_agg.pyc in get_text_width_height_descent(self, s, prop, ismath)
    221             fontsize = prop.get_size_in_points()
    222             w, h, d = texmanager.get_text_width_height_descent(s, fontsize,
--> 223                                                                renderer=self)
    224             return w, h, d
    225 

/usr/lib/python2.7/dist-packages/matplotlib/texmanager.pyc in get_text_width_height_descent(self, tex, fontsize, renderer)
    668         else:
    669             # use dviread. It sometimes returns a wrong descent.
--> 670             dvifile = self.make_dvi(tex, fontsize)
    671             dvi = dviread.Dvi(dvifile, 72 * dpi_fraction)
    672             try:

/usr/lib/python2.7/dist-packages/matplotlib/texmanager.pyc in make_dvi(self, tex, fontsize)
    415                      'string:\n%s\nHere is the full report generated by '
    416                      'LaTeX: \n\n' % repr(tex.encode('unicode_escape')) +
--> 417                      report))
    418             else:
    419                 mpl.verbose.report(report, 'debug')

RuntimeError: LaTeX was not able to process the following string:
'lp'
Here is the full report generated by LaTeX:

<matplotlib.figure.Figure at 0x7fd0d2773910>
In [19]:
fig, ax = plt.subplots(figsize=(12,8))
fig = sm.graphics.plot_partregress("murder", "hs_grad", ["urban", "poverty", "single"],  ax=ax, data=dta)
<matplotlib.figure.Figure at 0x7fd0d2858d10>

Leverage-Resid2 Plot

Closely related to the influence_plot is the leverage-resid2 plot.

In [20]:
fig, ax = plt.subplots(figsize=(8,6))
fig = sm.graphics.plot_leverage_resid2(crime_model, ax=ax)
<matplotlib.figure.Figure at 0x7fd0d9551a90>

Influence Plot

In [21]:
fig, ax = plt.subplots(figsize=(8,6))
fig = sm.graphics.influence_plot(crime_model, ax=ax)
<matplotlib.figure.Figure at 0x7fd0d2c16e10>

Using robust regression to correct for outliers.

Part of the problem here in recreating the Stata results is that M-estimators are not robust to leverage points. MM-estimators should do better with this examples.

In [22]:
from statsmodels.formula.api import rlm
In [23]:
rob_crime_model = rlm("murder ~ urban + poverty + hs_grad + single", data=dta,
                      M=sm.robust.norms.TukeyBiweight(3)).fit(conv="weights")
print(rob_crime_model.summary())
                    Robust linear Model Regression Results
==============================================================================
Dep. Variable:                 murder   No. Observations:                   51
Model:                            RLM   Df Residuals:                       46
Method:                          IRLS   Df Model:                            4
Norm:                   TukeyBiweight
Scale Est.:                       mad
Cov Type:                          H1
Date:                Wed, 20 May 2015
Time:                        21:54:46
No. Iterations:                    50
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     -4.2986      9.494     -0.453      0.651       -22.907    14.310
urban          0.0029      0.012      0.241      0.809        -0.021     0.027
poverty        0.2753      0.110      2.499      0.012         0.059     0.491
hs_grad       -0.0302      0.092     -0.328      0.743        -0.211     0.150
single         0.2902      0.055      5.253      0.000         0.182     0.398
==============================================================================

If the model instance has been used for another fit with different fit
parameters, then the fit options might not be the correct ones anymore .

In [24]:
#rob_crime_model = rlm("murder ~ pctmetro + poverty + pcths + single", data=dta, M=sm.robust.norms.TukeyBiweight()).fit(conv="weights")
#print(rob_crime_model.summary())

There isn't yet an influence diagnostics method as part of RLM, but we can recreate them. (This depends on the status of issue #888)

In [25]:
weights = rob_crime_model.weights
idx = weights > 0
X = rob_crime_model.model.exog[idx.values]
ww = weights[idx] / weights[idx].mean()
hat_matrix_diag = ww*(X*np.linalg.pinv(X).T).sum(1)
resid = rob_crime_model.resid
resid2 = resid**2
resid2 /= resid2.sum()
nobs = int(idx.sum())
hm = hat_matrix_diag.mean()
rm = resid2.mean()
In [26]:
from statsmodels.graphics import utils
fig, ax = plt.subplots(figsize=(12,8))
ax.plot(resid2[idx], hat_matrix_diag, 'o')
ax = utils.annotate_axes(range(nobs), labels=rob_crime_model.model.data.row_labels[idx],
                    points=lzip(resid2[idx], hat_matrix_diag), offset_points=[(-5,5)]*nobs,
                    size="large", ax=ax)
ax.set_xlabel("resid2")
ax.set_ylabel("leverage")
ylim = ax.get_ylim()
ax.vlines(rm, *ylim)
xlim = ax.get_xlim()
ax.hlines(hm, *xlim)
ax.margins(0,0)
<matplotlib.figure.Figure at 0x7fd0d2bfd390>

This Page