I remember experimenting with doing regressions in Python using R-style formulae a long time ago, and I remember it being a bit complicated. Luckily it’s become really easy now – and I’ll show you just how easy.

Before running this you will need to install the pandas, statsmodels and patsy packages. If you’re using conda you should be able to do this by running the following from the terminal:

conda install statsmodels patsy

(and then say yes when it asks you to confirm it)


import import pandas pandas as as pdpdfrom from statsmodels.formula.api statsmodels.formula.api import import olsols

Before we can do any regression, we need some data – so lets read some data on cars:


You may have noticed from the code above that you can just give a URL to the read_csv function and it will download it and open it – handy!


Anyway, here is the data:


dfdf .. headhead ()()
Model 模型 MPG 手脉 Cylinders 气瓶 Engine Disp 发动机排量 Horsepower 马力 Weight 重量 Accelerate 加速 Year Origin 起源
0 0 amc ambassador dpl AMC大使DPL 15.0 15.0 8 8 390.0 390.0 190 190 3850 3850 8.5 8.5 70 70 American 美国人
1 1个 amc gremlin 阿姆·格林林 21.0 21.0 6 6 199.0 199.0 90 90 2648 2648 15.0 15.0 70 70 American 美国人
2 2 amc hornet 大黄蜂 18.0 18.0 6 6 199.0 199.0 97 97 2774 2774 15.5 15.5 70 70 American 美国人
3 3 amc rebel sst amc反叛者 16.0 16.0 8 8 304.0 304.0 150 150 3433 3433 12.0 12.0 70 70 American 美国人
4 4 buick estate wagon (sw) 别克旅行车(SW) 14.0 14.0 8 8 455.0 455.0 225 225 3086 3086 10.0 10.0 70 70 American 美国人

Before we do our regression it might be a good idea to look at simple correlations between columns. We can get the correlations between each pair of columns using the corr() method:

MPG 手脉 Cylinders 气瓶 Engine Disp 发动机排量 Horsepower 马力 Weight 重量 Accelerate 加速 Year
MPG 手脉 1.000000 1.000000 -0.777618 -0.777618 -0.805127 -0.805127 -0.778427 -0.778427 -0.832244 -0.832244 0.423329 0.423329 0.580541 0.580541
Cylinders 气瓶 -0.777618 -0.777618 1.000000 1.000000 0.950823 0.950823 0.842983 0.842983 0.897527 0.897527 -0.504683 -0.504683 -0.345647 -0.345647
Engine Disp 发动机排量 -0.805127 -0.805127 0.950823 0.950823 1.000000 1.000000 0.897257 0.897257 0.932994 0.932994 -0.543800 -0.543800 -0.369855 -0.369855
Horsepower 马力 -0.778427 -0.778427 0.842983 0.842983 0.897257 0.897257 1.000000 1.000000 0.864538 0.864538 -0.689196 -0.689196 -0.416361 -0.416361
Weight 重量 -0.832244 -0.832244 0.897527 0.897527 0.932994 0.932994 0.864538 0.864538 1.000000 1.000000 -0.416839 -0.416839 -0.309120 -0.309120
Accelerate 加速 0.423329 0.423329 -0.504683 -0.504683 -0.543800 -0.543800 -0.689196 -0.689196 -0.416839 -0.416839 1.000000 1.000000 0.290316 0.290316
Year 0.580541 0.580541 -0.345647 -0.345647 -0.369855 -0.369855 -0.416361 -0.416361 -0.309120 -0.309120 0.290316 0.290316 1.000000 1.000000

Now we can do some regression using R-style formulae. In this case we’re trying to predict MPG based on the year that the car was released:

model model = = olsols (( "MPG ~ Year""MPG ~ Year" , , datadata == dfdf ))results results = = modelmodel .. fitfit ()()

The ‘formula’ that we used above is the same as R uses: on the left is the dependent variable, on the right is the independent variable. The ols method is nice and easy, we just give it the formula, and then the DataFrame to use to get the data from (in this case, it’s called df). We then call fit() to actually do the regression.

We can easily get a summary of the results here – including all sorts of crazy statistical measures!


OLS Regression Results
Dep. Variable: 部门 变量: MPG 手脉 R-squared: R平方: 0.337 0.337
Model: 模型: OLS 最小二乘 Adj. R-squared: 调整 R平方: 0.335 0.335
Method: 方法: Least Squares 最小二乘 F-statistic: F统计: 198.3 198.3
Date: 日期: Sat, 20 Aug 2016 2016年8月20日,星期六 Prob (F-statistic): 概率(F统计): 1.08e-36 1.08e-36
Time: 时间: 10:42:17 10:42:17 Log-Likelihood: 对数似然: -1280.6 -1280.6
No. Observations: 号观察: 392 392 AIC: AIC: 2565. 2565。
Df Residuals: Df残渣: 390 390 BIC: BIC: 2573. 2573。
Df Model: DF型号: 1 1个
Covariance Type: 协方差类型: nonrobust 不稳健
coef ef std err 标准错误 t Ť P>|t| P> | t | [95.0% Conf. Int.] [95.0%Conf。 整数]
Intercept 截距 -70.0117 -70.0117 6.645 6.645 -10.536 -10.536 0.000 0.000 -83.076 -56.947 -83.076 -56.947
Year 1.2300 1.2300 0.087 0.087 14.080 14.080 0.000 0.000 1.058 1.402 1.058 1.402
Omnibus: 综合: 21.407 21.407 Durbin-Watson: 杜宾·沃森: 1.121 1.121
Prob(Omnibus): 概率(Omnibus): 0.000 0.000 Jarque-Bera (JB): Jarque-Bera(JB): 15.843 15.843
Skew: 偏斜: 0.387 0.387 Prob(JB): 概率(JB): 0.000363 0.000363
Kurtosis: 峰度: 2.391 2.391 Cond. No. 条件。 没有。 1.57e+03 1.57e + 03

We can do a more complex model easily too. First lets list the columns of the data to remind us what variables we have:

dfdf .. columnscolumns
Index(['Model', 'MPG', 'Cylinders', 'Engine Disp', 'Horsepower', 'Weight',       'Accelerate', 'Year', 'Origin'],      dtype='object')

We can now add in more variables – doing multiple regression:


OLS Regression Results
Dep. Variable: 部门 变量: MPG 手脉 R-squared: R平方: 0.808 0.808
Model: 模型: OLS 最小二乘 Adj. R-squared: 调整 R平方: 0.807 0.807
Method: 方法: Least Squares 最小二乘 F-statistic: F统计: 545.4 545.4
Date: 日期: Sat, 20 Aug 2016 2016年8月20日,星期六 Prob (F-statistic): 概率(F统计): 9.37e-139 9.37e-139
Time: 时间: 10:42:17 10:42:17 Log-Likelihood: 对数似然: -1037.4 -1037.4
No. Observations: 号观察: 392 392 AIC: AIC: 2083. 2083。
Df Residuals: Df残渣: 388 388 BIC: BIC: 2099. 2099。
Df Model: DF型号: 3 3
Covariance Type: 协方差类型: nonrobust 不稳健
coef ef std err 标准错误 t Ť P>|t| P> | t | [95.0% Conf. Int.] [95.0%Conf。 整数]
Intercept 截距 -13.7194 -13.7194 4.182 4.182 -3.281 -3.281 0.001 0.001 -21.941 -5.498 -21.941 -5.498
Year 0.7487 0.7487 0.052 0.052 14.365 14.365 0.000 0.000 0.646 0.851 0.646 0.851
Weight 重量 -0.0064 -0.0064 0.000 0.000 -15.768 -15.768 0.000 0.000 -0.007 -0.006 -0.007 -0.006
Horsepower 马力 -0.0050 -0.0050 0.009 0.009 -0.530 -0.530 0.597 0.597 -0.024 0.014 -0.024 0.014
Omnibus: 综合: 41.952 41.952 Durbin-Watson: 杜宾·沃森: 1.423 1.423
Prob(Omnibus): 概率(Omnibus): 0.000 0.000 Jarque-Bera (JB): Jarque-Bera(JB): 69.490 69.490
Skew: 偏斜: 0.671 0.671 Prob(JB): 概率(JB): 8.14e-16 8.14e-16
Kurtosis: 峰度: 4.566 4.566 Cond. No. 条件。 没有。 7.48e+04 7.48e + 04

We can see that bringing in some extra variables has increased the $R^2$ value from ~0.3 to ~0.8 – although we can see that the P value for the Horsepower is very high. If we remove Horsepower from the regression then it barely changes the results:

model model = = olsols (( "MPG ~ Year + Weight""MPG ~ Year + Weight" , , datadata == dfdf ))results results = = modelmodel .. fitfit ()()resultsresults .. summarysummary ()()
OLS Regression Results
Dep. Variable: 部门 变量: MPG 手脉 R-squared: R平方: 0.808 0.808
Model: 模型: OLS 最小二乘 Adj. R-squared: 调整 R平方: 0.807 0.807
Method: 方法: Least Squares 最小二乘 F-statistic: F统计: 819.5 819.5
Date: 日期: Sat, 20 Aug 2016 2016年8月20日,星期六 Prob (F-statistic): 概率(F统计): 3.33e-140 3.33e-140
Time: 时间: 10:42:17 10:42:17 Log-Likelihood: 对数似然: -1037.6 -1037.6
No. Observations: 号观察: 392 392 AIC: AIC: 2081. 2081。
Df Residuals: Df残渣: 389 389 BIC: BIC: 2093. 2093。
Df Model: DF型号: 2 2
Covariance Type: 协方差类型: nonrobust 不稳健
coef ef std err 标准错误 t Ť P>|t| P> | t | [95.0% Conf. Int.] [95.0%Conf。 整数]
Intercept 截距 -14.3473 -14.3473 4.007 4.007 -3.581 -3.581 0.000 0.000 -22.224 -6.470 -22.224 -6.470
Year 0.7573 0.7573 0.049 0.049 15.308 15.308 0.000 0.000 0.660 0.855 0.660 0.855
Weight 重量 -0.0066 -0.0066 0.000 0.000 -30.911 -30.911 0.000 0.000 -0.007 -0.006 -0.007 -0.006
Omnibus: 综合: 42.504 42.504 Durbin-Watson: 杜宾·沃森: 1.425 1.425
Prob(Omnibus): 概率(Omnibus): 0.000 0.000 Jarque-Bera (JB): Jarque-Bera(JB): 71.997 71.997
Skew: 偏斜: 0.670 0.670 Prob(JB): 概率(JB): 2.32e-16 2.32e-16
Kurtosis: 峰度: 4.616 4.616 Cond. No. 条件。 没有。 7.17e+04 7.17e + 04

We can also see if introducing categorical variables helps with the regression. In this case, we only have one categorical variable, called Origin. Patsy automatically treats strings as categorical variables, so we don’t have to do anything special – but if needed we could wrap the variable name in C() to force it to be a categorical variable.

OLS Regression Results
Dep. Variable: 部门 变量: MPG 手脉 R-squared: R平方: 0.579 0.579
Model: 模型: OLS 最小二乘 Adj. R-squared: 调整 R平方: 0.576 0.576
Method: 方法: Least Squares 最小二乘 F-statistic: F统计: 178.0 178.0
Date: 日期: Sat, 20 Aug 2016 2016年8月20日,星期六 Prob (F-statistic): 概率(F统计): 1.42e-72 1.42e-72
Time: 时间: 10:42:17 10:42:17 Log-Likelihood: 对数似然: -1191.5 -1191.5
No. Observations: 号观察: 392 392 AIC: AIC: 2391. 2391。
Df Residuals: Df残渣: 388 388 BIC: BIC: 2407. 2407。
Df Model: DF型号: 3 3
Covariance Type: 协方差类型: nonrobust 不稳健
coef ef std err 标准错误 t Ť P>|t| P> | t | [95.0% Conf. Int.] [95.0%Conf。 整数]
Intercept 截距 -61.2643 -61.2643 5.393 5.393 -11.360 -11.360 0.000 0.000 -71.868 -50.661 -71.868 -50.661
Origin[T.European] 起源[T.European] 7.4784 7.4784 0.697 0.697 10.734 10.734 0.000 0.000 6.109 8.848 6.109 8.848
Origin[T.Japanese] 起源[T.Japanese] 8.4262 8.4262 0.671 0.671 12.564 12.564 0.000 0.000 7.108 9.745 7.108 9.745
Year 1.0755 1.0755 0.071 0.071 15.102 15.102 0.000 0.000 0.935 1.216 0.935 1.216
Omnibus: 综合: 10.231 10.231 Durbin-Watson: 杜宾·沃森: 1.656 1.656
Prob(Omnibus): 概率(Omnibus): 0.006 0.006 Jarque-Bera (JB): Jarque-Bera(JB): 10.589 10.589
Skew: 偏斜: 0.402 0.402 Prob(JB): 概率(JB): 0.00502 0.00502
Kurtosis: 峰度: 2.980 2.980 Cond. No. 条件。 没有。 1.60e+03 1.60e + 03

You can see here that Patsy has automatically created extra variables for Origin: in this case, European and Japanese, with the ‘default’ being American. You can configure how this is done very easily – see .

Just for reference, you can easily get any of the statistical outputs as attributes on the results object:


resultsresults .. rsquaredrsquared
Intercept            -61.264305Origin[T.European]     7.478449Origin[T.Japanese]     8.426227Year                   1.075484dtype: float64

You can also really easily use the model to predict based on values you’ve got:


resultsresults .. predictpredict ({
'Year''Year' :: 9090 , , 'Origin''Origin' :: 'European''European' })})
array([ 43.00766095])




