Разница в линейной регрессии с использованием Statsmodels между версией Patsy и версией Dummy-списков

Если вы хотите преобразовать все *.ipynb файлы из текущего каталога в сценарий python, вы можете запустить команду следующим образом:

jupyter nbconvert --to script *.ipynb
0
задан NuValue 19 February 2019 в 12:39
поделиться

1 ответ

Причина, по которой результаты не совпадают, заключается в том, что Statsmodels выполняет предварительный отбор прогнозирующих переменных в зависимости от высокой мультиколлинеарности.

Точно такие же результаты достигаются путем описания описанной регрессии и определения отсутствующих переменных:

deletex = [
        'make_alfa-romero',
        'fuel_system_1bbl',
        'engine_type_dohc',
        'num_of_doors_four'
        ]
df_num.drop( deletex, axis = 1, inplace = True) 
df_num = df_num[df_num.columns].apply(pd.to_numeric, errors='coerce', axis = 1)
X = df_num.drop('price', axis = 1)
y = df_num.price.values
Xc = sm.add_constant(X) # Adds a constant to the model
random.seed(1234)
linear_regression = sm.OLS(y, Xc)
linear_regression.fit().summary()

, который печатает результат:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.894
Model:                            OLS   Adj. R-squared:                  0.868
Method:                 Least Squares   F-statistic:                     35.54
Date:                Thu, 21 Feb 2019   Prob (F-statistic):           5.24e-62
Time:                        18:16:08   Log-Likelihood:                -1899.7
No. Observations:                 205   AIC:                             3879.
Df Residuals:                     165   BIC:                             4012.
Df Model:                          39                                         
Covariance Type:            nonrobust                                         
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
const               1.592e+04   1.21e+04      1.320      0.189   -7898.396    3.97e+04
bore                3993.4308   1373.487      2.908      0.004    1281.556    6705.306
compression_ratio  -1200.5665    460.681     -2.606      0.010   -2110.156    -290.977
height               -80.7141    146.219     -0.552      0.582    -369.417     207.988
peak_rpm              -0.5903      0.790     -0.747      0.456      -2.150       0.970
make_audi           6519.7045   2371.807      2.749      0.007    1836.700    1.12e+04
make_bmw            1.427e+04   2292.551      6.223      0.000    9740.771    1.88e+04
make_chevrolet      -571.8236   2860.026     -0.200      0.842   -6218.788    5075.141
make_dodge         -1186.3430   2261.240     -0.525      0.601   -5651.039    3278.353
make_honda          2779.6496   2891.626      0.961      0.338   -2929.709    8489.009
make_isuzu          3098.9677   2592.645      1.195      0.234   -2020.069    8218.004
make_jaguar         1.752e+04   2416.313      7.252      0.000    1.28e+04    2.23e+04
make_mazda           306.6568   2134.567      0.144      0.886   -3907.929    4521.243
make_mercedes-benz  1.698e+04   2320.871      7.318      0.000    1.24e+04    2.16e+04
make_mercury        2958.1002   3605.739      0.820      0.413   -4161.236    1.01e+04
make_mitsubishi    -1188.8337   2284.697     -0.520      0.604   -5699.844    3322.176
make_nissan        -1211.5463   2073.422     -0.584      0.560   -5305.405    2882.312
make_peugot         3057.0217   4255.809      0.718      0.474   -5345.841    1.15e+04
make_plymouth       -894.5921   2332.746     -0.383      0.702   -5500.473    3711.289
make_porsche        9558.8747   3688.038      2.592      0.010    2277.044    1.68e+04
make_renault       -2124.9722   2847.536     -0.746      0.457   -7747.277    3497.333
make_saab           3490.5333   2319.189      1.505      0.134   -1088.579    8069.645
make_subaru        -1.636e+04   4002.796     -4.087      0.000   -2.43e+04   -8456.659
make_toyota         -770.9677   1911.754     -0.403      0.687   -4545.623    3003.688
make_volkswagen      406.9179   2219.714      0.183      0.855   -3975.788    4789.623
make_volvo          5433.7129   2397.030      2.267      0.025     700.907    1.02e+04
fuel_system_2bbl    2142.1594   2232.214      0.960      0.339   -2265.226    6549.545
fuel_system_4bbl     464.1109   3999.976      0.116      0.908   -7433.624    8361.846
fuel_system_idi     1.991e+04   6622.812      3.007      0.003    6837.439     3.3e+04
fuel_system_mfi     3716.5201   3936.805      0.944      0.347   -4056.488    1.15e+04
fuel_system_mpfi    3964.1109   2267.538      1.748      0.082    -513.019    8441.241
fuel_system_spdi    3240.0003   2719.925      1.191      0.235   -2130.344    8610.344
fuel_system_spfi     932.1959   4019.476      0.232      0.817   -7004.041    8868.433
engine_type_dohcv  -1.208e+04   4205.826     -2.872      0.005   -2.04e+04   -3773.504
engine_type_l      -4833.9860   3763.812     -1.284      0.201   -1.23e+04    2597.456
engine_type_ohc    -4038.8848   1213.598     -3.328      0.001   -6435.067   -1642.702
engine_type_ohcf    9618.9281   3504.600      2.745      0.007    2699.286    1.65e+04
engine_type_ohcv    3051.7629   1445.185      2.112      0.036     198.323    5905.203
engine_type_rotor   1403.9928   3217.402      0.436      0.663   -4948.593    7756.579
num_of_doors_two    -419.9640    521.754     -0.805      0.422   -1450.139     610.211
==============================================================================
Omnibus:                       65.777   Durbin-Watson:                   1.217
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              399.594
Skew:                           1.059   Prob(JB):                     1.70e-87
Kurtosis:                       9.504   Cond. No.                     3.26e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.26e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Результаты, которые полностью равны первый вызов с Statsmodels:

random.seed(1234)
lm_python = smf.ols('price ~ make + fuel_system + engine_type + num_of_doors + bore + compression_ratio + height + peak_rpm + 1', data = df)
lm_python.fit().summary()

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.894
Model:                            OLS   Adj. R-squared:                  0.868
Method:                 Least Squares   F-statistic:                     35.54
Date:                Thu, 21 Feb 2019   Prob (F-statistic):           5.24e-62
Time:                        18:17:37   Log-Likelihood:                -1899.7
No. Observations:                 205   AIC:                             3879.
Df Residuals:                     165   BIC:                             4012.
Df Model:                          39                                         
Covariance Type:            nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept              1.592e+04   1.21e+04      1.320      0.189   -7898.396    3.97e+04
make[T.audi]           6519.7045   2371.807      2.749      0.007    1836.700    1.12e+04
make[T.bmw]            1.427e+04   2292.551      6.223      0.000    9740.771    1.88e+04
make[T.chevrolet]      -571.8236   2860.026     -0.200      0.842   -6218.788    5075.141
make[T.dodge]         -1186.3430   2261.240     -0.525      0.601   -5651.039    3278.353
make[T.honda]          2779.6496   2891.626      0.961      0.338   -2929.709    8489.009
make[T.isuzu]          3098.9677   2592.645      1.195      0.234   -2020.069    8218.004
make[T.jaguar]         1.752e+04   2416.313      7.252      0.000    1.28e+04    2.23e+04
make[T.mazda]           306.6568   2134.567      0.144      0.886   -3907.929    4521.243
make[T.mercedes-benz]  1.698e+04   2320.871      7.318      0.000    1.24e+04    2.16e+04
make[T.mercury]        2958.1002   3605.739      0.820      0.413   -4161.236    1.01e+04
make[T.mitsubishi]    -1188.8337   2284.697     -0.520      0.604   -5699.844    3322.176
make[T.nissan]        -1211.5463   2073.422     -0.584      0.560   -5305.405    2882.312
make[T.peugot]         3057.0217   4255.809      0.718      0.474   -5345.841    1.15e+04
make[T.plymouth]       -894.5921   2332.746     -0.383      0.702   -5500.473    3711.289
make[T.porsche]        9558.8747   3688.038      2.592      0.010    2277.044    1.68e+04
make[T.renault]       -2124.9722   2847.536     -0.746      0.457   -7747.277    3497.333
make[T.saab]           3490.5333   2319.189      1.505      0.134   -1088.579    8069.645
make[T.subaru]        -1.636e+04   4002.796     -4.087      0.000   -2.43e+04   -8456.659
make[T.toyota]         -770.9677   1911.754     -0.403      0.687   -4545.623    3003.688
make[T.volkswagen]      406.9179   2219.714      0.183      0.855   -3975.788    4789.623
make[T.volvo]          5433.7129   2397.030      2.267      0.025     700.907    1.02e+04
fuel_system[T.2bbl]    2142.1594   2232.214      0.960      0.339   -2265.226    6549.545
fuel_system[T.4bbl]     464.1109   3999.976      0.116      0.908   -7433.624    8361.846
fuel_system[T.idi]     1.991e+04   6622.812      3.007      0.003    6837.439     3.3e+04
fuel_system[T.mfi]     3716.5201   3936.805      0.944      0.347   -4056.488    1.15e+04
fuel_system[T.mpfi]    3964.1109   2267.538      1.748      0.082    -513.019    8441.241
fuel_system[T.spdi]    3240.0003   2719.925      1.191      0.235   -2130.344    8610.344
fuel_system[T.spfi]     932.1959   4019.476      0.232      0.817   -7004.041    8868.433
engine_type[T.dohcv]  -1.208e+04   4205.826     -2.872      0.005   -2.04e+04   -3773.504
engine_type[T.l]      -4833.9860   3763.812     -1.284      0.201   -1.23e+04    2597.456
engine_type[T.ohc]    -4038.8848   1213.598     -3.328      0.001   -6435.067   -1642.702
engine_type[T.ohcf]    9618.9281   3504.600      2.745      0.007    2699.286    1.65e+04
engine_type[T.ohcv]    3051.7629   1445.185      2.112      0.036     198.323    5905.203
engine_type[T.rotor]   1403.9928   3217.402      0.436      0.663   -4948.593    7756.579
num_of_doors[T.two]    -419.9640    521.754     -0.805      0.422   -1450.139     610.211
bore                   3993.4308   1373.487      2.908      0.004    1281.556    6705.306
compression_ratio     -1200.5665    460.681     -2.606      0.010   -2110.156    -290.977
height                  -80.7141    146.219     -0.552      0.582    -369.417     207.988
peak_rpm                 -0.5903      0.790     -0.747      0.456      -2.150       0.970
==============================================================================
Omnibus:                       65.777   Durbin-Watson:                   1.217
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              399.594
Skew:                           1.059   Prob(JB):                     1.70e-87
Kurtosis:                       9.504   Cond. No.                     3.26e+05
==============================================================================

Существует необходимость проверки соответствия в прогнозирующих переменных, поскольку pd.get_dummies осуществляет обширное получение всех фиктивных переменных, а Statsmodels применяет уровни N-1. внутри категориальной переменной выбора.

0
ответ дан NuValue 19 February 2019 в 12:39
поделиться
Другие вопросы по тегам:

Похожие вопросы: