Python学习之线性回归

2020-02-21

在处理数据时，经常会用到数据拟合，其中最常用的应当是线性回归。本文整理了 Python 中一些常见的进行线性回归的方法。

使用 numpy 进行多项式拟合

from __future__ import division
import numpy as np 

d1=[1,2,3,4,5,6]
d2=[0.9,2.3,3.2,3.7,4.8,6.5]
X=np.array(d1)
Y=np.array(d2)

z1 = np.polyfit(X,Y, deg=1)  #最高次幂为 1，相当于线性拟合
p1 = np.poly1d(z1)
print(z1)  #多项式系数
print(p1)  #多项式方程
p1(X)      #代入求值  

plt.plot(X,marker='o',label='X')
plt.plot(Y,marker='o',label='Y')
plt.plot(p1(X),marker='o',label='Z')
plt.legend()

1、np.poly1d()此函数有两个参数：

　　参数1：为一个数组，若没有参数2，则生成一个多项式，例如：

　　　　　　p = np.poly1d([2,3,5,7])

　　　　　　print(p) ==>>2x^3 + 3x^2 + 5x + 7 数组中的数值为coefficient（系数），从后往前 0，1，2.。。为位置书的次数

参数2：若参数2为True，则表示把数组中的值作为根，然后反推多项式，例如：

　　　　　　q = np.poly1d([2,3,5],True)

　　　　　　print(q) ===>>(x - 2)(x - 3)(x - 5) = x^3 - 10x^2 + 31x -30

　　参数3：variable=‘z’表示改变未知数的字母，例如：

　　　　　　q = np.poly1d([2,3,5],True,varibale = ‘z’)

　　　　　　print(q) ===>>(z - 2)(z - 3)(z - 5) = z^3 - 10z^2 + 31z -30

2、多项式poly1d()的方法

　　　a.　　p(0.5)表示当x = 0.5时，多项式的值为多少

　　　b.　　p.r表示当多项式为 0 时，此等式的根

　　　c.　　p.c表示生成多项式的系数数组

　　　d.　　p.order表示返回最高项的次方数

　　　e.　　p[1]表示返回第一项的系数

　　　f.　　多项式支持实数的四则运算

使用 statsmodels 进行多项式拟合

from __future__ import division
import numpy as np 

import statsmodels.api as sm
import statsmodels.formula.api as smf

d1=[1,2,3,4,5,6]
d2=[0.9,2.3,3.2,3.7,4.8,6.5]
X=np.array(d1)
Y=np.array(d2)

# Construct the columns for the different powers of x
# xpoly 的列数（即 k 值）表示多项式最高项的次数
k=1
xpoly = np.column_stack([X**i for i in range(k+1)])    
regr = sm.OLS(Y, xpoly).fit()
print(regr.summary())  # 会详细列出拟合的结果  
regr.params            # 拟合得到的系数 
regr.fittedvalues      # 拟合得到的 Yhat 值
regr.predict(xpoly)    # 带入求值
regr.rsquared          # 返回 rsquared

plt.plot(regr.predict(xpoly),marker='o',label='Z1')
plt.legend()

使用 scipy 进行一次多项式拟合

from __future__ import division
import numpy as np
from scipy import stats

d1=[1,2,3,4,5,6]
d2=[0.9,2.3,3.2,3.7,4.8,6.5]
X=np.array(d1)
Y=np.array(d2)

slope, intercept, r_value, p_value, std_err = stats.linregress(X, Y)

决定系数

　　统计学中，决定系数 (Coefficient of determination) 表示为 R^2 或者 r^2 发音为”R squared”，可定义为已被模式中全部自变量说明的自变量的变差对自变量总变差的比值。

定义

　　数据集中包含 n 个观测值 y_1,…,y_n，与之对应的预测值为${\widehat {y}}1,…,{\widehat {y}}_n
$，使用 \bar {y} 表示观测值的平均值：
$$
{\bar{y} = {\frac {1}{n}}\sum _{i=1}^{n}y{i}}
$$
则：

The total sum of squares (proportional to the variance of the data):
$$
SS_{\text{tot}}=\sum {i}(y{i}-{\bar {y}})^{2}
$$
The sum of squares of residuals, also called the residual sum of squares:
$$
SS_{\text{res}}=\sum {i}(y{i}-\widehat {y}_i)^{2}
$$

对应的决定系数的最普遍的定义为
$$
R^{2}\equiv 1-{SS_{\rm {res}} \over SS_{\rm {tot}}}
$$

对应的 Python 计算公式

from __future__ import division
import numpy as np 

def get_r2_numpy(x, y):
    slope, intercept = np.polyfit(x, y, 1)
    yhat = slope * x + intercept
    r_squared = 1 - (sum((y - yhat)**2) / ((len(y) - 1) * np.var(y, ddof=1)))  # np.var() 表示方差 
    return r_squared
    
d1=[1,2,3,4,5,6]
d2=[0.9,2.3,3.2,3.7,4.8,6.5]

x=np.array(d2)
y=np.array(d1)
get_r2_numpy(x, y)

相关系数

　　相关系数 (correlation coefficient) 是表示变量之间线性相关程度的量，一般用字母 r 表示，具有不同的定义方式，常见的为 Pearson product-moment correlation coefficient 和 Spearman’s rank correlation coefficient。

皮尔逊相关系数的定义

$$
{\rho _{X,Y}={\frac {\operatorname {cov} (X,Y)}{\sigma _{X}\sigma _{Y}}}}
$$

式中 $\operatorname {cov}$表示协方差，${\sigma {X}}和{\sigma {Y}}$ 分别表示 X 和 Y 的标准差，由标准差和协方差的定义：
$$
\sigma _{X}^{2}=\operatorname {E} [(X-\operatorname {E} [X])^{2}]=\operatorname {E} [X^{2}]-[\operatorname {E} [X]]^{2}
$$

$$
\sigma _{Y}^{2}=\operatorname {E} [(Y-\operatorname {E} [Y])^{2}]=\operatorname {E} [Y^{2}]-[\operatorname {E} [Y]]^{2}
$$

$$
\operatorname {E} [(X-\mu _{X})(Y-\mu _{Y})]=\operatorname {E} [(X-\operatorname {E} [X])(Y-\operatorname {E} [Y])]=\operatorname {E} [XY]-\operatorname {E} [X]\operatorname {E} [Y]
$$

皮尔逊相关系数也可以写为：
$$
{\rho _{X,Y}=\frac {\operatorname {E} [XY]-\operatorname {E} [X]\operatorname {E} [Y]}{\sigma _{X}~\sigma _{Y}}}
$$

Python 中直接计算皮尔逊相关系数：

from __future__ import division
import numpy as np 

d1=[1,2,3,4,5,6]
d2=[0.9,2.3,3.2,3.7,4.8,6.5]
x=np.array(d2)
y=np.array(d1)

rp = np.corrcoef(x, y)[0, 1]

　　由上可知，对于最高项次数为 1 的线性回归，其决定系数等于 x, y 相关系数的平方；对于其它拟合，起决定系数等于${\widehat {y}}, y$相关系数的平方。