逻辑回归

logistic分布

设X是连续随机变量,X服从logistic分布是指X具有下列分布函数和密度函数:
$ F(x)=P(X \le x) = \frac{1}{1+e^{-(x - \mu)/\gamma}} $
$ f(x) = F’(x) = \frac{e^{-(x-\mu)/\gamma} }{\gamma(1+e^{-(x-\mu)/\gamma})^2} $ 式中,$ \mu $ 为位置参数,$ \gamma \gt 0 $ 为形状参数。

当$ \mu $不等于0时,相当于图像向左或向右移动$ \mu $个单位,$ \gamma $越大,图像越
平坦。 图像代码


logistic回归模型

二项logistic回归模型是如下的条件概率分布:
$ P(Y=1|x) = \frac{\exp(w \cdot x + b)} {1+\exp(w \cdot x + b)} $
$ P(Y=0|x) = \frac{1} {1+\exp(w \cdot x + b)} $
这里,$ x \in \mathcal R^n $是输入,$ Y \in {0,1} $是输出,$ w\in \mathcal R^n $和$ b\in \mathcal R $是参数,w称为权值向量,b称为偏置,$ w \cdot x $为w和x的内积。 逻辑回归其
实就是回归加上阶梯函数,将输出值映射到一定的范围内。

模型参数估计


# 构建数据集
def loadDataSet2():
    dataMat = []; labelMat = []
    fr = open('./dataset/testSet5.txt')
    for line in fr.readlines():
        lineArr = line.strip().split()
        dataMat.append([1.0, float(lineArr[0]), float(lineArr[1])])
        labelMat.append(int(lineArr[2]))
    return dataMat,labelMat


# 梯度下降(每次都要对所有训练数据进行计算,不适合数据量大的场景)
def gradAscent(dataMatIn, classLabels):
    dataMatrix = np.mat(dataMatIn)
    labelMat = np.mat(classLabels).transpose()
    m, n = dataMatrix.shape

    alpha = 0.001
    max_iter = 500
    weights = np.ones((n, 1))
    for k in range(max_iter):
        h = sigmoid(dataMatrix*weights)
        error = (labelMat -h)
        weights = weights + alpha * dataMatrix.transpose() * error
    return weights


# 随机梯度下降每次迭代只使用一条训练数据进行计算,减少计了计算量
def stocGradAscent0(dataMatrix, classLabels):
    m,n = np.shape(dataMatrix)
    alpha = 0.01
    weights = np.ones(n)
    for i in range(m):
        h = sigmoid(sum(dataMatrix[i]*weights))
        error = classLabels[i] - h
        weights = weights + alpha * error * dataMatrix[i]
    return weights


# 改良版随机梯度下降,
def stocGradAscent1(dataMatrix, classLabels, numIter=150):
    m, n = dataMatrix.shape
    weights = np.ones(n)
    dataIndex = range(m)
    for j in range(numIter):
        for i in range(m):
            alpha = 4/(1.0+j+i)+0.01
            randIndex = int(np.random.uniform(0, len(dataIndex)))
            h = sigmoid(sum(dataMatrix[randIndex]*weights))
            error = classLabels[randIndex] - h
            weights = weights + alpha*error*dataMatrix[randIndex]
    return weights


# 分类
def classifyVec(inX, weights):
    prob = sigmoid(sum(inX*weights))
    if prob > 0.5:
        return 1
    return 0

# 训练,测试
def colicTest():
    frTrain = open('dataset/horseColicTraining.txt')
    frTest = open('dataset/horseColicTest.txt')
    trainingSet, trainingLabels = [], []
    for line in frTrain.readlines():
        currLine = line.strip().split('\t')
        lineArr = []
        for i in range(21):
            lineArr.append(float(currLine[i]))
        trainingSet.append(lineArr)
        trainingLabels.append(float(currLine[21]))
    trainWeights = stocGradAscent1(np.array(trainingSet), trainingLabels, 500)
    errorCount, numTestVec = 0, 0
    for line in frTest.readlines():
        numTestVec += 1
        currLine = line.strip().split('\t')
        lineArr = []
        for i in range(21):
            lineArr.append(float(currLine[i]))
        if int(classifyVec(np.array(lineArr), trainWeights))!=int(currLine[21]):
            errorCount += 1
    errorRate = (float(errorCount)/numTestVec)
    print("the error rate of this test is: %f" % errorRate)

完整代码


Ref:
1.统计学习方法
2.机器学习实战