朴素贝叶斯法

1.条件概率(定义): 设A,B两 个事件,且P(A)>0,称 $ P(B|A) = \frac{P(AB)}{P(A)} $为事件A发生的条件下事件B发生的概率。

2.全概率公式: 设试验E的样本空间为S, A为E的事件,$ B{1},B{2}, \cdot \cdot \cdot,B{n} $为S的一个划分,且$ P(B{i})>0 (i=1,2,\cdot \cdot \cdot,n) $,则 $ P(A) = P(A|B{1})P(B{1})+P(A|B{2})P(B{2})+\cdot \cdot \cdot + P(A|B{n})P(B{n}) $,称为全概率公式。

3.贝叶斯公式:设试验E的样本空间为S。A为E的事件,$ B{1},B{2}, \cdot \cdot \cdot,B{n} $为S的一个划分,且$ P(A)>0,P(B{i})>0 (i=1,2,\cdot \cdot \cdot,n) $,则 $ P(B{i}|A) = \frac{P(A|B{i})P(B{i})}{\sum{j=1}^nP(A|B{j})P(B{j})},i=1,2, \cdot \cdot \cdot, n. $称为贝叶斯公式。


朴素贝叶斯法是基于贝叶斯定理与特征条件独立假设的分类方法。对于给定的训练数据集,首先基于特征条件独立假设学习输入/输出的联合概率分布;然后基于此模型,对给定的输入x,利用贝叶斯定理求出后验概率大的输出y。朴素贝叶斯法实现简单,学习与预测的效率都很高,是种常用的方法。


朴素贝叶斯算法

输入:训练数据$ T = {(x{1},y{1}),(x{2},y{2}),\cdot \cdot \cdot,(x{N},y{N})} $,其中$ x{i} = (x{i}^{(1)},x{i}^{(2)},\cdot \cdot \cdot, x{i}^{(n)})^T $,$ x{i}^{(j)} $是第i个样本的第j个特征,$ x{i}^{(j)} \in {a{j1},a{j2},\cdot \cdot \cdot,a{js_j}} $,$ a\{jl} $是第j个特征可能取的第l个值,$ j = 1,2,\cdot \cdot \cdot,n, l= 1, 2,\cdot \cdot \cdot, Sj, y{i} \in {c{1},c{2}, \cdot \cdot \cdot ,c{k}} $;实例x;
输出:实例x的分类。
(1)计算先验概率及条件概率
$ P(Y=c
{k})=\frac{\sum{i=1}^{N}I(y{i}=c{k})}{N}, k=1,2,\cdot \cdot \cdot,K $ $ P(X^{(j)}=a{jl}|Y=c{k})=\frac{\sum{i=1}^{n}I(x{i}^{(j)}=a{jl},y{i}=c{k})}{\sum{i=1}^{N}I(y{i}=c{k})} $ $ j=1,2,\cdot \cdot,\cdot,n;l=1,2,\cdot \cdot \cdot,S{j};k=1,2,\cdot \cdot \cdot, K $ (2)对于给定的实例$ x=(x^{(1)},x^{(2)},\cdot \cdot \cdot,x^{(n)})^T $,计算
$ P(Y=c{k})\prod{j=1}^{n}P(X^{(j)}=x^{(j)}|Y=c{k}),k=1,2,\cdot \cdot \cdot, K $ (3)确定实例x的类
$ y = arg \max
{c{k}}P(Y=c{k})\prod{j=1}^{n}P(X^{(j)}=x^{(j)}|Y=c{k}) $


实现代码

# 对数据集进行去重
def createVecabList(dataSet):
    vocabSet = set([])
    for doc in dataSet:
        vocabSet = vocabSet | set(doc)
    return list(vocabSet)


# 将数据集中出现的word,映射到去重后的数据集中,如果出现则为1,否则为0
def setOfWords2Vec(vocabList, inputSet):
    retVec = [0] * len(vocabList)
    for word in inputSet:
        if word in vocabList:
            retVec[vocabList.index(word)] = 1
        else:
            print("the word: %s is not in my vocabulary!" % word)
    return retVec


# 计算p(wi|c0),p(wi|c1),pAbusive
def trainNB0(trainMatrix, trainCategory):
    numTrainDocs = len(trainMatrix)
    print(numTrainDocs)
    numWords = len(trainMatrix[0])
    pAbusive = sum(trainCategory)/float(numTrainDocs)
    p0Num = np.zeros(numWords)
    p1Num = np.zeros(numWords)
    p0Denom, p1Denom = 0.0, 0.0

    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            print(trainMatrix[i])
            p1Denom += sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])

    print(p1Denom)
    p1Vect = p1Num/p1Denom
    p0Vect = p0Num/p0Denom
    return p0Vect, p1Vect, pAbusive


def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    p1 = sum(vec2Classify * p1Vec) + log(pClass1)
    p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
    if p1 > p0:
        return 1
    else:
        return 0

def testingNB():
    listPosts, listClasses = loadDataSet()
    myVocabList = createVecabList(listPosts)
    trainMat = []
    for p in listPosts:
        trainMat.append(setOfWords2Vec(myVocabList, p))

    p0V, p1V, pAb = trainNB0(np.array(trainMat), np.array(listClasses))
    testEntry = ['love', 'my', 'dalmation']
    thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry))
    print(testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb))
    testEntry = ['stupid', 'garbage']
    thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry))
    print(testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb))

完整代码


Ref:
1.机器学习实战
2.统计学习方法