看起来您的代码是正确的,但是有一些简单的方法可以优化它。例如,您可以为每个单词动态计算p(word | topic),而您可以轻松地预先计算这些值。(我假设你想在这里对多个文档进行分类,如果你只做一个文档,我想这是可以的,因为你不计算文档中没有的单词)
类似地,p(topic)的计算也可以移到循环之外。
最后,不需要对整个数组进行排序就可以找到最大值。
所有的小点!但这就是你的要求:)
我编写了一些未经测试的PHP代码,展示了如何在下面实现这一点:
<?php
// Get word counts from database
$nWordPerTopic = mystery_sql();
// Calculate p(word|topic) = nWord / sum(nWord for every word)
$nTopics = array();
$pWordPerTopic = array();
foreach($nWordPerTopic as $topic => $wordCounts)
{
// Get total word count in topic
$nTopic = array_sum($wordCounts);
// Calculate p(word|topic)
$pWordPerTopic[$topic] = array();
foreach($wordCounts as $word => $count)
$pWordPerTopic[$topic][$word] = $count / $nTopic;
// Save $nTopic for next step
$nTopics[$topic] = $nTopic;
}
// Calculate p(topic)
$nTotal = array_sum($nTopics);
$pTopics = array();
foreach($nTopics as $topic => $nTopic)
$pTopics[$topic] = $nTopic / $nTotal;
// Classify
foreach($documents as $document)
{
$title = $document['title'];
$tokens = tokenizer($title);
$pMax = -1;
$selectedTopic = null;
foreach($pTopics as $topic => $pTopic)
{
$p = $pTopic;
foreach($tokens as $word)
{
if (!array_key_exists($word, $pWordPerTopic[$topic]))
continue;
$p *= $pWordPerTopic[$topic][$word];
}
if ($p > $pMax)
{
$selectedTopic = $topic;
$pMax = $p;
}
}
}
?>
至于数学。。。
你想最大化p(主题词),所以找到
arg max p(topic|words)
(即p(topic | words)最高的论点主题)
p(topic)*p(words|topic)
p(topic|words) = -------------------------
p(words)
所以你在找
p(topic)*p(words|topic)
arg max -------------------------
p(words)
arg max p(topic)*p(words|topic)
朴素贝叶斯假设(使其成为朴素贝叶斯分类器)是
p(words|topic) = p(word1|topic) * p(word2|topic) * ...
arg max p(topic) * p(word1|topic) * p(word2|topic) * ...
在哪里?
p(topic) = number of words in topic / number of words in total
以及
p(word, topic) 1
p(word | topic) = ---------------- = p(word, topic) * ----------
p(topic) p(topic)
number of times word occurs in topic number of words in total
= -------------------------------------- * --------------------------
number of words in total number of words in topic
number of times word occurs in topic
= --------------------------------------
number of words in topic