代码之家  ›  专栏  ›  技术社区  ›  caw

Bayes分类器的PHP实现:为文本分配主题

  •  6
  • caw  · 技术社区  · 14 年前

    在我的新闻页面项目中,我有一个数据库表 新闻

     - id: [integer] unique number identifying the news entry, e.g.: *1983*
     - title: [string] title of the text, e.g.: *New Life in America No Longer Means a New Name*
     - topic: [string] category which should be chosen by the classificator, e.g: *Sports*
    

    另外,还有一张桌子

     - word: [string] a word which the frequencies are given for, e.g.: *real estate*
     - topic: [string] same content as "topic" field above, e.h. *Economics*
     - count: [integer] number of occurrences of "word" in "topic" (incremented when new documents go to "topic"), e.g: *100*
    

    现在,我希望我的PHP脚本对所有新闻条目进行分类,并为它们指定几个可能的类别(主题)中的一个。

    这是正确的执行方式吗?你能改进一下吗?

    <?php
    include 'mysqlLogin.php';
    $get1 = "SELECT id, title FROM ".$prefix."news WHERE topic = '' LIMIT 0, 150";
    $get2 = mysql_abfrage($get1);
    // pTOPICS BEGIN
    $pTopics1 = "SELECT topic, SUM(count) AS count FROM ".$prefix."bayes WHERE topic != '' GROUP BY topic";
    $pTopics2 = mysql_abfrage($pTopics1);
    $pTopics = array();
    while ($pTopics3 = mysql_fetch_assoc($pTopics2)) {
        $pTopics[$pTopics3['topic']] = $pTopics3['count'];
    }
    // pTOPICS END
    // pWORDS BEGIN
    $pWords1 = "SELECT word, topic, count FROM ".$prefix."bayes";
    $pWords2 = mysql_abfrage($pWords1);
    $pWords = array();
    while ($pWords3 = mysql_fetch_assoc($pWords2)) {
        if (!isset($pWords[$pWords3['topic']])) {
            $pWords[$pWords3['topic']] = array();
        }
        $pWords[$pWords3['topic']][$pWords3['word']] = $pWords3['count'];
    }
    // pWORDS END
    while ($get3 = mysql_fetch_assoc($get2)) {
        $pTextInTopics = array();
        $tokens = tokenizer($get3['title']);
        foreach ($pTopics as $topic=>$documentsInTopic) {
            if (!isset($pTextInTopics[$topic])) { $pTextInTopics[$topic] = 1; }
            foreach ($tokens as $token) {
                echo '....'.$token;
                if (isset($pWords[$topic][$token])) {
                    $pTextInTopics[$topic] *= $pWords[$topic][$token]/array_sum($pWords[$topic]);
                }
            }
            $pTextInTopics[$topic] *= $pTopics[$topic]/array_sum($pTopics); // #documentsInTopic / #allDocuments
        }
        asort($pTextInTopics); // pick topic with lowest value
        if ($chosenTopic = each($pTextInTopics)) {
            echo '<p>The text belongs to topic '.$chosenTopic['key'].' with a likelihood of '.$chosenTopic['value'].'</p>';
        }
    }
    ?>
    

    培训是手动完成的,它不包含在此代码中。如果文本“如果你出售房地产,你可以赚钱”被分配到类别/主题“经济学”,那么所有单词(You,can,make,…)都会插入表中 贝耶斯 以“经济学”为核心 话题 标准计数为1。如果单词已与同一主题组合在一起,则计数将递增。

    学习数据示例:

    单词主题计数

    卡钦斯基政治1

    索尼科技1

    银行经济学1

    电话技术1

    爱立信科技2

    样本输出/结果:

    政治

    ……电话 ……试验 ……索尼 ……爱立信 ……白杨 ……敏感 ……winberry公司

    技术

    ……索尼找到了 ……爱立信发现 ……敏感 ……winberry公司

    ……电话 ……试验 ……索尼找到了 ……白杨 ……winberry公司

    1 回复  |  直到 14 年前
        1
  •  7
  •   Michael Clerx    14 年前

    看起来您的代码是正确的,但是有一些简单的方法可以优化它。例如,您可以为每个单词动态计算p(word | topic),而您可以轻松地预先计算这些值。(我假设你想在这里对多个文档进行分类,如果你只做一个文档,我想这是可以的,因为你不计算文档中没有的单词)

    类似地,p(topic)的计算也可以移到循环之外。

    最后,不需要对整个数组进行排序就可以找到最大值。

    所有的小点!但这就是你的要求:)

    我编写了一些未经测试的PHP代码,展示了如何在下面实现这一点:

    <?php
    
    // Get word counts from database
    $nWordPerTopic = mystery_sql();
    
    // Calculate p(word|topic) = nWord / sum(nWord for every word)
    $nTopics = array();
    $pWordPerTopic = array();
    foreach($nWordPerTopic as $topic => $wordCounts)
    {
        // Get total word count in topic
        $nTopic = array_sum($wordCounts);
    
        // Calculate p(word|topic)
        $pWordPerTopic[$topic] = array();
        foreach($wordCounts as $word => $count)
            $pWordPerTopic[$topic][$word] = $count / $nTopic;
    
        // Save $nTopic for next step
        $nTopics[$topic] = $nTopic;
    }
    
    // Calculate p(topic)
    $nTotal = array_sum($nTopics);
    $pTopics = array();
    foreach($nTopics as $topic => $nTopic)
        $pTopics[$topic] = $nTopic / $nTotal;
    
    // Classify
    foreach($documents as $document)
    {
        $title = $document['title'];
        $tokens = tokenizer($title);
        $pMax = -1;
        $selectedTopic = null;
        foreach($pTopics as $topic => $pTopic)
        {
            $p = $pTopic;
            foreach($tokens as $word)
            {
                if (!array_key_exists($word, $pWordPerTopic[$topic]))
                    continue;
                $p *= $pWordPerTopic[$topic][$word];
            }
    
            if ($p > $pMax)
            {
                $selectedTopic = $topic;
                $pMax = $p;
            }
        }
    } 
    ?>
    

    至于数学。。。

    你想最大化p(主题词),所以找到

    arg max p(topic|words)
    

    (即p(topic | words)最高的论点主题)

                      p(topic)*p(words|topic)
    p(topic|words) = -------------------------
                            p(words)
    

    所以你在找

             p(topic)*p(words|topic)
    arg max -------------------------
                   p(words)
    

    arg max p(topic)*p(words|topic)
    

    朴素贝叶斯假设(使其成为朴素贝叶斯分类器)是

    p(words|topic) = p(word1|topic) * p(word2|topic) * ...
    

    arg max p(topic) * p(word1|topic) * p(word2|topic) * ...
    

    在哪里?

    p(topic) = number of words in topic / number of words in total
    

    以及

                       p(word, topic)                         1
    p(word | topic) = ---------------- = p(word, topic) * ----------
                          p(topic)                         p(topic)
    
          number of times word occurs in topic     number of words in total
       = -------------------------------------- * --------------------------
                number of words in total           number of words in topic
    
          number of times word occurs in topic 
       = --------------------------------------
                number of words in topic