代码之家 › 专栏 › 技术社区 › Waheeb Al-Abyadh

使用ISRIStemmer为文件中的阿拉伯文本添加词干时出错

arabic nltk python

Waheeb Al-Abyadh · 技术社区 · 8 年前

我正在尝试使用nltk.stem来删除阿拉伯语文本文件(text.txt)的内容。以色列。

§§± § ± ¨§ §§ § § ° §§± ¨§ § § ¨± §¨ §§± § § ¨ §§ ± ± § ¨ §¨§ ± §§ §§· § ¨ §± §± ¨ §§ § § § ± §± §§§ § §·. §¨ § § §§°§¨ §§¨ ¨ §§ § ¨ §§ §¨§ ° ¨± §§ ¨ §§ §¨ ± § ± § §± §° ¨± §¨ §¨ §° ± § ± §§ ¨ §¨ § ·± ¨ §§ §§ §¨ ¨ § §.

我参考了前面的一个问题,使用了以下代码: Python Stemming words in a File

# -*- coding: UTF-8 -*-

from nltk.stem.isri import ISRIStemmer
def stemming_text_1():
    with open('test.txt', 'r') as f:
        for line in f:
            print line
            singles = []

            stemmer = ISRIStemmer()
            for plural in line.split():
                singles.append(stemmer.stem(plural))
            print ' '.join(singles)

stemming_text_1()

/home/waheeb/anaconda2/lib/python2.7/site-packages/nltk/stem/isri.py:154:     UnicodeWarning: Unicode equal comparison failed to convert     both arguments to Unicode - interpreting them as being unequal
  if token in self.stop_words:
Traceback (most recent call last):
  File "Arabic_stem.py", line 15, in <module>
    stemming_text_1()
  File "Arabic_stem.py", line 12, in stemming_text_1
    singles.append(stemmer.stem(plural))
  File "/home/waheeb/anaconda2/lib/python2.7/site-packages/nltk/stem    /isri.py", line 156, in stem
    token = self.pre32(token)     # remove length three and length two     prefixes in this order
  File "/home/waheeb/anaconda2/lib/python2.7/site-packages/nltk/stem    /isri.py", line 198, in pre32
    if word.startswith(pre3):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd8 in position 0:     ordinal not in range(128)

1 回复 | 直到 7 年前

mhawke 8 年前

尝试将文件中的行解码为 unicode码 然后将其传递给词干分析器。我假设您的输入文件编码为UTF8(看起来可能是在查看错误),但是,您可以根据需要更改编码:

for line in f:
    line = line.decode('utf8')    # use the correct encoding here
    ...

或者,您可以使用 io.open() ,指定编码,Python将把传入流解码为unicode:

with io.open('test.txt', encoding='utf8') as f:
    ...

推荐文章

user4660280 · 使用我自己的标记语料库进行NLTK词性标记?

6 年前

Swamy · 如何建立深度学习模型,从几个不同的袋子中挑选单词,形成一个有意义的句子[结束]

6 年前

user9092346 · NLTK-标记后连接专有名词

7 年前

Nice · 如何解决nltk中的NotImplementedError。分类I?

7 年前

ArchivistG · 尝试使用re将3个结果打印到表中

7 年前

AKKA · nltk中Jaccard距离度量的实现。指标。距离与数学定义不一致?

7 年前

Ovaflow · 计算句子中的特定单词

7 年前

Sandy · 使用pandas从字符串生成N-gram

7 年前

Freakant · NLTK。检测句子是否是疑问句?

7 年前

Adeeb Abdul Salam · 如何查找NLTK缺少的资源?[副本]

7 年前