代码之家 › 专栏 › 技术社区 › user3286053

utf-8在列表中搜索单词

utf-8 search python

user3286053 · 技术社区 · 7 年前

我有一个由utf-8文件生成的查找列表

with open('stop_word_Tiba.txt') as f:
    newStopWords= list(itertools.chain( line.split() for line in f)) #save the file as list of lines
newStopWords1d=list(itertools.chain(*newStopWords)) # convert 2d list to 1d list

当我打开文件时,我看到“Ø§ÙÙ”一词在那里。所以它在列表中,但是列表现在看起来像 [“\xd8\xa7\xd9\x84\xd9\x88”,“\xd8\xa3\xd9\x84\xd9\x88”,“\xd8\xa7\xd9\x88\xd9\x88\xd9\x8a”,“\xd8\xa7\xd9\x84”,“\xd8\xa7\xd9\x87”,“\xd8\xa3\xd9\x87”,“\xd9\x84\xd9\x88”,“\xd8\xa3\xd9\x88”“\xd9\x83\xd9\x8a”,“\xd9\x88”]

然后我想搜索newStopWords1d中是否有特定的单词 “Ø§ÙÙ”一词是“\xd8\xa7\xd9\x84\xd9\x88”

word='Ø§ÙÙ'
for w in newStopWords1d:
    if word == w.encode("utf-8"):
        print 'found'

找不到这个词,我试过了

    if word in newStopWords1d:
        print 'found'

但这个词还是看不见。这似乎是编码的问题,但我无法解决。你能帮帮我吗。

2 回复 | 直到 7 年前

radzak 7 年前

值得一提的是,您使用的是Python 2.7。

word='Ø§ÙÙ'
for w in newStopWords1d:
    if word == w.decode("utf-8"):
        print 'found'

更好的解决方案是使用 io

import io

with io.open('stop_word_Tiba.txt', encoding="utf-8") as f:
    ...

或 codecs 单元

import codecs

with codecs.open('stop_word_Tiba.txt', encoding="utf-8") as f:
    ...

因为Python 2.7中内置的open函数不支持指定编码。

user3286053 7 年前

通过将open file语句编辑为

with codecs.open("stop_word_Tiba.txt", "r", "utf-8") as f:
    newStopWords= list(itertools.chain( line.split() for line in f)) #save the file as list of lines
newStopWords1d=list(itertools.chain(*newStopWords))
    for w in newStopWords1d:
            if word.encode("utf-8") == w.encode("utf-8") :  
                      return 'found'

谢谢你。。

推荐文章

July · 如何定义数字间隔,然后四舍五入

1 年前

Community wiki · 对象名称前的单下划线和双下划线的含义是什么?

1 年前