代码之家 › 专栏 › 技术社区 › Giacomo Ciampoli

删除停止词和字符串。标点符号

punctuation nltk python

1

Giacomo Ciampoli · 技术社区 · 7 年前

import nltk
from nltk.corpus import stopwords
import string

with open('moby.txt', 'r') as f:
    moby_raw = f.read()
    stop = set(stopwords.words('english'))
    moby_tokens = nltk.word_tokenize(moby_raw)
    text_no_stop_words_punct = [t for t in moby_tokens if t not in stop or t not in string.punctuation]

    print(text_no_stop_words_punct)

从输出来看,我有:

[...';', 'surging', 'from', 'side', 'to', 'side', ';', 'spasmodically', 'dilating', 'and', 'contracting',...]

3 回复 | 直到 7 年前

1

9

DYZ 7 年前

一定是这样 and 不 or :

if t not in stop and t not in string.punctuation

if not (t in stop or t in string.punctuation):

all_stops = stop | set(string.punctuation)
if t not in all_stops:

2

4

vielkind 7 年前

text_no_stop_words = [t for t in moby_tokens if t not in stop or t not in string.punctuation]

3

1

Caleb Gates 7 年前

你需要使用 and 不 or 在你的比较中。 stop 那么python不会检查它是否在 string.punctuation .

text_no_stop_words_punct = [t for t in moby_tokens if t not in stop and t not in string.punctuation]