代码之家  ›  专栏  ›  技术社区  ›  Giacomo Ciampoli

删除停止词和字符串。标点符号

  •  1
  • Giacomo Ciampoli  · 技术社区  · 7 年前

    import nltk
    from nltk.corpus import stopwords
    import string
    
    with open('moby.txt', 'r') as f:
        moby_raw = f.read()
        stop = set(stopwords.words('english'))
        moby_tokens = nltk.word_tokenize(moby_raw)
        text_no_stop_words_punct = [t for t in moby_tokens if t not in stop or t not in string.punctuation]
    
        print(text_no_stop_words_punct)
    

    从输出来看,我有:

    [...';', 'surging', 'from', 'side', 'to', 'side', ';', 'spasmodically', 'dilating', 'and', 'contracting',...]
    

    3 回复  |  直到 7 年前
        1
  •  9
  •   DYZ    7 年前

    一定是这样 and or :

    if t not in stop and t not in string.punctuation
    

    if not (t in stop or t in string.punctuation):
    

    all_stops = stop | set(string.punctuation)
    if t not in all_stops:
    

        2
  •  4
  •   vielkind    7 年前

    text_no_stop_words = [t for t in moby_tokens if t not in stop or t not in string.punctuation]
    
        3
  •  1
  •   Caleb Gates    7 年前

    你需要使用 and or 在你的比较中。 stop 那么python不会检查它是否在 string.punctuation .

    text_no_stop_words_punct = [t for t in moby_tokens if t not in stop and t not in string.punctuation]