这个
everygrams()
函数返回n的连续顺序的ngram,例如,以下返回1到3克:
>>> from nltk import everygrams
>>> everygrams('a b c d'.split(), 1, 3)
<generator object everygrams at 0x1147e3410>
>>> list(everygrams('a b c d'.split(), 1, 3))
[('a',), ('b',), ('c',), ('d',), ('a', 'b'), ('b', 'c'), ('c', 'd'), ('a', 'b', 'c'), ('b', 'c', 'd')]
使用
apply
:
>>> import pandas as pd
>>> from itertools import chain
>>> from nltk import everygrams, word_tokenize
>>> df = pd.read_csv('x.tsv', sep='\t')
>>> df
Pattern String
0 101 hi, how are you?
1 104 what are you doing?
2 108 Python is good to learn.
>>> df['String'].apply(lambda x: [' '.join(ng) for ng in everygrams(word_tokenize(x), 1, 3)])
0 [hi, ,, how, are, you, ?, hi ,, , how, how are...
1 [what, are, you, doing, ?, what are, are you, ...
2 [Python, is, good, to, learn, ., Python is, is...
Name: String, dtype: object
>>> list(chain(*list(df['1to3grams'])))
['hi', ',', 'how', 'are', 'you', '?', 'hi ,', ', how', 'how are', 'are you', 'you ?', 'hi , how', ', how are', 'how are you', 'are you ?', 'what', 'are', 'you', 'doing', '?', 'what are', 'are you', 'you doing', 'doing ?', 'what are you', 'are you doing', 'you doing ?', 'Python', 'is', 'good', 'to', 'learn', '.', 'Python is', 'is good', 'good to', 'to learn', 'learn .', 'Python is good', 'is good to', 'good to learn', 'to learn .']