代码之家 › 专栏 › 技术社区 › Sandy

使用pandas从字符串生成N-gram

nltk dataframe pandas python

Sandy · 技术社区 · 7 年前

我有一个数据框 df 像这样:

Pattern    String                                       
101        hi, how are you?
104        what are you doing?
108        Python is good to learn.

我想为String列创建ngram。我已经使用 split() 和 stack()

new= df.String.str.split(expand=True).stack()

但是,我想创建ngram(bi、tri、quad等)

2 回复 | 直到 5 年前

cs95 abhishek58g 5 年前

对文本列进行一点预处理,然后进行一点移位+串联:

# generate unigrams 
unigrams  = (
    df['String'].str.lower()
                .str.replace(r'[^a-z\s]', '')
                .str.split(expand=True)
                .stack())

# generate bigrams by concatenating unigram columns
bigrams = unigrams + ' ' + unigrams.shift(-1)
# generate trigrams by concatenating unigram and bigram columns
trigrams = bigrams + ' ' + unigrams.shift(-2)

# concatenate all series vertically, and remove NaNs
pd.concat([unigrams, bigrams, trigrams]).dropna().reset_index(drop=True)

0                   hi
1                  how
2                  are
3                  you
4                 what
5                  are
6                  you
7                doing
8               python
9                   is
10                good
11                  to
12               learn
13              hi how
14             how are
15             are you
16            you what
17            what are
18             are you
19           you doing
20        doing python
21           python is
22             is good
23             good to
24            to learn
25          hi how are
26         how are you
27        are you what
28        you what are
29        what are you
30       are you doing
31    you doing python
32     doing python is
33      python is good
34          is good to
35       good to learn
dtype: object

alvas 7 年前

这个 everygrams() 函数返回n的连续顺序的ngram,例如,以下返回1到3克:

>>> from nltk import everygrams
>>> everygrams('a b c d'.split(), 1, 3)
<generator object everygrams at 0x1147e3410>
>>> list(everygrams('a b c d'.split(), 1, 3))
[('a',), ('b',), ('c',), ('d',), ('a', 'b'), ('b', 'c'), ('c', 'd'), ('a', 'b', 'c'), ('b', 'c', 'd')]

使用 apply :

>>> import pandas as pd
>>> from itertools import chain
>>> from nltk import everygrams, word_tokenize
>>> df = pd.read_csv('x.tsv', sep='\t')
>>> df
   Pattern                    String
0      101          hi, how are you?
1      104       what are you doing?
2      108  Python is good to learn.

>>> df['String'].apply(lambda x: [' '.join(ng) for ng in everygrams(word_tokenize(x), 1, 3)])
0    [hi, ,, how, are, you, ?, hi ,, , how, how are...
1    [what, are, you, doing, ?, what are, are you, ...
2    [Python, is, good, to, learn, ., Python is, is...
Name: String, dtype: object

>>> list(chain(*list(df['1to3grams'])))
['hi', ',', 'how', 'are', 'you', '?', 'hi ,', ', how', 'how are', 'are you', 'you ?', 'hi , how', ', how are', 'how are you', 'are you ?', 'what', 'are', 'you', 'doing', '?', 'what are', 'are you', 'you doing', 'doing ?', 'what are you', 'are you doing', 'you doing ?', 'Python', 'is', 'good', 'to', 'learn', '.', 'Python is', 'is good', 'good to', 'to learn', 'learn .', 'Python is good', 'is good to', 'good to learn', 'to learn .']

推荐文章

Mainland · Python数据帧规范化值错误:列的长度必须与键相同

1 年前

user026 · 如何根据特定窗口的平均值(行数)创建新列?

1 年前

rpn · 如何在列[1]中连续第二次出现“0”时返回列[0]的值

1 年前