代码之家  ›  专栏  ›  技术社区  ›  Sandy

使用pandas从字符串生成N-gram

  •  6
  • Sandy  · 技术社区  · 6 年前

    我有一个数据框 df 像这样:

    Pattern    String                                       
    101        hi, how are you?
    104        what are you doing?
    108        Python is good to learn.
    

    我想为String列创建ngram。 我已经使用 split() stack()

    new= df.String.str.split(expand=True).stack()
    

    但是,我想创建ngram(bi、tri、quad等)

    2 回复  |  直到 5 年前
        1
  •  5
  •   cs95 abhishek58g    5 年前

    对文本列进行一点预处理,然后进行一点移位+串联:

    # generate unigrams 
    unigrams  = (
        df['String'].str.lower()
                    .str.replace(r'[^a-z\s]', '')
                    .str.split(expand=True)
                    .stack())
    
    # generate bigrams by concatenating unigram columns
    bigrams = unigrams + ' ' + unigrams.shift(-1)
    # generate trigrams by concatenating unigram and bigram columns
    trigrams = bigrams + ' ' + unigrams.shift(-2)
    
    # concatenate all series vertically, and remove NaNs
    pd.concat([unigrams, bigrams, trigrams]).dropna().reset_index(drop=True)
    

    0                   hi
    1                  how
    2                  are
    3                  you
    4                 what
    5                  are
    6                  you
    7                doing
    8               python
    9                   is
    10                good
    11                  to
    12               learn
    13              hi how
    14             how are
    15             are you
    16            you what
    17            what are
    18             are you
    19           you doing
    20        doing python
    21           python is
    22             is good
    23             good to
    24            to learn
    25          hi how are
    26         how are you
    27        are you what
    28        you what are
    29        what are you
    30       are you doing
    31    you doing python
    32     doing python is
    33      python is good
    34          is good to
    35       good to learn
    dtype: object
    
        2
  •  3
  •   alvas    6 年前

    这个 everygrams() 函数返回n的连续顺序的ngram,例如,以下返回1到3克:

    >>> from nltk import everygrams
    >>> everygrams('a b c d'.split(), 1, 3)
    <generator object everygrams at 0x1147e3410>
    >>> list(everygrams('a b c d'.split(), 1, 3))
    [('a',), ('b',), ('c',), ('d',), ('a', 'b'), ('b', 'c'), ('c', 'd'), ('a', 'b', 'c'), ('b', 'c', 'd')]
    

    使用 apply :

    >>> import pandas as pd
    >>> from itertools import chain
    >>> from nltk import everygrams, word_tokenize
    >>> df = pd.read_csv('x.tsv', sep='\t')
    >>> df
       Pattern                    String
    0      101          hi, how are you?
    1      104       what are you doing?
    2      108  Python is good to learn.
    
    >>> df['String'].apply(lambda x: [' '.join(ng) for ng in everygrams(word_tokenize(x), 1, 3)])
    0    [hi, ,, how, are, you, ?, hi ,, , how, how are...
    1    [what, are, you, doing, ?, what are, are you, ...
    2    [Python, is, good, to, learn, ., Python is, is...
    Name: String, dtype: object
    
    >>> list(chain(*list(df['1to3grams'])))
    ['hi', ',', 'how', 'are', 'you', '?', 'hi ,', ', how', 'how are', 'are you', 'you ?', 'hi , how', ', how are', 'how are you', 'are you ?', 'what', 'are', 'you', 'doing', '?', 'what are', 'are you', 'you doing', 'doing ?', 'what are you', 'are you doing', 'you doing ?', 'Python', 'is', 'good', 'to', 'learn', '.', 'Python is', 'is good', 'good to', 'to learn', 'learn .', 'Python is good', 'is good to', 'good to learn', 'to learn .']