代码之家 › 专栏 › 技术社区 › artemis Roberto

Python 3.7中通过逐段计数单词的自定义数据结构

data-structures string python-3.x python

artemis Roberto · 技术社区 · 5 年前

我有以下要求:

对于给定的单词或标记,确定它出现在多少段落中(称为文档频率)
创建一个数据结构(dict、pandas dataframe等),其中包含单词、其集合(总体)频率和文档频率

示例数据集如下所示:

<P ID=1>
I have always wanted to try like, multiple? Different rasteraunts. Not quite sure which kind, maybe burgers!
</P>

<P ID=2>
Nice! I love burgers. Cheeseburgers, too. Have you ever gone to a diner type restauraunt? I have always wanted to try every diner in the country.
</P>

<P ID=3>
I am not related to the rest of these paragraphs at all.
</P>

一个“段落”是由 <P ID=x> </P> tags

我需要的是创建一个类似这样的数据结构(我认为它是 dict ):

{'i': X Y, 'have': X Y, etc}

或者,可能是 pandas 如下所示的数据帧:

| Word | Content Frequency | Document Frequency |
|   i  |         4         |          3         |
| have |         3         |          2         |
| etc  |         etc       |          etc       |

目前,我可以找到内容频率没有问题使用下面的代码。

import nltk
import string
from nltk.tokenize import word_tokenize, RegexpTokenizer
import csv
import numpy
import operator
import re

# Requisite
def get_input(filepath):
    f = open(filepath, 'r')
    content = f.read()
    return content

# 1
def normalize_text(file):
    file = re.sub('<P ID=(\d+)>', '', file)
    file = re.sub('</P>', '', file)
    tokenizer = RegexpTokenizer(r'\w+')
    all_words = tokenizer.tokenize(file)
    lower_case = []
    for word in all_words:
        curr = word.lower()
        lower_case.append(curr)

    return lower_case

# Requisite for 3
# Answer for 4
def get_collection_frequency(a):
    g = {}
    for i in a:
        if i in g: 
            g[i] +=1
        else: 
            g[i] =1
    return g

myfile = get_input('example.txt')
words = normalize_text(myfile)

## ANSWERS
collection_frequency = get_collection_frequency(words)
print("Collection frequency: ", collection_frequency)

Collection frequency:  {'i': 4, 'have': 3, 'always': 2, 'wanted': 2, 
'to': 4, 'try': 2, 'like': 1, 'multiple': 1, 'different': 1,
'rasteraunts': 1, 'not': 2, 'quite': 1, 'sure': 1, 'which': 1,
'kind': 1, 'maybe': 1, 'burgers': 2, 'nice': 1, 'love': 1,
'cheeseburgers': 1, 'too': 1, 'you': 1, 'ever': 1, 'gone': 1, 'a': 1,
'diner': 2, 'type': 1, 'restauraunt': 1, 'every': 1, 'in': 1, 'the': 2,
'country': 1, 'am': 1, 'related': 1, 'rest': 1, 'of': 1, 'these': 1, 
'paragraphs': 1, 'at': 1, 'all': 1}

但是,我目前正在删除 normalize_text 功能与行:

file = re.sub('<P ID=(\d+)>', '', file)
file = re.sub('</P>', '', file)

P , ID , 1 , 2 , 3 在我的字典里算一下,因为那些只是段落标题。

那么,我怎样才能将一个单词的出现与它在一个段落中的实例联系起来,从而产生上面所期望的结果呢?我甚至不确定试图创建这样一个数据结构的逻辑。

1 回复 | 直到 5 年前

wwii 5 年前

那么,我怎样才能将一个单词的出现与它在一个段落中的实例联系起来,从而产生上面所期望的结果呢?

将过程分成两部分:查找段落和查找单词

from nltk.tokenize import RegexpTokenizer
import re, collections

p = r'<P ID=\d+>(.*?)</P>'
paras = RegexpTokenizer(p)
words = RegexpTokenizer(r'\w+')

解析时保留两个词典:一个用于收集频率,一个用于文档频率。

col_freq = collections.Counter()
doc_freq = collections.Counter()

遍历段落;获取段落中的单词;将单词输入col_freq dict,并将一组单词输入doc_freq dict

for para in paras.tokenize(text):
    tokens = [word.lower() for word in words.tokenize(para)]
    col_freq.update(tokens)
    doc_freq.update(set(tokens))

把这两本词典结合起来。

d = {word:(col_freq[word], doc_freq[word]) for word in col_freq}

有一些效率低下-分析文本两次-但它可以调整,如果它成为一个问题。

RegexpTokenizer 真的没有什么比 re.findall() 在这种情况下,但是兽皮一些细节,使这个不那么冗长,所以我用了它。

有时 re 不能很好地处理格式错误的标记。分析段落可以用BeautifulSoup完成。

from bs4 import BeautifulSoup
soup = BeautifulSoup(text,"html.parser")
for para in soup.find_all('p'):
    tokens = [word.lower() for word in words.tokenize(para.text)]
    print(tokens)
##    col_freq.update(tokens)
##    doc_freq.update(set(tokens))

Lord Elrond Mureinik 5 年前

试试这个:

import re
from nltk.tokenize import word_tokenize, RegexpTokenizer

def normalize_text(file):
    file = re.sub('<P ID=(\d+)>', '', file)
    file = re.sub('</P>', '', file)
    tokenizer = RegexpTokenizer(r'\w+')
    all_words = tokenizer.tokenize(file)
    lower_case = []
    for word in all_words:
        curr = word.lower()
        lower_case.append(curr)

    return lower_case

def find_words(filepath):
    with open(filepath, 'r') as f:
        file = f.read()
    word_list = normalize_text(file)
    data = file.replace('</P>','').split('<P ID=')
    result = {}
    for word in word_list:
        result[word] = {}
        for p in data:
            if p:
                result[word][f'paragraph_{p[0]}'] = p[2:].count(word)
    print(result)
    return result

find_words('./test.txt')

如果要按段落分组,则按单词出现次数分组:

def find_words(filepath):
    with open(filepath, 'r') as f:
        file = f.read()
    word_list = normalize_text(file)
    data = file.replace('</P>','').split('<P ID=')
    result = {}
    for p in data:
        if p:
            result[f'paragraph_{p[0]}'] = {}
            for word in word_list:
                result[f'paragraph_{p[0]}'][word] = p[2:].count(word)


    print(result)
    return result

不过还是有点难读。如果漂亮的打印对象对您很重要,您可以尝试使用 pretty printing package .

要查找单词出现的段落数,请执行以下操作:

def find_paragraph_occurrences(filepath):
    with open(filepath, 'r') as f:
        file = f.read()
    word_list = normalize_text(file)
    data = file.replace('</P>','').lower().split('<P ID=')
    result = {}
    for word in word_list:
        result[word] = 0
        for p in data:
            if word in p:
                result[word] += 1

    print(result)
    return result

DarrylG 5 年前

import re
from collections import defaultdict, Counter

def create_dict(text):
" Dictionary contains strings for each paragraph using paragraph ID as key"
  d = defaultdict(lambda: "")
  lines = text.splitlines()
  for line in lines:
    matchObj = re.match( r'<P ID=(\d+)>', line)
    if matchObj:
      dictName = matchObj.group(0)
      continue  #skip line containing paragraph ID
    elif re.match(r'</P>', line):
      continue  #skip line containing paragraph ending token
    d[dictName] += line.lower()
  return d

def document_frequency(d):
" frequency of words in document "
  c = Counter()
  for paragraph in d.values():
    words = re.findall(r'\w+', paragraph)
    c.update(words)
  return c

def paragraph_frequency(d):
"Frequency of words in paragraph "
  c = Counter()
  for sentences in d.values():
    words = re.findall(r'\w+', sentences)
    set_words = set(words)  # Set causes at most one occurrence 
                            # of word in paragraph
    c.update(set_words)
  return c

text = """<P ID=1>
I have always wanted to try like, multiple? Different rasteraunts. Not quite sure which kind, maybe burgers!
</P>

<P ID=2>
Nice! I love burgers. Cheeseburgers, too. Have you ever gone to a diner type restauraunt? I have always wanted to try every diner in the country.
</P>

<P ID=3>
I am not related to the rest of these paragraphs at all.
</P>"""

d = create_dict(text)
doc_freq = document_frequency(d)    # Number of times in document
para_freq = paragraph_frequency(d)  # Number of times in paragraphs
print("document:", doc_freq)
print("paragraph: ", para_freq)

结果

document: Counter({'i': 4, 'to': 4, 'have': 3, 'always': 2, 'wanted': 2, 'try': 2, 'not': 2,'burgers': 2, 'diner': 2, 'the': 2, 'like': 1, 'multiple': 1, 'different': 1, 'rasteraunts':1, 'quite': 1, 'sure': 1, 'which': 1, 'kind': 1, 'maybe': 1, 'nice': 1, 'love': 1, 'cheeseburgers': 1, 'too': 1, 'you': 1, 'ever': 1, 'gone': 1, 'a': 1, 'type': 1, 'restauraunt': 1, 'every': 1, 'in': 1, 'country': 1, 'am': 1, 'related': 1, 'rest': 1, 'of': 1, 'these': 1, 'paragraphs': 1, 'at': 1, 'all': 1})
paragraph:  Counter({'to': 3, 'i': 3, 'try': 2, 'have': 2, 'burgers': 2, 'wanted': 2, 'always': 2, 'not': 2, 'the': 2, 'which': 1, 'multiple': 1, 'quite': 1, 'rasteraunts': 1, 'kind': 1, 'like': 1, 'maybe': 1, 'sure': 1, 'different': 1, 'love': 1, 'too': 1, 'in': 1, 'restauraunt': 1, 'every': 1, 'nice': 1, 'cheeseburgers': 1, 'diner': 1, 'ever': 1, 'a': 1, 'type': 1, 'you': 1, 'country': 1, 'gone': 1, 'at': 1, 'related': 1, 'paragraphs': 1, 'rest': 1, 'of': 1,'am': 1, 'these': 1, 'all': 1})