代码之家  ›  专栏  ›  技术社区  ›  artemis Roberto

Python 3.7中通过逐段计数单词的自定义数据结构

  •  0
  • artemis Roberto  · 技术社区  · 5 年前


    • 对于给定的单词或标记,确定它出现在多少段落中(称为文档频率)
    • 创建一个数据结构(dict、pandas dataframe等),其中包含单词、其集合(总体)频率和文档频率


    <P ID=1>
    I have always wanted to try like, multiple? Different rasteraunts. Not quite sure which kind, maybe burgers!
    <P ID=2>
    Nice! I love burgers. Cheeseburgers, too. Have you ever gone to a diner type restauraunt? I have always wanted to try every diner in the country.
    <P ID=3>
    I am not related to the rest of these paragraphs at all.

    一个“段落”是由 <P ID=x> </P> tags

    我需要的是创建一个类似这样的数据结构(我认为它是 dict ):

    {'i': X Y, 'have': X Y, etc}

    或者,可能是 pandas 如下所示的数据帧:

    | Word | Content Frequency | Document Frequency |
    |   i  |         4         |          3         |
    | have |         3         |          2         |
    | etc  |         etc       |          etc       |


    import nltk
    import string
    from nltk.tokenize import word_tokenize, RegexpTokenizer
    import csv
    import numpy
    import operator
    import re
    # Requisite
    def get_input(filepath):
        f = open(filepath, 'r')
        content = f.read()
        return content
    # 1
    def normalize_text(file):
        file = re.sub('<P ID=(\d+)>', '', file)
        file = re.sub('</P>', '', file)
        tokenizer = RegexpTokenizer(r'\w+')
        all_words = tokenizer.tokenize(file)
        lower_case = []
        for word in all_words:
            curr = word.lower()
        return lower_case
    # Requisite for 3
    # Answer for 4
    def get_collection_frequency(a):
        g = {}
        for i in a:
            if i in g: 
                g[i] +=1
                g[i] =1
        return g
    myfile = get_input('example.txt')
    words = normalize_text(myfile)
    ## ANSWERS
    collection_frequency = get_collection_frequency(words)
    print("Collection frequency: ", collection_frequency)


    Collection frequency:  {'i': 4, 'have': 3, 'always': 2, 'wanted': 2, 
    'to': 4, 'try': 2, 'like': 1, 'multiple': 1, 'different': 1,
    'rasteraunts': 1, 'not': 2, 'quite': 1, 'sure': 1, 'which': 1,
    'kind': 1, 'maybe': 1, 'burgers': 2, 'nice': 1, 'love': 1,
    'cheeseburgers': 1, 'too': 1, 'you': 1, 'ever': 1, 'gone': 1, 'a': 1,
    'diner': 2, 'type': 1, 'restauraunt': 1, 'every': 1, 'in': 1, 'the': 2,
    'country': 1, 'am': 1, 'related': 1, 'rest': 1, 'of': 1, 'these': 1, 
    'paragraphs': 1, 'at': 1, 'all': 1}

    但是,我目前正在删除 normalize_text 功能与行:

    file = re.sub('<P ID=(\d+)>', '', file)
    file = re.sub('</P>', '', file)

    P , ID , 1 , 2 , 3 在我的字典里算一下,因为那些只是段落标题。


    1 回复  |  直到 5 年前
  •  1
  •   wwii    5 年前



    from nltk.tokenize import RegexpTokenizer
    import re, collections
    p = r'<P ID=\d+>(.*?)</P>'
    paras = RegexpTokenizer(p)
    words = RegexpTokenizer(r'\w+')


    col_freq = collections.Counter()
    doc_freq = collections.Counter()

    遍历段落;获取段落中的单词;将单词输入col_freq dict,并将一组单词输入doc_freq dict

    for para in paras.tokenize(text):
        tokens = [word.lower() for word in words.tokenize(para)]


    d = {word:(col_freq[word], doc_freq[word]) for word in col_freq}


    RegexpTokenizer 真的没有什么比 re.findall() 在这种情况下,但是 兽皮 一些细节,使这个不那么冗长,所以我用了它。

    有时 re 不能很好地处理格式错误的标记。分析段落可以用BeautifulSoup完成。

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(text,"html.parser")
    for para in soup.find_all('p'):
        tokens = [word.lower() for word in words.tokenize(para.text)]
    ##    col_freq.update(tokens)
    ##    doc_freq.update(set(tokens))
  •  1
  •   Lord Elrond Mureinik    5 年前


    import re
    from nltk.tokenize import word_tokenize, RegexpTokenizer
    def normalize_text(file):
        file = re.sub('<P ID=(\d+)>', '', file)
        file = re.sub('</P>', '', file)
        tokenizer = RegexpTokenizer(r'\w+')
        all_words = tokenizer.tokenize(file)
        lower_case = []
        for word in all_words:
            curr = word.lower()
        return lower_case
    def find_words(filepath):
        with open(filepath, 'r') as f:
            file = f.read()
        word_list = normalize_text(file)
        data = file.replace('</P>','').split('<P ID=')
        result = {}
        for word in word_list:
            result[word] = {}
            for p in data:
                if p:
                    result[word][f'paragraph_{p[0]}'] = p[2:].count(word)
        return result


    def find_words(filepath):
        with open(filepath, 'r') as f:
            file = f.read()
        word_list = normalize_text(file)
        data = file.replace('</P>','').split('<P ID=')
        result = {}
        for p in data:
            if p:
                result[f'paragraph_{p[0]}'] = {}
                for word in word_list:
                    result[f'paragraph_{p[0]}'][word] = p[2:].count(word)
        return result 

    不过还是有点难读。如果漂亮的打印对象对您很重要,您可以尝试使用 pretty printing package .


    def find_paragraph_occurrences(filepath):
        with open(filepath, 'r') as f:
            file = f.read()
        word_list = normalize_text(file)
        data = file.replace('</P>','').lower().split('<P ID=')
        result = {}
        for word in word_list:
            result[word] = 0
            for p in data:
                if word in p:
                    result[word] += 1
        return result
  •  1
  •   DarrylG    5 年前
    import re
    from collections import defaultdict, Counter
    def create_dict(text):
    " Dictionary contains strings for each paragraph using paragraph ID as key"
      d = defaultdict(lambda: "")
      lines = text.splitlines()
      for line in lines:
        matchObj = re.match( r'<P ID=(\d+)>', line)
        if matchObj:
          dictName = matchObj.group(0)
          continue  #skip line containing paragraph ID
        elif re.match(r'</P>', line):
          continue  #skip line containing paragraph ending token
        d[dictName] += line.lower()
      return d
    def document_frequency(d):
    " frequency of words in document "
      c = Counter()
      for paragraph in d.values():
        words = re.findall(r'\w+', paragraph)
      return c
    def paragraph_frequency(d):
    "Frequency of words in paragraph "
      c = Counter()
      for sentences in d.values():
        words = re.findall(r'\w+', sentences)
        set_words = set(words)  # Set causes at most one occurrence 
                                # of word in paragraph
      return c
    text = """<P ID=1>
    I have always wanted to try like, multiple? Different rasteraunts. Not quite sure which kind, maybe burgers!
    <P ID=2>
    Nice! I love burgers. Cheeseburgers, too. Have you ever gone to a diner type restauraunt? I have always wanted to try every diner in the country.
    <P ID=3>
    I am not related to the rest of these paragraphs at all.
    d = create_dict(text)
    doc_freq = document_frequency(d)    # Number of times in document
    para_freq = paragraph_frequency(d)  # Number of times in paragraphs
    print("document:", doc_freq)
    print("paragraph: ", para_freq)


    document: Counter({'i': 4, 'to': 4, 'have': 3, 'always': 2, 'wanted': 2, 'try': 2, 'not': 2,'burgers': 2, 'diner': 2, 'the': 2, 'like': 1, 'multiple': 1, 'different': 1, 'rasteraunts':1, 'quite': 1, 'sure': 1, 'which': 1, 'kind': 1, 'maybe': 1, 'nice': 1, 'love': 1, 'cheeseburgers': 1, 'too': 1, 'you': 1, 'ever': 1, 'gone': 1, 'a': 1, 'type': 1, 'restauraunt': 1, 'every': 1, 'in': 1, 'country': 1, 'am': 1, 'related': 1, 'rest': 1, 'of': 1, 'these': 1, 'paragraphs': 1, 'at': 1, 'all': 1})
    paragraph:  Counter({'to': 3, 'i': 3, 'try': 2, 'have': 2, 'burgers': 2, 'wanted': 2, 'always': 2, 'not': 2, 'the': 2, 'which': 1, 'multiple': 1, 'quite': 1, 'rasteraunts': 1, 'kind': 1, 'like': 1, 'maybe': 1, 'sure': 1, 'different': 1, 'love': 1, 'too': 1, 'in': 1, 'restauraunt': 1, 'every': 1, 'nice': 1, 'cheeseburgers': 1, 'diner': 1, 'ever': 1, 'a': 1, 'type': 1, 'you': 1, 'country': 1, 'gone': 1, 'at': 1, 'related': 1, 'paragraphs': 1, 'rest': 1, 'of': 1,'am': 1, 'these': 1, 'all': 1})