代码之家  ›  专栏  ›  技术社区  ›  Anurag Sharma

从自由流动的文本中删除html标记以形成独立的句子

  •  0
  • Anurag Sharma  · 技术社区  · 7 年前

    我想从一大块文本中提取句子。我的文字有点像tihs-

    <ul><li>Registered Nurse in <font>Missouri</font>, License number <font>xxxxxxxx</font>, <font>2017</font></li><li>AHA Advanced Cardiac Life Support (ACLS) Certification <font>2016-2018</font></li><li>AHA PALS - Pediatric Advanced Life Support 2017-2019</li><li>AHA Basic Life Support 2016-2018</li></ul>
    

    我想从上面的课文中提取合适的句子。因此,预期输出将是一个列表

    ['Registered Nurse in Missouri, License number xxxxxxxx, 2017',
    'AHA Advanced Cardiac Life Support (ACLS) Certification 2016-2018',
    'AHA PALS - Pediatric Advanced Life Support 2017-2019',
    'AHA Basic Life Support 2016-2018']
    

    我使用python内置 HTMLParser

    class HTMLStripper(HTMLParser):
    
        def __init__(self):
            super().__init__()
            self.reset()
            self.strict = False
            self.convert_charrefs= True
            self.fed = []
    
        def handle_data(self, chunk):
            #import pdb; pdb.set_trace()
            self.fed.append(chunk.strip())
    
        def get_data(self):
            return [x for x in self.fed if x]
    
    
    def strip_html_tags(html):
        try:
            s = HTMLStripper()
            s.feed(html)
            return s.get_data()
        except Exception as e:
            # Remove html strings from the given string
            p = re.compile(r'<.*?>')
            return p.sub('', html)
    

    它在调用时给出以下结果 strip_html_tags

    ['Registered Nurse in', 'Missouri', ', License number', 'xxxxxxx', ',', '2017', 'AHA Advanced Cardiac Life Support (ACLS) Certification', '2016-2018', 'AHA PALS - Pediatric Advanced Life Support 2017-2019', 'AHA Basic Life Support 2016-2018']
    

    我不能严格检查 <ul> or <li> tags 因为不同的文本可能有不同的html标记。我有一种方法可以把上面这样的文本在外部拆分 html-tags 而不是在每个 html-tag 遇到

    2 回复  |  直到 7 年前
        1
  •  1
  •   Ofer Sadan    7 年前

    为什么不使用已经可以高效解析html的工具呢?喜欢 BeautifulSoup :

    from bs4 import BeautifulSoup
    
    demo = '<ul><li>Registered Nurse in <font>Missouri</font>, License number <font>xxxxxxxx</font>, <font>2017</font></li><li>AHA Advanced Cardiac Life Support (ACLS) Certification <font>2016-2018</font></li><li>AHA PALS - Pediatric Advanced Life Support 2017-2019</li><li>AHA Basic Life Support 2016-2018</li></ul>'
    soup = BeautifulSoup(demo, 'lxml')
    sentences = [item.text for item in soup.findAll('li')]
    

    sentences

    根据您的评论,我将使用以下代码:

    text_without_tags = soup.text
    

    现在你不再需要担心标签了,只需要一个简单的字符串,然后你就可以用它变成一个列表 split(',') 例如在逗号上(但如果文本不总是带有逗号或点,我不介意,只使用字符串本身)

        2
  •  0
  •   Anurag Sharma    7 年前

    BeautifulSoup 如果我知道必须提前从中提取文本的标签(以便我可以应用 soup.findAll(specific_tag) ),但我的情况并非如此。它们可以是多个标记,我必须从中提取文本。例如-

    <p>Science</p><div> Biology </div><div>Generation of mature T cells from human hematopoietic stem and progenitor cells in artificial thymic organoids. <span style=\"text-decoration: underline;\">Nature Methods</span> 2017,</div>
    

    <p> 标记和 <div> 标签

    import re
    import copy
    from html.parser import HTMLParser
    from sample_htmls import *
    
    class HTMLStripper(HTMLParser):
    
        def __init__(self):
            super().__init__()
            self.reset()
            self.strict = False
            self.convert_charrefs= True
            self.feeds = []
            self.sentence = ''
            self.current_path = []
            self.tree = []
            self.lookup_tags = ['div', 'span', 'p', 'ul', 'li']
    
        def update_feed(self):
            self.tree.append(copy.deepcopy(self.current_path))
            self.current_path[:] = []
            self.feeds.append(re.sub(' +', ' ', self.sentence).strip())
            self.sentence = ''
    
        def handle_starttag(self, tag, attrs):
            if tag in self.lookup_tags:
                if tag == 'li' and len(self.current_path) > 0:
                    self.update_feed()
                self.current_path.append(tag)
    
        def handle_endtag(self, tag):
            if tag in self.lookup_tags:
                self.current_path.append(tag)
                if tag == self.current_path[0]:
                    self.update_feed()
    
        def handle_data(self, data):
            self.sentence += ' ' + data
    
        def get_tree(self):
            return self.tree
    
        def get_data(self):
            return [x for x in self.feeds if x]
    

    在上面的示例中运行代码

    parser = HTMLStripper()
    parser.feed(mystr)
    l1 = parser.get_tree()
    feed = parser.get_data()
    print(l1)
    print("\n", mystr)
    print("\n", feed)
    print("\n\n")
    

    和输出-

    [['ul'], ['li', 'li'], ['li', 'li'], ['li', 'li'], ['li', 'li'], ['ul']]
    
    <ul><li>Registered Nurse in <font>Missouri</font>, License number <font>xxxxxxxx</font>, <font>2017</font></li><li>AHA Advanced Cardiac Life Support (ACLS) Certification <font>2016-2018</font></li><li>AHA PALS - Pediatric Advanced Life Support 2017-2019</li><li>AHA Basic Life Support 2016-2018</li></ul>
    
    ['Registered Nurse in Missouri , License number xxxxxxxx , 2017', 'AHA Advanced Cardiac Life Support (ACLS) Certification 2016-2018', 'AHA PALS - Pediatric Advanced Life Support 2017-2019', 'AHA Basic Life Support 2016-2018']
    

    也适用于混合标记html字符串-

    [['p', 'p'], ['div', 'div'], ['div', 'span', 'span', 'div']]
    
    <p>Science</p><div> Biology </div><div>Generation of mature T cells from human hematopoietic stem and progenitor cells in artificial thymic organoids. <span style="text-decoration: underline;">Nature Methods</span> 2017,</div>
    
    ['Science', 'Biology', 'Generation of mature T cells from human hematopoietic stem and progenitor cells in artificial thymic organoids. Nature Methods 2017,']