代码之家  ›  专栏  ›  技术社区  ›  surendra

从html文本中提取标记信息

  •  0
  • surendra  · 技术社区  · 8 年前

    我正在努力刮网页。我得到了下面的文字。如何从以下字符串中提取src信息。有人能告诉我这个过程吗?我们如何从文本中提取任何键值数据

    <img id="imgsglx2" onerror="this.alt=not select the picture or pictures cannot be displayed" src="http://114.255.167.200:8092/cidasEN/extend/sglx_images/UTYP/221.jpg" style=" border: 0; padding: 0; margin: 0;height:110px;width:110px; "/>
    

    以及文本区域标记内的文本。

      <textarea id="sgmsbck" name="sgms" style="width:98%;height:120px">On August. 19, 2014\uff0c08:30, Mr. Xiao who drove lu K9**** MPV from south to north along the TaiShang south Road, when Mr. Xiao drove lu K9**** MPV turn west at the crossing of Chengshan road and TaiShang south road, RongCheng City. Due to wrong behavior towards pedestrians at pedestrian crossings, the left part of the lu K9**** MPV impacted with Mr. Song(Pedestrian) from south to north across ChengShan Road of the pedestrian crossings. Causing the lu K9**** MPV damaged, Mr. Song injured.</textarea>
    
    2 回复  |  直到 8 年前
        1
  •  0
  •   Shane    8 年前

    自从你提到 beautifulsoup 在标记中,我假设您希望使用它来解析html内容。

    import bs4
    
    content = """<img id="imgsglx2" onerror="this.alt=not select the picture or pictures cannot be displayed" src="http://114.255.167.200:8092/cidasEN/extend/sglx_images/UTYP/221.jpg" style=" border: 0; padding: 0; margin: 0;height:110px;width:110px; "/>
    <textarea id="sgmsbck" name="sgms" style="width:98%;height:120px">On August. 19, 2014\uff0c08:30, Mr. Xiao who drove lu K9**** MPV from south to north along the TaiShang south Road, when Mr. Xiao drove lu K9**** MPV turn west at the crossing of Chengshan road and TaiShang south road, RongCheng City. Due to wrong behavior towards pedestrians at pedestrian crossings, the left part of the lu K9**** MPV impacted with Mr. Song(Pedestrian) from south to north across ChengShan Road of the pedestrian crossings. Causing the lu K9**** MPV damaged, Mr. Song injured.</textarea>
    """
    
    soup = bs4.BeautifulSoup(content, 'lxml')
    
    img = soup.find('img') # locate img tag
    text_area = soup.find('textarea') # locate textarea tag
    
    print img['id'] # print value of 'id' attribute in img tag
    print img['src'] # print value of 'src' attribute
    print text_area.text # print content in this tag
    
        2
  •  0
  •   宏杰李    8 年前

    beautifulsoup 可以帮助:

    标记可以具有任意数量的属性。标记有一个属性类,其值最粗体。可以通过将标记视为字典来访问标记属性:

    tag['class']
    
    # u'boldest'
    

    您可以作为.attrs直接访问该词典:

    tag.attrs
    # {u'class': u'boldest'}
    

    您可以通过.text从标记中获取文本

    tag.text