代码之家 › 专栏 › 技术社区 › Emre Sevinç

对于python有什么类似readability.js的东西吗?

heuristics html-content-extraction python javascript

Emre Sevinç · 技术社区 · 15 年前

我正在寻找一个包/模块/函数等,它大约相当于python中arc90的readability.js。

http://lab.arc90.com/experiments/readability

http://lab.arc90.com/experiments/readability/js/readability.js

这样我就可以给它一些input.html,结果就被清除了该html页面的版本。” 正文 “。我希望这样,这样我就可以在服务器端使用它(与仅在浏览器端运行的JS版本不同)。

有什么想法吗?

附言:我尝试过Rhino+env.js,但是这种组合是可以接受的,但是性能是不可接受的,清理大部分HTML内容需要几分钟:(仍然找不到为什么性能有如此大的差异)。

6 回复 | 直到 14 年前

Yuri Baburov 14 年前

请尝尝我的叉子 https://github.com/buriy/python-readability 它速度很快,具有最新JavaScript版本的所有功能。

Martin 15 年前

我们刚刚在repustate.com上启动了一个新的自然语言处理API。使用RESTAPI,您可以清除任何HTML或PDF,只返回文本部分。我们的API是免费的,所以请随意使用您的心的内容。它是用python实现的。检查一下,并将结果与readability.js进行比较——我想你会发现它们几乎100%相同。

Sridhar Ratnakumar 14 年前

hn.py 通过 Readability's blog . Readable Feeds 一个应用引擎应用程序,利用它。

我已将其作为PIP可安装模块捆绑在这里: http://github.com/srid/readability

Alec Thomas 15 年前

我过去对此做过一些研究,最终实现了 this approach [pdf] 在蟒蛇中。在应用算法之前,我实现的最终版本也做了一些清理,比如删除head/script/iframe元素、隐藏元素等,但这是它的核心。

这里有一个带有“链接列表”鉴别器的(非常)简单实现的函数,它试图删除链接与文本比率很高的元素(即导航栏、菜单、广告等):

def link_list_discriminator(html, min_links=2, ratio=0.5):
    """Remove blocks with a high link to text ratio.

    These are typically navigation elements.

    Based on an algorithm described in:
        http://www.psl.cs.columbia.edu/crunch/WWWJ.pdf

    :param html: ElementTree object.
    :param min_links: Minimum number of links inside an element
                      before considering a block for deletion.
    :param ratio: Ratio of link text to all text before an element is considered
                  for deletion.
    """
    def collapse(strings):
        return u''.join(filter(None, (text.strip() for text in strings)))

    # FIXME: This doesn't account for top-level text...
    for el in html.xpath('//*'):
        anchor_text = el.xpath('.//a//text()')
        anchor_count = len(anchor_text)
        anchor_text = collapse(anchor_text)
        text = collapse(el.xpath('.//text()'))
        anchors = float(len(anchor_text))
        all = float(len(text))
        if anchor_count > min_links and all and anchors / all > ratio:
            el.drop_tree()

在测试语料库中,我使用它实际上工作得很好,但是实现高可靠性需要很多调整。

Vinay Sajip 15 年前

为什么不尝试使用GoogleV8/node.js而不是Rhino?速度应该可以接受。

-3

eikes 15 年前

我想 BeautifulSoup 是Python最好的HTML解析器。但是您仍然需要弄清楚站点的“主要”部分是什么。

如果您只分析一个域,那么它是相当直接的,但是要找到一个适用于任何网站不是那么容易。

也许您可以将readability.js方法移植到python上?