代码之家 › 专栏 › 技术社区 › The Wanderer

对类似URL进行分组/查找常见URL模式(Python)

url parsing python

The Wanderer · 技术社区 · 8 年前

我有大约10万个URL,每个都被标记为积极或消极。我想看看什么类型的URL对应于积极的?(同样适用于负片)

我从分组子域开始,确定了最常见的正负子域。

现在,对于正负比相等的子域,我想进一步剖析并寻找模式。示例模式:

http://www.clarin.com/politica/ (pattern: domain/section)
http://www.clarin.com/tema/manifestaciones.html (pattern: domain/tag/tag_name)
http://www.clarin.com/buscador?q=protesta (pattern: domain/search?=search_term)

这些链接不仅限于clarin.com。

关于如何发现这种模式有什么建议吗?

1 回复 | 直到 8 年前

The Wanderer 8 年前

解决了这个问题 finding largest common substring 问题

解决方案包括从url的每个字符构建一个解析树。树中的每个节点存储正数、负数和总数。最后,修剪树以返回最常见的模式。

代码:

def find_patterns(incoming_urls):
    urls = {}
    # make the tree
    for url in incoming_urls:
        url, atype = line.strip().split("____")  # assuming incoming_urls is a list with each entry of type url__class
        if len(url) < 100:   # Take only the initial 100 characters to avoid building a sparse tree
            bound = len(url) + 1
        else:
            bound = 101
        for x in range(1, bound):
            if url[:x].lower() not in urls:
                urls[url[:x].lower()] = {'positive': 0, 'negative': 0, 'total': 0}
            urls[url[:x].lower()][atype] += 1
            urls[url[:x].lower()]['total'] += 1

    new_urls = {}
    # prune the tree
    for url in urls:
        if urls[url]['total'] < 5:  # For something to be called as common pattern, there should be at least 5 occurrences of it.
            continue
        urls[url]['negative_percentage'] = (float(urls[url]['negative']) * 100) / urls[url]['total']
        if urls[url]['negative_percentage'] < 85.0: # Assuming I am interested in finding url patterns for negative class
            continue
        length = len(url)
        found = False
        # iterate to see if a len+1 url is present with same total count
        for second in urls:
            if len(second) <= length:
                continue
            if url == second[:length] and urls[url]['total'] == urls[second]['total']:
                found = True
                break
        # discard urls with length less than 20
        if not found and len(url) > 20:
            new_urls[url] = urls[url]

    print "URL Pattern; Positive; Negative; Total; Negative (%)"
    for url in new_urls:
        print "%s; %d; %d; %d; %.2f" % (
            url, new_urls[url]['positive'], new_urls[url]['negative'], new_urls[url]['total'],
            new_urls[url]['negative_percentage'])