代码之家 › 专栏 › 技术社区 › Crickets

如何从列表中删除重复项,同时向各个列表项添加重复计数?

list python

-2

Crickets · 技术社区 · 6 年前

我的问题和

How to remove case-insensitive duplicates from a list, while maintaining the original list order?

但我也希望重复的项目 反映项字符串本身中的重复数 (在括号中)。

示例输入:

myList = ["paper", "Plastic", "aluminum", "PAPer", "TIN", " paper", "glass", "tin", "PAPER", "Polypropylene Plastic"]

唯一可接受的输出:

myList = ["paper (3)", "Plastic", "aluminum", "TIN (2)", " paper", "glass", "Polypropylene Plastic"]

笔记:

注意,如果一个项目( "Polypropylene Plastic" )正好包含另一个项目( "Plastic" )我仍然想保留这两个项目。
因此,情况可能不同,但项必须是字符匹配的字符,才能将其删除。
必须保留原始列表顺序。
之后的所有重复项 第一个实例 应该删除该项的。原来的情况 第一个实例 应保存,以及所有非重复项目的原始案例。

我在找 最快方法 在Python2.7中完成这一点。

3 回复 | 直到 6 年前

chthonicdaemon 6 年前

这里有一个版本使用一个 Counter ,避免使用其他 set 就像在@roadrunner的解决方案中,从 计数器 当我们经过他们的时候。如果有许多重复项,这可能比ordereddict解决方案稍慢,但使用的内存较少:

from collections import Counter

words = ["paper", "Plastic", "aluminum", "PAPer", "TIN", " paper", "glass", "tin", "PAPER", "Polypropylene Plastic"]

counter = Counter(w.lower() for w in words)

result = []
for word in words:
    key = word.lower()
    if key in counter:
        count = counter[key]
        if count == 1:
            result.append(word)
        else:
            result.append('{} ({})'.format(word, count))
        counter.pop(key)

注释你应该使用 casefold 而不是 lower 对于python>=3.3

abarnert 6 年前

在最初的问题中,你大概(我只是瞥了一眼)用了 set 把这些串折叠起来,看看你有没有新的或者重复的,在你进行的过程中建立一个新的列表。

你可以用一个 Counter 而不是 设置 . 但是你需要建立这个列表,然后返回并用计数编辑它。

因此,取而代之的是 二者都 这个 设置 / 计数器 和输出列表 OrderedDict 它为每个折叠的项目存储项目计数对:

d = collections.OrderedDict()
for item in myList:
    caseless = item.lower()
    try:
        d[caseless][1] += 1
    except KeyError:
        d[caseless] = [item, 1]

然后通过该dict生成输出列表:

myList = []
for item, count in d.values():
    if count > 1:
        item = '{} ({})'.format(item, count)
    myList.append(item)

你可以使这个更简洁(例如, myList = ['{} ({})'.format(item, count) if count > 1 else item for item, count in d.values() ,这也将使它更快一点,由一个小的常数因子。

你可以通过使用 % 而不是 format 也可能更多的是 %d 而不是 %s (尽管我认为最后一部分即使在2.7之前也不再是真的)。

取决于你的平台, a[0] += 1 可能比 a[1] += 1 . 所以试着用两种方法,如果 a[0] 更快,使用 [count, item] 成对而不是 [item, count] . 如果你有大量的dup,你可能会考虑用 __slots__ 这实际上比列表更新速度稍快,但创建速度明显慢。

另外,使用 in 测试,或者可能存储 d.__contains__ 作为一个本地人,可能比 try 或者它可能会变慢,这取决于您希望有多少次重复,所以请尝试使用三种方法来处理实际数据,而不是玩具数据集。

RoadRunner 6 年前

您也可以尝试使用 collections.Counter() 对象跟踪计数,并使用它跟踪所看到的单词,使用无大小写单词作为引用。然后,当您完成对输入列表的迭代后,更新结果列表,使单词计数在表单中。 %s (%d) ,如果计数大于1。

代码:

from collections import Counter

words = ["paper", "Plastic", "aluminum", "PAPer", "TIN", " paper", "glass", "tin", "PAPER", "Polypropylene Plastic"]

counts = Counter()
result = []

for word in words:
    caseless = word.casefold()

    if caseless not in counts:
        result.append(word)

    counts[caseless] += 1

result = ['%s (%d)' % (w, counts[w.casefold()]) if counts[w.casefold()] > 1 
                                                else w for w in result]

print(result)

输出:

['paper (3)', 'Plastic', 'aluminum', 'TIN (2)', ' paper', 'glass', 'Polypropylene Plastic']