代码之家 › 专栏 › 技术社区 › foosion

在表示数字的字符串集合中查找最近的匹配

search python

foosion · 技术社区 · 15 年前

我有一个文本格式的日期时间排序列表。每个条目的格式为“2009-09-10t12:00:00”。

我想找到离目标最近的入口。有比我要做的搜索数量更多的条目。

我可以将每个条目更改为一个数字,然后用数字搜索(例如 these 方法),但这似乎是过度的努力。

有没有比这更好的方法:

def mid(res, target): 
#res is a list of entries, sorted by dt (dateTtime) 
#each entry is a dict with a dt and some other info
    n = len(res)
    low = 0
    high = n-1

    # find the first res greater than target
    while low < high:
        mid = (low + high)/2
        t = res[int(mid)]['dt']
        if t < target:
            low = mid + 1
        else:
            high = mid

    # check if the prior value is closer
    i = max(0, int(low)-1)
    a = dttosecs(res[i]['dt'])
    b = dttosecs(res[int(low)]['dt'])
    t = dttosecs(target)
    if abs(a-t) < abs(b-t):
        return int(low-1)
    else:
        return int(low)

import time
def dttosecs(dt):
    # string to seconds since the beginning
    date,tim = dt.split('T')
    y,m,d = date.split('-')
    h,mn,s = tim.split(':')
    y = int(y)
    m = int(m)
    d = int(d)
    h = int(h)
    mn = int(mn)
    s = min(59,int(float(s)+0.5)) # round to neatest second
    s = int(s)
    secs = time.mktime((y,m,d,h,mn,s,0,0,-1))
    return secs

4 回复 | 直到 15 年前

Alex Martelli 15 年前

“复制粘贴编码”(获取 bisect 不建议使用的源代码),因为它会带来各种成本(许多额外的源代码供您测试和维护、处理所复制上游代码升级的困难等);重用标准库模块的最佳方法是简单地导入并使用它们。

然而,将字典转换成有意义的可比较条目的一次传递是O(n),它(即使传递的每个步骤都很简单)最终会淹没搜索的O(log n)时间。自从 二等分 不能支持 key= 钥匙拔出器 sort 是的,解决这个难题的方法是什么——如何重用 二等分 通过导入和调用,无需初步的O(N)步骤…?

如引用 here 答案在大卫惠勒的名言中,“计算机科学中的所有问题都可以通过另一个间接层次来解决”。考虑例如:

import bisect

listofdicts = [
  {'dt': '2009-%2.2d-%2.2dT12:00:00' % (m,d) }
  for m in range(4,9) for d in range(1,30)
  ]

class Indexer(object):
  def __init__(self, lod, key):
    self.lod = lod
    self.key = key
  def __len__(self):
    return len(self.lod)
  def __getitem__(self, idx):
    return self.lod[idx][self.key]


lookfor = listofdicts[len(listofdicts)//2]['dt']

def mid(res=listofdicts, target=lookfor):
    keys = [r['dt'] for r in res]
    return res[bisect.bisect_left(keys, target)]

def midi(res=listofdicts, target=lookfor):
    wrap = Indexer(res, 'dt')
    return res[bisect.bisect_left(wrap, target)]

if __name__ == '__main__':
  print '%d dicts on the list' % len(listofdicts)
  print 'Looking for', lookfor
  print mid(), midi()
assert mid() == midi()

输出(只是运行这个 indexer.py 作为支票,然后 timeit ,两种方式):

$ python indexer.py 
145 dicts on the list
Looking for 2009-06-15T12:00:00
{'dt': '2009-06-15T12:00:00'} {'dt': '2009-06-15T12:00:00'}
$ python -mtimeit -s'import indexer' 'indexer.mid()'
10000 loops, best of 3: 27.2 usec per loop
$ python -mtimeit -s'import indexer' 'indexer.midi()'
100000 loops, best of 3: 9.43 usec per loop

如您所见,即使在列表中有145个条目的中等任务中,间接方法的性能也可以比“密钥提取传递”方法高出三倍。由于我们比较了O(n)和O(log n),间接方法的优势随着n的增加而无限增长。(对于非常小的n,由于间接性,较高的乘法常数使键提取方法更快,但这很快就被大的o差所超越)。诚然,indexer类是额外的代码——但是,它可以在所有的二进制搜索任务中重用——在每个dict中按一个条目对dict列表进行排序,因此将其放在“容器实用程序的诀窍后面”中可以获得很好的投资回报。

对于主搜索循环来说就这么多了。对于将两个条目(一个位于目标下方,一个位于目标上方)和目标转换为若干秒的次要任务,再次考虑更高的重用方法,即:

import time

adt = '2009-09-10T12:00:00'

def dttosecs(dt=adt):
    # string to seconds since the beginning
    date,tim = dt.split('T')
    y,m,d = date.split('-')
    h,mn,s = tim.split(':')
    y = int(y)
    m = int(m)
    d = int(d)
    h = int(h)
    mn = int(mn)
    s = min(59,int(float(s)+0.5)) # round to neatest second
    s = int(s)
    secs = time.mktime((y,m,d,h,mn,s,0,0,-1))
    return secs

def simpler(dt=adt):
  return time.mktime(time.strptime(dt, '%Y-%m-%dT%H:%M:%S'))

if __name__ == '__main__':
  print adt, dttosecs(), simpler()
assert dttosecs() == simpler()

这里,重用方法没有性能优势(事实上,相反, dttosecs 速度更快),但是,每次搜索只需要执行三次转换,不管您的听写列表中有多少条条目,所以还不清楚性能问题是否有重大影响。与此同时,用 simpler 您只需要编写、测试和维护一行简单的代码,而 DTTOSECS公司 是一打线;考虑到这个比率,在大多数情况下(即排除绝对瓶颈),我更喜欢 简单的 .重要的是要注意这两种方法以及它们之间的权衡,以确保明智地做出选择。

Eli Courtwright 15 年前

你想要 bisect module 来自标准库。它将进行二进制搜索,并将新值的正确插入点告知已排序的列表。下面是一个示例,它将打印列表中的位置 target 将插入:

from bisect import bisect
dates = ['2009-09-10T12:00:00', '2009-09-11T12:32:00', '2009-09-11T12:43:00']
target = '2009-09-11T12:40:00'
print bisect(dates, target)

从这里,您可以比较插入点前后的内容,在本例中是 dates[i-1] 和 dates[i] 看看哪一个离你最近 目标 .

nosklo 15 年前

import bisect

def mid(res, target):
    keys = [r['dt'] for r in res]
    return res[bisect.bisect_left(keys, target)]

S.Lott 15 年前

首先,改成这个。

import datetime
def parse_dt(dt):
    return datetime.strptime( dt, "%Y-%m-%dT%H:%M:%S" )

这就消除了很多“努力”。

把这当作搜索。

def mid( res, target ):
    """res is a list of entries, sorted by dt (dateTtime) 
       each entry is a dict with a dt and some other info
    """
    times = [ parse_dt(r['dt']) for r in res ]
    index= bisect( times, parse_dt(target) )
    return times[index]

这似乎不是很“努力”。这也不取决于时间戳的格式是否正确。您可以更改为任何时间戳格式,并确保这始终有效。