代码之家  ›  专栏  ›  技术社区  ›  satellite9

Python中按时间序列箱分组项目

  •  1
  • satellite9  · 技术社区  · 10 年前

    我有如下数据:

    [[datetime1, label1],
     [datetime2, label2],
     [datetime3, label3]]
    

    标签是字符串。我有一个binning参数(delta),它是datetime.timedelta。

    我想做的是:

    1. 拿出一组日期时间箱,用delta等间距排列。换句话说,在下面,datetimebin2-datetimebin1=datetimebin3-datetimebin2=delta。
    2. 将标签放入这些垃圾箱。

    所以我最终会得出这样的结论:

    [[datetimebin1, [label1, label2],
     [datetimebin2, []],
     [datetimebin3, []],
     [datetimebin4, [label3]]
    

    我被指了熊猫,但没有找到我要找的东西。非常感谢您的帮助!

    2 回复  |  直到 10 年前
        1
  •  3
  •   agmangas    10 年前

    我认为@DrV是正确的答案,但我准备了一个例子,试图展示如何使用Pandas实现类似的目标:

    import numpy
    import pandas
    import datetime
    import time
    
    # Binning delta
    
    delta = datetime.timedelta(hours=1)
    
    # Sample data
    
    sample = [
        ['2014-08-09 16:30:00', 'label1'],
        ['2014-08-09 15:30:00', 'label2'],
        ['2014-08-09 14:30:00', 'label3'],
        ['2014-08-09 14:00:00', 'label4']
    ]
    
    # Create dataframe and append UNIX timestamp column
    
    df = pandas.DataFrame(sample)
    df.columns = ['Datetime', 'Label']
    df['Datetime'] = pandas.to_datetime(df['Datetime'])
    df['UnixStamp'] = df['Datetime'].apply(lambda d: time.mktime(d.timetuple()))
    df = df.set_index('Datetime')
    
    # Calculate bins
    
    bins = numpy.arange(min(df['UnixStamp']), max(df['UnixStamp']) + delta.seconds, delta.seconds)
    
    # Group columns by datetime bin
    
    def bin_from_tstamp(tstamp):
    
        diffs = [abs(tstamp - bin) for bin in bins]
        return bins[diffs.index(min(diffs))]
    
    grouped = df.groupby(df['UnixStamp'].map(
        lambda t: datetime.datetime.fromtimestamp(bin_from_tstamp(t))
    ))
    

    此时 grouped 包含按日期时间段分组的数据集。

    以下是打印结果 grouped.groups (其中键是日期时间箱,值是分组的日期时间):

    {
        numpy.datetime64('2014-08-09T18:00:00.000000000+0200'): [
            Timestamp('2014-08-09 16:30:00')
        ], 
        numpy.datetime64('2014-08-09T17:00:00.000000000+0200'): [
            Timestamp('2014-08-09 15:30:00')
        ], 
        numpy.datetime64('2014-08-09T16:00:00.000000000+0200'): [
            Timestamp('2014-08-09 14:30:00'), 
            Timestamp('2014-08-09 14:00:00'
        ]
    }
    
        2
  •  2
  •   DrV    10 年前

    应该采取以下措施:

    # data: a lists of lists (length 2) of measurements
    # res: resulting list of lists
    # delta: time delta
    
    # output list (will be a list of lists, as in the question
    
    res = []
    # end of first bin:
    binstart = data[0][0]
    res.append([binstart, []])
    
    # iterate through the data item
    for d in data:
        # if the data item belongs to this bin, append it into the bin
        if d[0] < binstart + delta:
            res[-1][1].append(d[1])
            continue
    
        # otherwise, create new empty bins until this data fits into a bin
        binstart += delta
        while d[0] > binstart + delta:
            res.append([binstart, [])
            binstart += delta
    
        # create a bin with the data
        res.append([binstart, [d[1]]])