代码之家  ›  专栏  ›  技术社区  ›  Stacey

比较dataframe日期列并返回新列

  •  1
  • Stacey  · 技术社区  · 6 年前

    我有一个数据框 df 由一列标识符和四列日期组成。请参见下面的数据帧头部。

       Identifier                    A \
    0       12253  1989-11-09 00:00:00   
    1       11413  1990-09-03 00:00:00   
    2       12337  1977-09-07 00:00:00   
    3       10610  1994-08-24 00:00:00   
    4        7493  1993-08-22 00:00:00   
    
                          B                              C  \
    0   2012-03-01 00:00:00            2015-04-01 00:00:00   
    1   2009-08-16 00:00:00            2015-05-18 00:00:00   
    2   1977-09-07 00:00:00            1977-09-07 00:00:00   
    3   2009-01-09 00:00:00            2015-03-01 00:00:00   
    4   2002-06-03 00:00:00            2015-02-16 00:00:00   
    
                          D  \
    0   2012-01-03 00:00:00   
    1   2015-05-18 00:00:00   
    2   1977-09-07 00:00:00   
    3   2015-03-01 00:00:00   
    4   2015-02-16 00:00:00 
    

    我试图将日期列相互比较,如果A>=B>=C>=然后在一个新的列中,check True应该返回,否则返回False。

    因此,生成的数据帧将如下所示:

       Identifier                    A \
    0       12253  1989-11-09 00:00:00   
    1       11413  1990-09-03 00:00:00   
    2       12337  1977-09-07 00:00:00   
    3       10610  1994-08-24 00:00:00   
    4        7493  1993-08-22 00:00:00   
    
                          B                              C  \
    0   2012-03-01 00:00:00            2015-04-01 00:00:00   
    1   2009-08-16 00:00:00            2015-05-18 00:00:00   
    2   1977-09-07 00:00:00            1977-09-07 00:00:00   
    3   2009-01-09 00:00:00            2015-03-01 00:00:00   
    4   2002-06-03 00:00:00            2015-02-16 00:00:00   
    
                          D       Check\
    0   2012-01-03 00:00:00       False
    1   2015-05-18 00:00:00       True
    2   1977-09-07 00:00:00       True 
    3   2015-03-01 00:00:00       True
    4   2015-02-16 00:00:00       True
    

    我试过了

    df['Check'] = np.where(df['A'] >= df['B']>= df['C']>= df['D'], 'True', 'False')
    

    但我没法让它发挥作用。下一步我能试试什么?

    3 回复  |  直到 4 年前
        1
  •  2
  •   jezrael    6 年前

    使用 DataFrame.eval :

    df['Check'] = df.eval('A <= B <= C <= D')
    print (df)
       Identifier          A          B          C          D  Check
    0       12253 1989-11-09 2012-03-01 2015-04-01 2012-01-03  False
    1       11413 1990-09-03 2009-08-16 2015-05-18 2015-05-18   True
    2       12337 1977-09-07 1977-09-07 1977-09-07 1977-09-07   True
    3       10610 1994-08-24 2009-01-09 2015-03-01 2015-03-01   True
    4        7493 1993-08-22 2002-06-03 2015-02-16 2015-02-16   True
    
        2
  •  1
  •   Community holdenweb    4 年前

    使用df。diff(),类似于 np.diff() 并检查是否有任何元素低于pd。时间差(0)。

    df['check'] = ~df.diff(axis=1).lt(pd.Timedelta(0)).any(1)
    

    完整示例:

    import pandas as pd
    import numpy
    
    np.random.seed(333)
    
    # Random dates from:  https://stackoverflow.com/questions/50559078/
    def pp(start, end, n):
        start_u = start.value//10**9
        end_u = end.value//10**9
        return pd.DatetimeIndex((10**9*np.random.randint(start_u, end_u, n)).view('M8[ns]'))
    
    n = 10
    
    df = pd.DataFrame({
        'A': pp(pd.Timestamp('2018'), pd.Timestamp('2019'), n),
        'B': pp(pd.Timestamp('2018'), pd.Timestamp('2019'), n),
        'C': pp(pd.Timestamp('2018'), pd.Timestamp('2019'), n),
        'D': pp(pd.Timestamp('2018'), pd.Timestamp('2019'), n)
    })
    
    df['check'] = ~df.diff(axis=1).lt(pd.Timedelta(0)).any(1)
    print(df)
    

    返回:

                        A                   B                   C  \
    0 2018-07-30 04:54:04 2018-03-13 00:28:13 2018-08-24 11:01:29   
    1 2018-12-26 21:22:20 2018-09-23 14:25:11 2018-08-19 07:21:59   
    2 2018-04-29 17:15:57 2018-05-28 12:35:35 2018-10-16 00:19:11   
    3 2018-12-11 06:56:35 2018-08-15 00:12:12 2018-08-05 23:47:08   
    4 2018-03-04 11:00:03 2018-07-03 07:22:30 2018-09-09 01:45:09   
    5 2018-08-22 03:24:30 2018-12-17 17:38:34 2018-01-29 13:02:29   
    6 2018-04-21 01:10:14 2018-06-09 20:37:08 2018-04-30 12:30:00   
    7 2018-06-27 18:40:46 2018-09-15 10:26:06 2018-05-13 03:51:36   
    8 2018-03-18 06:31:24 2018-11-10 06:24:12 2018-02-25 02:58:15   
    9 2018-11-08 17:52:19 2018-03-27 01:02:12 2018-03-06 00:10:02   
    
                        D  check  
    0 2018-07-30 16:16:03  False  
    1 2018-07-21 23:38:59  False  
    2 2018-10-25 03:46:37   True  
    3 2018-12-01 07:43:53  False  
    4 2018-12-07 16:11:31   True  
    5 2018-09-17 14:58:20  False  
    6 2018-07-02 09:36:35  False  
    7 2018-03-16 23:21:27  False  
    8 2018-10-30 11:24:01  False  
    9 2018-04-03 12:17:52  False
    

    标杆管理

    %timeit ~df.diff(axis=1).lt(pd.Timedelta(0)).any(1)
    %timeit df.eval('A <= B <= C <= D')    
    

    10000行:

    #1000 loops, best of 3: 1.58 ms per loop
    #100 loops, best of 3: 3.31 ms per loop
    

    10.000.000行:

    #1 loop, best of 3: 2.27 s per loop
    #1 loop, best of 3: 243 ms per loop
    
        3
  •  0
  •   ipramusinto    6 年前

    也许有更聪明的方法,但这应该行得通

    df['Check'] = (df['A'] <= df['B']) & (df['B']<= df['C']) & (df['C'] <= df['D'])