代码之家  ›  专栏  ›  技术社区  ›  daiyue

字符串以列的值结尾,然后将字符串的开头转换为日期进行比较

  •  1
  • daiyue  · 技术社区  · 6 年前

    df ,

    cluster_id    amount    inv_id            inv_date
    1             309.9     07121830990       2018-07-12
    1             309.9     07121830990       2018-07-12
    2             3130.0    20180501313000B   2018-05-01
    2             3130.0    20180501313000B   2018-05-01
    3             3330.50   201804253330.50   2018-04-25             
    3             3330.50   201804253330.50   2018-04-25
    4             70.0      61518             2018-06-15
    4             70.0      61518             2018-06-15
    5             100.0     011318            2018-01-13
    5             100.0     011318            2018-01-13
    6             50.0      12202017          2017-12-20
    6             50.0      12202017          2017-12-20
    7             101.0     0000014482        2017-10-01
    7             101.0     0000014482        2017-10-01
    

    我想创建一个布尔列 dummy_inv_id 通过 groupby cluster_id dummy_invoice_id True 如果每组,

    1. inv_id (stripped non-numerics) ends with amount and the remaining part of inv_id can be coerced into a valid date which is +/- 180 days of the inv_date
    

    2. inv_id (stripped non-numerics) can be coerced into a date which is +/- 180 days of the inv_date 
    

    inv_id 群集\u id

    df['inv_id_stp'] = df.inv_id.str.replace(r'\D+', '')
    grouped = df.groupby('cluster_id')
    

    然后转换 amount *100到字符串以便于匹配

    df['amount'] = df['amount']*100
    df['amt_str'] = df['amount'].apply(str) 
    

    例如 309.9 '30990' , 3130.0 '313000' ,在这里我想知道如何检查 发票id 结束于 数量 发票id 可以转换为 datetime 在+/-180天内 inv_date ,或者如果 发票id 可以直接转换为日期。特别是有一些日期格式,即。

    071218 - 2018-07-12
    20180501 - 2018-05-01
    61518 - 2018-06-15
    12202017 - 2017-12-20
    0000014482 - cannot be converted to date 
    

    结果呢 测向 看起来像,

    cluster_id    amount    inv_id            inv_date      dummy_inv_id
    1             309.9     07121830990       2018-07-12    True
    1             309.9     07121830990       2018-07-12    True
    2             3130.0    20180501313000B   2018-05-01    True
    2             3130.0    20180501313000B   2018-05-01    True
    3             3330.50   201804253330.50   2018-04-25    True          
    3             3330.50   201804253330.50   2018-04-25    True
    4             70.0      61518             2018-06-15    True
    4             70.0      61518             2018-06-15    True
    5             100.0     011318            2018-01-13    True
    5             100.0     011318            2018-01-13    True
    6             50.0      12202017          2017-12-20    True
    6             50.0      12202017          2017-12-20    True
    7             101.0     0000014482        2017-10-01    False
    7             101.0     0000014482        2017-10-01    False
    
    1 回复  |  直到 6 年前
        1
  •  1
  •   jezrael    6 年前

    其思想是创建具有可能的datetimes格式的helper字典,其中包含用于切片和列表内转换的字母数- errors='coerce' 创造 NaT 对于不匹配的值:

    from functools import reduce
    
    #add zeros to length 6
    s = df.inv_id.str.replace(r'\D+', '').str.zfill(6)
    
    formats = {'%m%d%y':6, 
               '%y%m%d':6,
               '%Y%m%d':8, 
               '%m%d%Y':8}
    L = [pd.to_datetime(s.str[:v], format=k, errors='coerce') for k,v in formats.items()]
    

    纳特

    L = [x.where(x.between('2000-01-01', pd.datetime.now())) for x in L]
    

    并通过 Series.combine_first :

    s2 = reduce(lambda l,r: pd.Series.combine_first(l,r), L)
    print (s2)
    0    2018-07-12
    1    2018-07-12
    2    2018-05-01
    3    2018-05-01
    4    2018-04-25
    5    2018-04-25
    6    2018-06-15
    7    2018-06-15
    8    2018-01-13
    9    2018-01-13
    10   2017-12-20
    11   2017-12-20
    12          NaT
    13          NaT
    Name: inv_id, dtype: datetime64[ns]
    

    上次检查 +-180 天:

    df['new'] = s2.between(s2 - pd.Timedelta(180, unit='d'), s2 + pd.Timedelta(180, unit='d'))
    

    print (df)
        cluster_id  amount           inv_id    inv_date    new
    0            1   309.9      07121830990  2018-07-12   True
    1            1   309.9      07121830990  2018-07-12   True
    2            2  3130.0  20180501313000B  2018-05-01   True
    3            2  3130.0  20180501313000B  2018-05-01   True
    4            3  3330.5  201804253330.50  2018-04-25   True
    5            3  3330.5  201804253330.50  2018-04-25   True
    6            4    70.0            61518  2018-06-15   True
    7            4    70.0            61518  2018-06-15   True
    8            5   100.0           011318  2018-01-13   True
    9            5   100.0           011318  2018-01-13   True
    10           6    50.0         12202017  2017-12-20   True
    11           6    50.0         12202017  2017-12-20   True
    12           7   101.0       0000014482  2017-10-01  False
    13           7   101.0       0000014482  2017-10-01  False
    

    添加了从末端移除子字符串的解决方案:

    import re
    from functools import reduce
    
    df['amt_str'] = (df['amount']*100).round().astype(int).astype(str) 
    df['inv_str'] = df.inv_id.str.replace(r'\D+', '').str.zfill(6)
    
    #https://stackoverflow.com/a/1038845/2901002
    df['inv_str'] = df.apply(lambda x: re.sub('{}$'.format(x['amt_str']),'', x['inv_str']),axis=1)
    print (df)
        cluster_id  amount           inv_id    inv_date amt_str     inv_str
    0            1   309.9      07121830990  2018-07-12   30990      071218
    1            1   309.9      07121830990  2018-07-12   30990      071218
    2            2  3130.0  20180501313000B  2018-05-01  313000    20180501
    3            2  3130.0  20180501313000B  2018-05-01  313000    20180501
    4            3  3330.5  201804253330.50  2018-04-25  333050    20180425
    5            3  3330.5  201804253330.50  2018-04-25  333050    20180425
    6            4    70.0            61518  2018-06-15    7000      061518
    7            4    70.0            61518  2018-06-15    7000      061518
    8            5   100.0           011318  2018-01-13   10000      011318
    9            5   100.0           011318  2018-01-13   10000      011318
    10           6    50.0         12202017  2017-12-20    5000    12202017
    11           6    50.0         12202017  2017-12-20    5000    12202017
    12           7   101.0       0000014482  2017-10-01   10100  0000014482
    13           7   101.0       0000014482  2017-10-01   10100  0000014482
    

    formats = {'%m%d%y':6, 
               '%y%m%d':6,
               '%Y%m%d':8, 
               '%m%d%Y':8}
    L=[pd.to_datetime(df['inv_str'].str[:v],format=k, errors='coerce') for k,v in formats.items()]
    L = [x.where(x.between('2000-01-01', pd.datetime.now())) for x in L]
    
    s2 = reduce(lambda l,r: pd.Series.combine_first(l,r), L)
    df['new'] = s2.between(s2 - pd.Timedelta(180, unit='d'), s2 + pd.Timedelta(180, unit='d'))
    print (df)
        cluster_id  amount           inv_id    inv_date amt_str     inv_str    new
    0            1   309.9      07121830990  2018-07-12   30990      071218   True
    1            1   309.9      07121830990  2018-07-12   30990      071218   True
    2            2  3130.0  20180501313000B  2018-05-01  313000    20180501   True
    3            2  3130.0  20180501313000B  2018-05-01  313000    20180501   True
    4            3  3330.5  201804253330.50  2018-04-25  333050    20180425   True
    5            3  3330.5  201804253330.50  2018-04-25  333050    20180425   True
    6            4    70.0            61518  2018-06-15    7000      061518   True
    7            4    70.0            61518  2018-06-15    7000      061518   True
    8            5   100.0           011318  2018-01-13   10000      011318   True
    9            5   100.0           011318  2018-01-13   10000      011318   True
    10           6    50.0         12202017  2017-12-20    5000    12202017   True
    11           6    50.0         12202017  2017-12-20    5000    12202017   True
    12           7   101.0       0000014482  2017-10-01   10100  0000014482  False
    13           7   101.0       0000014482  2017-10-01   10100  0000014482  False