代码之家 › 专栏 › 技术社区 › daiyue

字符串以列的值结尾,然后将字符串的开头转换为日期进行比较

pandas-groupby dataframe pandas python-3.x python

daiyue · 技术社区 · 6 年前

df ,

cluster_id    amount    inv_id            inv_date
1             309.9     07121830990       2018-07-12
1             309.9     07121830990       2018-07-12
2             3130.0    20180501313000B   2018-05-01
2             3130.0    20180501313000B   2018-05-01
3             3330.50   201804253330.50   2018-04-25             
3             3330.50   201804253330.50   2018-04-25
4             70.0      61518             2018-06-15
4             70.0      61518             2018-06-15
5             100.0     011318            2018-01-13
5             100.0     011318            2018-01-13
6             50.0      12202017          2017-12-20
6             50.0      12202017          2017-12-20
7             101.0     0000014482        2017-10-01
7             101.0     0000014482        2017-10-01

我想创建一个布尔列 dummy_inv_id 通过 groupby cluster_id dummy_invoice_id 到 True 如果每组,

1. inv_id (stripped non-numerics) ends with amount and the remaining part of inv_id can be coerced into a valid date which is +/- 180 days of the inv_date

或

2. inv_id (stripped non-numerics) can be coerced into a date which is +/- 180 days of the inv_date

inv_id 和 群集\u id

df['inv_id_stp'] = df.inv_id.str.replace(r'\D+', '')
grouped = df.groupby('cluster_id')

然后转换 amount *100到字符串以便于匹配

df['amount'] = df['amount']*100
df['amt_str'] = df['amount'].apply(str)

例如 309.9 '30990' , 3130.0 '313000' ,在这里我想知道如何检查 发票id 结束于 数量 发票id 可以转换为 datetime 在+/-180天内 inv_date ,或者如果 发票id 可以直接转换为日期。特别是有一些日期格式,即。

071218 - 2018-07-12
20180501 - 2018-05-01
61518 - 2018-06-15
12202017 - 2017-12-20
0000014482 - cannot be converted to date

结果呢 测向 看起来像,

cluster_id    amount    inv_id            inv_date      dummy_inv_id
1             309.9     07121830990       2018-07-12    True
1             309.9     07121830990       2018-07-12    True
2             3130.0    20180501313000B   2018-05-01    True
2             3130.0    20180501313000B   2018-05-01    True
3             3330.50   201804253330.50   2018-04-25    True          
3             3330.50   201804253330.50   2018-04-25    True
4             70.0      61518             2018-06-15    True
4             70.0      61518             2018-06-15    True
5             100.0     011318            2018-01-13    True
5             100.0     011318            2018-01-13    True
6             50.0      12202017          2017-12-20    True
6             50.0      12202017          2017-12-20    True
7             101.0     0000014482        2017-10-01    False
7             101.0     0000014482        2017-10-01    False

1 回复 | 直到 6 年前

jezrael 6 年前

其思想是创建具有可能的datetimes格式的helper字典,其中包含用于切片和列表内转换的字母数- errors='coerce' 创造 NaT 对于不匹配的值:

from functools import reduce

#add zeros to length 6
s = df.inv_id.str.replace(r'\D+', '').str.zfill(6)

formats = {'%m%d%y':6, 
           '%y%m%d':6,
           '%Y%m%d':8, 
           '%m%d%Y':8}
L = [pd.to_datetime(s.str[:v], format=k, errors='coerce') for k,v in formats.items()]

纳特

L = [x.where(x.between('2000-01-01', pd.datetime.now())) for x in L]

并通过 Series.combine_first :

s2 = reduce(lambda l,r: pd.Series.combine_first(l,r), L)
print (s2)
0    2018-07-12
1    2018-07-12
2    2018-05-01
3    2018-05-01
4    2018-04-25
5    2018-04-25
6    2018-06-15
7    2018-06-15
8    2018-01-13
9    2018-01-13
10   2017-12-20
11   2017-12-20
12          NaT
13          NaT
Name: inv_id, dtype: datetime64[ns]

上次检查 +-180 天:

df['new'] = s2.between(s2 - pd.Timedelta(180, unit='d'), s2 + pd.Timedelta(180, unit='d'))

print (df)
    cluster_id  amount           inv_id    inv_date    new
0            1   309.9      07121830990  2018-07-12   True
1            1   309.9      07121830990  2018-07-12   True
2            2  3130.0  20180501313000B  2018-05-01   True
3            2  3130.0  20180501313000B  2018-05-01   True
4            3  3330.5  201804253330.50  2018-04-25   True
5            3  3330.5  201804253330.50  2018-04-25   True
6            4    70.0            61518  2018-06-15   True
7            4    70.0            61518  2018-06-15   True
8            5   100.0           011318  2018-01-13   True
9            5   100.0           011318  2018-01-13   True
10           6    50.0         12202017  2017-12-20   True
11           6    50.0         12202017  2017-12-20   True
12           7   101.0       0000014482  2017-10-01  False
13           7   101.0       0000014482  2017-10-01  False

添加了从末端移除子字符串的解决方案:

import re
from functools import reduce

df['amt_str'] = (df['amount']*100).round().astype(int).astype(str) 
df['inv_str'] = df.inv_id.str.replace(r'\D+', '').str.zfill(6)

#https://stackoverflow.com/a/1038845/2901002
df['inv_str'] = df.apply(lambda x: re.sub('{}$'.format(x['amt_str']),'', x['inv_str']),axis=1)
print (df)
    cluster_id  amount           inv_id    inv_date amt_str     inv_str
0            1   309.9      07121830990  2018-07-12   30990      071218
1            1   309.9      07121830990  2018-07-12   30990      071218
2            2  3130.0  20180501313000B  2018-05-01  313000    20180501
3            2  3130.0  20180501313000B  2018-05-01  313000    20180501
4            3  3330.5  201804253330.50  2018-04-25  333050    20180425
5            3  3330.5  201804253330.50  2018-04-25  333050    20180425
6            4    70.0            61518  2018-06-15    7000      061518
7            4    70.0            61518  2018-06-15    7000      061518
8            5   100.0           011318  2018-01-13   10000      011318
9            5   100.0           011318  2018-01-13   10000      011318
10           6    50.0         12202017  2017-12-20    5000    12202017
11           6    50.0         12202017  2017-12-20    5000    12202017
12           7   101.0       0000014482  2017-10-01   10100  0000014482
13           7   101.0       0000014482  2017-10-01   10100  0000014482

formats = {'%m%d%y':6, 
           '%y%m%d':6,
           '%Y%m%d':8, 
           '%m%d%Y':8}
L=[pd.to_datetime(df['inv_str'].str[:v],format=k, errors='coerce') for k,v in formats.items()]
L = [x.where(x.between('2000-01-01', pd.datetime.now())) for x in L]

s2 = reduce(lambda l,r: pd.Series.combine_first(l,r), L)
df['new'] = s2.between(s2 - pd.Timedelta(180, unit='d'), s2 + pd.Timedelta(180, unit='d'))
print (df)
    cluster_id  amount           inv_id    inv_date amt_str     inv_str    new
0            1   309.9      07121830990  2018-07-12   30990      071218   True
1            1   309.9      07121830990  2018-07-12   30990      071218   True
2            2  3130.0  20180501313000B  2018-05-01  313000    20180501   True
3            2  3130.0  20180501313000B  2018-05-01  313000    20180501   True
4            3  3330.5  201804253330.50  2018-04-25  333050    20180425   True
5            3  3330.5  201804253330.50  2018-04-25  333050    20180425   True
6            4    70.0            61518  2018-06-15    7000      061518   True
7            4    70.0            61518  2018-06-15    7000      061518   True
8            5   100.0           011318  2018-01-13   10000      011318   True
9            5   100.0           011318  2018-01-13   10000      011318   True
10           6    50.0         12202017  2017-12-20    5000    12202017   True
11           6    50.0         12202017  2017-12-20    5000    12202017   True
12           7   101.0       0000014482  2017-10-01   10100  0000014482  False
13           7   101.0       0000014482  2017-10-01   10100  0000014482  False