重建数据帧(感谢使用
StringIO
(方法)
In [82]: df4['RB'].values
Out[82]: array([46, 47, 47, 48, 48, 48, 50, 50, 50])
In [83]: test(46)
Out[83]: array([42, 42, 42, 42, 42, 42, 42, 42, 42])
In [84]: test(50)
Out[84]: 1
In [85]: [test(i) for i in df4['RB'].values]
Out[85]:
[array([42, 42, 42, 42, 42, 42, 42, 42, 42]),
array([21, 21, 21, 21, 21, 21, 21, 21, 21]),
array([21, 21, 21, 21, 21, 21, 21, 21, 21]),
array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
1,
1,
1]
In [86]: vfunc=np.vectorize(test)
In [87]: vfunc(df4['RB'].values)
TypeError: only size-1 arrays can be converted to Python scalars
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<ipython-input-87-8db8cd5dc5ab>", line 1, in <module>
vfunc(df4['RB'].values)
File "/usr/local/lib/python3.8/dist-packages/numpy/lib/function_base.py", line 2163, in __call__
return self._vectorize_call(func=func, args=vargs)
File "/usr/local/lib/python3.8/dist-packages/numpy/lib/function_base.py", line 2249, in _vectorize_call
res = asanyarray(outputs, dtype=otypes[0])
ValueError: setting an array element with a sequence.
请注意完整的回溯。
vectorize
从这组大小混合的数组创建返回数组时遇到问题。这是猜测
, based on a trial calculation that it should return an
int`dtype。
如果我们告诉它返回一个对象数据类型数组,我们会得到:
In [88]: vfunc=np.vectorize(test, otypes=['object'])
In [89]: vfunc(df4['RB'].values)
Out[89]:
array([array([42, 42, 42, 42, 42, 42, 42, 42, 42]),
array([21, 21, 21, 21, 21, 21, 21, 21, 21]),
array([21, 21, 21, 21, 21, 21, 21, 21, 21]),
array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
array([7, 7, 7, 7, 7, 7, 7, 7, 7]), 1, 1, 1], dtype=object)
我们可以将其分配给df列:
In [90]: df4['n']=_
In [91]: df4
Out[91]:
contract RB BeginDate ... 49 50 n
2 A00118 46 19850100 ... 7 7 [42, 42, 42, 42, 42, 42, 42, 42, 42]
3 A00118 47 19000100 ... 7 7 [21, 21, 21, 21, 21, 21, 21, 21, 21]
5 A00118 47 19850100 ... 7 7 [21, 21, 21, 21, 21, 21, 21, 21, 21]
6 A00253 48 19000100 ... 7 7 [7, 7, 7, 7, 7, 7, 7, 7, 7]
7 A00253 48 19820100 ... 7 7 [7, 7, 7, 7, 7, 7, 7, 7, 7]
8 A00253 48 19850100 ... 7 7 [7, 7, 7, 7, 7, 7, 7, 7, 7]
9 A00253 50 19000100 ... 7 7 1
10 A00253 50 19790100 ... 7 7 1
11 A00253 50 19850100 ... 7 7 1
我们也可以指定
Out[85]
列表
df4['n']=Out[85]
时间差不多:
In [94]: timeit vfunc(df4['RB'].values)
211 µs ± 5.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [95]: timeit [test(i) for i in df4['RB'].values]
217 µs ± 6.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
通常
矢量化
比较慢,但是
test
它本身可能足够慢,而迭代方法并没有多大区别。记住(必要时重新阅读文档),
矢量化
是
不
性能工具。它不会“编译”你的函数,也不会让它运行得更快。
返回对象数据类型数组的另一种方法是:
In [96]: vfunc=np.frompyfunc(test,1,1)
In [97]: vfunc(df4['RB'].values)
Out[97]:
array([array([42, 42, 42, 42, 42, 42, 42, 42, 42]),
array([21, 21, 21, 21, 21, 21, 21, 21, 21]),
array([21, 21, 21, 21, 21, 21, 21, 21, 21]),
array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
array([7, 7, 7, 7, 7, 7, 7, 7, 7]), 1, 1, 1], dtype=object)
In [98]: timeit vfunc(df4['RB'].values)
202 µs ± 6.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)