代码之家  ›  专栏  ›  技术社区  ›  William

pandas numpy:在进行数学运算时使用序列设置数组元素

  •  0
  • William  · 技术社区  · 3 年前

    我有一个名为df4的df,你可以购买以下代码:

    df4s = """
    contract    RB  BeginDate   ValIssueDate    EndDate Valindex0   48  46  47  49  50
    2   A00118  46  19850100    19880901    99999999    50  1   2   3   7   7
    3   A00118  47  19000100    19880901    19831231    47  1   2   3   7   7
    5   A00118  47  19850100    19880901    99999999    50  1   2   3   7   7
    6   A00253  48  19000100    19820101    19811231    47  1   2   3   7   7
    7   A00253  48  19820100    19820101    19841299    47  1   2   3   7   7
    8   A00253  48  19850100    19820101    99999999    50  1   2   3   7   7
    9   A00253  50  19000100    19820101    19781231    47  1   2   3   7   7
    10  A00253  50  19790100    19820101    19841299    47  1   2   3   7   7
    11  A00253  50  19850100    19820101    99999999    50  1   2   3   7   7
    
    """
    
    df4 = pd.read_csv(StringIO(df4s.strip()), sep='\s+', 
                      dtype={"RB": int, "BeginDate": int, "EndDate": int,'ValIssueDate':int,'Valindex0':int})
    

    结果将是:

    contract    RB  BeginDate   ValIssueDate    EndDate Valindex0   48  46  47  49  50
    2   A00118  46  19850100    19880901    99999999    50  1   2   3   7   7
    3   A00118  47  19000100    19880901    19831231    47  1   2   3   7   7
    5   A00118  47  19850100    19880901    99999999    50  1   2   3   7   7
    6   A00253  48  19000100    19820101    19811231    47  1   2   3   7   7
    7   A00253  48  19820100    19820101    19841299    47  1   2   3   7   7
    8   A00253  48  19850100    19820101    99999999    50  1   2   3   7   7
    9   A00253  50  19000100    19820101    19781231    47  1   2   3   7   7
    10  A00253  50  19790100    19820101    19841299    47  1   2   3   7   7
    11  A00253  50  19850100    19820101    99999999    50  1   2   3   7   7
    

    我试图通过以下逻辑构建一个新列,新列的值将基于两个现有列的值:

    def test(RB):
        n=1
        for i in np.arange(RB,50):
            n = n * df4[str(i)].values
        return  n
    
    
    vfunc=np.vectorize(test)
    df4['n']=vfunc(df4['RB'].values)
    

    然后收到错误:

        res = array(outputs, copy=False, subok=True, dtype=otypes[0])
    
    ValueError: setting an array element with a sequence.
    
    0 回复  |  直到 3 年前
        1
  •  1
  •   hpaulj    3 年前

    重建数据帧(感谢使用 StringIO (方法)

    In [82]: df4['RB'].values
    Out[82]: array([46, 47, 47, 48, 48, 48, 50, 50, 50])
    In [83]: test(46)
    Out[83]: array([42, 42, 42, 42, 42, 42, 42, 42, 42])
    In [84]: test(50)
    Out[84]: 1
    In [85]: [test(i) for i in df4['RB'].values]
    Out[85]: 
    [array([42, 42, 42, 42, 42, 42, 42, 42, 42]),
     array([21, 21, 21, 21, 21, 21, 21, 21, 21]),
     array([21, 21, 21, 21, 21, 21, 21, 21, 21]),
     array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
     array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
     array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
     1,
     1,
     1]
    In [86]: vfunc=np.vectorize(test)
    In [87]: vfunc(df4['RB'].values)
    TypeError: only size-1 arrays can be converted to Python scalars
    
    The above exception was the direct cause of the following exception:
    Traceback (most recent call last):
      File "<ipython-input-87-8db8cd5dc5ab>", line 1, in <module>
        vfunc(df4['RB'].values)
      File "/usr/local/lib/python3.8/dist-packages/numpy/lib/function_base.py", line 2163, in __call__
        return self._vectorize_call(func=func, args=vargs)
      File "/usr/local/lib/python3.8/dist-packages/numpy/lib/function_base.py", line 2249, in _vectorize_call
        res = asanyarray(outputs, dtype=otypes[0])
    ValueError: setting an array element with a sequence.
    

    请注意完整的回溯。 vectorize 从这组大小混合的数组创建返回数组时遇到问题。这是猜测 , based on a trial calculation that it should return an int`dtype。

    如果我们告诉它返回一个对象数据类型数组,我们会得到:

    In [88]: vfunc=np.vectorize(test, otypes=['object'])
    In [89]: vfunc(df4['RB'].values)
    Out[89]: 
    array([array([42, 42, 42, 42, 42, 42, 42, 42, 42]),
           array([21, 21, 21, 21, 21, 21, 21, 21, 21]),
           array([21, 21, 21, 21, 21, 21, 21, 21, 21]),
           array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
           array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
           array([7, 7, 7, 7, 7, 7, 7, 7, 7]), 1, 1, 1], dtype=object)
    

    我们可以将其分配给df列:

    In [90]: df4['n']=_
    In [91]: df4
    Out[91]: 
       contract  RB  BeginDate  ...  49  50                                     n
    2    A00118  46   19850100  ...   7   7  [42, 42, 42, 42, 42, 42, 42, 42, 42]
    3    A00118  47   19000100  ...   7   7  [21, 21, 21, 21, 21, 21, 21, 21, 21]
    5    A00118  47   19850100  ...   7   7  [21, 21, 21, 21, 21, 21, 21, 21, 21]
    6    A00253  48   19000100  ...   7   7           [7, 7, 7, 7, 7, 7, 7, 7, 7]
    7    A00253  48   19820100  ...   7   7           [7, 7, 7, 7, 7, 7, 7, 7, 7]
    8    A00253  48   19850100  ...   7   7           [7, 7, 7, 7, 7, 7, 7, 7, 7]
    9    A00253  50   19000100  ...   7   7                                     1
    10   A00253  50   19790100  ...   7   7                                     1
    11   A00253  50   19850100  ...   7   7                                     1
    

    我们也可以指定 Out[85] 列表

    df4['n']=Out[85]
    

    时间差不多:

    In [94]: timeit vfunc(df4['RB'].values)
    211 µs ± 5.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    In [95]: timeit [test(i) for i in df4['RB'].values]
    217 µs ± 6.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    

    通常 矢量化 比较慢,但是 test 它本身可能足够慢,而迭代方法并没有多大区别。记住(必要时重新阅读文档), 矢量化 性能工具。它不会“编译”你的函数,也不会让它运行得更快。

    返回对象数据类型数组的另一种方法是:

    In [96]: vfunc=np.frompyfunc(test,1,1)
    In [97]: vfunc(df4['RB'].values)
    Out[97]: 
    array([array([42, 42, 42, 42, 42, 42, 42, 42, 42]),
           array([21, 21, 21, 21, 21, 21, 21, 21, 21]),
           array([21, 21, 21, 21, 21, 21, 21, 21, 21]),
           array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
           array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
           array([7, 7, 7, 7, 7, 7, 7, 7, 7]), 1, 1, 1], dtype=object)
    In [98]: timeit vfunc(df4['RB'].values)
    202 µs ± 6.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)