代码之家  ›  专栏  ›  技术社区  ›  Sachin Myneni

正在删除丢失数据的组

  •  1
  • Sachin Myneni  · 技术社区  · 7 年前

    我有一个熊猫数据框,有1000行,看起来像这样:

    >>> y=x.groupby('wbdqueue_id')
    >>> y.head()
               id                        jname  orderid  wbdqueue_id  platform_id  \
    59    1341127    ondemand_build_baspen-w7g       15        26581    1341122.0
    60    1341126       ondemand_qa_qforchecka       41        26581    1341125.0
    61    1341125       ondemand_build_bchecka       17        26581    1341123.0
    63    1341123    ondemand_build_baspen-w7f       14        26581    1341122.0
    64    1341122   ondemand_update_waspen-w7a        2        26581    1341073.0
    116   1340927              qa_db_insertall       56        26578          NaN
    117   1340926      ondemand_qa_qca-pc20rha       39        26578    1340925.0
    118   1340925   ondemand_build_bca-pc20rha       16        26578    1340924.0
    119   1340924  ondemand_update_wca-pc20rha        3        26578    1340871.0
    120   1340923    ondemand_qa_qaspen-w7_qa2       35        26578    1340922.0
    173   1340870              qa_db_insertall       56        26577          NaN
    174   1340869           ondemand_qa_qtopia       52        26577    1340868.0
    175   1340868        ondemand_build_btopia       33        26577    1340867.0
    176   1340867       ondemand_update_wtopia        9        26577    1340814.0
    177   1340866            ondemand_qa_qmoed       47        26577    1340865.0
    230   1340813              qa_db_insertall       56        26576          NaN
    231   1340812            ondemand_qa_qmoeb       46        26576    1340811.0
    232   1340811         ondemand_build_bmoed       22        26576    1340810.0
    233   1340810         ondemand_build_bmoee       23        26576    1340809.0
    234   1340809        ondemand_update_wmoeb        5        26576    1340757.0
    287   1340756              qa_db_insertall       56        26575          NaN
    293   1340750      ondemand_qa_qca-pc20rha       39        26575    1340749.0
    294   1340749   ondemand_build_bca-pc20rha       16        26575    1340748.0
    295   1340748  ondemand_update_wca-pc20rha        3        26575    1340700.0
    296   1340747      ondemand_qa_qvmwin7-64i       55        26575    1340746.0
    344   1340699              qa_db_insertall       56        26574          NaN
    345   1340698          ondemand_qa_qslotha       51        26574    1340697.0
    346   1340697       ondemand_build_bmousef       28        26574    1340684.0
    347   1340696           ondemand_qa_qboarb       38        26574    1340695.0
    348   1340695       ondemand_build_bmouseg       29        26574    1340684.0
    ...       ...                          ...      ...          ...          ...
    9659  1327031              qa_db_insertall       56        26311          NaN
    9660  1327030       ondemand_qa_qforchecka       41        26311    1327029.0
    9661  1327029       ondemand_build_bchecka       17        26311    1327027.0
    9662  1327028    ondemand_qa_qaspen-w7_qa1       34        26311    1327027.0
    9663  1327027    ondemand_build_baspen-w7f       14        26311    1327024.0
    9716  1326974              qa_db_insertall       56        26310          NaN
    9717  1326973            ondemand_qa_qmoeb       46        26310    1326972.0
    9718  1326972         ondemand_build_bmoed       22        26310    1326971.0
    9719  1326971         ondemand_build_bmoee       23        26310    1326970.0
    9720  1326970        ondemand_update_wmoeb        5        26310    1326918.0
    9773  1326917              qa_db_insertall       56        26309          NaN
    9774  1326916           ondemand_qa_qtopia       52        26309    1326915.0
    9775  1326915        ondemand_build_btopia       33        26309    1326914.0
    9776  1326914       ondemand_update_wtopia        9        26309    1326861.0
    9777  1326913    ondemand_qa_qaspen-w7_qa2       35        26309    1326912.0
    9830  1326860              qa_db_insertall       56        26308          NaN
    9831  1326859    ondemand_build_balder-w7d       12        26308    1326852.0
    9832  1326858      ondemand_qa_qvmwin7-64i       55        26308    1326857.0
    9833  1326857    ondemand_build_balder-w7h       13        26308    1326852.0
    9834  1326856      ondemand_qa_qvmwin7-64d       54        26308    1326855.0
    9887  1326803              qa_db_insertall       56        26307          NaN
    9888  1326802    ondemand_qa_qaspen-w7_qa1       34        26307    1326799.0
    9889  1326801       ondemand_qa_qforchecka       41        26307    1326800.0
    9890  1326800       ondemand_build_bchecka       17        26307    1326799.0
    9891  1326799    ondemand_build_baspen-w7f       14        26307    1326796.0
    9944  1326746              qa_db_insertall       56        26306          NaN
    9950  1326740         ondemand_qa_qkrakena       43        26306    1326737.0
    9951  1326739          ondemand_qa_qkirina       42        26306    1326738.0
    9952  1326738       ondemand_build_bkirina       18        26306    1326737.0
    9953  1326737      ondemand_build_bkrakena       19        26306    1326736.0
    
               startdatetime         enddatetime  runtime
    59   2017-07-31 23:14:56 2017-07-31 23:19:12 00:04:16
    60   2017-07-31 23:15:35 2017-07-31 23:34:12 00:18:37
    61   2017-07-31 23:14:56 2017-07-31 23:15:30 00:00:34
    63   2017-07-31 23:10:05 2017-07-31 23:14:56 00:04:51
    64   2017-07-31 23:09:32 2017-07-31 23:10:00 00:00:28
    116  2017-07-31 21:42:28 2017-07-31 21:42:55 00:00:27
    117  2017-07-31 21:10:15 2017-07-31 21:17:46 00:07:31
    118  2017-07-31 21:09:37 2017-07-31 21:10:10 00:00:33
    119  2017-07-31 21:09:22 2017-07-31 21:09:32 00:00:10
    120  2017-07-31 21:17:57 2017-07-31 21:33:22 00:15:25
    173  2017-07-31 21:05:15 2017-07-31 21:05:17 00:00:02
    174  2017-07-31 20:27:07 2017-07-31 20:27:19 00:00:12
    175  2017-07-31 20:27:00 2017-07-31 20:27:02 00:00:02
    176  2017-07-31 20:26:52 2017-07-31 20:26:54 00:00:02
    177  2017-07-31 20:48:56 2017-07-31 20:48:59 00:00:03
    230  2017-07-31 21:04:50 2017-07-31 21:05:15 00:00:25
    231  2017-07-31 20:20:02 2017-07-31 20:23:24 00:03:22
    232  2017-07-31 20:19:43 2017-07-31 20:19:57 00:00:14
    233  2017-07-31 20:19:11 2017-07-31 20:19:38 00:00:27
    234  2017-07-31 20:17:26 2017-07-31 20:18:49 00:01:23
    287  2017-07-31 20:58:36 2017-07-31 20:59:00 00:00:24
    293  2017-07-31 20:08:57 2017-07-31 20:16:33 00:07:36
    294  2017-07-31 20:07:44 2017-07-31 20:08:52 00:01:08
    295  2017-07-31 20:07:06 2017-07-31 20:07:39 00:00:33
    296  2017-07-31 20:36:23 2017-07-31 20:36:58 00:00:35
    344  2017-07-31 20:38:11 2017-07-31 20:38:35 00:00:24
    345  2017-07-31 19:58:32 2017-07-31 20:00:04 00:01:32
    346  2017-07-31 19:57:48 2017-07-31 19:58:26 00:00:38
    347  2017-07-31 19:59:17 2017-07-31 20:04:59 00:05:42
    348  2017-07-31 19:58:26 2017-07-31 19:59:12 00:00:46
    ...                  ...                 ...      ...
    9659 2017-07-18 18:44:05 2017-07-18 18:44:29 00:00:24
    9660 2017-07-18 18:08:17 2017-07-18 18:25:53 00:17:36
    9661 2017-07-18 18:07:35 2017-07-18 18:08:12 00:00:37
    9662 2017-07-18 18:07:40 2017-07-18 18:15:01 00:07:21
    9663 2017-07-18 18:03:20 2017-07-18 18:07:35 00:04:15
    9716 2017-07-18 18:10:53 2017-07-18 18:11:15 00:00:22
    9717 2017-07-18 17:40:52 2017-07-18 17:44:15 00:03:23
    9718 2017-07-18 17:40:29 2017-07-18 17:40:48 00:00:19
    9719 2017-07-18 17:39:51 2017-07-18 17:40:25 00:00:34
    9720 2017-07-18 17:37:42 2017-07-18 17:39:23 00:01:41
    9773 2017-07-18 16:45:15 2017-07-18 16:45:39 00:00:24
    9774 2017-07-18 16:02:44 2017-07-18 16:06:24 00:03:40
    9775 2017-07-18 16:02:09 2017-07-18 16:02:39 00:00:30
    9776 2017-07-18 16:01:55 2017-07-18 16:02:04 00:00:09
    9777 2017-07-18 16:10:46 2017-07-18 16:26:18 00:15:32
    9830 2017-07-18 16:10:22 2017-07-18 16:10:46 00:00:24
    9831 2017-07-18 15:41:47 2017-07-18 15:44:07 00:02:20
    9832 2017-07-18 16:01:19 2017-07-18 16:01:55 00:00:36
    9833 2017-07-18 15:44:07 2017-07-18 15:46:02 00:01:55
    9834 2017-07-18 15:41:52 2017-07-18 15:42:36 00:00:44
    9887 2017-07-18 15:02:56 2017-07-18 15:03:20 00:00:24
    9888 2017-07-18 14:34:26 2017-07-18 14:41:15 00:06:49
    9889 2017-07-18 14:35:03 2017-07-18 14:51:17 00:16:14
    9890 2017-07-18 14:34:21 2017-07-18 14:34:58 00:00:37
    9891 2017-07-18 14:30:03 2017-07-18 14:34:22 00:04:19
    9944 2017-07-18 13:50:01 2017-07-18 13:50:25 00:00:24
    9950 2017-07-18 13:18:38 2017-07-18 13:39:37 00:20:59
    9951 2017-07-18 13:19:35 2017-07-18 13:24:42 00:05:07
    9952 2017-07-18 13:18:38 2017-07-18 13:19:30 00:00:52
    9953 2017-07-18 13:17:16 2017-07-18 13:18:33 00:01:17
    

    我想,对于每个wbdqueue\u id,得到 ondemand_update_waspen-w7a和 结束日期时间 以下是拟议的解决方案,如果不是缺少数据,其中任何一种都会非常有效:

    df.set_index('jname').groupby('wbdqueue_id').apply(
        lambda x: x.at['ondemand_update_waspen-w7a', 'startdatetime'] \
                - x.at['ondemand_build_baspen-w7f', 'enddatetime'] )
    

    def get_time_diff(dff):
        start_time = dff[dff.jname.eq('ondemand_update_waspen-w7a')].startdatetime.values[0]
        end_time = dff[dff.jname.eq('ondemand_build_baspen-w7g')].enddatetime.values[0]
        return pd.Timedelta(end_time - start_time)
    

    如何跳过或删除没有所需数据的组?这似乎不是一个简单的方法。

    1 回复  |  直到 7 年前
        1
  •  1
  •   Abdou    7 年前

    通过添加 try-except 要处理缺少指定字符串的组,请执行以下操作:

    def get_time_diff(dff):
        try:
            start_time = dff[dff.jname.eq('ondemand_update_waspen-w7a')].startdatetime.values[0]
            end_time = dff[dff.jname.eq('ondemand_build_baspen-w7g')].enddatetime.values[0]
            return pd.Timedelta(end_time - start_time)
        except (KeyError, IndexError):
            return 0
    

    出于调试目的,我放置了 尝试例外

    def get_time_diff2(dff, str1='ondemand_update_waspen-w7a', str2='ondemand_build_baspen-w7g'):
        missing = ""
        try:
            start_time = dff[dff.jname.eq(str1)].startdatetime.values[0]
        except (KeyError, IndexError):
            missing += "{} is not in startdatetime. ".format(str1)
        try:
            end_time = dff[dff.jname.eq(str2)].enddatetime.values[0]
        except (KeyError, IndexError):
            missing += "{} is not in enddatetime".format(str2)
        if missing:
            return missing
        return pd.Timedelta(end_time - start_time)
    

    如果将此函数与数据帧和两个字符串(如“hello”和“world”)一起使用,它将告诉您在应用数据帧的特定组中是否缺少这两个字符串或其中一个字符串 get_time_diff 作用请参见以下内容:

    df.groupby('wbdqueue_id').apply(get_time_diff2, 'hello', 'world')
    

    应返回:

    # wbdqueue_id
    # 26578    hello is not in startdatetime. world is not in enddatetime
    # 26581    hello is not in startdatetime. world is not in enddatetime
    # dtype: object
    

    如果其中只有一个丢失:

    df.groupby('wbdqueue_id').apply(get_time_diff2, 'ondemand_update_waspen-w7a', 'hello')
    

    wbdqueue_id
    26578    hello is not in enddatetime
    26581    hello is not in enddatetime
    dtype: object