代码之家  ›  专栏  ›  技术社区  ›  Jas

为什么pandas 2min bucket会打印nan,尽管我所有的行值都是数字(而不是nan)?

  •  1
  • Jas  · 技术社区  · 6 年前

    我知道在我的数据响应字节列中没有NaN值,因为当我运行时: data[data.response_bytes.isna()].count() 结果是0。

    当我运行2分钟Bucket Mean然后抬头时,我得到Nan:

    print(data.reset_index().set_index('time').resample('2min').mean().head())
    
                         index  identity  user  http_code  response_bytes  unknown
    time                                                                          
    2018-01-31 09:26:00    0.5       NaN   NaN      200.0           264.0      NaN
    2018-01-31 09:28:00    NaN       NaN   NaN        NaN             NaN      NaN
    2018-01-31 09:30:00    NaN       NaN   NaN        NaN             NaN      NaN
    2018-01-31 09:32:00    NaN       NaN   NaN        NaN             NaN      NaN
    2018-01-31 09:34:00    NaN       NaN   NaN        NaN             NaN      NaN
    

    为什么响应字节时间bucketing意味着有nan值?

    我想做个实验,学习一下大熊猫的时间节律。所以我使用了日志文件: http://www.cs.tufts.edu/comp/116/access.log 作为输入数据,然后将其加载到pandas数据帧中,然后应用时间桶2分钟(这是我有生以来第一次)并运行mean(),我不希望在 响应字节 列,因为所有值都不是NaN。

    这是我的完整代码:

    import urllib.request
    import pandas as pd
    import re
    from datetime import datetime
    import pytz
    
    pd.set_option('max_columns',10)
    
    def parse_str(x):
        """
        Returns the string delimited by two characters.
    
        Example:
            `>>> parse_str('[my string]')`
            `'my string'`
        """
        return x[1:-1]
    
    def parse_datetime(x):
        '''
        Parses datetime with timezone formatted as:
            `[day/month/year:hour:minute:second zone]`
    
        Example:
            `>>> parse_datetime('13/Nov/2015:11:45:42 +0000')`
            `datetime.datetime(2015, 11, 3, 11, 45, 4, tzinfo=<UTC>)`
    
        Due to problems parsing the timezone (`%z`) with `datetime.strptime`, the
        timezone will be obtained using the `pytz` library.
        '''
        dt = datetime.strptime(x[1:-7], '%d/%b/%Y:%H:%M:%S')
        dt_tz = int(x[-6:-3])*60+int(x[-3:-1])
        return dt.replace(tzinfo=pytz.FixedOffset(dt_tz))
    
    # data = pd.read_csv(StringIO(accesslog))
    url = "http://www.cs.tufts.edu/comp/116/access.log"
    accesslog =  urllib.request.urlopen(url).read().decode('utf-8')
    fields = ['host', 'identity', 'user', 'time_part1', 'time_part2', 'cmd_path_proto', 
              'http_code', 'response_bytes', 'referer', 'user_agent', 'unknown']
    
    data = pd.read_csv(url, sep=' ', header=None, names=fields, na_values=['-'])
    
    # Panda's parser mistakenly splits the date into two columns, so we must concatenate them
    time = data.time_part1 + data.time_part2
    time_trimmed = time.map(lambda s: re.split('[-+]', s.strip('[]'))[0]) # Drop the timezone for simplicity
    data['time'] = pd.to_datetime(time_trimmed, format='%d/%b/%Y:%H:%M:%S')
    
    data.head()
    
    print(data.reset_index().set_index('time').resample('2min').mean().head())
    

    我希望响应字节平均值列的时间间隔不是nan。

    1 回复  |  直到 6 年前
        1
  •  1
  •   jezrael    6 年前

    这是预期的行为,因为 resampling 转换为常规时间间隔,因此如果没有示例 NaN 是的。

    因此,这意味着在大约2分钟的时间间隔内没有日期时间,例如。 2018-01-31 09:28:00 2018-01-31 09:30:00 ,所以 mean 无法计数和获取 S.

    print (data[data['time'].between('2018-01-31 09:28:00','2018-01-31 09:30:00')])
    Empty DataFrame
    Columns: [host, identity, user, time_part1, time_part2, cmd_path_proto,
              http_code, response_bytes, referer, user_agent, unknown, time]
    Index: []
    
    [0 rows x 12 columns]