代码之家  ›  专栏  ›  技术社区  ›  ZeerakW

当封装在函数中时Python代码会减慢

  •  0
  • ZeerakW  · 技术社区  · 4 年前

    我正在读取和处理一个文件(使用相同的代码位),它以两种截然不同的速度运行:1。脚本化(每秒50K+次迭代)和2。包装在函数中(每秒大约300次迭代)。我真搞不懂为什么在阅读时间上会有如此大的差异。

    模块结构(省略未使用和不相关的文件。代码在末尾。):

    | experiments/
    |--| experiment_runner.py
    |
    | module/
    |--| shared/
    |--|--| dataloaders.py
    |--|--| data.py
    

    data.py 我们有办法( load ,包装方法继承自的类 torch.utils.data.Dataset )正在加载文件。在 dataloaders.py 负载 loader 函数,它处理数据集的拆分等。

    experiment_runner 数据加载程序.py 加载发生在大约300次迭代/秒。如果我从函数中复制代码并将其直接放入 实验跑步者 函数来自 (因此,没有为每个数据集包装一个函数),加载速度大约为每秒50000次。我完全不明白为什么在函数中包装代码会大大改变它的速度。

    数据.py:

        def load(self, dataset: str = 'train', skip_header = True, **kwargs) -> None:
            fp = open(self.data_files[dataset])
    
            if skip_header:
                next(fp)
    
            data = []
            for line in tqdm(self.reader(fp), desc = f'loading {self.name} ({dataset})'):
                data_line, datapoint = {}, base.Datapoint()
    
                for field in self.train_fields:
                    idx = field.index if self.ftype in ['CSV', 'TSV'] else field.cname
                    data_line[field.name] = self.process_doc(line[idx].rstrip())
                    data_line['original'] = line[idx].rstrip()
    
                for field in self.label_fields:
                    idx = field.index if self.ftype in ['CSV', 'TSV'] else field.cname
                    if self.label_preprocessor:
                        data_line[field.name] = self.label_preprocessor(line[idx].rstrip())
                    else:
                        data_line[field.name] = line[idx].rstrip()
    
                for key, val in data_line.items():
                    setattr(datapoint, key, val)
                data.append(datapoint)
            fp.close()
    
            if self.length is None:
                # Get the max length
                lens = []
                for doc in data:
                    for f in self.train_fields:
                        lens.append(len([tok for tok in getattr(doc, getattr(f, 'name'))]))
                self.length = max(lens)
    
            if dataset == 'train':
                self.data = data
            elif dataset == 'dev':
                self.dev = data
            elif dataset == 'test':
                self.test = data
    

    def loader(args: dict, **kwargs):
        """Loads the dataset.
        :args (dict): Dict containing arguments to load dataaset.
        :returns: Loaded and splitted dataset.
        """
        dataset = GeneralDataset(**args)
        dataset.load('train', **kwargs)
    
        if (args['dev'], args['test']) == (None, None):  # Only train set is given.
            dataset.split(dataset.data, [0.8, 0.1, 0.1], **kwargs)
    
        elif args['dev'] is not None and args['test'] is None:  # Dev set is given, test it not.
            dataset.load('dev', **kwargs)
            dataset.split(dataset.data, [0.8], **kwargs)
    
        elif args['dev'] is None and args['test'] is not None:  # Test is given, dev is not.
            dataset.split(dataset.data, [0.8], **kwargs)
            dataset.dev_set = dataset.test
            dataset.load('test', **kwargs)
    
        else:  # Both dev and test sets are given.
            dataset.load('dev', **kwargs)
            dataset.load('test', **kwargs)
    
        return dataset
    
    
    def binarize(label: str) -> str:
        if label in ['0', '1']:
            return 'pos'
        else:
            return 'neg'
    
    
    def datal(path: str, cleaners: base.Callable, preprocessor: base.Callable = None):
        args = {'data_dir': path,
                'ftype': 'csv',
                'fields': None,
                'train': 'dataset.csv', 'dev': None, 'test': None,
                'train_labels': None, 'dev_labels': None, 'test_labels': None,
                'sep': ',',
                'tokenizer': lambda x: x.split(),
                'preprocessor': preprocessor,
                'transformations': None,
                'length': None,
                'label_preprocessor': binarize,
                'name': 'First dataset.'
                }
    
        ignore = base.Field('ignore', train = False, label = False, ignore = True)
        d_text = base.Field('text', train = True, label = False, ignore = False, ix = 6, cname = 'text')
        d_label = base.Field('label', train = False, label = True, cname = 'label', ignore = False, ix = 5)
    
        args['fields'] = [ignore, ignore, ignore, ignore, ignore, d_label, d_text]
    
        return loader(args)
    

    为达到以下目的: 实验_跑步者.py

    from module.dataloaders import datal, loader
    
    dataset = datal() # Slow: 300-ish iterations/second
    
    # Fast version: 50000 iter/second
    def binarize(label: str) -> str:
        if label in ['0', '1']:
            return 'pos'
        else:
            return 'neg'
    
    args = {'data_dir': path,
            'ftype': 'csv',
            'fields': None,
            'train': 'dataset.csv', 'dev': None, 'test': None,
            'train_labels': None, 'dev_labels': None, 'test_labels': None,
            'sep': ',',
            'tokenizer': lambda x: x.split(),
            'preprocessor': preprocessor,
            'transformations': None,
            'length': None,
            'label_preprocessor': binarize,
            'name': 'First dataset.'
            }
    
    ignore = base.Field('ignore', train = False, label = False, ignore = True)
    d_text = base.Field('text', train = True, label = False, ignore = False, ix = 6, cname = 'text')
    d_label = base.Field('label', train = False, label = True, cname = 'label', ignore = False, ix = 5)
    
    args['fields'] = [ignore, ignore, ignore, ignore, ignore, d_label, d_text]
    
    dataset = loader(args)
    

    理想情况下,我希望保留数据集函数(例如。 datal )包装以保持逻辑分离,但随着速度的降低,这是不可行的。

    0 回复  |  直到 4 年前