代码之家  ›  专栏  ›  技术社区  ›  Lomtrur EK.

sklearn pipeline中用于分类的图像数组-值错误:使用序列设置数组元素

  •  1
  • Lomtrur EK.  · 技术社区  · 7 年前

    from pandas import DataFrame
    from scipy.misc import imread, imresize
    rows = []
    for product in products:
        try:
            relevant = product.categoryrelevant.all()[0].relevant
        except IndexError:
            relevant = False
        if relevant:
            relevant = "A"
        else:
            relevant = "B"
        # this exists for all pictures
        image_array = imread("{}/{}".format(MEDIA_ROOT, product.picture_file.url))
        image_array = imresize(image_array, (160, 160))
        image_array = image_array.reshape(-1)
        print(image_array)
        # [254 254 252 ..., 255 255 253]
        print(image_array.shape)
        # (76800,)
        rows.append({"id": product.pk, "image": image_array, "class": relevant})
        index.append(product)
    df = DataFrame(rows, index=index)
    

    http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html

    它接受“图像”列中的值。或者,也可以这样做 train_X = df.iloc[train_indices]["image"].values ,但我想稍后添加其他列。

    def randomforest_image_pipeline():
        """Returns a RandomForest pipeline."""
        return Pipeline([
            ("union", FeatureUnion(
                transformer_list=[
                    ("image", Pipeline([
                        ("selector", ItemSelector(key="image")),
                    ]))
                ],
                transformer_weights={
                    "image": 1.0
                },
            )),
            ("classifier", RandomForestClassifier()),
        ])
    

    from sklearn.model_selection import KFold
    kfold(tested_pipeline=randomforest_image_pipeline(), df=df)
    def kfold(tested_pipeline=None, df=None, splits=6):
        k_fold = KFold(n_splits=splits)
        for train_indices, test_indices in k_fold.split(df):
            # training set
            train_X = df.iloc[train_indices]
            train_y = df.iloc[train_indices]['class'].values
            # test set
            test_X = df.iloc[test_indices]
            test_y = df.iloc[test_indices]['class'].values
            for val in train_X["image"]:
                print(len(val), val.dtype, val.shape)
                # 76800 uint8 (76800,) for all
            tested_pipeline.fit(train_X, train_y) # crashes in this call
            pipeline_predictions = tested_pipeline.predict(test_X)
            ...
    

    .fit 我得到以下错误:

    Traceback (most recent call last):
      File "<path>/project/classifier/classify.py", line 362, in <module>
        best = best_pipeline(dataframe=data, f1_scores=f1_dict, get_fp=True)
      File "<path>/project/classifier/classify.py", line 351, in best_pipeline
        confusion_list=confusion_list, get_fp=get_fp)
      File "<path>/project/classifier/classify.py", line 65, in kfold
        tested_pipeline.fit(train_X, train_y)
      File "/usr/local/lib/python3.5/dist-packages/sklearn/pipeline.py", line 270, in fit
        self._final_estimator.fit(Xt, y, **fit_params)
      File "/usr/local/lib/python3.5/dist-packages/sklearn/ensemble/forest.py", line 247, in fit
        X = check_array(X, accept_sparse="csc", dtype=DTYPE)
      File "/usr/local/lib/python3.5/dist-packages/sklearn/utils/validation.py", line 382, in check_array
        array = np.array(array, dtype=dtype, order=order, copy=copy)
    ValueError: setting an array element with a sequence.
    

    我发现其他人也有同样的问题,对他们来说,问题是他们的行长度不同。我的情况似乎不是这样,因为所有行都是一维的,长度为76800:

        for val in train_X["image"]:
            print(len(val), val.dtype, val.shape)
            # 76800 uint8 (76800,) for all
    

    array

    [array([ 255.,  255.,  255., ...,  255.,  255.,  255.])
     array([ 255.,  255.,  255., ...,  255.,  255.,  255.])
     array([ 255.,  255.,  255., ...,  255.,  255.,  255.]) ...,
     array([ 255.,  255.,  255., ...,  255.,  255.,  255.])
     array([ 255.,  255.,  255.
    

    我能做些什么来解决这个问题?

    1 回复  |  直到 6 年前
        1
  •  1
  •   Vivek Kumar    7 年前

    错误是因为您正在将图像的所有数据(即76800个特征)保存在一个列表中,而该列表保存在数据帧的一列中。

    (Train_len, )

    更改 transform() ItemSelector的功能,用于返回具有形状的适当二维数据阵列(Train\u len,76800)。只有这样它才会起作用。

    更改为:

    def transform(self, data_dict):
        return np.array([np.array(x) for x in data_dict[self.key]])