代码之家  ›  专栏  ›  技术社区  ›  Hagbard

如何随机抽取样本进行验证?

  •  1
  • Hagbard  · 技术社区  · 6 年前

    我目前正在培训一个Keras模型,其相应的fit调用如下所示:

    model.fit(X,y_train,batch_size=myBatchSize,epochs=myAmountOfEpochs,validation_split=0.1,callbacks=myCallbackList)
    

    This comment

    验证数据不一定取自每个类,而且它

    我现在的问题是:有没有一种简单的方法可以随机选择,比如说,10%的训练数据作为验证数据?我之所以要使用随机选取的样本,是因为在我的例子中,最后10%的数据不一定包含所有类。

    3 回复  |  直到 4 年前
        1
  •  3
  •   Dr. Snoopy    6 年前

    Keras没有提供比只获取一小部分训练数据进行验证更高级的功能。如果你需要更高级的东西,比如分层抽样,以确保类在样本中得到很好的表示,那么你需要在Keras之外手动完成这项工作(比如使用sciket learn或numpy),然后通过 validation_data model.fit

        2
  •  2
  •   Hagbard    6 年前

    感谢 Matias Valdenegro ,我受到启发,想看得更远一点,想出了以下解决问题的办法:

    from sklearn.model_selection import train_test_split
    [input: X and Y]
    XTraining, XValidation, YTraining, YValidation = train_test_split(X,Y,stratify=Y,test_size=0.1) # before model building
    [The model is built here...]
    model.fit(XTraining,YTraining,batch_size=batchSize,epochs=amountOfEpochs,validation_data=(XValidation,YValidation),callbacks=callbackList)
    
        3
  •  1
  •   ljusten    4 年前

    this post 我提出了一个使用 split-folders 包将主数据目录随机拆分为培训和验证目录,同时维护类子文件夹。你可以使用keras .flow_from_directory 方法来指定训练和验证路径。

    import split_folders
    
    # Split with a ratio.
    # To only split into training and validation set, set a tuple to `ratio`, i.e, `(.8, .2)`.
    split_folders.ratio('input_folder', output="output", seed=1337, ratio=(.8, .1, .1)) # default values
    
    # Split val/test with a fixed number of items e.g. 100 for each set.
    # To only split into training and validation set, use a single number to `fixed`, i.e., `10`.
    split_folders.fixed('input_folder', output="output", seed=1337, fixed=(100, 100), oversample=False) # default values
    

    输入文件夹应具有以下格式:

    input/
        class1/
            img1.jpg
            img2.jpg
            ...
        class2/
            imgWhatever.jpg
            ...
        ...
    

    为了给你这个:

    output/
        train/
            class1/
                img1.jpg
                ...
            class2/
                imga.jpg
                ...
        val/
            class1/
                img2.jpg
                ...
            class2/
                imgb.jpg
                ...
        test/            # optional
            class1/
                img3.jpg
                ...
            class2/
                imgc.jpg
                ...
    

    ImageDataGenerator 要建立培训和验证数据集:

    import tensorflow as tf
    import split_folders
    import os
    
    main_dir = '/Volumes/WMEL/Independent Research Project/Data/test_train/Data'
    output_dir = '/Volumes/WMEL/Independent Research Project/Data/test_train/output'
    
    split_folders.ratio(main_dir, output=output_dir, seed=1337, ratio=(.7, .3))
    
    train_datagen = tf.keras.preprocessing.image.ImageDataGenerator(
        rescale=1./224)
    
    train_generator = train_datagen.flow_from_directory(os.path.join(output_dir,'train'),
                                                        class_mode='categorical',
                                                        batch_size=32,
                                                        target_size=(224,224),
                                                        shuffle=True)
    
    validation_generator = train_datagen.flow_from_directory(os.path.join(output_dir,'val'),
                                                            target_size=(224, 224),
                                                            batch_size=32,
                                                            class_mode='categorical',
                                                            shuffle=True) # set as validation data
    
    base_model = tf.keras.applications.ResNet50V2(
        input_shape=IMG_SHAPE,
        include_top=False,
        weights=None)
    
    maxpool_layer = tf.keras.layers.GlobalMaxPooling2D()
    prediction_layer = tf.keras.layers.Dense(4, activation='softmax')
    
    model = tf.keras.Sequential([
        base_model,
        maxpool_layer,
        prediction_layer
    ])
    
    opt = tf.keras.optimizers.Adam(lr=0.004)
    model.compile(optimizer=opt,
                  loss=tf.keras.losses.CategoricalCrossentropy(),
                  metrics=['accuracy'])
    
    model.fit(
        train_generator,
        steps_per_epoch = train_generator.samples // 32,
        validation_data = validation_generator,
        validation_steps = validation_generator.samples // 32,
        epochs = 20)
    
        4
  •  0
  •   user358041    6 年前

    Keras Getting Started FAQ 你可以用 shuffle 中的参数 model.fit

        5
  •  0
  •   Jason    4 年前

    model.fit() 参数,则验证\u数据将覆盖验证\u拆分,因此无需同时配置这两个参数。

    validation_split: Float between 0 and 1.
                Fraction of the training data to be used as validation data.
                The model will set apart this fraction of the training data,
                will not train on it, and will evaluate
                the loss and any model metrics
                on this data at the end of each epoch.
    
    validation_data: Data on which to evaluate
                the loss and any model metrics at the end of each epoch.
                The model will not be trained on this data.
                `validation_data` will override `validation_split`
    

    shuffle

    shuffle: Boolean (whether to shuffle the training data
                before each epoch) or str (for 'batch').
                'batch' is a special option for dealing with the
                limitations of HDF5 data; it shuffles in batch-sized chunks.
    

    所以你可以做的是:

    model.fit(**other_kwargs, validation_split = 0.1, shuffle=True)
    
        6
  •  0
  •   Jason    4 年前

    如果您有1000个训练数据,100个测试数据,验证\u split=0.1,批大小=100,它将做的是:对训练数据进行拆分(第1批:90个训练和10个验证,第2批:90个训练和10个验证,…,全部按原始顺序,90,10,90,10…90,10) 它与100个测试数据无关(你的模型永远看不到它)。所以我猜你只想洗牌所有的10号验证集,而不去碰90号的训练集。我可能会做的是手动洗牌我的10%的数据部分,因为这是什么 shuffle=True 这样做,只需对索引进行洗牌,然后用洗牌索引的新数据替换旧的训练数据,如下所示:

    import numpy as np
    train_index = np.arange(1000,dtype=np.int32)
    split = 0.1
    batch_size = 100
    num_batch = int(len(train_index)/batch_size)
    train_index = np.reshape(train_index,(num_batch,batch_size))
    for i in range(num_batch):
        r = np.random.choice(range(10),10,replace=False)
        print(r)
        train_index[i,int((1-split)*batch_size):] = np.array(r+((1-split)*batch_size)+i*batch_size)
        print(train_index[i])
    
    flatten_index = train_index.reshape(-1)
    print(flatten_index)
    
    x_train = np.arange(1000,2000)
    x_train = x_train[flatten_index]
    print(x_train)