Explanation by Tilman Kamp
以下内容:
这实际上不是一个bug,因为当前的epoch是根据
当前参数和快照的
步数。仔细看看这段摘录:
# Number of GPUs per worker - fixed for now by local reality or cluster setup
gpus_per_worker = len(available_devices)
# Number of batches processed per job per worker
batches_per_job = gpus_per_worker * max(1, FLAGS.iters_per_worker)
# Number of batches per global step
batches_per_step = gpus_per_worker * max(1, FLAGS.replicas_to_agg)
# Number of global steps per epoch - to be at least 1
steps_per_epoch = max(1, model_feeder.train.total_batches // batches_per_step)
# The start epoch of our training
# Number of GPUs per worker - fixed for now by local reality or cluster setup
gpus_per_worker = len(available_devices)
# Number of batches processed per job per worker
batches_per_job = gpus_per_worker * max(1, FLAGS.iters_per_worker)
# Number of batches per global step
batches_per_step = gpus_per_worker * max(1, FLAGS.replicas_to_agg)
# Number of global steps per epoch - to be at least 1
steps_per_epoch = max(1, model_feeder.train.total_batches // batches_per_step)
# The start epoch of our training
self._epoch = step // steps_per_epoch
所以在训练过程中你的设定尺寸与
您当前的设置大小。这就是奇怪的纪元数。
简化示例(不混淆批大小):如果您曾经培训过
在1000个样本集中有5个阶段,你得到了5000个“全局步骤”
(在快照中保留为数字)。经过这次训练你
将命令行参数更改为一组大小为1的参数(您的--limit_*
参数)。”突然“你会看到纪元5000,因为5000
全局步骤意味着应用大小为15000次的数据集。
外卖:使用
--checkpoint_dir
避免此类问题的论点。