此示例显示如何将HDF5文件用于所描述的过程。
首先,创建一个HDF5文件,其数据集为
shape(2_000_000, 2000)
和
dtype=float64
价值观我在尺寸上使用了变量,所以你可以修改它。
import numpy as np
import h5py
import random
h5_a0, h5_a1 = 2_000_000, 2_000
with h5py.File('SO_68206763.h5','w') as h5f:
dset = h5f.create_dataset('test',shape=(h5_a0, h5_a1))
incr = 1_000
a0 = h5_a0//incr
for i in range(incr):
arr = np.random.random(a0*h5_a1).reshape(a0,h5_a1)
dset[i*a0:i*a0+a0, :] = arr
print(dset[-1,0:10]) # quick dataset check of values in last row
接下来,以读取模式打开文件,读取10_000个形状的随机阵列切片
(16,2_000)
并附加到列表中
L
。最后,将列表转换为数组
WINS
。注意,默认情况下,数组将有2个轴——您需要使用
.reshape()
如果您希望每个注释有3个轴(还显示了整形)。
with h5py.File('SO_68206763.h5','r') as h5f:
dset = h5f['test']
L = []
ds0, ds1 = dset.shape[0], dset.shape[1]
for i in range(10_000):
ir = random.randint(0, ds0 - 16)
window = dset[ir:ir+16, :] # window from dset of shape (16, 2000) starting at a random index i
L.append(window)
WINS = np.concatenate(L) # shape (160_000, 2_000) of float64,
print(WINS.shape, WINS.dtype)
WINS = np.concatenate(L).reshape(10_0000,16,ds1) # reshaped to (10_000, 16, 2_000) of float64
print(WINS.shape, WINS.dtype)
上面的过程没有内存效率。您最终得到两个随机切片数据的副本:在列表L和数组WINS中。如果内存有限,这可能是个问题。要避免中间副本,请直接将随机幻灯片中的数据读取到数组中。这样做简化了代码,减少了内存占用。该方法如下所示(WINS2是2轴数组,WINS3是3轴数组)。
with h5py.File('SO_68206763.h5','r') as h5f:
dset = h5f['test']
ds0, ds1 = dset.shape[0], dset.shape[1]
WINS2 = np.empty((10_000*16,ds1))
WINS3 = np.empty((10_000,16,ds1))
for i in range(10_000):
ir = random.randint(0, ds0 - 16)
WINS2[i*16:(i+1)*16,:] = dset[ir:ir+16, :]
WINS3[i,:,:] = dset[ir:ir+16, :]