ml4gw.dataloading.hdf5_dataset

Classes

Hdf5TimeSeriesDataset(fnames, channels, ...)

Iterable dataset that samples and loads windows of timeseries data uniformly from a set of HDF5 files.

Exceptions

ContiguousHdf5Warning

exception ml4gw.dataloading.hdf5_dataset.ContiguousHdf5Warning: Bases: Warning

class ml4gw.dataloading.hdf5_dataset.Hdf5TimeSeriesDataset(fnames, channels, kernel_size, batch_size, batches_per_epoch, coincident, num_files_per_batch=None)

Bases: IterableDataset

Iterable dataset that samples and loads windows of timeseries data uniformly from a set of HDF5 files. It is _strongly_ recommended that these files have been written using chunked storage. This has shown to produce increases in read-time speeds of over an order of magnitude.

Parameters:

fnames (Sequence[str]) -- Paths to HDF5 files from which to sample data.
channels (Sequence[str]) -- Datasets to read from the indicated files, which will be stacked along dim 1 of the generated batches during iteration.
kernel_size (int) -- Size of the windows to read, in number of samples. This will be the size of the last dimension of the generated batches.
batch_size (int) -- Number of windows to sample at each iteration.
batches_per_epoch (int) -- Number of batches to generate during each call to __iter__.
coincident (bool | str) -- Whether windows for each channel in a given batch element should be sampled coincidentally, i.e. corresponding to the same time indices from the same files, or should be sampled independently. For the latter case, users can either specify False, which will sample filenames independently for each channel, or "files", which will sample windows independently within a given file for each channel. The latter setting limits the amount of entropy in the effective dataset, but can provide over 2x improvement in total throughput.
num_files_per_batch (int | None) -- The number of unique files from which to sample batch elements each epoch. If left as None, will use all available files. Useful when reading from many files is bottlenecking dataloading.

sample_batch()

Sample a single batch of multichannel timeseries

Return type:: Float[Tensor, 'batch num_ifos time']

sample_fnames(size)

Return type:: ndarray