ml4gw.dataloading package

Submodules

ml4gw.dataloading.chunked_dataset module

class ml4gw.dataloading.chunked_dataset.ChunkedTimeSeriesDataset(chunk_it, kernel_size, batch_size, batches_per_chunk, coincident=True, device='cpu')

Bases: IterableDataset

Wrapper dataset that will loop through chunks of timeseries data produced by another iterable and sample windows from these chunks.

Parameters:
  • chunk_it (Iterable) -- Iterator which will produce chunks of timeseries data to sample windows from. Should have shape (N, C, T), where N is the number of chunks to sample from, C is the number of channels, and T is the number of samples along the time dimension for each chunk.

  • kernel_size (float) -- Size of windows to be sampled from each chunk. Should be less than the size of each chunk along the time dimension.

  • batch_size (int) -- Number of windows to sample at each iteration

  • batches_per_chunk (int) -- Number of batches of windows to sample from each chunk before moving on to the next one. Sampling fewer batches from each chunk means a lower likelihood of sampling duplicate windows, but an increase in chunk-loading overhead.

  • coincident (bool) -- Whether the windows sampled from individual channels in each batch element should be sampled coincidentally, i.e. consisting of the same timesteps, or whether each window should be sample independently from the others.

  • device (str) -- Which device chunks should be moved to upon loading.

ml4gw.dataloading.hdf5_dataset module

exception ml4gw.dataloading.hdf5_dataset.ContiguousHdf5Warning

Bases: Warning

class ml4gw.dataloading.hdf5_dataset.Hdf5TimeSeriesDataset(fnames, channels, kernel_size, batch_size, batches_per_epoch, coincident)

Bases: IterableDataset

Iterable dataset that samples and loads windows of timeseries data uniformly from a set of HDF5 files. It is _strongly_ recommended that these files have been written using [chunked storage] (https://docs.h5py.org/en/stable/high/dataset.html#chunked-storage). This has shown to produce increases in read-time speeds of over an order of magnitude.

Parameters:
  • fnames (Sequence[str]) -- Paths to HDF5 files from which to sample data.

  • channels (Sequence[str]) -- Datasets to read from the indicated files, which will be stacked along dim 1 of the generated batches during iteration.

  • kernel_size (int) -- Size of the windows to read, in number of samples. This will be the size of the last dimension of the generated batches.

  • batch_size (int) -- Number of windows to sample at each iteration.

  • batches_per_epoch (int) -- Number of batches to generate during each call to __iter__.

  • coincident (Union[bool, str]) -- Whether windows for each channel in a given batch element should be sampled coincidentally, i.e. corresponding to the same time indices from the same files, or should be sampled independently. For the latter case, users can either specify False, which will sample filenames independently for each channel, or "files", which will sample windows independently within a given file for each channel. The latter setting limits the amount of entropy in the effective dataset, but can provide over 2x improvement in total throughput.

sample_batch()

Sample a single batch of multichannel timeseries

Return type:

Tensor

sample_fnames(size)
Return type:

ndarray

ml4gw.dataloading.in_memory_dataset module

class ml4gw.dataloading.in_memory_dataset.InMemoryDataset(X, kernel_size, y=None, batch_size=32, stride=1, batches_per_epoch=None, coincident=True, shuffle=True, device='cpu')

Bases: IterableDataset

Dataset for iterating through in-memory multi-channel timeseries

Dataset for arrays of timeseries data which can be stored in-memory all at once. Iterates through the data by sampling fixed-length windows from all channels. The precise mechanism for this iteration is determined by combinations of the keyword arguments. See their descriptions for details.

Parameters:
  • X (Tensor) -- Timeseries data to be iterated through. Should have shape (num_channels, length * sample_rate). Windows will be sampled from the time (1st) dimension for all channels along the channel (0th) dimension.

  • kernel_size (int) -- The length of the windows to sample from X in units of samples.

  • y (Optional[Tensor]) -- Target timeseries to be iterated through. If specified, should be a single channel and have shape (length * sample_rate,). If left as None, only windows sampled from X will be returned during iteration. Otherwise, windows sampled from both arrays will be returned. Note that if sampling is performed non-coincidentally, there's no sensible way to align windows sampled from this array with the windows sampled from X, so this combination of arguments is not permitted.

  • batch_size (int) -- Maximum number of windows to return at each iteration. Will be the length of the 0th dimension of the returned array(s). If batches_per_epoch is specified, this will be the length of _every_ array returned during iteration. Otherwise, it's possible that the last array will be shorter due to the number of windows in the timeseries being a non-integer multiple of batch_size.

  • stride (int) -- The resolution at which windows will be sampled from the specified timeseries, in units of samples. E.g. if stride=2, the first sample of each window can only be from an index of X which is a multiple of 2. Obviously, this reduces the number of windows which can be iterated through by a factor of stride.

  • batches_per_epoch (Optional[int]) -- Number of batches of window to produce during iteration before raising a StopIteration. Must be specified if performing non-coincident sampling. Otherwise, if left as None, windows will be sampled until the entire timeseries has been exhausted. Note that batch_size * batches_per_epoch must be be small enough to be able to be fulfilled by the number of windows in the timeseries, otherise a ValueError will be raised.

  • coincident (bool) -- Whether to sample windows from the channels of X using the same indices or independently. Can't be True if batches_per_epoch is None or y is _not_ None.

  • shuffle (bool) -- Whether to sample windows from timeseries randomly or in order along the time axis. If coincident=False and shuffle=False, channels will be iterated through with the index along the last channel moving fastest.

  • device (str) -- Which device to host the timeseries arrays on

init_indices()

Initialize arrays of indices we'll use to slice through X and y at iteration time. This helps by taking care of building in any randomness upfront.

property num_kernels: int

The number of windows contained in the timeseries if we sample at the specified stride.

Module contents