ml4gw.dataloading.in_memory_dataset

Classes

InMemoryDataset(X, kernel_size[, y, ...])

Dataset for iterating through in-memory multi-channel timeseries

class ml4gw.dataloading.in_memory_dataset.InMemoryDataset(X, kernel_size, y=None, batch_size=32, stride=1, batches_per_epoch=None, coincident=True, shuffle=True, device='cpu')

Bases: IterableDataset

Dataset for iterating through in-memory multi-channel timeseries

Dataset for arrays of timeseries data which can be stored in-memory all at once. Iterates through the data by sampling fixed-length windows from all channels. The precise mechanism for this iteration is determined by combinations of the keyword arguments. See their descriptions for details.

Parameters:

X (Float[Tensor, 'channels time']) -- Timeseries data to be iterated through. Should have shape (num_channels, length * sample_rate). Windows will be sampled from the time (1st) dimension for all channels along the channel (0th) dimension.
kernel_size (int) -- The length of the windows to sample from X in units of samples.
y (Float[Tensor, 'time'] | None) -- Target timeseries to be iterated through. If specified, should be a single channel and have shape (length * sample_rate,). If left as None, only windows sampled from X will be returned during iteration. Otherwise, windows sampled from both arrays will be returned. Note that if sampling is performed non-coincidentally, there's no sensible way to align windows sampled from this array with the windows sampled from X, so this combination of arguments is not permitted.
batch_size (int) -- Maximum number of windows to return at each iteration. Will be the length of the 0th dimension of the returned array(s). If batches_per_epoch is specified, this will be the length of every array returned during iteration. Otherwise, it's possible that the last array will be shorter due to the number of windows in the timeseries being a non-integer multiple of batch_size.
stride (int) -- The resolution at which windows will be sampled from the specified timeseries, in units of samples. E.g. if stride=2, the first sample of each window can only be from an index of X which is a multiple of 2. Obviously, this reduces the number of windows which can be iterated through by a factor of stride.
batches_per_epoch (int | None) -- Number of batches of window to produce during iteration before raising a StopIteration. Must be specified if performing non-coincident sampling. Otherwise, if left as None, windows will be sampled until the entire timeseries has been exhausted. Note that batch_size * batches_per_epoch must be be small enough to be able to be fulfilled by the number of windows in the timeseries, otherise a ValueError will be raised.
coincident (bool) -- Whether to sample windows from the channels of X using the same indices or independently. Can't be True if batches_per_epoch is None or y is not None.
shuffle (bool) -- Whether to sample windows from timeseries randomly or in order along the time axis. If coincident=False and shuffle=False, channels will be iterated through with the index along the last channel moving fastest.
device (str) -- Which device to host the timeseries arrays on

init_indices(): Initialize arrays of indices we'll use to slice through X and y at iteration time. This helps by taking care of building in any randomness upfront.

property num_kernels: int: The number of windows contained in the timeseries if we sample at the specified stride.