DatasetCreator#

class myoverse.datasets.DatasetCreator(modalities, sampling_frequency=2048.0, tasks_to_use=(), save_path=PosixPath('dataset.zip'), test_ratio=0.2, val_ratio=0.2, time_chunk_size=256, debug_level=0)[source]#

Creates datasets stored in zarr for direct tensor loading.

Data is stored in zarr format with metadata for dimension names, enabling direct loading to GPU tensors with named dimensions.

Parameters:
  • modalities (dict[str, Modality]) – Dictionary mapping modality names to Modality configs.

  • sampling_frequency (float) – Sampling frequency in Hz.

  • tasks_to_use (Sequence[str]) – Task keys to include (empty = all).

  • save_path (Path | str) – Output path for the zarr zip file (e.g., “dataset.zip”).

  • test_ratio (float) – Ratio for test split (0.0-1.0).

  • val_ratio (float) – Ratio for validation split (0.0-1.0).

  • time_chunk_size (int) – Chunk size along time dimension for zarr storage.

  • debug_level (int) – Debug output level (0=none, 1=text, 2=text+graphs).

Examples

>>> from myoverse.datasets import DatasetCreator, Modality
>>>
>>> creator = DatasetCreator(
...     modalities={
...         "emg": Modality(path="emg.pkl", dims=("channel", "time")),
...         "kinematics": Modality(path="kin.pkl", dims=("joint", "time")),
...     },
...     sampling_frequency=2048.0,
...     save_path="dataset.zip",
... )
>>> creator.create()
>>>
>>> # Load directly to GPU tensors
>>> from myoverse.datasets import DataModule
>>> dm = DataModule("dataset.zip", device="cuda")

Methods

__init__(modalities[, sampling_frequency, ...])

_extract_center(data, ratio)

Extract center portion of data along last axis.

_print_config()

_print_data_structure()

_print_header([title])

_print_summary()

Print dataset summary.

_process_all_tasks(store)

_process_task(task, store)

Process a single task for all modalities.

_split_continuous(data)

Split continuous data along time axis (last dimension).

_store_array(group, name, data)

Store an array with time-chunked layout.

create()

Create the dataset.

create()[source]#

Create the dataset.

Return type:

None