returnn.datasets.huggingface¶
HuggingFace dataset wrapper
See https://github.com/rwth-i6/returnn/issues/1257 for some initial discussion.
- class returnn.datasets.huggingface.HuggingFaceDataset(dataset_opts: Dict[str, Any] | str | os.PathLike | Sequence[str | os.PathLike] | Callable[[], Dict[str, Any] | str | os.PathLike | Sequence[str | os.PathLike] | datasets.Dataset], *, use_file_cache: bool = False, map_func: Callable[[datasets.Dataset], datasets.Dataset] | None = None, rename_columns: Dict[str, str] | None = None, cast_columns: Dict[str, Dict[str, Any]] | None = None, data_format: Dict[str, Dict[str, Any]], seq_tag_column: str | None = 'id', sorting_seq_len_column_data: str | None = None, sorting_seq_len_column: str | None = None, **kwargs)[source]¶
HuggingFace dataset wrapper.
- Parameters:
dataset_opts – either a dict of options for
datasets.load_dataset()or a path to a local dataset fordatasets.load_from_disk(), or a list of Arrow filenames to load withdatasets.Dataset.from_file()and concatenate. It can also be a callable returning one of the above, or returning adatasets.Datasetdirectly.use_file_cache – if True, will cache the dataset files on local disk using
file_cache. This only works for dataset_opts which is a str or list of str (or callable returning that).map_func – optional function to apply to the dataset after loading
rename_columns – if given, will rename these columns
cast_columns – if given, will cast these columns to the specified types. This is useful if the dataset has not the expected types. See
datasets.Dataset.cast()for details. You can also e.g. enforce some sample_rate for audio, etc.data_format – For each column name (data key), specify the format, as a dict with entries for “dim”, “ndim”, “shape”, and/or “dtype”, compatible to
Tensor. It can be a subset of the available columns. If “vocab” is specified, and the underlying HF datasets column is of dtype “string”, it will automatically tokenize the string using the vocab.seq_tag_column – key (column name) in the dataset to use as sequence tag. If None, will use the sequence index as tag.
sorting_seq_len_column_data – key (column name) in the dataset to use for sorting by sequence length. It will take len(dataset[sorting_seq_len_column_data]) as sequence length (only for sorting/shuffling).
sorting_seq_len_column – key (column name) in the dataset to use for sorting by sequence length. It will take the value of dataset[sorting_seq_len_column] as sequence length (only for sorting/shuffling). E.g. some datasets provide “duration”, “duration_ms”, “wav_filesize” or similar such information which can be used.
- get_total_num_seqs(*, fast: bool = False) int[source]¶
- Returns:
total number of sequences in the dataset
- init_seq_order(epoch: int | None = None, seq_list: Sequence[str] | None = None, seq_order: Sequence[int] | None = None) bool[source]¶
- Parameters:
epoch
seq_list – List of sequence tags, to set a predefined order.
seq_order – List of corpus sequence indices, to set a predefined order.
:returns whether the order changed (True is always safe to return)
- returnn.datasets.huggingface.get_arrow_shard_files_from_hf_dataset_dir(hf_data_dir: str | PathLike) List[str][source]¶
Given some HF datasets directory (created via
datasets.save_to_disk()), return the list of Arrow shard files (data-*-of-*.arrow). This also verifies that the directory looks like a valid HF datasets directory. The order of the returned list is by shard index. Note that this does not load the dataset, just inspects the directory structure.- Parameters:
hf_data_dir – directory
- Returns:
list of Arrow shard files