returnn.datasets.huggingface

HuggingFace dataset wrapper

See https://github.com/rwth-i6/returnn/issues/1257 for some initial discussion.

class returnn.datasets.huggingface.HuggingFaceDataset(dataset_opts: Dict[str, Any] | str | os.PathLike | Sequence[str | os.PathLike] | Callable[[], Dict[str, Any] | str | os.PathLike | Sequence[str | os.PathLike] | datasets.Dataset], *, use_file_cache: bool = False, map_func: Callable[[datasets.Dataset], datasets.Dataset] | None = None, rename_columns: Dict[str, str] | None = None, cast_columns: Dict[str, Dict[str, Any]] | None = None, data_format: Dict[str, Dict[str, Any]], seq_tag_column: str | None = 'id', sorting_seq_len_column_data: str | None = None, sorting_seq_len_column: str | None = None, **kwargs)[source]

HuggingFace dataset wrapper.

Parameters:
  • dataset_opts – either a dict of options for datasets.load_dataset() or a path to a local dataset for datasets.load_from_disk(), or a list of Arrow filenames to load with datasets.Dataset.from_file() and concatenate. It can also be a callable returning one of the above, or returning a datasets.Dataset directly.

  • use_file_cache – if True, will cache the dataset files on local disk using file_cache. This only works for dataset_opts which is a str or list of str (or callable returning that).

  • map_func – optional function to apply to the dataset after loading

  • rename_columns – if given, will rename these columns

  • cast_columns – if given, will cast these columns to the specified types. This is useful if the dataset has not the expected types. See datasets.Dataset.cast() for details. You can also e.g. enforce some sample_rate for audio, etc.

  • data_format – For each column name (data key), specify the format, as a dict with entries for “dim”, “ndim”, “shape”, and/or “dtype”, compatible to Tensor. It can be a subset of the available columns. If “vocab” is specified, and the underlying HF datasets column is of dtype “string”, it will automatically tokenize the string using the vocab.

  • seq_tag_column – key (column name) in the dataset to use as sequence tag. If None, will use the sequence index as tag.

  • sorting_seq_len_column_data – key (column name) in the dataset to use for sorting by sequence length. It will take len(dataset[sorting_seq_len_column_data]) as sequence length (only for sorting/shuffling).

  • sorting_seq_len_column – key (column name) in the dataset to use for sorting by sequence length. It will take the value of dataset[sorting_seq_len_column] as sequence length (only for sorting/shuffling). E.g. some datasets provide “duration”, “duration_ms”, “wav_filesize” or similar such information which can be used.

get_data_keys() List[str][source]
Returns:

list of data keys

get_target_list() List[str][source]
Returns:

list of target keys

get_data_shape(key: str) List[int][source]
Returns:

data shape for the given key

get_data_dim(key: str) int[source]
Returns:

data dimension for the given key

is_data_sparse(key: str) bool[source]
Returns:

whether the data is sparse for the given key

get_data_dtype(key: str) str[source]
Returns:

dtype

property num_seqs: int[source]
Returns:

number of sequences

get_tag(sorted_seq_idx: int) str[source]
Returns:

tag of the sequence

get_all_tags() List[str][source]
Returns:

all tags

get_total_num_seqs(*, fast: bool = False) int[source]
Returns:

total number of sequences in the dataset

init_seq_order(epoch: int | None = None, seq_list: Sequence[str] | None = None, seq_order: Sequence[int] | None = None) bool[source]
Parameters:
  • epoch

  • seq_list – List of sequence tags, to set a predefined order.

  • seq_order – List of corpus sequence indices, to set a predefined order.

:returns whether the order changed (True is always safe to return)

supports_sharding() bool[source]
Returns:

whether this dataset supports sharding

get_current_seq_order() Sequence[int][source]
Returns:

list of corpus seq idx

get_corpus_seq_idx(sorted_seq_idx: int) int[source]
Returns:

corpus seq idx

returnn.datasets.huggingface.get_arrow_shard_files_from_hf_dataset_dir(hf_data_dir: str | PathLike) List[str][source]

Given some HF datasets directory (created via datasets.save_to_disk()), return the list of Arrow shard files (data-*-of-*.arrow). This also verifies that the directory looks like a valid HF datasets directory. The order of the returned list is by shard index. Note that this does not load the dataset, just inspects the directory structure.

Parameters:

hf_data_dir – directory

Returns:

list of Arrow shard files