`returnn.datasets.huggingface`¶

HuggingFace dataset wrapper

See https://github.com/rwth-i6/returnn/issues/1257 for some initial discussion.

HuggingFace dataset wrapper.

Parameters:

dataset_opts – either a dict of options for datasets.load_dataset() or a path to a local dataset for datasets.load_from_disk(), or a list of Arrow filenames to load with datasets.Dataset.from_file() and concatenate. It can also be a callable returning one of the above, or returning a datasets.Dataset directly.
use_file_cache – if True, will cache the dataset files on local disk using file_cache. This only works for dataset_opts which is a str or list of str (or callable returning that).
map_func – optional function to apply to the dataset after loading
rename_columns – if given, will rename these columns
cast_columns – if given, will cast these columns to the specified types. This is useful if the dataset has not the expected types. See datasets.Dataset.cast() for details. You can also e.g. enforce some sample_rate for audio, etc.
data_format – For each column name (data key), specify the format, as a dict with entries for “dim”, “ndim”, “shape”, and/or “dtype”, compatible to Tensor. It can be a subset of the available columns. If “vocab” is specified, and the underlying HF datasets column is of dtype “string”, it will automatically tokenize the string using the vocab.
seq_tag_column – key (column name) in the dataset to use as sequence tag. If None, will use the sequence index as tag.
sorting_seq_len_column_data – key (column name) in the dataset to use for sorting by sequence length. It will take len(dataset[sorting_seq_len_column_data]) as sequence length (only for sorting/shuffling).
sorting_seq_len_column – key (column name) in the dataset to use for sorting by sequence length. It will take the value of dataset[sorting_seq_len_column] as sequence length (only for sorting/shuffling). E.g. some datasets provide “duration”, “duration_ms”, “wav_filesize” or similar such information which can be used.

get_data_keys() → List[str][source]¶

Returns:: list of data keys

get_target_list() → List[str][source]¶

Returns:: list of target keys

get_data_shape(key: str) → List[int][source]¶

Returns:: data shape for the given key

get_data_dim(key: str) → int[source]¶

Returns:: data dimension for the given key

is_data_sparse(key: str) → bool[source]¶

Returns:: whether the data is sparse for the given key

get_data_dtype(key: str) → str[source]¶

Returns:: dtype

property num_seqs: int[source]¶

Returns:: number of sequences

get_tag(sorted_seq_idx: int) → str[source]¶

Returns:: tag of the sequence

get_all_tags() → List[str][source]¶

Returns:: all tags

get_total_num_seqs(*, fast: bool = False) → int[source]¶

Returns:: total number of sequences in the dataset

init_seq_order(epoch: int | None = None, seq_list: Sequence[str] | None = None, seq_order: Sequence[int] | None = None) → bool[source]¶

Parameters:

epoch
seq_list – List of sequence tags, to set a predefined order.
seq_order – List of corpus sequence indices, to set a predefined order.

:returns whether the order changed (True is always safe to return)

supports_sharding() → bool[source]¶

Returns:: whether this dataset supports sharding

get_current_seq_order() → Sequence[int][source]¶

Returns:: list of corpus seq idx

get_corpus_seq_idx(sorted_seq_idx: int) → int[source]¶

Returns:: corpus seq idx

returnn.datasets.huggingface.get_arrow_shard_files_from_hf_dataset_dir(hf_data_dir: str | PathLike) → List[str][source]¶

Given some HF datasets directory (created via datasets.save_to_disk()), return the list of Arrow shard files (data-*-of-*.arrow). This also verifies that the directory looks like a valid HF datasets directory. The order of the returned list is by shard index. Note that this does not load the dataset, just inspects the directory structure.

Parameters:: hf_data_dir – directory
Returns:: list of Arrow shard files

returnn.datasets.huggingface¶

`returnn.datasets.huggingface`¶