FederatedDataset#

class FederatedDataset(*, dataset: str, subset: str | None = None, resplitter: Callable[[DatasetDict], DatasetDict] | Dict[str, Tuple[str, ...]] | None = None, partitioners: Dict[str, Partitioner | int], shuffle: bool = True, seed: int | None = 42)[source]#

Bases: object

Representation of a dataset for federated learning/evaluation/analytics.

Download, partition data among clients (edge devices), or load full dataset.

Partitions are created using IidPartitioner. Support for different partitioners specification and types will come in future releases.

Parameters:
  • dataset (str) – The name of the dataset in the Hugging Face Hub.

  • subset (str) – Secondary information regarding the dataset, most often subset or version (that is passed to the name in datasets.load_dataset).

  • resplitter (Optional[Union[Resplitter, Dict[str, Tuple[str, ...]]]]) – Callable that transforms DatasetDict splits, or configuration dict for MergeResplitter.

  • partitioners (Dict[str, Union[Partitioner, int]]) – A dictionary mapping the Dataset split (a str) to a Partitioner or an int (representing the number of IID partitions that this split should be partitioned into). One or multiple Partitioner objects can be specified in that manner, but at most, one per split.

  • shuffle (bool) – Whether to randomize the order of samples. Applied prior to resplitting, speratelly to each of the present splits in the dataset. It uses the seed argument. Defaults to True.

  • seed (Optional[int]) – Seed used for dataset shuffling. It has no effect if shuffle is False. The seed cannot be set in the later stages. If None, then fresh, unpredictable entropy will be pulled from the OS. Defaults to 42.

Examples

Use MNIST dataset for Federated Learning with 100 clients (edge devices):

>>> mnist_fds = FederatedDataset(dataset="mnist", partitioners={"train": 100})
>>> # Load partition for client with ID 10.
>>> partition = mnist_fds.load_partition(10, "train")
>>> # Use test split for centralized evaluation.
>>> centralized = mnist_fds.load_split("test")

Methods

load_partition(partition_id[, split])

Load the partition specified by the idx in the selected split.

load_split(split)

Load the full split of the dataset.

Attributes

partitioners

Dictionary mapping each split to its associated partitioner.

load_partition(partition_id: int, split: str | None = None) Dataset[source]#

Load the partition specified by the idx in the selected split.

The dataset is downloaded only when the first call to load_partition or load_split is made.

Parameters:
  • partition_id (int) – Partition index for the selected split, idx in {0, …, num_partitions - 1}.

  • split (Optional[str]) – Name of the (partitioned) split (e.g. “train”, “test”). You can skip this parameter if there is only one partitioner for the dataset. The name will be inferred automatically. For example, if partitioners={“train”: 10}, you do not need to provide this argument, but if partitioners={“train”: 10, “test”: 100}, you need to set it to differentiate which partitioner should be used.

Returns:

partition – Single partition from the dataset split.

Return type:

Dataset

load_split(split: str) Dataset[source]#

Load the full split of the dataset.

The dataset is downloaded only when the first call to load_partition or load_split is made.

Parameters:

split (str) – Split name of the downloaded dataset (e.g. “train”, “test”).

Returns:

dataset_split – Part of the dataset identified by its split name.

Return type:

Dataset

property partitioners: Dict[str, Partitioner]#

Dictionary mapping each split to its associated partitioner.

The returned partitioners have the splits of the dataset assigned to them.