DirichletPartitioner#

class DirichletPartitioner(num_partitions: int, partition_by: str, alpha: int | float | List[float] | ndarray[Any, dtype[float64]], min_partition_size: int = 10, self_balancing: bool = False, shuffle: bool = True, seed: int | None = 42)[source]#

Bases: Partitioner

Partitioner based on Dirichlet distribution.

Implementation based on Bayesian Nonparametric Federated Learning of Neural Networks https://arxiv.org/abs/1905.12022.

The algorithm sequentially divides the data with each label. The fractions of the data with each label is drawn from Dirichlet distribution and adjusted in case of balancing. The data is assigned. In case the min_partition_size is not satisfied the algorithm is run again (the fractions will change since it is a random process even though the alpha stays the same).

The notion of balancing is explicitly introduced here (not mentioned in paper but implemented in the code). It is a mechanism that excludes the partition from assigning new samples to it if the current number of samples on that partition exceeds the average number that the partition would get in case of even data distribution. It is controlled by`self_balancing` parameter.

Parameters:
  • num_partitions (int) – The total number of partitions that the data will be divided into.

  • partition_by (str) – Column name of the labels (targets) based on which Dirichlet sampling works.

  • alpha (Union[int, float, List[float], NDArrayFloat]) – Concentration parameter to the Dirichlet distribution

  • min_partition_size (int) – The minimum number of samples that each partitions will have (the sampling process is repeated if any partition is too small).

  • self_balancing (bool) – Whether assign further samples to a partition after the number of samples exceeded the average number of samples per partition. (True in the original paper’s code although not mentioned in paper itself).

  • shuffle (bool) – Whether to randomize the order of samples. Shuffling applied after the samples assignment to partitions.

  • seed (int) – Seed used for dataset shuffling. It has no effect if shuffle is False.

Examples

>>> from flwr_datasets import FederatedDataset
>>> from flwr_datasets.partitioner import DirichletPartitioner
>>>
>>> partitioner = DirichletPartitioner(num_partitions=10, partition_by="label",
>>>                                    alpha=0.5, min_partition_size=10,
>>>                                    self_balancing=True)
>>> fds = FederatedDataset(dataset="mnist", partitioners={"train": partitioner})
>>> partition = fds.load_partition(0)
>>> print(partition[0])  # Print the first example
{'image': <PIL.PngImagePlugin.PngImageFile image mode=L size=28x28 at 0x127B92170>,
'label': 4}
>>> partition_sizes = partition_sizes = [
>>>     len(fds.load_partition(partition_id)) for partition_id in range(10)
>>> ]
>>> print(sorted(partition_sizes))
[2134, 2615, 3646, 6011, 6170, 6386, 6715, 7653, 8435, 10235]

Methods

is_dataset_assigned()

Check if a dataset has been assigned to the partitioner.

load_partition(partition_id)

Load a partition based on the partition index.

Attributes

dataset

Dataset property.

num_partitions

Total number of partitions.

property dataset: Dataset#

Dataset property.

is_dataset_assigned() bool#

Check if a dataset has been assigned to the partitioner.

This method returns True if a dataset is already set for the partitioner, otherwise, it returns False.

Returns:

dataset_assigned – True if a dataset is assigned, otherwise False.

Return type:

bool

load_partition(partition_id: int) Dataset[source]#

Load a partition based on the partition index.

Parameters:

partition_id (int) – the index that corresponds to the requested partition

Returns:

dataset_partition – single partition of a dataset

Return type:

Dataset

property num_partitions: int#

Total number of partitions.