Use with Local Data#

You can partition your local files and Python objects in Flower Datasets library using any available Partitioner.

This guide details how to create a Hugging Face Dataset which is the required type of input for Partitioners. We will cover:

  • local files: CSV, JSON, image, audio,

  • in-memory data: dictionary, list, pd.DataFrame, np.ndarray.

General Overview#

An all-in-one dataset preparation (downloading, preprocessing, partitioning) happens using FederatedDataset. However, we will use only the Partitioner here since we use locally accessible data.

The rest of this guide will explain how to create a Dataset from local files and existing (in memory) Python objects.

Local Files#

CSV#

from datasets import load_dataset
from flwr_datasets.partitioner import ChosenPartitioner

# Single file
data_files = "path-to-my-file.csv"

# Multiple Files
data_files = [ "path-to-my-file-1.csv", "path-to-my-file-2.csv", ...]
dataset = load_dataset("csv", data_files=data_files)

# Divided Dataset
data_files = {
  "train": single_train_file_or_list_of_files,
  "test": single_test_file_or_list_of_files,
  "can-have-more-splits": ...
}
dataset = load_dataset("csv", data_files=data_files)

partitioner = ChosenPartitioner(...)
partitioner.dataset = dataset
partition = partitioner.load_partition(partition_id=0)

JSON#

from datasets import load_dataset
from flwr_datasets.partitioner import ChosenPartitioner

# Single file
data_files = "path-to-my-file.json"

# Multitple Files
data_files = [ "path-to-my-file-1.json", "path-to-my-file-2.json", ...]
dataset = load_dataset("json", data_files=data_files)

# Divided Dataset
data_files = {
  "train": single_train_file_or_list_of_files,
   "test": single_test_file_or_list_of_files,
   "can-have-more-splits": ...
}
dataset = load_dataset("json", data_files=data_files)

partitioner = ChosenPartitioner(...)
partitioner.dataset = dataset
partition = partitioner.load_partition(partition_id=0)

Image#

You can create an image dataset in two ways:

  1. give a path the directory

The directory needs to be structured in the following way: dataset-name/split/class/name. For example:

mnist/train/1/unique_name.png
mnist/train/1/unique_name.png
mnist/train/2/unique_name.png
...
mnist/test/1/unique_name.png
mnist/test/1/unique_name.png
mnist/test/2/unique_name.png

Then, the path you can give is ./mnist.

from datasets import load_dataset
from flwr_datasets.partitioner import ChosenPartitioner

# Directly from a directory
dataset = load_dataset("imagefolder", data_dir="/path/to/folder")
partitioner = ChosenPartitioner(...)
partitioner.dataset = dataset
partition = partitioner.load_partition(partition_id=0)
  1. create a dataset from a CSV/JSON file and cast the path column to Image.

from datasets import Image, load_dataset
from flwr_datasets.partitioner import ChosenPartitioner

dataset = load_dataset(...)
dataset = dataset.cast_column("path", Image())

partitioner = ChosenPartitioner(...)
partitioner.dataset = dataset
partition = partitioner.load_partition(partition_id=0)

Audio#

Analogously to the image datasets, there are two methods here:

  1. give a path to the directory

from datasets import load_dataset
from flwr_datasets.partitioner import ChosenPartitioner

dataset = load_dataset("audiofolder", data_dir="/path/to/folder")

partitioner = ChosenPartitioner(...)
partitioner.dataset = dataset
partition = partitioner.load_partition(partition_id=0)
  1. create a dataset from a CSV/JSON file and cast the path column to Audio.

from datasets import Audio, load_dataset
from flwr_datasets.partitioner import ChosenPartitioner

dataset = load_dataset(...)
dataset = dataset.cast_column("path", Audio())

partitioner = ChosenPartitioner(...)
partitioner.dataset = dataset
partition = partitioner.load_partition(partition_id=0)

In-Memory#

From dictionary#

from datasets import Dataset
from flwr_datasets.partitioner import ChosenPartitioner
data = {"features": [1, 2, 3], "labels": [0, 0, 1]}
dataset = Dataset.from_dict(data)

partitioner = ChosenPartitioner(...)
partitioner.dataset = dataset
partition = partitioner.load_partition(partition_id=0)

From list#

from datasets import Dataset
from flwr_datasets.partitioner import ChosenPartitioner

my_list = [
  {"features": 1, "labels": 0},
  {"features": 2, "labels": 0},
  {"features": 3, "labels": 1}
]
dataset = Dataset.from_list(my_list)

partitioner = ChosenPartitioner(...)
partitioner.dataset = dataset
partition = partitioner.load_partition(partition_id=0)

From pd.DataFrame#

from datasets import Dataset
from flwr_datasets.partitioner import ChosenPartitioner

data = {"features": [1, 2, 3], "labels": [0, 0, 1]}
df = pd.DataFrame(data)
dataset = Dataset.from_pandas(df)

partitioner = ChosenPartitioner(...)
partitioner.dataset = dataset
partition = partitioner.load_partition(partition_id=0)

From np.ndarray#

The np.ndarray will be first transformed to pd.DataFrame

from datasets import Dataset
from flwr_datasets.partitioner import ChosenPartitioner

data = np.array([[1, 2, 3], [0, 0, 1]]).T
# You can add the column names by passing columns=["features", "labels"]
df = pd.DataFrame(data)
dataset = Dataset.from_pandas(df)

partitioner = ChosenPartitioner(...)
partitioner.dataset = dataset
partition = partitioner.load_partition(partition_id=0)

Partitioner Details#

Partitioning is triggered automatically during the first load_partition call. You do not need to call any “do_partitioning” method.

Partitioner abstraction is designed to allow for a single dataset assignment.

partitioner.dataset = your_dataset

If you need to do the same partitioning on a different dataset, create a new Partitioner for that, e.g.:

from flwr_datasets.partitioner import IidPartitioner

iid_partitioner_for_mnist = IidPartitioner(num_partitions=10)
iid_partitioner_for_mnist.dataset = mnist_dataset

iid_partitioner_for_cifar = IidPartitioner(num_partitions=10)
iid_partitioner_for_cifar.dataset = cifar_dataset

More Resources#

If you are looking for more details or you have not found the format you are looking for, please visit the HuggingFace Datasets docs. This guide is based on the following ones: