Federated Dataset and DataPartitioner#

Sophisticated in real world, FL need to handle various kind of data distribution scenarios, including iid and non-iid scenarios. Though there already exists some datasets and partition schemes for published data benchmark, it still can be very messy and hard for researchers to partition datasets according to their specific research problems, and maintain partition results during simulation. FedLab provides fedlab.utils.dataset.partition.DataPartitioner that allows you to use pre-partitioned datasets as well as your own data. DataPartitioner stores sample indices for each client given a data partition scheme. Also, FedLab provides some extra datasets that are used in current FL researches while not provided by official Pytorch torchvision.datasets yet.

Note

Current implementation and design of this part are based on LEAF [2], Acar et al. [5], Yurochkin et al. [6] and NIID-Bench [7].

Vision Data#

CIFAR10#

FedLab provides a number of pre-defined partition schemes for some datasets (such as CIFAR10) that subclass fedlab.utils.dataset.partition.DataPartitioner and implement functions specific to particular partition scheme. They can be used to prototype and benchmark your FL algorithms.

Tutorial for CIFAR10Partitioner: CIFAR10 tutorial.

CIFAR100#

Notebook tutorial for CIFAR100Partitioner: CIFAR100 tutorial.

FMNIST#

Notebook tutorial for data partition of FMNIST (FashionMNIST) : FMNIST tutorial.

MNIST#

MNIST is very similar with FMNIST, please check FMNIST tutorial.

SVHN#

Data partition tutorial for SVHN: SVHN tutorial

CelebA#

Data partition for CelebA: CelebA tutorial.

FEMNIST#

Data partition of FEMNIST: FEMNIST tutorial.

Text Data#

Shakespeare#

Data partition of Shakespeare dataset: Shakespeare tutorial.

Sent140#

Data partition of Sent140: Sent140 tutorial.

Reddit#

Data partition of Reddit: Reddit tutorial.

Tabular Data#

Adult#

Adult is from LIBSVM Data. Its original source is from UCI/Adult. FedLab provides both Dataset and DataPartitioner for Adult. Notebook tutorial for Adult: Adult tutorial.

Covtype#

Covtype is from LIBSVM Data. Its original source is from UCI/Covtype. FedLab provides both Dataset and DataPartitioner for Covtype. Notebook tutorial for Covtype: Covtype tutorial.

RCV1#

RCV1 is from LIBSVM Data. Its original source is from UCI/RCV1. FedLab provides both Dataset and DataPartitioner for RCV1. Notebook tutorial for RCV1: RCV1 tutorial.

Synthetic Data#

FCUBE#

FCUBE is a synthetic dataset for federated learning. FedLab provides both Dataset and DataPartitioner for FCUBE. Tutorial for FCUBE: FCUBE tutorial.

LEAF-Synthetic#

LEAF-Synthetic is a federated dataset proposed by LEAF. Client number, class number and feature dimensions can all be customized by user.

Please check LEAF-Synthetic for more details.