dataset#

Package Contents#

FedDataset

BaseDataset

Base dataset iterator

Subset

For data subset with different augmentation for different client.

FCUBE

FCUBE data set.

Covtype

Covtype binary dataset from LIBSVM Data.

RCV1

RCV1 binary dataset from LIBSVM Data.

PathologicalMNIST

The partition stratigy in FedAvg. See http://proceedings.mlr.press/v54/mcmahan17a?ref=https://githubhelp.com

RotatedMNIST

Rotate MNIST and partition them.

RotatedCIFAR10

Rotate CIFAR10 and patrition them.

PartitionedMNIST

FedDataset with partitioning preprocess. For detailed partitioning, please

PartitionedCIFAR10

FedDataset with partitioning preprocess. For detailed partitioning, please

SyntheticDataset

class FedDataset#

Bases: object

preprocess()#

Define the dataset partition process

abstract get_dataset(id, type='train')#

Get dataset class

Parameters:
  • id (int) – Client ID for the partial dataset to achieve.

  • type (str, optional) – Type of dataset, can be chosen from ["train", "val", "test"]. Defaults as "train".

Raises:

NotImplementedError

abstract get_dataloader(id, batch_size, type='train')#

Get data loader

__len__()#
class BaseDataset(x, y)#

Bases: torch.utils.data.Dataset

Base dataset iterator

__len__()#
__getitem__(index)#
class Subset(dataset, indices, transform=None, target_transform=None)#

Bases: torch.utils.data.Dataset

For data subset with different augmentation for different client.

Parameters:
  • dataset (Dataset) – The whole Dataset

  • indices (List[int]) – Indices of sub-dataset to achieve from dataset.

  • transform (callable, optional) – A function/transform that takes in an PIL image and returns a transformed version.

  • target_transform (callable, optional) – A function/transform that takes in the target and transforms it.

__getitem__(index)#

Get item

Parameters:

index (int) – index

Returns:

(image, target) where target is index of the target class.

__len__()#
class FCUBE(root, train=True, generate=True, transform=None, target_transform=None, num_samples=4000)#

Bases: torch.utils.data.Dataset

FCUBE data set.

From paper Federated Learning on Non-IID Data Silos: An Experimental Study.

Parameters:
  • root (str) – Root for data file.

  • train (bool, optional) – Training set or test set. Default as True.

  • generate (bool, optional) – Whether to generate synthetic dataset. If True, then generate new synthetic FCUBE data even existed. Default as True.

  • transform (callable, optional) – A function/transform that takes in an numpy.ndarray and returns a transformed version.

  • target_transform (callable, optional) – A function/transform that takes in the target and transforms it.

  • num_samples (int, optional) – Total number of samples to generate. We suggest to use 4000 for training set, and 1000 for test set. Default is 4000 for trainset.

train_files#
test_files#
num_clients = 4#
_generate_train()#
_generate_test()#
_save_data()#
__len__()#
__getitem__(index)#
Parameters:

index (int) – Index

Returns:

(features, target) where target is index of the target class.

Return type:

tuple

class Covtype(root, train=True, train_ratio=0.75, transform=None, target_transform=None, download=False, generate=False, seed=None)#

Bases: torch.utils.data.Dataset

Covtype binary dataset from LIBSVM Data.

Parameters:
  • root (str) – Root directory of raw dataset to download if download is set to True.

  • train (bool, optional) – If True, creates dataset from training set, otherwise creates from test set.

  • transform (callable, optional) – A function/transform that takes in an PIL image and returns a transformed version. Default as None.

  • target_transform (callable, optional) – A function/transform that takes in the target and transforms it. Default as None.

  • download (bool, optional) – If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.

num_classes = 2#
num_features = 54#
url = 'https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/covtype.libsvm.binary.bz2'#
source_file_name = 'covtype.libsvm.binary.bz2'#
download()#
generate()#
_local_npy_existence()#
_local_source_file_existence()#
__getitem__(index)#
Parameters:

index (int) – Index

Returns:

(features, target) where target is index of the target class.

Return type:

tuple

__len__()#
class RCV1(root, train=True, train_ratio=0.75, transform=None, target_transform=None, download=False, generate=False, seed=None)#

Bases: torch.utils.data.Dataset

RCV1 binary dataset from LIBSVM Data.

Parameters:
  • root (str) – Root directory of raw dataset to download if download is set to True.

  • train (bool, optional) – If True, creates dataset from training set, otherwise creates from test set.

  • transform (callable, optional) – A function/transform that takes in an PIL image and returns a transformed version. Default as None.

  • target_transform (callable, optional) – A function/transform that takes in the target and transforms it. Default as None.

  • download (bool, optional) – If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.

num_classes = 2#
num_features = 47236#
url = 'https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/rcv1_train.binary.bz2'#
source_file_name = 'rcv1_train.binary.bz2'#
download()#
generate()#
_local_npy_existence()#
_local_source_file_existence()#
__getitem__(index)#
Parameters:

index (int) – Index

Returns:

(features, target) where target is index of the target class.

Return type:

tuple

__len__()#
class PathologicalMNIST(root, path, num_clients=100, shards=200, download=True, preprocess=False)#

Bases: fedlab.contrib.dataset.basic_dataset.FedDataset

The partition stratigy in FedAvg. See http://proceedings.mlr.press/v54/mcmahan17a?ref=https://githubhelp.com

Parameters:
  • root (str) – Path to download raw dataset.

  • path (str) – Path to save partitioned subdataset.

  • num_clients (int) – Number of clients.

  • shards (int, optional) – Sort the dataset by the label, and uniformly partition them into shards. Then

  • download (bool, optional) – Download. Defaults to True.

preprocess(download=True)#

Define the dataset partition process

get_dataset(id, type='train')#

Load subdataset for client with client ID cid from local file.

Parameters:
  • cid (int) – client id

  • type (str, optional) – Dataset type, can be "train", "val" or "test". Default as "train".

Returns:

Dataset

get_dataloader(id, batch_size=None, type='train')#

Return dataload for client with client ID cid.

Parameters:
  • cid (int) – client id

  • batch_size (int, optional) – batch size in DataLoader.

  • type (str, optional) – Dataset type, can be "train", "val" or "test". Default as "train".

class RotatedMNIST(root, path, num)#

Bases: fedlab.contrib.dataset.basic_dataset.FedDataset

Rotate MNIST and partition them.

Parameters:
  • root (str) – Path to download raw dataset.

  • path (str) – Path to save partitioned subdataset.

  • num_clients (int) – Number of clients.

preprocess(thetas=[0, 90, 180, 270], download=True)#

Define the dataset partition process

get_dataset(id, type='train')#

Get dataset class

Parameters:
  • id (int) – Client ID for the partial dataset to achieve.

  • type (str, optional) – Type of dataset, can be chosen from ["train", "val", "test"]. Defaults as "train".

Raises:

NotImplementedError

get_data_loader(id, batch_size=None, type='train')#
class RotatedCIFAR10(root, save_dir, num_clients)#

Bases: fedlab.contrib.dataset.basic_dataset.FedDataset

Rotate CIFAR10 and patrition them.

Parameters:
  • root (str) – Path to download raw dataset.

  • path (str) – Path to save partitioned subdataset.

  • num_clients (int) – Number of clients.

preprocess(shards, thetas=[0, 180])#

_summary_

Parameters:
  • shards (_type_) – _description_

  • thetas (list, optional) – _description_. Defaults to [0, 180].

get_dataset(id, type='train')#

Get dataset class

Parameters:
  • id (int) – Client ID for the partial dataset to achieve.

  • type (str, optional) – Type of dataset, can be chosen from ["train", "val", "test"]. Defaults as "train".

Raises:

NotImplementedError

get_data_loader(id, batch_size=None, type='train')#
class PartitionedMNIST(root, path, num_clients, download=True, preprocess=False, partition='iid', dir_alpha=None, verbose=True, seed=None, transform=None, target_transform=None)#

Bases: fedlab.contrib.dataset.basic_dataset.FedDataset

FedDataset with partitioning preprocess. For detailed partitioning, please check Federated Dataset and DataPartitioner.

Parameters:
  • root (str) – Path to download raw dataset.

  • path (str) – Path to save partitioned subdataset.

  • num_clients (int) – Number of clients.

  • download (bool) – Whether to download the raw dataset.

  • preprocess (bool) – Whether to preprocess the dataset.

  • partition (str, optional) – Partition name. Only supports "noniid-#label", "noniid-labeldir", "unbalance" and "iid" partition schemes.

  • dir_alpha (float, optional) – Dirichlet distribution parameter for non-iid partition. Only works if partition="dirichlet". Default as None.

  • verbose (bool, optional) – Whether to print partition process. Default as True.

  • seed (int, optional) – Random seed. Default as None.

  • transform (callable, optional) – A function/transform that takes in an PIL image and returns a transformed version.

  • target_transform (callable, optional) – A function/transform that takes in the target and transforms it.

preprocess(partition='iid', dir_alpha=None, verbose=True, seed=None, download=True, transform=None, target_transform=None)#

Perform FL partition on the dataset, and save each subset for each client into data{cid}.pkl file.

For details of partition schemes, please check Federated Dataset and DataPartitioner.

get_dataset(cid, type='train')#

Load subdataset for client with client ID cid from local file.

Parameters:
  • cid (int) – client id

  • type (str, optional) – Dataset type, can be "train", "val" or "test". Default as "train".

Returns:

Dataset

get_dataloader(cid, batch_size=None, type='train')#

Return dataload for client with client ID cid.

Parameters:
  • cid (int) – client id

  • batch_size (int, optional) – batch size in DataLoader.

  • type (str, optional) – Dataset type, can be "train", "val" or "test". Default as "train".

class PartitionedCIFAR10(root, path, dataname, num_clients, download=True, preprocess=False, balance=True, partition='iid', unbalance_sgm=0, num_shards=None, dir_alpha=None, verbose=True, seed=None, transform=None, target_transform=None)#

Bases: fedlab.contrib.dataset.basic_dataset.FedDataset

FedDataset with partitioning preprocess. For detailed partitioning, please check Federated Dataset and DataPartitioner.

Parameters:
  • root (str) – Path to download raw dataset.

  • path (str) – Path to save partitioned subdataset.

  • dataname (str) – “cifar10” or “cifar100”

  • num_clients (int) – Number of clients.

  • download (bool) – Whether to download the raw dataset.

  • preprocess (bool) – Whether to preprocess the dataset.

  • balance (bool, optional) – Balanced partition over all clients or not. Default as True.

  • partition (str, optional) – Partition type, only "iid", shards, "dirichlet" are supported. Default as "iid".

  • unbalance_sgm (float, optional) – Log-normal distribution variance for unbalanced data partition over clients. Default as 0 for balanced partition.

  • num_shards (int, optional) – Number of shards in non-iid "shards" partition. Only works if partition="shards". Default as None.

  • dir_alpha (float, optional) – Dirichlet distribution parameter for non-iid partition. Only works if partition="dirichlet". Default as None.

  • verbose (bool, optional) – Whether to print partition process. Default as True.

  • seed (int, optional) – Random seed. Default as None.

  • transform (callable, optional) – A function/transform that takes in an PIL image and returns a transformed version.

  • target_transform (callable, optional) – A function/transform that takes in the target and transforms it.

preprocess(balance=True, partition='iid', unbalance_sgm=0, num_shards=None, dir_alpha=None, verbose=True, seed=None, download=True)#

Perform FL partition on the dataset, and save each subset for each client into data{cid}.pkl file.

For details of partition schemes, please check Federated Dataset and DataPartitioner.

get_dataset(cid, type='train')#

Load subdataset for client with client ID cid from local file.

Parameters:
  • cid (int) – client id

  • type (str, optional) – Dataset type, can be "train", "val" or "test". Default as "train".

Returns:

Dataset

get_dataloader(cid, batch_size=None, type='train')#

Return dataload for client with client ID cid.

Parameters:
  • cid (int) – client id

  • batch_size (int, optional) – batch size in DataLoader.

  • type (str, optional) – Dataset type, can be "train", "val" or "test". Default as "train".

class SyntheticDataset(root, path, preprocess=False)#

Bases: fedlab.contrib.dataset.basic_dataset.FedDataset

preprocess(root, path, partition=0.2)#

Preprocess the raw data to fedlab dataset format.

Parameters:
  • root (str) – path to the raw data.

  • path (str) – path to save the preprocessed datasets.

  • partition (float, optional) – The propotion of testset. Defaults to 0.2.

get_dataset(id, type='train')#

Get dataset class

Parameters:
  • id (int) – Client ID for the partial dataset to achieve.

  • type (str, optional) – Type of dataset, can be chosen from ["train", "val", "test"]. Defaults as "train".

Raises:

NotImplementedError

get_dataloader(id, batch_size, type='train')#

Get data loader