dataset#

Package Contents#

`FedDataset`
`BaseDataset`	Base dataset iterator
`Subset`	For data subset with different augmentation for different client.
`FCUBE`	FCUBE data set.
`Covtype`	Covtype binary dataset from LIBSVM Data.
`RCV1`	RCV1 binary dataset from LIBSVM Data.
`PathologicalMNIST`	The partition stratigy in FedAvg. See http://proceedings.mlr.press/v54/mcmahan17a?ref=https://githubhelp.com
`RotatedMNIST`	Rotate MNIST and partition them.
`RotatedCIFAR10`	Rotate CIFAR10 and patrition them.
`PartitionedMNIST`	`FedDataset` with partitioning preprocess. For detailed partitioning, please
`PartitionedCIFAR10`	`FedDataset` with partitioning preprocess. For detailed partitioning, please
`SyntheticDataset`

class FedDataset#

Bases: object

preprocess()#: Define the dataset partition process

abstract get_dataset(id, type='train')#

Get dataset class

Parameters:

id (int) – Client ID for the partial dataset to achieve.
type (str, optional) – Type of dataset, can be chosen from ["train", "val", "test"]. Defaults as "train".

Raises:

NotImplementedError –

abstract get_dataloader(id, batch_size, type='train')#: Get data loader

__len__()#

class BaseDataset(x, y)#

Bases: torch.utils.data.Dataset

Base dataset iterator

__len__()#

__getitem__(index)#

class Subset(dataset, indices, transform=None, target_transform=None)#

Bases: torch.utils.data.Dataset

For data subset with different augmentation for different client.

Parameters:

dataset (Dataset) – The whole Dataset
indices (List[int]) – Indices of sub-dataset to achieve from dataset.
transform (callable, optional) – A function/transform that takes in an PIL image and returns a transformed version.
target_transform (callable, optional) – A function/transform that takes in the target and transforms it.

__getitem__(index)#

Get item

Parameters:: index (int) – index
Returns:: (image, target) where target is index of the target class.

__len__()#

class FCUBE(root, train=True, generate=True, transform=None, target_transform=None, num_samples=4000)#

Bases: torch.utils.data.Dataset

FCUBE data set.

From paper Federated Learning on Non-IID Data Silos: An Experimental Study.

Parameters:

root (str) – Root for data file.
train (bool, optional) – Training set or test set. Default as True.
generate (bool, optional) – Whether to generate synthetic dataset. If True, then generate new synthetic FCUBE data even existed. Default as True.
transform (callable, optional) – A function/transform that takes in an numpy.ndarray and returns a transformed version.
target_transform (callable, optional) – A function/transform that takes in the target and transforms it.
num_samples (int, optional) – Total number of samples to generate. We suggest to use 4000 for training set, and 1000 for test set. Default is 4000 for trainset.

train_files#

test_files#

num_clients = 4#

_generate_train()#

_generate_test()#

_save_data()#

__len__()#

__getitem__(index)#

Parameters:: index (int) – Index
Returns:: (features, target) where target is index of the target class.
Return type:: tuple

class Covtype(root, train=True, train_ratio=0.75, transform=None, target_transform=None, download=False, generate=False, seed=None)#

Bases: torch.utils.data.Dataset

Covtype binary dataset from LIBSVM Data.

Parameters:

root (str) – Root directory of raw dataset to download if download is set to True.
train (bool, optional) – If True, creates dataset from training set, otherwise creates from test set.
transform (callable, optional) – A function/transform that takes in an PIL image and returns a transformed version. Default as None.
target_transform (callable, optional) – A function/transform that takes in the target and transforms it. Default as None.
download (bool, optional) – If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.

num_classes = 2#

num_features = 54#

url = 'https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/covtype.libsvm.binary.bz2'#

source_file_name = 'covtype.libsvm.binary.bz2'#

download()#

generate()#

_local_npy_existence()#

_local_source_file_existence()#

__getitem__(index)#

Parameters:: index (int) – Index
Returns:: (features, target) where target is index of the target class.
Return type:: tuple

__len__()#

class RCV1(root, train=True, train_ratio=0.75, transform=None, target_transform=None, download=False, generate=False, seed=None)#

Bases: torch.utils.data.Dataset

RCV1 binary dataset from LIBSVM Data.

Parameters:

root (str) – Root directory of raw dataset to download if download is set to True.
train (bool, optional) – If True, creates dataset from training set, otherwise creates from test set.
transform (callable, optional) – A function/transform that takes in an PIL image and returns a transformed version. Default as None.
target_transform (callable, optional) – A function/transform that takes in the target and transforms it. Default as None.
download (bool, optional) – If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.

num_classes = 2#

num_features = 47236#

url = 'https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/rcv1_train.binary.bz2'#

source_file_name = 'rcv1_train.binary.bz2'#

download()#

generate()#

_local_npy_existence()#

_local_source_file_existence()#

__getitem__(index)#

Parameters:: index (int) – Index
Returns:: (features, target) where target is index of the target class.
Return type:: tuple

__len__()#

class PathologicalMNIST(root, path, num_clients=100, shards=200, download=True, preprocess=False)#

Bases: fedlab.contrib.dataset.basic_dataset.FedDataset

The partition stratigy in FedAvg. See http://proceedings.mlr.press/v54/mcmahan17a?ref=https://githubhelp.com

Parameters:

root (str) – Path to download raw dataset.
path (str) – Path to save partitioned subdataset.
num_clients (int) – Number of clients.
shards (int, optional) – Sort the dataset by the label, and uniformly partition them into shards. Then
download (bool, optional) – Download. Defaults to True.

preprocess(download=True)#: Define the dataset partition process

get_dataset(id, type='train')#

Load subdataset for client with client ID cid from local file.

Parameters:

cid (int) – client id
type (str, optional) – Dataset type, can be "train", "val" or "test". Default as "train".

Returns:

Dataset

get_dataloader(id, batch_size=None, type='train')#

Return dataload for client with client ID cid.

Parameters:

cid (int) – client id
batch_size (int, optional) – batch size in DataLoader.
type (str, optional) – Dataset type, can be "train", "val" or "test". Default as "train".

class RotatedMNIST(root, path, num)#

Bases: fedlab.contrib.dataset.basic_dataset.FedDataset

Rotate MNIST and partition them.

Parameters:

root (str) – Path to download raw dataset.
path (str) – Path to save partitioned subdataset.
num_clients (int) – Number of clients.

preprocess(thetas=[0, 90, 180, 270], download=True)#: Define the dataset partition process

get_dataset(id, type='train')#

Get dataset class

Parameters:

id (int) – Client ID for the partial dataset to achieve.
type (str, optional) – Type of dataset, can be chosen from ["train", "val", "test"]. Defaults as "train".

Raises:

NotImplementedError –

get_data_loader(id, batch_size=None, type='train')#

class RotatedCIFAR10(root, save_dir, num_clients)#

Bases: fedlab.contrib.dataset.basic_dataset.FedDataset

Rotate CIFAR10 and patrition them.

Parameters:

root (str) – Path to download raw dataset.
path (str) – Path to save partitioned subdataset.
num_clients (int) – Number of clients.

preprocess(shards, thetas=[0, 180])#

_summary_

Parameters:

shards (_type_) – _description_
thetas (list, optional) – _description_. Defaults to [0, 180].

get_dataset(id, type='train')#

Get dataset class

Parameters:

id (int) – Client ID for the partial dataset to achieve.
type (str, optional) – Type of dataset, can be chosen from ["train", "val", "test"]. Defaults as "train".

Raises:

NotImplementedError –

get_data_loader(id, batch_size=None, type='train')#

class PartitionedMNIST(root, path, num_clients, download=True, preprocess=False, partition='iid', dir_alpha=None, verbose=True, seed=None, transform=None, target_transform=None)#

Bases: fedlab.contrib.dataset.basic_dataset.FedDataset

FedDataset with partitioning preprocess. For detailed partitioning, please check Federated Dataset and DataPartitioner.

Parameters:

root (str) – Path to download raw dataset.
path (str) – Path to save partitioned subdataset.
num_clients (int) – Number of clients.
download (bool) – Whether to download the raw dataset.
preprocess (bool) – Whether to preprocess the dataset.
partition (str, optional) – Partition name. Only supports "noniid-#label", "noniid-labeldir", "unbalance" and "iid" partition schemes.
dir_alpha (float, optional) – Dirichlet distribution parameter for non-iid partition. Only works if partition="dirichlet". Default as None.
verbose (bool, optional) – Whether to print partition process. Default as True.
seed (int, optional) – Random seed. Default as None.
transform (callable, optional) – A function/transform that takes in an PIL image and returns a transformed version.
target_transform (callable, optional) – A function/transform that takes in the target and transforms it.

preprocess(partition='iid', dir_alpha=None, verbose=True, seed=None, download=True, transform=None, target_transform=None)#

Perform FL partition on the dataset, and save each subset for each client into data{cid}.pkl file.

For details of partition schemes, please check Federated Dataset and DataPartitioner.

get_dataset(cid, type='train')#

Load subdataset for client with client ID cid from local file.

Parameters:

cid (int) – client id
type (str, optional) – Dataset type, can be "train", "val" or "test". Default as "train".

Returns:

Dataset

get_dataloader(cid, batch_size=None, type='train')#

Return dataload for client with client ID cid.

Parameters:

cid (int) – client id
batch_size (int, optional) – batch size in DataLoader.
type (str, optional) – Dataset type, can be "train", "val" or "test". Default as "train".

class PartitionedCIFAR10(root, path, dataname, num_clients, download=True, preprocess=False, balance=True, partition='iid', unbalance_sgm=0, num_shards=None, dir_alpha=None, verbose=True, seed=None, transform=None, target_transform=None)#

Bases: fedlab.contrib.dataset.basic_dataset.FedDataset

FedDataset with partitioning preprocess. For detailed partitioning, please check Federated Dataset and DataPartitioner.

Parameters:

root (str) – Path to download raw dataset.
path (str) – Path to save partitioned subdataset.
dataname (str) – “cifar10” or “cifar100”
num_clients (int) – Number of clients.
download (bool) – Whether to download the raw dataset.
preprocess (bool) – Whether to preprocess the dataset.
balance (bool, optional) – Balanced partition over all clients or not. Default as True.
partition (str, optional) – Partition type, only "iid", shards, "dirichlet" are supported. Default as "iid".
unbalance_sgm (float, optional) – Log-normal distribution variance for unbalanced data partition over clients. Default as 0 for balanced partition.
num_shards (int, optional) – Number of shards in non-iid "shards" partition. Only works if partition="shards". Default as None.
dir_alpha (float, optional) – Dirichlet distribution parameter for non-iid partition. Only works if partition="dirichlet". Default as None.
verbose (bool, optional) – Whether to print partition process. Default as True.
seed (int, optional) – Random seed. Default as None.
transform (callable, optional) – A function/transform that takes in an PIL image and returns a transformed version.
target_transform (callable, optional) – A function/transform that takes in the target and transforms it.

preprocess(balance=True, partition='iid', unbalance_sgm=0, num_shards=None, dir_alpha=None, verbose=True, seed=None, download=True)#

Perform FL partition on the dataset, and save each subset for each client into data{cid}.pkl file.

For details of partition schemes, please check Federated Dataset and DataPartitioner.

get_dataset(cid, type='train')#

Load subdataset for client with client ID cid from local file.

Parameters:

cid (int) – client id
type (str, optional) – Dataset type, can be "train", "val" or "test". Default as "train".

Returns:

Dataset

get_dataloader(cid, batch_size=None, type='train')#

Return dataload for client with client ID cid.

Parameters:

cid (int) – client id
batch_size (int, optional) – batch size in DataLoader.
type (str, optional) – Dataset type, can be "train", "val" or "test". Default as "train".

class SyntheticDataset(root, path, preprocess=False)#

Bases: fedlab.contrib.dataset.basic_dataset.FedDataset

preprocess(root, path, partition=0.2)#

Preprocess the raw data to fedlab dataset format.

Parameters:

root (str) – path to the raw data.
path (str) – path to save the preprocessed datasets.
partition (float, optional) – The propotion of testset. Defaults to 0.2.

get_dataset(id, type='train')#

Get dataset class

Parameters:

id (int) – Client ID for the partial dataset to achieve.
type (str, optional) – Type of dataset, can be chosen from ["train", "val", "test"]. Defaults as "train".

Raises:

NotImplementedError –

get_dataloader(id, batch_size, type='train')#: Get data loader