FCUBE#
FCUBE [7] 是为non-IID场景设计的合成数据集,属于特征不均衡。该数据集由 Federated Learning on Non-IID Data Silos: An Experimental Study 提出。
FCUBE的数据点包含3个特征,即 \(\mathcal{D}_{\text{FCUBE}} = \{ (\mathbf{x}, y) \}\),其中每个数据点写做 \(\mathbf{x} = (x_1, x_2, x_3)\),且标签满足 \(y \in \{ 0, 1 \}\)。数据点的分布是一个三维空间的立方体,且满足当 \(x_1 > 0\) 时,\(y = 0\);而当 \(x_1 < 0\) 时 \(y=1\)。默认情况下,我们建议训练集包含4000个数据点,而在测试集中包含1000个数据点。
更多细节请参考原论文的章节(IV-B-b)。
若 generate=True
,数据集 FCUBE
会在本地生成 .npy
文件。FCUBE
也接受数据集的常见参数:用于数据增广的 transform
和 target_transform
。
导入相关的包以及基本设定:
from matplotlib import pyplot as plt
from matplotlib.colors import ListedColormap
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns
%matplotlib notebook
import pandas as pd
import numpy as np
import sys
import fedlab_benchmarks
from fedlab_benchmarks.datasets import FCUBE
import torch
from torch.utils.data import DataLoader
sns.set_style('darkgrid')
Dataset#
生成#
生成训练集:
trainset = FCUBE('../../../../data-partition/FCUBE/', train=True, generate=True,
num_samples=1000)
train_loader = DataLoader(trainset, batch_size=20, shuffle=True)
Generate FCUBE data now...
../../../../data/FCUBE/fcube_train_X_1000.npy generated.
../../../../data/FCUBE/fcube_train_y_1000.npy generated.
生成测试集:
testset = FCUBE('../../../../data/FCUBE/', train=False, generate=True,
num_samples=250)
test_loader = DataLoader(testset, batch_size=20, shuffle=False)
Generate FCUBE data now...
../../../../data/FCUBE/fcube_test_X_250.npy generated.
../../../../data/FCUBE/fcube_test_y_250.npy generated.
可视化#
为了可视化,我们先将数据集构造成 DataFrame
:
train_df = pd.DataFrame({'x1': trainset.data[:,0],
'x2': trainset.data[:,1],
'x3': trainset.data[:,2],
'y': trainset.targets,
'split': ['train'] * trainset.targets.shape[0]})
test_df = pd.DataFrame({'x1': testset.data[:,0],
'x2': testset.data[:,1],
'x3': testset.data[:,2],
'y': testset.targets,
'split': ['test'] * testset.targets.shape[0]})
fcube_df = pd.concat([train_df, test_df], ignore_index=True)
FCUBE的类分布是均衡的。训练集/测试集中类分布的可视化:
sns.displot(fcube_df, x="y", col="split", bins=2, height=4, aspect=.6)
plt.savefig(f"../imgs/fcube_class_dist.png", dpi=400, bbox_inches = 'tight')
训练集的数据点分布:
fig = plt.figure()
ax = fig.add_subplot(111, projection = '3d')
ax.set_xlabel("x1")
ax.set_ylabel("x2")
ax.set_zlabel("x3")
# get colormap from seaborn
cmap = ListedColormap(sns.color_palette("RdBu", 2).as_hex())
ax.scatter(train_df['x1'], train_df['x2'], train_df['x3'], c=train_df['y'], marker='o',
cmap=cmap,
alpha=0.7)
plt.title("Trainset Distribution")
plt.show()
plt.savefig("../imgs/fcube_train_dist_vis.png", dpi=400, bbox_inches='tight')
测试集的数据点分布:
fig = plt.figure()
ax = fig.add_subplot(111, projection = '3d')
ax.set_xlabel("x1")
ax.set_ylabel("x2")
ax.set_zlabel("x3")
# get colormap from seaborn
cmap = ListedColormap(sns.color_palette("RdBu", 2).as_hex())
scatter = ax.scatter(test_df['x1'], test_df['x2'], test_df['x3'], c=test_df['y'], marker='o',
cmap=cmap,
alpha=0.7)
plt.legend(handles=scatter.legend_elements()[0], labels=['class 0','class 1'])
plt.title("Testset Distribution")
plt.show()
plt.savefig("../imgs/fcube_test_dist_vis.png", dpi=400, bbox_inches='tight')
数据划分#
FCUBE只支持2种划分方法:- 特征分布倾斜:合成 - IID
受限于模拟划分方案,FCUBE划分的client数量只能为4。
num_clients = 4
num_classes = 2
col_names = [f"class{i}" for i in range(num_classes)]
模拟划分#
从数据集可视化中我们可以看出FCUBE数据点的分布在一个 \(-1 < x_1 < 1\), \(-1 < x_2 < 1\), \(-1 < x_3 < 1\) 的立方体中。
在模拟划分中,数据立方体被坐标平面划分成8个部分,即 \(x_1=0\), \(x_2=0\), 以及 \(x_3=0\)。而每对根据 \((0,0,0)\) 对称的部分会被分配给一个client。这样得到的数据划分的特点为:不同client间特征分布各不相同,但是标签依旧保持均衡。
原论文中的可视化:
# perform partition
synthetic_part = FCUBEPartitioner(trainset.data, partition="synthetic")
print(f"Client number: {len(synthetic_part)}")
# Client number: 4
csv_file = "../partition-reports/fcube_synthetic.csv"
partition_report(trainset.targets, synthetic_part.client_dict,
class_num=num_classes,
verbose=False, file=csv_file)
synthetic_part_df = pd.read_csv(csv_file,header=1)
synthetic_part_df = synthetic_part_df.set_index('client')
col_names = [f"class{i}" for i in range(num_classes)]
for col in col_names:
synthetic_part_df[col] = (synthetic_part_df[col] * synthetic_part_df['Amount']).astype(int)
# select first 4 clients for bar plot
synthetic_part_df[col_names].plot.barh(stacked=True)
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.xlabel('sample num')
plt.savefig(f"../imgs/fcube_synthetic.png", dpi=400, bbox_inches = 'tight')
每个client拥有的数据点的可视化:
# extract data and targets for each clients
client_data = [trainset.data[synthetic_part[cid]] for cid in range(num_clients)]
client_targets = [trainset.targets[synthetic_part[cid]] for cid in range(num_clients)]
fig = plt.figure(figsize=(10,10))
# get colormap from seaborn
cmap = ListedColormap(sns.color_palette("RdBu", 2).as_hex())
for row in range(2):
for col in range(2):
cid = int(2*row + col)
ax = fig.add_subplot(2, 2, cid+1, projection='3d', title=f"Client {cid}")
ax.set_xlabel("x1")
ax.set_xlabel("x2")
ax.set_xlabel("x3")
scatter = ax.scatter(client_data[cid][:,0],
client_data[cid][:,1],
client_data[cid][:,2],
c=client_targets[cid],
marker='o',
cmap=cmap,
alpha=0.7)
ax.legend(handles=scatter.legend_elements()[0], labels=['class 0','class 1'])
plt.show()
plt.savefig("../imgs/fcube_synthetic_part.png", dpi=500, bbox_inches='tight')
IID划分#
# perform partition
iid_part = FCUBEPartitioner(trainset.data, partition="iid")
csv_file = "../partition-reports/fcube_iid.csv"
partition_report(trainset.targets, iid_part.client_dict,
class_num=num_classes,
verbose=False, file=csv_file)
iid_part_df = pd.read_csv(csv_file,header=1)
iid_part_df = iid_part_df.set_index('client')
for col in col_names:
iid_part_df[col] = (iid_part_df[col] * iid_part_df['Amount']).astype(int)
# select first 4 clients for bar plot
iid_part_df[col_names].plot.barh(stacked=True)
# plt.tight_layout()
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.xlabel('sample num')
plt.savefig(f"../imgs/fcube_iid.png", dpi=400, bbox_inches = 'tight')
每个client拥有的数据点的可视化:
# extract data and targets for each clients
client_data = [trainset.data[iid_part[cid]] for cid in range(num_clients)]
client_targets = [trainset.targets[iid_part[cid]] for cid in range(num_clients)]
fig = plt.figure(figsize=(10,10))
# get colormap from seaborn
cmap = ListedColormap(sns.color_palette("RdBu", 2).as_hex())
for row in range(2):
for col in range(2):
cid = int(2*row + col)
ax = fig.add_subplot(2, 2, cid+1, projection='3d', title=f"Client {cid}")
ax.set_xlabel("x1")
ax.set_xlabel("x2")
ax.set_xlabel("x3")
scatter = ax.scatter(client_data[cid][:,0],
client_data[cid][:,1],
client_data[cid][:,2],
c=client_targets[cid],
marker='o',
cmap=cmap,
alpha=0.7)
ax.legend(handles=scatter.legend_elements()[0], labels=['class 0','class 1'])
plt.show()
plt.savefig("../imgs/fcube_iid_part.png", dpi=500, bbox_inches='tight')
备注
FCUBE教程的完整代码可见 此连接.