sent140#
Module Contents#
- BASE_DIR#
- class Sent140Dataset(client_id: int, client_str: str, data: list, targets: list, is_to_tokens: bool = True, tokenizer: fedlab.contrib.dataset.utils.Tokenizer = None)#
Bases:
torch.utils.data.Dataset
- _process_data_target()#
process client’s data and target
- _data2token()#
- encode(vocab: fedlab.contrib.dataset.utils.Vocab, fix_len: int)#
transform token data to indices sequence by Vocab :param vocab: vocab for data_token :type vocab: fedlab_benchmark.leaf.nlp_utils.util.vocab :param fix_len: max length of sentence :type fix_len: int
- Returns:
list of integer list for data_token, and a list of tensor target
- __encode_tokens(tokens, pad_idx) torch.Tensor #
encode fix_len length for token_data to get indices list in self.vocab if one sentence length is shorter than fix_len, it will use pad word for padding to fix_len if one sentence length is longer than fix_len, it will cut the first max_words words :param tokens: data after tokenizer :type tokens: list[str]
- Returns:
integer list of indices with fix_len length for tokens input
- __len__()#
- __getitem__(item)#