Skip to main content

All Questions

0 votes
0 answers
21 views

Why it prints the content automatically when I using bert tokenizer?

class BertEncoder: def init(self): self.tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') self.model = BertModel.from_pretrained('bert-base-uncased') self.device = torch.device("cuda:...
dylan xie's user avatar
0 votes
1 answer
203 views

Map BERT token indices to Spacy token indices

I’m trying to make Bert’s (bert-base-uncased) tokenization token indices (not ids, token indices) map to Spacy’s tokenization token indices. In the following example, my approach doesn’t work becos ...
lrthistlethwaite's user avatar
1 vote
1 answer
35 views

ber-base-uncase does not use newly added suffix token

I want to add custom tokens to the BertTokenizer. However, the model does not use the new token. from transformers import BertTokenizer tokenizer = BertTokenizer.from_pretrained("bert-base-...
Lulacca's user avatar
  • 13
0 votes
1 answer
4k views

Loading local tokenizer

I'm trying to load a local tokenizer using; from transformers import RobertaTokenizerFast tokenizer = RobertaTokenizerFast.from_pretrained(r'file path\tokenizer') however, this gives me the ...
Jon's user avatar
  • 91
0 votes
0 answers
55 views

Value Error when using add_tokens, 'the truth value of an array with more than one element is ambiguous'

I'm trying to improve a basic BERT, pretrained tokenizer model. Im adding new tokens using add_tokens, but running into issues with the built in method. Namely: ValueError ...
Manny's user avatar
  • 35
2 votes
1 answer
195 views

bert_vocab.bert_vocab_from_dataset returning wrong vocabulary [closed]

i'm trying to build a tokenizer following the tf's tutorial https://www.tensorflow.org/text/guide/subwords_tokenizer. I'm basically doing the same thing only with a different dataset. The dataset in ...
Niccolò Tiezzi's user avatar
1 vote
0 answers
183 views

How to obtain the [CLS] sentence embedding of multiple sentences successively without facing a RAM crash?

I would like to obtain the [CLS] token's sentence embedding (as it represents the whole sentence's meaning) using BERT. I have many sentences (about 40) that belong to a Document, and 246 such ...
Aadithya Seshadri's user avatar
1 vote
1 answer
635 views

NER Classification Deberta Tokenizer error : You need to instantiate DebertaTokenizerFast

I'm trying to perform a NER Classification task using Deberta, but I'm stacked with a Tokenizer error. This is my code (my input sentence must be splitted word by word by ",:): from transformers ...
Chiara's user avatar
  • 510
0 votes
1 answer
305 views

How to get access to tokenzier after loading a saved custom BERT model using Keras and TF2?

I am working on Intent classification problem and need your help. I fine-tuned one of the BERT model for text classification. Trained and evaluated it on a small dataset for detecting five intents. I ...
Rohit's user avatar
  • 7,189
0 votes
1 answer
2k views

How to replace BERT tokenizer special tokens

I am using an AutoTokenizer --> tokenizer1 = AutoTokenizer.from_pretrained("vinai/bertweet-base", normalization=True) which is more complete than the tokenizer of bert-base-uncased. The ...
javafest's user avatar
1 vote
1 answer
2k views

How to preprocess a dataset for BERT model implemented in Tensorflow 2.x?

Overview I have a dataset made for classification problem. There are two columns one is sentences and the other is labels (total: 10 labels). I'm trying to convert this dataset to implement it in a ...
Y4RD13's user avatar
  • 984
2 votes
2 answers
6k views

"Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers." ValueError: Input is not valid

I am using Bert tokenizer for french and I am getting this error but I do not seems to solutionated it. If you have a suggestion. Traceback (most recent call last): File "training_cross_data_2....
emma's user avatar
  • 363
2 votes
0 answers
800 views

UnparsedFlagAccessError: Trying to access flag --preserve_unused_tokens before flags were parsed

Hello I am a beginner in ML. I tried to use BERT and tokenizer didn't work like below. train_input = bert_encode(train.text.values, tokenizer, max_len=160) test_input = bert_encode(test.text.values, ...
Tony's user avatar
  • 21
0 votes
1 answer
2k views

Split a sentence by words just as BERT Tokenizer would do?

I'm trying to localize all the [UNK] tokens of BERT tokenizer on my text. Once I have the position of the UNK token, I need to identify what word it belongs to. For that, I tried to get the position ...
Andrea NR's user avatar
  • 1,747
7 votes
2 answers
10k views

How to untokenize BERT tokens?

I have a sentence and I need to return the text corresponding to N BERT tokens to the left and right of a specific word. from transformers import BertTokenizer tz = BertTokenizer.from_pretrained("...
JayJay's user avatar
  • 203

15 30 50 per page