Skip to main content

All Questions

Tagged with
1 vote
1 answer
71 views

How to handle German language specific characters like (ä, ö, ü, ß) while tokenizing using GPT2Tokenizer?

I am working with German Texts, where I need to tokenize texts using GPT2Tokenizer. To tokenize the text, I wrote the implementation as follows: from transformers import GPT2Tokenizer text = "...
RajibTheKing's user avatar
  • 1,362
1 vote
1 answer
69 views

How do I remove escape characters from output of nltk.word_tokenize?

How do I get rid of non-printing (escaped) characters from the output of the nltk.word_tokenize method? I am working through the book 'Natural Language Processing with Python' and am following the ...
green_ruby's user avatar
5 votes
1 answer
565 views

Is there a way to save a pre-compiled AutoTokenizer?

Sometimes, we'll have to do something like this to extend a pre-trained tokenizer: from transformers import AutoTokenizer from datasets import load_dataset ds_de = load_dataset("mc4", 'de') ...
alvas's user avatar
  • 123k
-1 votes
1 answer
28 views

nltk python library word tokenizatio error [closed]

I am trying to tokenize a file. `AttributeError Traceback (most recent call last) <ipython-input-8-81ae6f78b554> in <cell line: 4>() 2 robert = open('...
i222025 Amna Javaid's user avatar
1 vote
1 answer
554 views

Splitting a word into two words with spaCy

I'm facing an issue where I need to split a single 'word' into two words due to missing spaces or new lines in the received text. My intention is to establish a pipeline (spaCy 3.5.4) for this task ...
Maciek's user avatar
  • 21
6 votes
1 answer
2k views

When to set `add_special_tokens=False` in huggingface transformers tokenizer?

this is the default way of setting tokenizer in the Hugging Face "transformers" library: from transformers import BertForSequenceClassification,BertTokenizer tokenizer=BertTokenizer....
Yilmaz's user avatar
  • 50.1k
0 votes
0 answers
33 views

Proper teatment of lists of word tokens in a df for clustering - advice needed

I have been provided with tokenized text that has been previously generated. I would like to keep these tokens and use them in my analysis. For each item of interest, there are multiple lists of ...
Linda Smith's user avatar
1 vote
1 answer
1k views

Can't find model 'en_core_web_lg'. It doesn't seem to be a Python package or a valid path to a data directory. Eventhough they are in same directory

I am trying different text processing model. I am trying to use spacy and it's model en_core_web_lg. import spacy import spacy.language from spacy_langdetect import LanguageDetector from spacy....
jill gosrani's user avatar
0 votes
1 answer
375 views

'BpeTrainer' object cannot be converted to 'Sequence' when training Bpetokenizer

I found this class on import os from tokenizers.models import BPE from tokenizers import Tokenizer from tokenizers.decoders import ByteLevel as ByteLevelDecoder from tokenizers.normalizers import NFKC,...
Brian Hode's user avatar
1 vote
1 answer
35 views

ber-base-uncase does not use newly added suffix token

I want to add custom tokens to the BertTokenizer. However, the model does not use the new token. from transformers import BertTokenizer tokenizer = BertTokenizer.from_pretrained("bert-base-...
Lulacca's user avatar
  • 13
0 votes
1 answer
656 views

Tokenizing very large text datasets (cannot fit in RAM/GPU Memory) with Tensorflow

How do we tokenize very large text datasets that don't fit into memory in Tensorflow? For image datasets, there is the ImageDataGenerator that loads the data per batch to the model, and preprocesses ...
jgoh's user avatar
  • 49
1 vote
1 answer
796 views

How to stop spaCy tokenizer from tokenizing words enclosed within brackets

I'm trying to make the spaCy tokenizer avoid certain words enclosed by brackets, like [intervention]. However, no matter what I try, I cannot get the right code to include a rule or an exception. ...
ignacioct's user avatar
  • 345
0 votes
0 answers
116 views

How to Sentence Tokenize a List of Strings while maintaining the information of what strings constitute each sentence?

I have a list of strings as below(found from an OCR on a pdf) , and for each string in the list, I also have the co-ordinates of their position in the pdf ["Much of Singapore's infrastructure had ...
newbie101's user avatar
  • 199
0 votes
0 answers
55 views

Value Error when using add_tokens, 'the truth value of an array with more than one element is ambiguous'

I'm trying to improve a basic BERT, pretrained tokenizer model. Im adding new tokens using add_tokens, but running into issues with the built in method. Namely: ValueError ...
Manny's user avatar
  • 35
0 votes
2 answers
266 views

What is Stanford CoreNLP's recipe for tokenization?

Whether you're using Stanza or Corenlp (now deprecated) python wrappers, or the original Java implementation, the tokenization rules that StanfordCoreNLP follows is super hard for me to figure out ...
lrthistlethwaite's user avatar

15 30 50 per page
1
2 3 4 5
13