All Questions
195 questions
1
vote
1
answer
71
views
How to handle German language specific characters like (ä, ö, ü, ß) while tokenizing using GPT2Tokenizer?
I am working with German Texts, where I need to tokenize texts using GPT2Tokenizer.
To tokenize the text, I wrote the implementation as follows:
from transformers import GPT2Tokenizer
text = "...
1
vote
1
answer
69
views
How do I remove escape characters from output of nltk.word_tokenize?
How do I get rid of non-printing (escaped) characters from the output of the nltk.word_tokenize method? I am working through the book 'Natural Language Processing with Python' and am following the ...
5
votes
1
answer
565
views
Is there a way to save a pre-compiled AutoTokenizer?
Sometimes, we'll have to do something like this to extend a pre-trained tokenizer:
from transformers import AutoTokenizer
from datasets import load_dataset
ds_de = load_dataset("mc4", 'de')
...
-1
votes
1
answer
28
views
nltk python library word tokenizatio error [closed]
I am trying to tokenize a file.
`AttributeError Traceback (most recent call last)
<ipython-input-8-81ae6f78b554> in <cell line: 4>()
2 robert = open('...
1
vote
1
answer
554
views
Splitting a word into two words with spaCy
I'm facing an issue where I need to split a single 'word' into two words due to missing spaces or new lines in the received text. My intention is to establish a pipeline (spaCy 3.5.4) for this task ...
6
votes
1
answer
2k
views
When to set `add_special_tokens=False` in huggingface transformers tokenizer?
this is the default way of setting tokenizer in the Hugging Face "transformers" library:
from transformers import BertForSequenceClassification,BertTokenizer
tokenizer=BertTokenizer....
0
votes
0
answers
33
views
Proper teatment of lists of word tokens in a df for clustering - advice needed
I have been provided with tokenized text that has been previously generated. I would like to keep these tokens and use them in my analysis. For each item of interest, there are multiple lists of ...
1
vote
1
answer
1k
views
Can't find model 'en_core_web_lg'. It doesn't seem to be a Python package or a valid path to a data directory. Eventhough they are in same directory
I am trying different text processing model. I am trying to use spacy and it's model en_core_web_lg.
import spacy
import spacy.language
from spacy_langdetect import LanguageDetector
from spacy....
0
votes
1
answer
375
views
'BpeTrainer' object cannot be converted to 'Sequence' when training Bpetokenizer
I found this class on
import os
from tokenizers.models import BPE
from tokenizers import Tokenizer
from tokenizers.decoders import ByteLevel as ByteLevelDecoder
from tokenizers.normalizers import NFKC,...
1
vote
1
answer
35
views
ber-base-uncase does not use newly added suffix token
I want to add custom tokens to the BertTokenizer. However, the model does not use the new token.
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-...
0
votes
1
answer
656
views
Tokenizing very large text datasets (cannot fit in RAM/GPU Memory) with Tensorflow
How do we tokenize very large text datasets that don't fit into memory in Tensorflow? For image datasets, there is the ImageDataGenerator that loads the data per batch to the model, and preprocesses ...
1
vote
1
answer
796
views
How to stop spaCy tokenizer from tokenizing words enclosed within brackets
I'm trying to make the spaCy tokenizer avoid certain words enclosed by brackets, like [intervention]. However, no matter what I try, I cannot get the right code to include a rule or an exception. ...
0
votes
0
answers
116
views
How to Sentence Tokenize a List of Strings while maintaining the information of what strings constitute each sentence?
I have a list of strings as below(found from an OCR on a pdf) , and for each string in the list, I also have the co-ordinates of their position in the pdf
["Much of Singapore's infrastructure had ...
0
votes
0
answers
55
views
Value Error when using add_tokens, 'the truth value of an array with more than one element is ambiguous'
I'm trying to improve a basic BERT, pretrained tokenizer model. Im adding new tokens using add_tokens, but running into issues with the built in method.
Namely:
ValueError ...
0
votes
2
answers
266
views
What is Stanford CoreNLP's recipe for tokenization?
Whether you're using Stanza or Corenlp (now deprecated) python wrappers, or the original Java implementation, the tokenization rules that StanfordCoreNLP follows is super hard for me to figure out ...