3
$\begingroup$

I'm currently working in a dataset with censorship on profanity. Basically, fuck would be 4 heart emojis. Considering I'm trying to run a topic modelling w/ BERTopic, what kinda of preprocessing would be adequate to handle this situation?

  • Should I change it all to a placeholder? [CENSORED] or [PROFANITY]
  • Should I try to change it back to the actual word? Considering a four letters cursing could be either "shit" or "fuck" for example it could be really hard.
New contributor
Gabriel Fagundes is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.
$\endgroup$

1 Answer 1

2
$\begingroup$

Interesting question. There are a few directions you could take.

First, consider how much profanity is in your data and whether it actually impacts the topics.

From an NLP perspective, profanity can take different roles:

  • noun (he is an ****)
  • verb (**** you)
  • adjective (this place is ****)
  • adverb (I **** love this place)

So treating everything the same way may not be ideal.

Here are some options:

1. Do nothing

It might not affect your topics much. In that case, leaving it as is can work.

2. Try to uncensor

You could reconstruct words using heuristics or a masked language model. This is noisy since multiple words can map to the same pattern.

3. Use a generic placeholder

Replace censored text with something like [PROFANITY] or [CENSORED].

You can also normalize all censored tokens to one form. Check that the token exists in the model vocabulary so it gets a meaningful embedding.

Another option is to use the model’s [MASK] token. It does not encode profanity, but the model understands it.

4. Use POS-based placeholders

If you can estimate part of speech, use tokens like:

  • [PROFANITY_NOUN]
  • [PROFANITY_VERB]

This keeps some structure.

5. Handle it in embeddings

Instead of changing text, map censored tokens to a general profanity vector. For example, an average of known profanity embeddings. This only helps if the model has seen such words.


Best choice depends on whether profanity matters for your topics. If it does, keep the signal. If not, simple normalization is enough.

Interesting problem.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.