Handling Profanity Censorship in BERTopic

Question

I'm currently working in a dataset with censorship on profanity. Basically, fuck would be 4 heart emojis. Considering I'm trying to run a topic modelling w/ BERTopic, what kinda of preprocessing would be adequate to handle this situation?

Should I change it all to a placeholder? [CENSORED] or [PROFANITY]
Should I try to change it back to the actual word? Considering a four letters cursing could be either "shit" or "fuck" for example it could be really hard.

Valentin Calomme · Accepted Answer · 2026-03-31 06:59:44Z

Interesting question. There are a few directions you could take.

First, consider how much profanity is in your data and whether it actually impacts the topics.

From an NLP perspective, profanity can take different roles:

noun (he is an ****)
verb (**** you)
adjective (this place is ****)
adverb (I **** love this place)

So treating everything the same way may not be ideal.

Here are some options:

1. Do nothing

It might not affect your topics much. In that case, leaving it as is can work.

2. Try to uncensor

You could reconstruct words using heuristics or a masked language model. This is noisy since multiple words can map to the same pattern.

3. Use a generic placeholder

Replace censored text with something like [PROFANITY] or [CENSORED].

You can also normalize all censored tokens to one form. Check that the token exists in the model vocabulary so it gets a meaningful embedding.

Another option is to use the model’s [MASK] token. It does not encode profanity, but the model understands it.

4. Use POS-based placeholders

If you can estimate part of speech, use tokens like:

[PROFANITY_NOUN]
[PROFANITY_VERB]

This keeps some structure.

5. Handle it in embeddings

Instead of changing text, map censored tokens to a general profanity vector. For example, an average of known profanity embeddings. This only helps if the model has seen such words.

Best choice depends on whether profanity matters for your topics. If it does, keep the signal. If not, simple normalization is enough.

Interesting problem.

Stack Exchange Network

Handling Profanity Censorship in BERTopic

1 Answer 1

1. Do nothing

2. Try to uncensor

3. Use a generic placeholder

4. Use POS-based placeholders

5. Handle it in embeddings

Hot Network Questions

Handling Profanity Censorship in BERTopic

1 Answer 1

1. Do nothing

2. Try to uncensor

3. Use a generic placeholder

4. Use POS-based placeholders

5. Handle it in embeddings

Related

Hot Network Questions