0

I need to process small amounts of texts (i.e. strings in python).

I want to remove certain punctuation (like '.', ',', ':', ';', )

but keep punctuation indicative of emotions like ('...', '?', '??','???', '!', '!!', '!!!')

Also, I want to remove non-informative words as 'a', 'an', 'the' . Also, the biggest challenge so far is how to parse "I've" or "we've" to get "I have" and "we have" eventually? the apostrophe makes it difficult for me.

What is the best/simplest way to do this in python?

For example:

"I've got an A mark!!! Such a relief... I should've partied more."

The result I want to get:

['I', 'have', 'got', 'A', 'mark', '!!!', 'Such', 'relief', '...', 

'I',  'should', 'have', 'partied', 'more']
4
  • 1
    Have you tried anything to accomplish this? Commented Feb 12, 2016 at 19:20
  • Yes! I have tried several regex expressions but I am either achieving one or another goal, and not all together.
    – Uylenburgh
    Commented Feb 12, 2016 at 19:21
  • Then post them & explain what was wrong, and maybe someone can help fix them. Commented Feb 12, 2016 at 19:22
  • make a Python list of all the things you want to remove, then apply str.replace(item, "") for item in list. That's not very efficient though if you have a lot of strings and a lot of replace substrings. Commented Feb 12, 2016 at 19:27

1 Answer 1

0

This can become complicated, depending on how much more rules you would want to apply.

You could make use of \b in your regular expressions to match the beginning or ending of a word. With this you can also isolate punctuation and check whether they are single characters in a list like [.;:].

These ideas are used in this code:

import re

def tokenise(txt):
    # Expand "'ve"
    txt = re.sub(r"(?i)(\w)'ve\b", r'\1 have', txt)
    # Separate punctuation from words
    txt = re.sub(r'\b', ' ', txt)
    # Remove isolated, single-character punctuation,
    # and articles (a, an, the)
    txt = re.sub(r'(^|\s)([.;:]|[Aa]n|a|[Tt]he)($|\s)', r'\1\3', txt)    
    # Split into non-empty strings
    return filter(bool, re.split(r'\s+', txt))

# Example use
txt = "I've got an A mark!!! Such a relief... I should've partied more."
words = tokenise(txt)
print (','.join(words))

Output:

I,have,got,A,mark,!!!,Such,relief,...,I,should,have,partied,more

See it run on eval.in

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.