Preprocessing text input to a machine-learning algorithm

Question

I have written the following function to preprocess some text data as input to machine learning algorithm. It lowercase, tokenises, removes stop words and lemmatizes, returning a string of space-separated tokens. However, this code runs extremely slowly. What can I do to optimise it?

import os
import re
import csv
import time
import nltk
import string
import pickle
import numpy as np
import pandas as pd
import pyparsing as pp
import matplotlib.pyplot as plt
from sklearn import preprocessing
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer


def preprocessText(text, lemmatizer, lemma, ps):
        '''
        Lowercase, tokenises, removes stop words and lemmatize's using word net. Returns a string of space seperated tokens.
        '''
        words = text.lower()
        words = re.sub("[^a-zA-Z]", " ", words)
        words = word_tokenize(words)
        stemmed_words = []
        stops = set(stopwords.words("english"))
        meaningful_words = [w for w in words if not w in stops]
        text = ""
        if lemmatizer == True:
            pos_translate = {'J':'a', 'V':'v', 'N':'n', 'R':'r'}
            meaningful_words = [lemma.lemmatize(w,pos=pos_translate[pos[0]] if pos[0] in pos_translate else 'n') for w,pos in nltk.pos_tag(meaningful_words)]
            for each in meaningful_words:
                if len(each) > 1:
                    text = text +" " + each
            return text
        else:
            words_again = []
            for each in meaningful_words:
                words_again.append(ps.stem(each))
            text = ""
            for each in words_again:
                if len(each) > 1:
                    text = text +" " +each
            return(text)

QA Collective · Accepted Answer · 2018-06-29 03:26:51Z

3

Given that you are already using Python, I would highly recommend using Spacy (base text parsing & tagging) and Textacy (higher level text processing built on top of Spacy). It can do everything you want to do, and more, with one function call:

http://textacy.readthedocs.io/en/latest/api_reference.html#textacy.preprocess.preprocess_text

For your further travels in text based machine learning, there are also a wealth of additional features, particularly with Spacy 2.0 and its universe.

edited Jun 29, 2018 at 3:26

answered Jun 13, 2018 at 1:56

QA Collective

1463 bronze badges

Add a comment |

Community · Accepted Answer · 2017-05-23 12:40:57Z

You can cut down on the number of times you iterate over the words, by filtering in a single loop. E.g. in case when lemmatizer is "falsy":

def preprocess_text_new(text, ps):
    '''
    Lowercase, tokenises, removes stop words and lemmatize's using word net. Returns a string of space seperated tokens.
    '''
    words = re.sub(r"[^a-zA-Z]", " ", text.lower())
    words = word_tokenize(words)

    stops = set(stopwords.words("english"))
    result = []
    for word in words:
        if word not in stops:
            continue

        stemmed = ps.stem(word)
        if len(stemmed) > 1:
            result.append(stemmed)

    return " ".join(result)

Note the use of the much faster str.join() instead of continuously concatenating a string.

If you are executing the function multiple times, you should not redo the things you've already done. E.g the stopwords set can be defined prior the function execution, the regular expression can be pre-compiled.

From what I can conclude after profiling the code without lemmatizing, stemming is the largest contributor to the overall execution time, it is costly. If it is at all possible, you can optimize things by caching the words that were already stemmed:

result = []
cache = {}
for word in words:
    # ... 
    if word not in cache:
        stemmed = ps.stem(word)
        cache[word] = stemmed
    else:
        stemmed = cache[word]
    result.append(stemmed)

Or, you can pre-compute stems for the most popular words in the corpus you are working in (difficult to tell how effective it would be, but please do experiment).

The same pre-computation and memoization idea would also work for the lemmatization part.

Also, since the PyPy interpreter supports nltk, check if using it can provide a performance boost.

Stack Exchange Network

Preprocessing text input to a machine-learning algorithm

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Preprocessing text input to a machine-learning algorithm

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions