0

I am using a function on a pandas dataframe as :

import spacy
from collections import Counter


# Load English language model
nlp = spacy.load("en_core_web_sm")

# Function to filter out only nouns from a list of words
def filter_nouns(words):
    SYMBOLS = '{}()[].,:;+-*/&|<>=~$1234567890#_%'
    filtered_nouns = []
    
    # Preprocess the text by removing symbols and splitting into words
    words = [word.translate({ord(SYM): None for SYM in SYMBOLS}).strip() for word in words.split()]
    
    # Process each word and filter only nouns
    filtered_nouns = [token.text for token in nlp(" ".join(words)) if token.pos_ == "NOUN"]
    
    return filtered_nouns



# Apply filtering logic to all rows in the 'NOTE' column
df['filtered_nouns'] = sf['NOTE'].apply(lambda x: filter_nouns(x))

I have a dataset containing 6400 rows and df['NOTE'] is a very long paragraph converted from the Oracle CLOB datatype.

This function is working quickly for 5-10 rows but for 6400 rows, it is taking a very long time.

Any ways to optimize this.

1
  • 1
    Can you share a few rows of sample data and expected output?
    – Nick
    Commented Mar 14, 2024 at 4:05

2 Answers 2

2

The first thing you should do is remove all the repetition in your function. In this line:

words = [word.translate({ord(SYM): None for SYM in SYMBOLS}).strip() for word in words.split()]

You are building the translation dictionary every time you translate a word, and calling translate for each word in the text. It is far more efficient to do each of those once:

tr = str.maketrans('', '', SYMBOLS)
words = words.strip().translate(tr).split()

This makes about a 50x speed-up on a 1000-word string on my computer.

In the next line you are then joining all the words for every call to nlp. You should do that once:

text = ' '.join(words)
filtered_nouns = [token.text for token in nlp(text) if token.pos_ == "NOUN"]

But note that you just split on spaces, so you might as well skip that step completely. In total:

def filter_nouns(text):
    SYMBOLS = '{}()[].,:;+-*/&|<>=~$1234567890#_%'
    tr = str.maketrans('', '', SYMBOLS)
    
    # Preprocess the text by removing symbols
    words = text.strip().translate(tr)
    
    # Process each word and filter only nouns
    filtered_nouns = [token.text for token in nlp(words) if token.pos_ == "NOUN"]
    
    return filtered_nouns

Finally, note that .apply(lambda x: filter_nouns(x)) is the same as .apply(filter_nouns).

2

A simple way is use the built-in multiprocessing module. Split the data into multiple parts and process them independently.
Check the doc for detail and example. https://docs.python.org/3/library/multiprocessing.html

1
  • While sometimes useful and required, multiprocessing should be included only after making sure that the code runs efficiently in a single process, which it does not do at the moment. This could be considered after trying out for example what Nick suggested and if the speed is still not enough after that. Commented Mar 14, 2024 at 7:45

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.