I am using a function on a pandas dataframe as :
import spacy
from collections import Counter
# Load English language model
nlp = spacy.load("en_core_web_sm")
# Function to filter out only nouns from a list of words
def filter_nouns(words):
SYMBOLS = '{}()[].,:;+-*/&|<>=~$1234567890#_%'
filtered_nouns = []
# Preprocess the text by removing symbols and splitting into words
words = [word.translate({ord(SYM): None for SYM in SYMBOLS}).strip() for word in words.split()]
# Process each word and filter only nouns
filtered_nouns = [token.text for token in nlp(" ".join(words)) if token.pos_ == "NOUN"]
return filtered_nouns
# Apply filtering logic to all rows in the 'NOTE' column
df['filtered_nouns'] = sf['NOTE'].apply(lambda x: filter_nouns(x))
I have a dataset containing 6400 rows and df['NOTE']
is a very long paragraph converted from the Oracle CLOB datatype.
This function is working quickly for 5-10 rows but for 6400 rows, it is taking a very long time.
Any ways to optimize this.