3
\$\begingroup\$

I have a list of strings to search along with a list of words and want to produce boolean feature vectors indicating presence of each word within the strings. What is the fastest way to build these feature vectors in Python?

In the example below the first string would output [1, 0, 1, 1] for example. I'm currently using the Aho Corasick algorithm (https://stackoverflow.com/questions/44310516/python-find-occurrences-of-list-of-strings-within-string) to search each string in a list comprehension but think it's possible there's a faster way. The code below uses this method and averages the time over 10 runs.

import time
import numpy as np
import ahocorasick

def check_strings(A, search_list, string_to_search):
    """Use Aho Corasick algorithm to produce boolean list indicating
    prescence of strings within a longer string"""
    index_list = []
    for item in A.iter(string_to_search):
        index_list.append(item[1][0])

    output_list = np.array([0] * len(search_list))
    output_list[index_list] = 1
    return output_list.tolist()


word_list = ["foo", "bar", "hello", "world"]
strings_to_check = ["hello world foo", "foo bar", "bar world"]

A = ahocorasick.Automaton()
for idx, s in enumerate(word_list):
    A.add_word(s, (idx, s))
A.make_automaton()

run_times = []
for i in range(10):
    t0 = time.time()
    feature_vectors = [check_strings(A, word_list, s) for s in strings_to_check]
    run_times.append(time.time()-t0)

print(feature_vectors)
print(np.mean(np.array(run_times)))

Output is:

[[1, 0, 1, 1], [1, 1, 0, 0], [0, 1, 0, 1]]
1.65939331055e-05

Looking for a more efficient way of doing this, I thought it might be possible somehow by concatenating the strings to avoid the loop but can't quite figure out how to get it to work.

\$\endgroup\$

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.