How do I make my html object-to-table array script run faster? (Pandas)

Question

I have a data frame that has several columns, one of which contains html objects (containing tables). I want a column of table arrays.

My problem is that this piece of code takes a long time to run. Is there any way I can optimize this? I tried list comprehension, which doesn't significantly improve run time.

Some suggested that I restructure the logic. Any suggestions how?

    df = htmldf
    countries = dict(countries_for_language('en'))
    countrylist = list(countries.values())
    arrayoftableswithcountry = []

    arrayofhtmltables = []


        for idx, row in df.iterrows():
            #print("We are now at row ", idx+1, "of", len(df),".")
            inner_html = tostring(row['html'])
            soup = bs(inner_html,'lxml')
            tableswithcountry = []
            outputr = []



            for idex,item in enumerate(soup.select('table')):
                #print("Extracting", idex+1, "of", len(soup.select('table')),".")
                table = soup.select('table')[idex]
                rows = table.find_all('tr')
                output = []
                outputrows = []

                for row in rows:
                    cols = row.find_all('td')
                    cols = [item.text.strip() for item in cols]
                    output.append([item for item in cols if item])

                if methodsname == 'revseg_geo':
                    if '$' in str(output):
                        for country in countrylist:
                            if country in str(output):
                                tableswithcountry.append(output)
                                outputr.append(table)


            arrayoftableswithcountry.append(tableswithcountry)
            arrayofhtmltables.append(outputr)

    df['arrayoftables'] = arrayoftableswithcountry
    df['arrayofhtmltables'] = arrayofhtmltables
    print('Made array of tables.')
    df.drop(columns=['html'])

The current question title, which states your concerns about the code, applies to too many questions on this site to be useful. The site standard is for the title to simply state the task accomplished by the code. Please see How do I ask a good question?. — BCdotWEB
– BCdotWEB, Commented May 26, 2021 at 6:34
Your question as posted is borderline off-topic. (For our "unclear what you're asking" reason.) Questions on Code Review must contain a description (in English) of what your code is doing. I wanted to edit your title to resolve BCdotWEB's comment, however you've not provided me with the information to do so. — Peilonrayz
– Peilonrayz ♦, Commented May 26, 2021 at 21:14

J_H · Accepted Answer · 2024-04-29 00:53:09Z

sensible names

Thesearenotgreatidentifiers:

    arrayoftableswithcountry = []
    arrayofhtmltables = []

Painful though it is to read camelCase python_code, even that would be preferable to this exercise in picking out the various word boundaries. It's easier to do with German nouns than in English.

I'm just going to pretend the for idx, row ... loop is exdented four spaces. Maybe there was some copy-n-paste difficulty.

You didn't show me def tostring(..., nor an import. I will just assume that it computes a result "fast". Maybe you intended to tell us from lxml.etree import tostring?

extract helper

The for idx, row ... loop should definitely be invoking a helper function. If nothing else, it would let the code creep a few spaces closer to the left margin.

It would also mitigate the need for this sort of naming nonsense:

        for idx, row in ...
            for idex, item in ...

algorithm

This is crazy:

        for idx, row in ...
            for idex, item in ...
                    if '$' in str(output):
                        for country in countrylist:
                            if country in str(output): ...

this piece of code takes a long time to run

output is "big", so str(output) is expensive. And the in operator has cost linear with size of its input. And we're performing such operations within inner loops. Don't do that.

Maintain flags for '$' and for country being present in strings as they get added to output. Then we can cheaply consult the flag, without repeatedly scanning and re-scanning the giant output.

toolic · Accepted Answer · 2024-04-29 11:51:15Z

Avoid Multiple BeautifulSoup Calls: Instead of repeatedly calling soup.select('table') inside your loop, call it once per iteration and use the result. This reduces the number of times BeautifulSoup parses the HTML.

Use Vectorized Operations: Replace the loop over the DataFrame with pandas’ apply() method, which is more efficient. Apply a function that processes each HTML content once and extracts the necessary data.

Minimize Operations Inside Loops: Simplify your list comprehensions and reduce the complexity inside your loops. For instance, avoid nested loops where possible and combine steps.

Here’s a quick code snippet to demonstrate using apply():

def extract_data(html_content):
    soup = bs(html_content, 'lxml')
    tables = soup.select('table')
    # Process your tables here and return the necessary data

df['processed_data'] = df['html'].apply(extract_data)

These changes should help speed up your data processing significantly.

FYI pandas apply() is not vectorization. It's just fancier iteration and is not (much) faster than looping. — tdy
– tdy, Commented Apr 29, 2024 at 13:35

Stack Exchange Network

How do I make my html object-to-table array script run faster? (Pandas)

2 Answers 2

sensible names

extract helper

algorithm

You must log in to answer this question.

Hot Network Questions

How do I make my html object-to-table array script run faster? (Pandas)

2 Answers 2

sensible names

extract helper

algorithm

You must log in to answer this question.

Related

Hot Network Questions