python: parse html document with UNNESTED div tags into dataframe (using beautifulsoup)

Question

long time user, but never had to ask my own question.

I want to use python to parse a table from an html document into a dataframe. The table is NOT an html table, I think it is javascript created html and just uses a bunch of divs with weirdly named classes to create the format and layout.

The data is workers and their working hours, sorted by the work area. The problem is, that the divs are not nested and therefore I can't easily assign each worker their work area. I am using beautifulsoup.

Here is a simplified sample:

<html>
<body>
<div class="workarea">construction
</div>
  <div class="name">Anna
  </div>
  <div class="Muell">w23f84md2o
  </div>
  <div class="time">8:23
  </div>
    <div class="name">Tom
  </div>
  <div class="Muell">w23f84md2o
  </div>
  <div class="time">10:20
  </div>
 <div class="workarea">cleaning
</div>
  <div class="name">Max
  </div>
  <div class="Muell">w23f84md2o
  </div>
  <div class="time">9:30
  </div>
</body>
</html>

Here is what I want as the data frame:

WORKAREA	NAME	TIME
construction	Anna	8:23
construction	Tom	10:20
cleaning	Max	9:30

Note: the real data has thousands of divs for formatting and layout, which is why I wanted to use a proper parser and not just read the document line-by-line into python and parse it myself.

I did not get very far:

### read html with bs4
with open("testdoc1.html") as fp:
     soup=BeautifulSoup(fp,"html.parser")

wa = soup.find_all("div",class="workarea")
## here I wanted to add a for loop through wa, but wa doesnt actually contain the info

I can't loop through wa to get the details, because it only contains the construction and the cleaning div, nothing in-between.

Is there a solution to parse kind of line-by-line but actually div-by-div ?

Can I make find_all find all divs that have as class workarea,time or name? and keep them in the reading order?

I know there are already alot of stackoverflow questions about beautifulsoup and parsing html documents, but I couldn't really find a solution, since the original table is not really just one table and sadly there is no preservation of hierarchy within the html.

Thank you so much for your help! Any hints are much appreciated!

Andrej Kesely · Accepted Answer · 2024-02-22 23:26:41Z

2

Try:

import pandas as pd
from bs4 import BeautifulSoup

html_text = """\
<html>
<body>
<div class="workarea">construction
</div>
  <div class="name">Anna
  </div>
  <div class="Muell">w23f84md2o
  </div>
  <div class="time">8:23
  </div>
    <div class="name">Tom
  </div>
  <div class="Muell">w23f84md2o
  </div>
  <div class="time">10:20
  </div>
 <div class="workarea">cleaning
</div>
  <div class="name">Max
  </div>
  <div class="Muell">w23f84md2o
  </div>
  <div class="time">9:30
  </div>
</body>
</html>"""

soup = BeautifulSoup(html_text, "html.parser")

data = []
for name in soup.select(".name"):
    workarea = name.find_previous(class_="workarea")
    data.append(
        {
            "workarea": workarea.text.strip(),
            "name": name.text.strip(),
            "time": name.find_next(class_="time").text.strip(),
        }
    )

df = pd.DataFrame(data)
print(df)

Prints:

       workarea  name   time
0  construction  Anna   8:23
1  construction   Tom  10:20
2      cleaning   Max   9:30

answered Feb 22, 2024 at 23:26

Andrej Kesely

196k15 gold badges58 silver badges103 bronze badges

Thanks for the quick answer. How does the soup.select() work? because in my real doc the div class is not name, it's "text-bold ellipsis pl-0" so I can't get it to work
– tailor
Commented Feb 22, 2024 at 23:41
1

@tailor .select() accepts CSS selectors. So you can do soup.select(".text-bold.ellipsis.pl-0") or soup.find_all(class_="text-bold ellipsis pl-0") using bs4 API.
– Andrej Kesely
Commented Feb 22, 2024 at 23:43
1

oh thanks perfect! the find_all solution worked!
– tailor
Commented Feb 23, 2024 at 0:01
1

thank you so so much !!! I figured out all the other problems and now went from 1000s of lines of html to a dataframe that I can use in R or python or even copy to google docs! I used to have to copy this data every week by typing it all out into a spreadsheet and I found this so so annoying and such a waste of time and I knew that there was a solution! now I just copy the html source run python and get my results! ideally the third party company should offer a download button, but I guess this is the next best thing :-)
– tailor
Commented Feb 23, 2024 at 0:41

Add a comment |

Collectives™ on Stack Overflow

python: parse html document with UNNESTED div tags into dataframe (using beautifulsoup)

1 Answer 1

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Related