long time user, but never had to ask my own question.
I want to use python to parse a table from an html document into a dataframe. The table is NOT an html table, I think it is javascript created html and just uses a bunch of divs with weirdly named classes to create the format and layout.
The data is workers and their working hours, sorted by the work area. The problem is, that the divs are not nested and therefore I can't easily assign each worker their work area. I am using beautifulsoup.
Here is a simplified sample:
<html>
<body>
<div class="workarea">construction
</div>
<div class="name">Anna
</div>
<div class="Muell">w23f84md2o
</div>
<div class="time">8:23
</div>
<div class="name">Tom
</div>
<div class="Muell">w23f84md2o
</div>
<div class="time">10:20
</div>
<div class="workarea">cleaning
</div>
<div class="name">Max
</div>
<div class="Muell">w23f84md2o
</div>
<div class="time">9:30
</div>
</body>
</html>
Here is what I want as the data frame:
WORKAREA | NAME | TIME |
---|---|---|
construction | Anna | 8:23 |
construction | Tom | 10:20 |
cleaning | Max | 9:30 |
Note: the real data has thousands of divs for formatting and layout, which is why I wanted to use a proper parser and not just read the document line-by-line into python and parse it myself.
I did not get very far:
### read html with bs4
with open("testdoc1.html") as fp:
soup=BeautifulSoup(fp,"html.parser")
wa = soup.find_all("div",class="workarea")
## here I wanted to add a for loop through wa, but wa doesnt actually contain the info
I can't loop through wa
to get the details, because it only contains the construction and the cleaning div, nothing in-between.
Is there a solution to parse kind of line-by-line but actually div-by-div ?
Can I make find_all find all divs that have as class workarea,time or name? and keep them in the reading order?
I know there are already alot of stackoverflow questions about beautifulsoup and parsing html documents, but I couldn't really find a solution, since the original table is not really just one table and sadly there is no preservation of hierarchy within the html.
Thank you so much for your help! Any hints are much appreciated!