I am trying to scrape a javascript web page. Having read some of the posts I managed to write the following:
from bs4 import BeautifulSoup
import requests
website_url = requests.get('https://ec.europa.eu/health/documents/community-register/html/reg_hum_atc.htm').text
soup= BeautifulSoup(website_url,'lxml')
print(soup.prettify())
and recover the following scripts as follows:
soup.find_all('script')[3]
which gives:
<script type="text/javascript">
// Initialize script parameters.
var exportTitle ="Centralised medicinal products for human use by ATC code";
// Initialise the dataset.
var dataSet = [
{"id":"A","parent":"#","text":"A - Alimentary tract and metabolism"},
{"id":"A02","parent":"A","text":"A02 - Drugs for acid related disorders"},
{"id":"A02B","parent":"A02","text":"A02B - Drugs for treatment of peptic ulcer"},
{"id":"A02BC","parent":"A02B","text":"A02BC - Proton pump inhibitors"},
{"id":"A02BC01","parent":"A02BC","text":"A02BC01 - omeprazole"},
{"id":"ho15861","parent":"A02BC01","text":"Losec and associated names (referral)","type":"pl"},
...
{"id":"h154","parent":"V09IA05","text":"NeoSpect (withdrawn)","type":"pl"},
{"id":"V09IA09","parent":"V09IA","text":"V09IA09 - technetium (<sup>99m</sup>Tc) tilmanocept"},
{"id":"h955","parent":"V09IA09","text":"Lymphoseek (active)","type":"pl"},
{"id":"V09IB","parent":"V09I","text":"V09IB - Indium (<sup>111</sup>In) compounds"},
{"id":"V09IB03","parent":"V09IB","text":"V09IB03 - indium (<sup>111</sup>In) antiovariumcarcinoma antibody"},{"id":"h025","parent":"V09IB03","text":"Indimacis 125 (withdrawn)","type":"pl"},
...
]; </script>
Now the problem that I am facing is to apply .text() to soup.find_all('script')[3]
and recover a json file from that. When I try to apply .text(), the result is an empty string: ''.
So my question is: why is that? Ideally I would like to end up with:
A02BC01 Losec and associated names (referral)
...
V09IA05 NeoSpect (withdrawn)
V09IA09 Lymphoseek
V09IB03 Indimacis 125 (withdrawn)
...