0

very new to python and trying to web scrape a website table, but I think the table data is seemingly from a Javascript variable with a JSON.parse. However the parse is not what I am used to and am unsure of how to use it in python.

The code is from this website, specifically it is var playersData = JSON.parse('\x5B\x7B\x22id\x3A,... (roughly 250,000 characters) nestled in a script tag.

So far I have managed to scrape the website using bs4, find the specific script and attempt to use re.search to find just the JSON.parse and find this <re.Match object; span=(2, 259126), match="var playersData\t= JSON.parse('\\x5B\\x7B\\x22id\> from the search.

I would then like to export the data somewhere else after loading the JSON parse.

Here is my code so far:

import requests
from bs4 import BeautifulSoup
import json
import re

response = requests.get('https://understat.com/league/EPL/2018')
soup = BeautifulSoup(response.text, 'lxml')

playerscript = soup.find_all('script')[3].string
m = re.search("var playersData  = (.*)", playerscript)

Thanks for any help.

2
  • Did you have a question? Commented Nov 7, 2018 at 17:06
  • Yes, mainly how do I use the javascript variable with the JSON.parse in python to get the table data from the website? Commented Nov 7, 2018 at 17:29

1 Answer 1

1

you don't need BeautifulSoup. in python json.loads same as JSON.parse and you need to convert the string using .decode('string_escape') or bytes('....', 'utf-8').decode('unicode_escape') for python 3

import requests
import json
import re

response = requests.get('https://understat.com/league/EPL/2018')
playersData = re.search("playersData\s+=\s+JSON.parse\('([^']+)", response.text)
# python 2.7
# decoded_string = playersData.groups()[0].decode('string_escape')
decoded_string = bytes(playersData.groups()[0], 'utf-8').decode('unicode_escape')
playerObj = json.loads(decoded_string)

print(playerObj[0]['player_name'])
Sign up to request clarification or add additional context in comments.

1 Comment

I have tried using this code, however I receive the error: AttributeError: 'str' object has no attribute 'decode'. If I remove the .decode part, then I get the error raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.