I'm starting to learn Python and I've written the following Python code (some of it omitted) and it works fine, but I'd like to understand it better. So I do the following:
html_doc = requests.get('[url here]')
Followed by:
if html_doc.status_code == 200:
soup = BeautifulSoup(html_doc.text, 'html.parser')
line = soup.find('a', class_="some_class")
value = re.search('[regex]', str(line))
print (value.group(0))
My questions are:
- What does
html_doc.textreally do? I understand that it makes "text" (a string?) out ofhtml_doc, but why isn't it text already? What is it? Bytes? Maybe a stupid question but why doesn'trequests.getcreate a really long string containing the HTML code? - The only way that I could get the result of
re.searchwas byvalue.group(0)but I have literally no idea what this does. Why can't I just look atvaluedirectly? I'm passing it a string, there's only one match, why is the resultingvaluenot a string?
remodule and its search method you don't get as a return value just a string, but Match object. You want to get 0 (first) group from this object, but you could easily get another one, if it's there.