0

Solution: The action for this specific site is action="user/ajax/login" so this is what has to be appended to url of the main site in order to implement the payload. (action can be found by searching ctrl + f for action in the Page Source). The url is the what is going to be scraped. The with requests.Session() as s: is what is maintaining the cookies from within the site, which is what allows consistent scraping. The res variable is the response that posts the payload into the login url, allowing the user to scrape from a specific account page. After the post, requests will then attain the specified url. With this in place, BeautifulSoup can now grab and parse the HTML from within the accounts site. "html.parser" and "lxml" are both compatible in this case. If there is HTML from within an iframe, it's doubtful it can be grabbed and parsed using only requests, so I recommend using selenium preferably using Firefox.

import requests

payload = {"username":"?????", "password":"?????"}
url = "https://9anime.to/user/watchlist"
loginurl = "https://9anime.to/user/ajax/login"

with requests.Session() as s:
    res = s.post(loginurl, data=payload)
    res = s.get(url)
from bs4 import BeautifulSoup

soup = BeautifulSoup(res.text, "html.parser")

[Windows 10] To install Selenium pip3 install selenium and for the drivers - (chrome: https://sites.google.com/a/chromium.org/chromedriver/downloads) (Firefox: https://github.com/mozilla/geckodriver/releases) How to place "geckodriver" into PATH for Firefox Selenium: control panel "environmental variables "Path" "New" "file location for "geckodriver" enter Then your'e all set. Also, in order to grab the iframes when using selenium, try import time and time.sleep(5) after 'getting' the url with your driver. This will give the site more time to load those extra iframes Example:

import time
from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Firefox()  # The WebDriver for this script
driver.get("https://www.google.com/")
time.sleep(5)  # Extra time for the iframe(s) to load
soup = BeautifulSoup(driver.page_source, "lxml")

print(soup.prettify())  # To see full HTML content
print(soup.find_all("iframe"))  # Finds all iframes

print(soup.find("iframe"))["src"]  # If you need the 'src' from within an iframe.
4
  • 1
    Try to attach the cookies of a logged-in user? Also it seems quite unlikely that the website would want username and password in clear in a GET request, so maybe you put your credentials in the wrong place. Commented May 24, 2020 at 20:44
  • The site probably not accepting username and password as a data payload to login via GET. Commented May 24, 2020 at 20:45
  • @lurker How would I go about finding the required credentials eg: (username, password, crsctoken, etc) for the payload? Commented May 24, 2020 at 21:35
  • You're misunderstanding my comment. The website probably doesn't accept any credentials via web access with GET unless they have defined a web API. You would have to contact the website owners for information. Commented May 24, 2020 at 23:38

3 Answers 3

1

You’re trying to make a GET request to a URL which requires being logged in to and therefore it is producing a 403 error which means forbidden. This means that the request is not authenticated to view the content.

If you think about it in terms of the URL you're constructing in your GET request, you would literally expose the username (x) and password (y) within the url like so:

https://9anime.to/user/watchlist?username=x&password=y

... which would of course be a security risk.

Without knowing what specific access you have to this particular site, in principle, you need to simulate authentication with a POST request first and then perform the GET request on that page afterwards. A successful response would return a 200 status code ('OK') and then you would be in a position to use BeautifulSoup to parse the content and target your desired part of that content from between the relevant HTML tags.

Sign up to request clarification or add additional context in comments.

Comments

0

I suggest, to start, give the address of the login page and connect. Then you make an

input('Enter something')

to allow you to pause the time you connect (You must hit the ENTER key in the terminal to continue the process once connected and voila.)

Comments

0

Solved: The action-tag was user/ajax/login in this case. So by appending that to the original main url of the website - not https://9anime.to/user/watchlist but to https://9anime.to you get https://9anime.to/user/ajax/login and this gives you the login url.

import requests
from bs4 import BeautifulSoup as bs
url = "https://9anime.to/user/watchlist"
loginurl = "https://9anime.to/user/ajax/login"
payload = {"username":"?????", "password":"?????"}
with requests.Session() as s:
    res = s.post(loginurl, data=payload)
    res = s.get(url)

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.