1

I am attempting to scrape the NSE website for a particular company, in Python. I am attempting this using the requests library and it's corresponding get() method. This should return a .html file, which I can then use for further processing with Beautiful Soup 4.

url = 'https://www.nseindia.com/get-quotes/equity?symbol=20MICRONS'

response = requests.get(url)

However, this does not return a response even after a few minutes, and instead made my system crash. On suggestion from the question, Python requests GET takes a long time to respond to some requests,

The server might only allow specific user-agent strings

I added user-agent string to the headers of the request as follows,

url = 'https://www.nseindia.com/get-quotes/equity?symbol=20MICRONS'
headers = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36",
}

response = requests.get(url, headers=headers)

This method did not remedy the issue either, making it clear to me it is unlikely a performance issue. Instead, I chose to attempt to remedy it using a different method mentioned in the aforementioned question,

IPv6 does not work, but IPv4 does

This was, in fact, the case, as when I attempted to run it in IPv6 mode I got an error.

$ curl --ipv6 -v 'https://www.nseindia.com/get-quotes/equity?symbol=20MICRONS'          
* Could not resolve host: www.nseindia.com
* Closing connection
curl: (6) Could not resolve host: www.nseindia.com

But, IPv4 mode did not fare much better in the case either.

$ curl --ipv4 'https://www.nseindia.com/get-quotes/equity?symbol=20MICRONS' # -v removed to focus on the error
curl: (92) HTTP/2 stream 1 was not closed cleanly: INTERNAL_ERROR (err 2)

Clearly, the website also does not support HTTP/"# m

2 protocol and it finally ran when run with the mode set to HTTP/1.1.

curl --ipv4 -v 'https://www.nseindia.com/get-quotes/equity?symbol=20MICRONS' --http1.1
* Host www.nseindia.com:443 was resolved.
* IPv6: (none)
* IPv4: 184.29.25.143
*   Trying 184.29.25.143:443...
* Connected to www.nseindia.com (184.29.25.143) port 443
* ALPN: curl offers http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: none
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384 / x25519 / RSASSA-PSS
* ALPN: server accepted http/1.1
* Server certificate:
*  subject: C=IN; ST=Maharashtra; L=Mumbai; O=National Stock Exchange of India Ltd.; CN=www.nseindia.com
*  start date: May 28 00:00:00 2024 GMT
*  expire date: May 22 23:59:59 2025 GMT
*  subjectAltName: host "www.nseindia.com" matched cert's "www.nseindia.com"
*  issuer: C=US; O=DigiCert Inc; OU=www.digicert.com; CN=GeoTrust RSA CA 2018
*  SSL certificate verify ok.
*   Certificate level 0: Public key type RSA (2048/112 Bits/secBits), signed using sha256WithRSAEncryption
*   Certificate level 1: Public key type RSA (2048/112 Bits/secBits), signed using sha256WithRSAEncryption
*   Certificate level 2: Public key type RSA (2048/112 Bits/secBits), signed using sha1WithRSAEncryption
* using HTTP/1.x
> GET /get-quotes/equity?symbol=20MICRONS HTTP/1.1
> Host: www.nseindia.com
> User-Agent: curl/8.8.0
> Accept: */*
> 
* Request completely sent off
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing

However, the cURL request still does not complete. I then looked at questions concerning the NSE India website itself, and found the question, Python Requests get returns response code 401 for nse india website,

To access the NSE (api's) site multiple times then set cookies in each subsequent requests

essentially recommending to add cookies to the request.

baseurl = 'https://www.nseindia.com/'
url = 'https://www.nseindia.com/get-quotes/equity?symbol=20MICRONS'
headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',
    'accept-language': 'en,gu;q=0.9,hi;q=0.8',
    'accept-encoding': 'gzip, deflate, br'
}

session = requests.Session()
request = session.get(baseurl, headers=headers, timeout=5)
cookies = dict(request.cookies)
response = session.get(url, headers=headers, timeout=5, cookies=cookies)

This solution still faced the same issue, where the request would simply never complete. This question was asked and answered for the NSE India API, and it makes sense that it does not work. I also checked by adding the Accept header according to @GTK's comment,

you need the accept header [...]

url = 'https://www.nseindia.com/get-quotes/equity?symbol=20MICRONS'
headers = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36",
    "Accept": "text/html"
}

response = requests.get(url, headers=headers)

Sadly, this does not impact the problem, perhaps because the Accept header was already set to */* which allows any MIME type.

How can I proceed in this situation, and why is this occurring. Is the request truly that slow, or is there some error occurring?

5
  • you need the accept header, and the user-agent.
    – GTK
    Commented Jun 26, 2024 at 20:54
  • @GTK Do requests not have the accept header including text/html already?
    – Shirsak
    Commented Jun 26, 2024 at 21:27
  • it's your user agent (linux), try this user agent Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36
    – GTK
    Commented Jun 26, 2024 at 21:49
  • maybe first use DevTools in Chrome/Firefox to see what headers sends real web browser - and start with all headers from browser. And later you can check which headers you can skip. It may also need to behave like real user and first visit main page to get cookies - and later use cookies on next pages. But I think there can be other problem - this page uses JavaScript to add some elements to HTML - so if you plan to get data from HTML then requests can be useless and it may need to use Selenium to control real web browser which can run JavaScript.
    – furas
    Commented Jun 26, 2024 at 23:26
  • BTW: when I use DevTools (tab Network) to see request then It shows me that this server use HTTP/2. As I know requests doesn't support HTTP/2 - and it would need to use httpx. But still real problem is JavaScript and even httpx can't resolve it.
    – furas
    Commented Jun 26, 2024 at 23:32

1 Answer 1

1

I was able to get response from the server with different User-Agent header (the server most probably blacklist some specific user agents):

import requests

url = "https://www.nseindia.com/get-quotes/equity?symbol=20MICRONS"

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:126.0) Gecko/20100101 Firefox/126.0",
}

response = requests.get(url, headers=headers)

print(response.text)

Prints:

<!DOCTYPE html>

<html lang="en">

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">

<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />

<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=0" />

<title>

    20 Microns Limited Share Price Today, Stock Price, Live NSE News, Quotes, Tips – NSE India

</title>

...
2
  • 1
    This User-Agent header worked, but how could I have known that this is the required header value?
    – Shirsak
    Commented Jun 27, 2024 at 16:00
  • @Shirsak There's no documentation - it's trial and error. Commented Jun 27, 2024 at 17:33

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.