Creating a scraper using multithreading

Question

I've written a script in python using "threading" module to scrape two sites simultaneously. It parses the two sites flawlessly. Any insight as to how I can improve this script will be appreciated.

Here is what I did:

import requests ; from lxml import html
import threading ; import time

Yp_link = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page=2"
Tuts_link = "http://www.wiseowl.co.uk/videos/"

def create_links(url):
    response = requests.get(url).text
    tree = html.fromstring(response)
    for title in tree.cssselect("div.info"):
        name = title.cssselect("a.business-name span[itemprop=name]")[0].text
        street = title.cssselect("span.street-address")[0].text
        phone = title.cssselect("div[itemprop=telephone]")[0].text if title.cssselect("div[itemprop=telephone]") else ""
        time.sleep(1)
        print(name, street, phone)

def process_links(link):
    response = requests.get(link).text
    tree = html.fromstring(response)
    for titles in tree.xpath("//p[@class='woVideoListDefaultSeriesTitle']"):
        title = titles.xpath('.//a')[0]
        time.sleep(1)
        print(title.text, title.attrib['href'])

th1 = threading.Thread(target=create_links, args=(Yp_link,))
th2 = threading.Thread(target=process_links, args=(Tuts_link,))

th1.start()
th2.start()

th1.join()
th2.join()

Do you want to be polite? (You are likely to get blacklisted if your not). Also as this counts as a robot did you want to check the robots.txt file to make sure you are not scrapping pages that are not meant for robots (as this will usually result in your scrapping randomly generated pages that are just junk designed to trap bad robots). — Loki Astari
– Loki Astari, Commented Sep 7, 2017 at 17:30
Politness: stackoverflow.com/q/8236046/14065 I suppose a sleep of 1 second counts as being polite. — Loki Astari
– Loki Astari, Commented Sep 7, 2017 at 17:31
@Loki Astari, I'm dreadfully sorry that I could not respond to your comment instantly cause the internet was down in my end. As I am not from a programming background, sometimes I find it hard to understand stuffs so easily. I knew so far that "robotstxt" is only available in scrapy. However, seeing your comment I got even more confused as to where I find that. Btw, I'm gonna follow your provided link. Thanks. — MITHU
– MITHU, Commented Sep 8, 2017 at 7:32
The robots.txt file (if it exists) is supposed to be at the root of a site. https://www.yellowpages.com/robots.txt and http://www.wiseowl.co.uk/robots.txt — Loki Astari
– Loki Astari, Commented Sep 8, 2017 at 15:52

alecxe · Accepted Answer · 2017-09-07 21:57:14Z

1

First of all, I think you are putting time.sleep() calls into the wrong places - you are putting them into the loops where you iterate over the extracted elements. Elements are already extracted and no requests are issued at that point - add delays between each requests - at the end of your functions.

I would also improve naming - Yp_link and Tuts_link can be renamed to more explicit YELLOW_PAGES_URL and WISEOWL_URL - note that I think these two need to be defined as proper constants - in upper case.

And, I would also switch to CSS selector locators for process_links() function as well.

As far as imports go, just don't put them on same lines - put each import on it's own line as per PEP8 importing guidelines.

answered Sep 7, 2017 at 21:57

alecxe

17.5k8 gold badges52 silver badges93 bronze badges

\$\begingroup\$ Thanks sir alecxe, for the review. If i put time.sleep() just before the loop starts, will that be the right way to follow what you instructed? \$\endgroup\$

MITHU
– MITHU

2017-09-08 07:35:17 +00:00
Commented Sep 8, 2017 at 7:35
\$\begingroup\$ @Mithu right, just move it from out of the loop please. \$\endgroup\$

alecxe
– alecxe

2017-09-08 11:44:49 +00:00
Commented Sep 8, 2017 at 11:44

Add a comment |

Stack Exchange Network

Creating a scraper using multithreading

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Creating a scraper using multithreading

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions