2

I've been exploring web scraping techniques using Python and RSS feed, but I'm not sure how to narrow down the search results to a particular year on Google News. Ideally, I'd like to retrieve headlines, publication dates, and possibly summaries for news articles from a specific year (such as 2020). With the code provided below, I can scrape the current data, but if I try to look for news from a specific year, it isn't available. Even when I use the Google articles search box, the filter only shows results from the previous year. However, when I scroll down, I can see articles from 2013 and 2017. Could someone provide me with a Python script or pointers on how to resolve this problem?

Here's what I've attempted so far:

import feedparser
import pandas as pd
from datetime import datetime

class GoogleNewsFeedScraper:
    def __init__(self, query):
        self.query = query

    def scrape_google_news_feed(self):
        formatted_query = '%20'.join(self.query.split())
        rss_url = f'https://news.google.com/rss/search?q={formatted_query}&hl=en-IN&gl=IN&ceid=IN%3Aen'
        feed = feedparser.parse(rss_url)
        titles = []
        links = []
        pubdates = []

        if feed.entries:
            for entry in feed.entries:
                # Title
                title = entry.title
                titles.append(title)
                # URL link
                link = entry.link
                links.append(link)
                # Date
                pubdate = entry.published
                date_str = str(pubdate)
                date_obj = datetime.strptime(date_str, "%a, %d %b %Y %H:%M:%S %Z")
                formatted_date = date_obj.strftime("%Y-%m-%d")
                pubdates.append(formatted_date)

        else:
            print("Nothing Found!")

        data = {'URL link': links, 'Title': titles, 'Date': pubdates}
        return data

    def convert_data_to_csv(self):
        d1 = self.scrape_google_news_feed()
        df = pd.DataFrame(d1)
        csv_name = self.query + ".csv"
        csv_name_new = csv_name.replace(" ", "_")
        df.to_csv(csv_name_new, index=False)


if __name__ == "__main__":
    query = 'forex rate news'
    scraper = GoogleNewsFeedScraper(query)
    scraper.convert_data_to_csv()

1 Answer 1

2

You can use date filters in your rss_url. modify the query part in the below format

Format: q=query+after:yyyy-mm-dd+before:yyyy-mm-dd

Example: https://news.google.com/rss/search?q=forex%20rate%20news+after:2023-11-01+before:2023-12-01&hl=en-IN&gl=IN&ceid=IN:en

The URL above returns articles related to forex rate news that were published between November 1st, 2023, and December 1st, 2023.

Please refer to this article for more information.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.