5
\$\begingroup\$

From my previous post, I tried to take into account all of the nice answers made by Kate, Chris, J_H, Booboo and mudskipper

Changes made

In short, I attempted to apply all of the following:

  1. API key handling
  2. Better URL construction
  3. Input cleaning improvements
  4. Category detection rewrite
  5. More functions
  6. Article formatting cleanup
  7. Added error handling
  8. Replaced hard-coded strings with constants
  9. PEP8 & Style corrections
  10. Added a __main__ entry point
  11. Simplified logic flow

API key changes

I noticed not long ago that NewsAPI prefers to keep all API keys for private use only. You may get yours here. I am in no way promoting them.

Code

Here is the updated Python code:


"""
This script fetches the most recent news
headline from the NewsAPI service based 
on a natural-language query given by the 
user (e.g., “latest technology news”, 
“what’s happening in sports today”, etc.).
"""
import os
import re
import requests
import json5

# Constants / configuration
API_URL = "https://newsapi.org/v2/top-headlines"
CATEGORIES = ("business", "entertainment", "general", "health", "science", "sports", "technology")
STOP_WORDS = [
    "news", "new", "for", "concerning", "can", "you", "give", "me",
    "the", "most", "recent", "in", "category", "currently", "current",
    "now", "what", "whats", "is", "about", "today"
]

def clean_query(text: str) -> str:
    """
    Clean the input text by removing stop words, then detect if
    any category is mentioned. Return query parameter as string.
    """
    text = text.lower()
    # remove stop words with word boundaries
    pattern = re.compile(r"\b(" + "|".join(re.escape(w) for w in STOP_WORDS) + r")\b", flags=re.IGNORECASE)
    text = pattern.sub("", text).strip()

    # detect categories
    for cat in CATEGORIES:
        if re.search(rf"\b{re.escape(cat)}\b", text, flags=re.IGNORECASE):
            return {"category": cat}

    # if nothing remains, default to general
    if not text:
        return {"category": "general"}

    # else treat as search query
    return {"q": text}

def fetch_top_headline(params: dict, api_key: str) -> dict:
    """
    Fetches the top headline from NewsAPI with the given params.
    Raises on HTTP error.
    """
    params = params.copy()
    params.update({
        "sortBy": "publishedAt",
        "pageSize": 1,
    })
    # Prefer putting API key in header for security
    headers = {"X-Api-Key": api_key}
    response = requests.get(API_URL, params=params, headers=headers)
    response.raise_for_status()
    return response.json()

def format_article(article: dict) -> str:
    """
    Given a single article dict, extract its fields and return a nice string.
    """
    source_name = article.get("source", {}).get("name", "")
    author = article.get("author")
    title = article.get("title", "")
    desc = article.get("description") or ""

    # clean up title
    title = title.replace(source_name, "").replace("\n", " ")
    title = title.replace(" - ", " ").replace("LISTEN | ", "")

    # clean up description (if HTML-like)
    if "<p>" in desc:
        desc = desc.split("<p>", 1)[1]
    # here you might want to use BeautifulSoup instead of manual replace
    desc = desc.replace("&quot;", '"')

    # author formatting
    if author and author.lower() != source_name.lower():
        # clean commas
        cleaned = author.replace(",", " and ")
        author_part = f" by {cleaned}"
    else:
        author_part = ""

    # source formatting
    source_part = f" on {source_name}" if source_name else ""

    return f"Here is the recently published news I found{source_part}{author_part}:\n" \
           f"The title is {title}...\n{desc}"

def news_concerning(text: str) -> str:
    """
    Main function that takes user text, queries NewsAPI, and returns a formatted result.
    """
    api_key = os.getenv("NEWS_API_KEY")
    if not api_key:
        raise RuntimeError("API key not found. Please set NEWS_API_KEY in your environment.")

    params = clean_query(text)
    data = fetch_top_headline(params, api_key)

    articles = data.get("articles", [])
    if not articles:
        # Nothing found
        query_term = params.get("q") or params.get("category")
        return f"I couldn’t find anything about {query_term}. Could you try a different query?"

    first = articles[0]
    try:
        return format_article(first)
    except Exception as e:
        # Fallback error message
        return "Sorry, I couldn’t format the article properly."

if __name__ == "__main__":
    user_input = input("What news would you like to hear about? ")
    print(news_concerning(user_input))

To run the script, enter the following: NEWS_API_KEY="your_api_key_here" python3 NewsAPI.py

Critique request

Please, tell me anything that comes to mind.

\$\endgroup\$

3 Answers 3

4
\$\begingroup\$

Reinventing the wheel

The following loop simply reinvents next.

    for cat in CATEGORIES:
        if re.search(rf"\b{re.escape(cat)}\b", text, flags=re.IGNORECASE):
            return {"category": cat}

Might be written:

    cat = next(
        (cat 
         for cat in CATEGORIES 
         if re.search(rf"\b{re.escape(cat)}\b", text, flags=re.I)),
        None
    )

    if cat:
        return {"category": cat}

Or:

    categories_in_text = (
        cat 
        for cat in CATEGORIES
        if re.search(rf"\b{re.escape(cat)}\b", text, flags=re.I)
    )

    if cat := next(categories_in_text, None):
        return {"category": cat}

Inconsistent chaining and DRY

In the following it's odd that you've chosen to chain the calls together the way you have.

    # clean up title
    title = title.replace(source_name, "").replace("\n", " ")
    title = title.replace(" - ", " ").replace("LISTEN | ", "")

Consider whether you really want to define a dictionary with strings to replace and what to replace them with. This will let you extend this logic with additional replacements with very little additional effort.

E.g.

    replacements = {
        source_name: "",
        "\n": "",
        " - ": " ",
        "LISTEN | ": ""
    }

    for orig, repl in replacements.items():
        title = title.replace(orig, repl)

Exception handling

    try:
        return format_article(first)
    except Exception as e:
        # Fallback error message
        return "Sorry, I couldn’t format the article properly."

This is very generic.

What exceptions can format_article actually raise?

Well, author could be None.

    author = article.get("author")

But you've checked for that before doing anything with it that might raise an exception.

    if author and author.lower() != source_name.lower():

Honestly, I can't figure out anything format_article itself can be doing to raise an exception.

As such, I think it would be more appropriate to simply let this unforeseen exception that's out of your control propagate out to the caller. Maybe in your calling code that's main guarded you choose to do something about this or maybe you just let the program exit with the exception, but generic Exception handlers should always raise an eyebrow.

\$\endgroup\$
5
  • 3
    \$\begingroup\$ Agreed about the exception. You should only catch exceptions if you actually can do something about fixing the problem, and only those specific exceptions you can fix. If you can't do anything about the problem, then don't catch the exception; instead, let it bubble up the stack – you never know if there's someone else who knows how to fix it! (Even if it is just the OS cleaning up the memory and resources.) \$\endgroup\$ Commented Nov 26 at 19:23
  • \$\begingroup\$ Your idea of using next() is really clean. Does it search for a specific category in order of the first found or in order of the ones listed? In other words, if I enter the query « sports business », which would be selected? \$\endgroup\$ Commented Nov 26 at 19:46
  • \$\begingroup\$ @JörgWMittag By the way, I actually do not remember why the exception is there, and can assure you that it was not because an error occurred without it. \$\endgroup\$ Commented Nov 26 at 19:49
  • \$\begingroup\$ @Chip01 because "business" occurs first in CATEGORIES it'd be the one returned. A slightly more sophisticated search might find how many times each keyword occurs in the text and return the one that occurs most often. Which one occurs first would be a decent tie-breaker. \$\endgroup\$ Commented Nov 26 at 19:57
  • 1
    \$\begingroup\$ To see the order, simply convert that generator expression to a list instead. \$\endgroup\$ Commented Nov 26 at 19:58
5
\$\begingroup\$

Type hints

I like the usage of type hints for your functions. However, the return hint looks wrong for:

def clean_query(text: str) -> str:

The function always seems to return a dict. For example:

return {"q": text}

Simpler

Since you only use the variable first in the news_concerning function once, you could eliminate it:

return format_article(articles[0])

Comments

This comment can be deleted since it looks like a note you made to yourself during code development:

# here you might want to use BeautifulSoup instead of manual replace

Comments are intended to explain code behavior.

Unused import

According to ruff , the following import is unused:

import json5
\$\endgroup\$
4
\$\begingroup\$

Session

This may not a big deal since you are using requests at only one place, but using a session has some benefits. You can set all your headers at session level once for all, and then there is no need to repeat them when doing requests.get or post etc.

So this code:

headers = {"X-Api-Key": api_key}
response = requests.get(API_URL, params=params, headers=headers)

can be:

session = requests.session()
session.headers = headers
response = session.get(API_URL, params=params)

And you can still add additional parameters on a case by case basis if necessary, on top of what session will provide automatically for you.

This is useful when working with APIs for example, and you have plenty of functions connecting to different API endpoints.

Separation of concerns

The function news_concerning does at least three completely different things: check an environment variable, and then call a function, and then extract some data.

To begin with, you could move this code to a dedicated function, that checks environment variables on startup:

api_key = os.getenv("NEWS_API_KEY")
if not api_key:
    raise RuntimeError("API key not found. Please set NEWS_API_KEY in your environment.")

For the moment, that's the only check you are doing but there may be more in the future. It is fine to have very small functions, each focused on a specific purpose.

However, it's only 3 lines currently. So it would be okay to have this inside the __main__ guard instead. This check should be performed as early as possible. If you refactor your code later (and I suspect you will), that check may not take place when it should, because the code is executed in a certain order and follows a certain flow.

If we look at your current code, the check is actually done almost immediately, because you call news_concerning very early:

if __name__ == "__main__":
    user_input = input("What news would you like to hear about? ")
    print(news_concerning(user_input))

But you should request user input only after making sure that the environment variable is set, so that the user does not enter data for nothing, since the application will stop with an error right away.

Future improvements

I would be tempted to use a config file for the stopwords. For example in YAML format. So that a user can maintain the list without touching the code, at the risk of breaking things.

Ditto for the categories.

I would also sort the keywords using Python, to have a better overview and reduce the chance of adding duplicates when editing that list. In fact, you can use the set function to process a list and remove duplicates and get a clean list.

\$\endgroup\$
3
  • \$\begingroup\$ I like the idea of moving stopwords/categories into a config file. Do you think YAML is the best choice here, or would JSON/TOML be more practical for portability? \$\endgroup\$ Commented Nov 26 at 19:51
  • 1
    \$\begingroup\$ @Chip01 I believe YAML with pyyaml is just fine for the config file. Also, you can structure your data hierarchically eg. by theme to make it even better organized. YAML would be more convenient than JSON for editing I think, but beware indentation. \$\endgroup\$ Commented Nov 26 at 19:59
  • 1
    \$\begingroup\$ YAML is a bad format: you'd want to use a library like StrictYAML to work around that. TOML is a simpler format with support in the Python standard library, and this use-case doesn't motivate the complexity of YAML. Portability isn't much of a concern considering how many third-party libraries are already being brought in, but for ultra-portability you'd want JSON: the main reason to avoid it is that JSON is hard to edit manually: it's easy to get those braces unbalanced by mistake. (cc @Chip01) \$\endgroup\$ Commented Nov 27 at 10:13

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.