1

I have gone through many of the regex questions on here and used the advice in them, but can't seem to get my code to run still. I have a list of strings, and I am attempting to find the entries in this list that contain one of the following patterns:

  • a BLANK of a BLANK
  • an BLANK of an BLANK
  • a BLANK of an BLANK
  • an BLANK of a BLANK
  • that BLANK of a BLANK
  • that BLANK of an BLANK
  • the BLANK of a BLANK
  • the BLANK of an BLANK

For example, I should be able to find sentences that contain phrases like "an idiot of a doctor" or "the hard-worker of a student."

Once found, I want to make a list of the sentences that satisfy this criteria. So far, this is my code:

for sentence in sentences:
    matched = re.search(r"a [.*]of a " \
                        r"an [.*]of an " \
                        r"a [.*]of an" \
                        r"an [.*]of a " \
                        r"that [.*]of a " \
                        r"that [.*]of an " \
                        r"the [.*]of a " \
                        r"the [.*]of an ", sentence)
    if matched:
        bnp.append(matched)

#Below two lines for testing purposes only
print(matched)
print(bnp)

This code turns up no results, despite the fact that there are phrases that should satisfy the criteria in the list.

3
  • Why do you write this kind of things: [.*], take the time to read regex tutorial before, don't try random things. Commented Jan 17, 2017 at 20:14
  • I thought that [.*] would let me search for a substring of any length with any characters- did I misunderstand this? Commented Jan 17, 2017 at 20:17
  • brackets are used to match single characters, use (.*) instead Commented Jan 17, 2017 at 20:31

2 Answers 2

1

[.*] is a character class, so you are asking regex to actually match the dot or star character, quoting from re's docs:

[]

Used to indicate a set of characters. In a set:

Characters can be listed individually, e.g. [amk] will match 'a', 'm', or 'k'.

...

So, here is one way to do it:

(th(at|e)|a[n]?)\b.*\b(a[n]?)\b.*

This expression will try to match either the, that , a or an, then any character up to there is either a or an.

Here in this link, there is a demonstration of it's process.

And here is the actual demonstration:

>>> import re
>>>
>>> regex = r"(th(at|e)|a[n]?)\b.*\b(a[n]?)\b.*"
>>> test_str = ("an idiot of a doctor\n"
    "the hard-worker of a student.\n"
    "an BLANK of an BLANK\n"
    "a BLANK of an BLANK\n"
    "an BLANK of a BLANK\n"
    "that BLANK of a BLANK\n"
    "the BLANK of a BLANK\n"
    "the BLANK of an BLANK\n")
>>>
>>> matches =  re.finditer(regex, test_str, re.MULTILINE | re.IGNORECASE) 
>>> 
>>> for m in matches:
        print(m.group())


an idiot of a doctor
the hard-worker of a student.
an BLANK of an BLANK
a BLANK of an BLANK
an BLANK of a BLANK
that BLANK of a BLANK
the BLANK of a BLANK
the BLANK of an BLANK
Sign up to request clarification or add additional context in comments.

Comments

1

As it stands, this code will concatenate your pattern parameters into one long string with no operators between them. So in effect you are searching for the regex "a [.*]of a an [.*]of an a [.*]of an ..."

You are missing the 'or' operator: |. A simpler regex to accomplish this task would be something like:

(a|an|that|the) \b.*\b of (a|an) \b.*\b

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.