525

I need some help on declaring a regex. My inputs are like the following:

this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>. 
and there are many other lines in the txt files
with<[3> such tags </[3>

The required output is:

this is a paragraph with in between and then there are cases ... where the number ranges from 1-100. 
and there are many other lines in the txt files
with such tags

I've tried this:

#!/usr/bin/python
import os, sys, re, glob
for infile in glob.glob(os.path.join(os.getcwd(), '*.txt')):
    for line in reader: 
        line2 = line.replace('<[1> ', '')
        line = line2.replace('</[1> ', '')
        line2 = line.replace('<[1>', '')
        line = line2.replace('</[1>', '')
        
        print line

I've also tried this (but it seems like I'm using the wrong regex syntax):

        line2 = line.replace('<[*> ', '')
        line = line2.replace('</[*> ', '')
        line2 = line.replace('<[*>', '')
        line = line2.replace('</[*>', '')

I dont want to hard-code the replace from 1 to 99.

0

7 Answers 7

877
+150

This tested snippet should do it:

import re
line = re.sub(r"</?\[\d+>", "", line)

Edit: Here's a commented version explaining how it works:

line = re.sub(r"""
  (?x) # Use free-spacing mode.
  <    # Match a literal '<'
  /?   # Optionally match a '/'
  \[   # Match a literal '['
  \d+  # Match one or more digits
  >    # Match a literal '>'
  """, "", line)

Regexes are fun! But I would strongly recommend spending an hour or two studying the basics. For starters, you need to learn which characters are special: "metacharacters" which need to be escaped (i.e. with a backslash placed in front - and the rules are different inside and outside character classes.) There is an excellent online tutorial at: www.regular-expressions.info. The time you spend there will pay for itself many times over. Happy regexing!

8
  • 19
    Also don't neglect The Book on Regular Expressions - Mastering Regular Expressions, by Jeffrey Friedl
    – pcurry
    Commented May 14, 2013 at 5:05
  • 4
    Another good reference sees w3schools.com/python/python_regex.asp
    – Carson
    Commented May 15, 2020 at 11:23
  • 3
    The commented version mentions (?x) free-spacing mode, but that is not in the snippet. Is that a default or something?
    – RufusVS
    Commented Sep 24, 2020 at 21:38
  • 2
    @RufusVS - The '(?x)' inside the regex text tells the regex engine compiler that this regex is written in free-spacing mode. You could alternatively add the: 're.VERBOSE' compilation flag to the function call. Commented Aug 29, 2021 at 21:21
  • 6
    @ridgerunner Actually, my point was, you have the expression (?x) present in your commented version of the pattern, but NOT in the uncommented version above it that you called tested snippet. (Edit: I googled it and found out the free-spacing expression is only needed because you have comments and spaces in the explanatory version, not needed in the tested snippet version. Got it now.)
    – RufusVS
    Commented Aug 30, 2021 at 3:27
66

str.replace() does fixed replacements. Use re.sub() instead.

4
  • 4
    Also worth noting that your pattern should look something like "</{0-1}\d{1-2}>" or whatever variant of regexp notation python uses.
    – user684934
    Commented Apr 14, 2011 at 4:05
  • 7
    What does fixed replacements mean?
    – avi
    Commented Jul 3, 2015 at 11:35
  • @avi Probably he meant fixed word replacement rather partial word locating through regex. Commented Jul 11, 2017 at 8:48
  • 2
    fixed (literal, constant) strings
    – vstepaniuk
    Commented Jul 31, 2019 at 11:54
32

I would go like this (regex explained in comments):

import re

# If you need to use the regex more than once it is suggested to compile it.
pattern = re.compile(r"</{0,}\[\d+>")

# <\/{0,}\[\d+>
# 
# Match the character “<” literally «<»
# Match the character “/” literally «\/{0,}»
#    Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «{0,}»
# Match the character “[” literally «\[»
# Match a single digit 0..9 «\d+»
#    Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
# Match the character “>” literally «>»

subject = """this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>. 
and there are many other lines in the txt files
with<[3> such tags </[3>"""

result = pattern.sub("", subject)

print(result)

If you want to learn more about regex I recomend to read Regular Expressions Cookbook by Jan Goyvaerts and Steven Levithan.

1
  • 5
    From the python docs: {0,} is the same as *, {1,} is equivalent to +, and {0,1} is the same as ?. It’s better to use *, +, or ? when you can, simply because they’re shorter and easier to read.
    – winklerrr
    Commented Aug 17, 2017 at 14:07
17

The easiest way

import re

txt='this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>.  and there are many other lines in the txt files with<[3> such tags </[3>'

out = re.sub("(<[^>]+>)", '', txt)
print out
0
16

replace method of string objects does not accept regular expressions but only fixed strings (see documentation: http://docs.python.org/2/library/stdtypes.html#str.replace).

You have to use re module:

import re
newline= re.sub("<\/?\[[0-9]+>", "", line)
1
  • 6
    You should use \d+ instead of [0-9]+
    – winklerrr
    Commented Aug 17, 2017 at 14:05
5

don't have to use regular expression (for your sample string)

>>> s
'this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>. \nand there are many other lines in the txt files\nwith<[3> such tags </[3>\n'

>>> for w in s.split(">"):
...   if "<" in w:
...      print w.split("<")[0]
...
this is a paragraph with
 in between
 and then there are cases ... where the
 number ranges from 1-100
.
and there are many other lines in the txt files
with
 such tags
4
import os, sys, re, glob

pattern = re.compile(r"\<\[\d\>")
replacementStringMatchesPattern = "<[1>"

for infile in glob.glob(os.path.join(os.getcwd(), '*.txt')):
   for line in reader: 
      retline =  pattern.sub(replacementStringMatchesPattern, "", line)         
      sys.stdout.write(retline)
      print (retline)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.