0

I need to parse text file which contains logins and id of users

+----+---------------+---------------+
| Id | Login         | Name          |
+----+---------------+---------------+
| 1  | admin         | admin         |
| 2  | admin2        | admin2        |
| 3  | ekaterina     | Ekaterina     |
| 4  | commarik      | commarik      |
| 5  | basildrescher | BasilDrescher |
| 6  | danielalynn   | DanielaLynn   |
| 7  | rosez13yipfj  | RoseZ13yipfj  |
| 8  | veolanoyes    | VeolaNoyes    |
| 9  | angel         | Angel         |
| 10 | michalea44    | MichaleA44    |
+----+---------------+---------------+

So I use re, like this:

import re
fh = open('test1.txt')
lines = fh.readlines()
for line in lines:
        #print line
        p = re.compile(r"|(.*?)|")
        m2 = p.search(line)
        if m2:
                print m2.group(0)

The problem is that I can't get needed result! I've tried various combinations with spaces and tabs, but it didn't work. I solved this with split(), but I still want to understand where I am wrong. Any help would be appreciated. Thank you!

5
  • 1
    p = re.compile(...) could be outside the for loop.
    – galath
    Commented Jul 17, 2015 at 16:15
  • 1
    As an alternative, consider m2 = line.strip('|').split('|')
    – Robᵩ
    Commented Jul 17, 2015 at 16:19
  • Code should parse wpscan logs into convinent form for users. Commented Jul 17, 2015 at 16:20
  • I agree with @Robᵩ that using strip and split is probably the better solution here. Commented Jul 17, 2015 at 16:25
  • 1
    "Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems." --jwz
    – Robᵩ
    Commented Jul 17, 2015 at 16:31

5 Answers 5

4

You have multiple errors:

  • The | is not escaped
  • You only have one group, so you are extracting only the first column.

The regex should be like this:

\|(.*?)\|(.*?)\|(.*?)\|

You can see a demo here.

0
4

If you dont expect fancy data, you can just use word chars and digits.

r"([\d\w]+)

Sample usage below

In [27]: data = """+----+---------------+---------------+
....:     | Id | Login         | Name          |
....:     +----+---------------+---------------+
....:     | 1  | admin         | admin         |
....:     | 2  | admin2        | admin2        |
....:     | 3  | ekaterina     | Ekaterina     |
....:     | 4  | commarik      | commarik      |
....:     | 5  | basildrescher | BasilDrescher |
....:     | 6  | danielalynn   | DanielaLynn   |
....:     | 7  | rosez13yipfj  | RoseZ13yipfj  |
....:     | 8  | veolanoyes    | VeolaNoyes    |
....:     | 9  | angel         | Angel         |
....:     | 10 | michalea44    | MichaleA44    |
....:     +----+---------------+---------------+"""

In [32]: matches = re.findall(r"([\d\w]+)", data)
In [36]: matches
Out[36]: ['Id', 'Login', 'Name', '1', 'admin', 'admin', '2', 'admin2', 'admin2', '3', 'ekaterina', 'Ekaterina', '4', 'commarik', 'commarik', '5', 'basildrescher', 'BasilDrescher', '6', 'danielalynn', 'DanielaLynn', '7', 'rosez13yipfj', 'RoseZ13yipfj', '8', 'veolanoyes', 'VeolaNoyes', '9', 'angel', 'Angel', '10', 'michalea44', 'MichaleA44']
1
  • This seems much elegant solution if you don't expect apart from words & digits in Data Commented Jul 17, 2015 at 16:25
3

| is a special character in regular expressions for "or"ing two expressions together. You need to escape it as \| to match the actual character. Also, search() will find one match. You may want to look through other methods such as findall.

0
1

Try using this regex to capture each individual line as a separate capture group, according to syntax:

\|\s*([0-9]+)\s*\|\s*([\w]+)\s*\|\s*([\w]+)\s*\|

Or, use this one to capture the same way you're trying above (which will also get the headers):

\|\s*(.*?)\s*\|\s*(.*?)\s*\|\s*(.*?)\s*\|

Here's a demo of the first.

As two other people have already said, you didn't escape your pipe character, which was messing up.

Also, you weren't taking into account whitespace on the edges of the words, so I added the \s regex pattern and kept that outside of the captured group to better what you get out.

1

Yes, something like the below would work;

import re
fh = open('test1.txt')
lines = fh.readlines()
for line in lines[2:]:
    p = re.compile(r"\|(?P<id>.*)\|(?P<login>.*)\|(?P<name>.*)\|")
    if p.search(line):
        id = re.match(p, line).group('id')
        login = re.match(p, line).group('login')
        name = re.match(p, line).group('name')
        print id.strip(),login.strip(),name.strip()

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.