3

I am using regular expressions in Python to search through a page source, and find all the json information in the javascript. Specifically an example would look something like this:

var fooData = {
    id: 123456789,
    name : "foo bar",
    country_name: "foo",
    country_is_eu: null,
    foo_bars: null,
    foo_email: null,
    foo_rate: 1.0,
    foo_id: 0987654321
};

I'm fairly new to understanding all there is to know about regular expressions, and I'm not sure if what I'm doing is correct. I can get some individual lines, but I'm not completely sure of how to use re.MULTILINE. This is the code I have so right now:

prog = re.compile('[var ]?\w+ ?= ?{[^.*]+\n};', re.MULTILINE)
vars = prog.findall(text)

Why is this not working?

To be more clear, I really need it to match everything in between these brackets like this:

var fooData = {

};

So, essentially I can't figure out a way to match every line except one that looks like this:

};
4
  • checkout my response, I updated it, maybe you'd like to give it a try
    – user1006989
    Commented Dec 22, 2012 at 5:34
  • Yes thank you for helping me! I didn't realize that it was as simple as [^}]+, I did not know you could do that. Commented Dec 22, 2012 at 5:39
  • There's a built in json module that is useful. It sounds like you should be using that instead of regex.
    – ninMonkey
    Commented Dec 22, 2012 at 5:47
  • Obviously when i have the entire page source, to parse the json i have to find it first. Commented Dec 22, 2012 at 5:52

3 Answers 3

2

This is what you are looking for not including the brackets:

(?<=var fooData = {)[^}]+(?=};)
1
  • 1
    Thank you! I appreciate your diligence in helping me even after I got it to work. ^_^ Commented Dec 22, 2012 at 5:40
0

When you're not sure, always consult the documentation (it's quite good for Python).

The multi-line mode makes regular expressions beginning with a caret (^) and ending with a ($) to match the beginning and end of each respective line (where a "line" is whatever immediately follows a newline character \n).

It looks like you are already accounting for this by having \ns at the beginning and end of your regex and you are using the findall() function.

3
  • I'm using findall() because I'm looking for all instances of the stated pattern, and I've already read the documentation. Commented Dec 22, 2012 at 4:19
  • @yentup findall() matches ALL instances of the regex and so if you have \n at the beginning and end of your regular expression then multiline mode is useless.
    – Alex W
    Commented Dec 22, 2012 at 4:24
  • I know how findall() works, I'm not stupid. I'm trying to find everything like that in the entire page, the multiline is only for finding all the random information in between the brackets. Commented Dec 22, 2012 at 4:27
0

I got it! Turns out multiline mode was not even needed, I just matched all lines that didn't end in a ; in between the brackets. I also slightly modified the regex for finding the brackets and such, here is my code:

re.findall('(?:var )?\w+[ ]?=[ ]?{\n(?:.+(?!(?<=;))\n)+};', text)

Thanks to X.Jacobs, I simplified (and fixed) my code to this:

re.findall('(?:var )?\w+\s*=\s*{[^;]+};', text)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.