2

I'm looking for help to create a regular expression that can get a certain text after a given string using Python.

I'm trying to extract a JSON from a page and it's like this:

    var config = {aslkdjsakljdkalsj{asdasdas}askldjaskljd};

I need a regex that can get from the first { to the } => without the semicolon

I've tried using

    config = .*?(?=\}\;)

but the output is

    config = {sadasdasdas{a}asdasdasd

It gets the config = part and doesn't get the last }.

How can I fix it?

10
  • What language are you implementing this in? Commented Jan 3, 2019 at 1:43
  • Also, {aslkdjsakljdkalsj{asdasdas}askldjaskljd}; is not JSON Commented Jan 3, 2019 at 1:44
  • I'm not implementing yet. I'm trying to find the right pattern using link. Commented Jan 3, 2019 at 1:48
  • And yes, in this example it's not a JSON. Commented Jan 3, 2019 at 1:48
  • Which language you're using really matters - in some languages, this is impossible, in others, it's doable. Commented Jan 3, 2019 at 1:50

1 Answer 1

1

If your line of JS there is guaranteed to contain no newline characters before the terminating ;, then the problem is simple enough - match var config =, followed by non-newline characters captured in a group, and then matcha semicolon and the end of the line. If the JSON is delimited with 's, then, for example, use the pattern

var config = '(.+)';$

and extract the first group.

input = '''
  var config = '{ "foo": "b\\ar", "ba{{}}}{{z": ["buzz}", "qux", {"innerprop": "innerval"}]}';
  var someOtherVar = 'bar';
'''
match = re.search("(?m)var config = '(.+)';$", input);

If the JSON isn't guaranteed to be on its own line, then it's a lot more complicated. Parsing nested structures like JSON is difficult - the only way the general problem is solvable with regular expressions is if the structure is known beforehand (which often isn't the case, and can require a lot of repetitive code in the pattern), or if the RE engine being used supports recursive matches. Without that, there's no way to to express the need for a balanced number of {s with }s in the pattern.

Luckily, if you're working with Python, even though Python's native REs don't support recursion, there'a a regex module available that does. You'll also need to make sure that the { and }s that may come inside of strings in the JSON don't affect the current nesting level. For a raw string, you'd need a pattern like

var config = String\.raw`\K({(?:"(?:\\|\\"|[^"])*"|[^{}]|(?1))*})(?=`;)

The outside of the capture group is

var config = String\.raw`\K({ ... })(?=`;)

matching the line you want and the string delimiters, with a capturing group of

{(?:"(?:\\|\\"|[^"])*"|[^{}]|(?1))*}

which means - {, followed by any number of: either

  • "(?:\\|\\"|[^"])*" - match a string inside the JSON (either a key or a value), from its starting delimiter to its ending delimiter, ignoring escaped "s, or
  • [^{}] - Match anything that isn't a { or } - other characters can be ignored, since we just want to get the nesting level right, or
  • (?1) - Recurse the whole first capture group (the one that matches the { ... })

This will ensure that the { } brackets are balanced by the end of the pattern.


But - the above is an example where String.raw was used, where literal backslashes in the Javascript code indicate literal backslashes in the string. With ' delimiters, on the other hand, literal backslashes need to be double-escaped in the JS, so the above input would look like

var config = '{ "foo": "b\\\\ar", "ba{{}}}{{z": ["buzz}", "qux", {"innerprop": "innerval"}]}';

requiring double-escaping the backslashes in the pattern as well:

var config = '\K({(?:"(?:\\\\|\\\\"|[^"])*"|[^{}]|(?1))*})(?=';)

https://regex101.com/r/8rSrGf/1

It's pretty complicated. I'd recommend going with the first approach or a variation on it instead, if at all possible.

1
  • Thanks for the editing and answer. Really helped me. Commented Jan 6, 2019 at 3:37

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.