Extract json values using just regex

Question

I have a description field that is embedded within json and I'm unable to utilize json libraries to parse this data.

I use {0,23} in order in attempt to extract first 23 characters of string, how to extract entire value associated with description ?

   import re

    description = "'\description\" : \"this is a tesdt \n another test\" "

    re.findall(r'description(?:\w+){0,23}', description, re.IGNORECASE)

For above code just ['description'] is displayed

There are no characters matching \w imrediately after description so this is completely expected. Perhaps you are looking for .{0,23}? — tripleee, Commented Apr 23, 2018 at 16:35
Even if you are unable to import json (but why??) using regex for this seems misdirected, especially if you are unfamiliar with regex. — tripleee, Commented Apr 23, 2018 at 16:38
It may be helpful to know why you can't use any JSON libraries. — Alex Hall, Commented Apr 23, 2018 at 16:43
In case the problem with JSON libraries is that the JSON is embedded in a larger document like a webpage and you don't know how to parse only the JSON, check out github.com/alexmojaki/jsonfinder — Alex Hall, Commented Apr 23, 2018 at 17:02
This is a typical bad question. "Have some problem (which is not demonstrated in the question) and I want to solve it with a regex". A regex is obviously the wrong approach here. — hek2mgl, Commented Apr 23, 2018 at 17:10

Community · Accepted Answer · 2020-06-20 09:12:55Z

1

You could try this code out:

import re

description = "description\" : \"this is a tesdt \n another test\" "

result = re.findall(r'(?<=description")(?:\s*\:\s*)(".{0,23}?(?=")")', description, re.IGNORECASE+re.DOTALL)[0]

print(result)

Which gives you the result of:

"this is a tesdt 
 another test"

Which is essentially:

\"this is a tesdt \n another test\"

And is what you have asked for in the comments.

Explanation -

(?<=description") is a positive look-behind that tells the regex to match the text preceded by description"
(?:\s*\:\s*) is a non-capturing group that tells the regex that description" will be followed by zero-or-more spaces, a colon (:) and again zero-or-more spaces.
(".{0,23}?(?=")") is the actual match desired, which consists of a double-quotes ("), zero-to-twenty three characters, and a double-quotes (") at the end.

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Apr 23, 2018 at 16:46

Robo Mop

3,5511 gold badge11 silver badges23 bronze badges

how to match until a double quotes is met ?
– blue-sky
Commented Apr 23, 2018 at 16:49
@blue-sky You'll have to elaborate on that, because in your sample input, description is immediately followed by a double quotation mark.
– Robo Mop
Commented Apr 23, 2018 at 16:51
1

@hek2mgl I apologize for using regex here, even though I have heard that using a JSON Library is better than regex in such cases, as comments in my previous answers. However, I know absolutely nothing about JSON, or its libraries, and I am accustomed to using regex. The question seemed simple enough, so I used regex in my answer.
– Robo Mop
Commented Apr 23, 2018 at 17:19
1

@hek2mgl I thought that since the OP tagged the question with regex, he would be familiar with it, and I also saw some comments telling him to use JSON Libraries. So I thought I might as well add whatever limited information I knew as an answer, solve his problem, and then he would also be able to later learn about JSON Parsing.
– Robo Mop
Commented Apr 23, 2018 at 17:22
1

@hek2mgl Perhaps you could add an answer using JSON, and tell him how it would be easier to use that instead of regex. I'm sure that would be better appreciated :)
– Robo Mop
Commented Apr 23, 2018 at 17:24

| Show 5 more comments

Alex Hall · Accepted Answer · 2018-04-23 16:58:18Z

# First just creating some test JSON

import json

data = {
    'items': [
        {
            'description': 'A "good" thing',

            # This is ignored because I'm assuming we only want the exact key 'description'
            'full_description': 'Not a good thing'
        },
        {
            'description': 'Test some slashes: \\ \\\\ \" // \/ \n\r',
        },
    ]
}

j = json.dumps(data)

print(j)

# The actual code

import re

pattern = r'"description"\s*:\s*("(?:\\"|[^"])*?")'
descriptions = [

    # I'm using json.loads just to parse the matched string to interpret
    # escapes properly. If this is not acceptable then ast.literal_eval
    # will probably also work
    json.loads(d)
    for d in re.findall(pattern, j)]

# Testing that it works

assert descriptions == [item['description'] for item in data['items']]

Honestly, what's the point here? You encourage the OP to parse json with regular expressions? — hek2mgl, Commented Apr 23, 2018 at 17:08

Collectives™ on Stack Overflow

Extract json values using just regex

2 Answers 2

Explanation -

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Explanation -

Linked

Related