1

I have a description field that is embedded within json and I'm unable to utilize json libraries to parse this data.

I use {0,23} in order in attempt to extract first 23 characters of string, how to extract entire value associated with description ?

   import re

    description = "'\description\" : \"this is a tesdt \n another test\" "

    re.findall(r'description(?:\w+){0,23}', description, re.IGNORECASE)

For above code just ['description'] is displayed

13
  • 1
    There are no characters matching \w imrediately after description so this is completely expected. Perhaps you are looking for .{0,23}?
    – tripleee
    Commented Apr 23, 2018 at 16:35
  • 1
    Even if you are unable to import json (but why??) using regex for this seems misdirected, especially if you are unfamiliar with regex.
    – tripleee
    Commented Apr 23, 2018 at 16:38
  • 2
    It may be helpful to know why you can't use any JSON libraries.
    – Alex Hall
    Commented Apr 23, 2018 at 16:43
  • 1
    In case the problem with JSON libraries is that the JSON is embedded in a larger document like a webpage and you don't know how to parse only the JSON, check out github.com/alexmojaki/jsonfinder
    – Alex Hall
    Commented Apr 23, 2018 at 17:02
  • 1
    This is a typical bad question. "Have some problem (which is not demonstrated in the question) and I want to solve it with a regex". A regex is obviously the wrong approach here.
    – hek2mgl
    Commented Apr 23, 2018 at 17:10

2 Answers 2

1

You could try this code out:

import re

description = "description\" : \"this is a tesdt \n another test\" "

result = re.findall(r'(?<=description")(?:\s*\:\s*)(".{0,23}?(?=")")', description, re.IGNORECASE+re.DOTALL)[0]

print(result)

Which gives you the result of:

"this is a tesdt 
 another test"

Which is essentially:

\"this is a tesdt \n another test\"

And is what you have asked for in the comments.


Explanation -

(?<=description") is a positive look-behind that tells the regex to match the text preceded by description"
(?:\s*\:\s*) is a non-capturing group that tells the regex that description" will be followed by zero-or-more spaces, a colon (:) and again zero-or-more spaces.
(".{0,23}?(?=")") is the actual match desired, which consists of a double-quotes ("), zero-to-twenty three characters, and a double-quotes (") at the end.

10
  • how to match until a double quotes is met ?
    – blue-sky
    Commented Apr 23, 2018 at 16:49
  • @blue-sky You'll have to elaborate on that, because in your sample input, description is immediately followed by a double quotation mark.
    – Robo Mop
    Commented Apr 23, 2018 at 16:51
  • 1
    @hek2mgl I apologize for using regex here, even though I have heard that using a JSON Library is better than regex in such cases, as comments in my previous answers. However, I know absolutely nothing about JSON, or its libraries, and I am accustomed to using regex. The question seemed simple enough, so I used regex in my answer.
    – Robo Mop
    Commented Apr 23, 2018 at 17:19
  • 1
    @hek2mgl I thought that since the OP tagged the question with regex, he would be familiar with it, and I also saw some comments telling him to use JSON Libraries. So I thought I might as well add whatever limited information I knew as an answer, solve his problem, and then he would also be able to later learn about JSON Parsing.
    – Robo Mop
    Commented Apr 23, 2018 at 17:22
  • 1
    @hek2mgl Perhaps you could add an answer using JSON, and tell him how it would be easier to use that instead of regex. I'm sure that would be better appreciated :)
    – Robo Mop
    Commented Apr 23, 2018 at 17:24
0
# First just creating some test JSON

import json

data = {
    'items': [
        {
            'description': 'A "good" thing',

            # This is ignored because I'm assuming we only want the exact key 'description'
            'full_description': 'Not a good thing'
        },
        {
            'description': 'Test some slashes: \\ \\\\ \" // \/ \n\r',
        },
    ]
}

j = json.dumps(data)

print(j)

# The actual code

import re

pattern = r'"description"\s*:\s*("(?:\\"|[^"])*?")'
descriptions = [

    # I'm using json.loads just to parse the matched string to interpret
    # escapes properly. If this is not acceptable then ast.literal_eval
    # will probably also work
    json.loads(d)
    for d in re.findall(pattern, j)]

# Testing that it works

assert descriptions == [item['description'] for item in data['items']]
1
  • 3
    Honestly, what's the point here? You encourage the OP to parse json with regular expressions?
    – hek2mgl
    Commented Apr 23, 2018 at 17:08

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.