Parse JSON Object in python without the json library (Using only regex)

Question

I'm currently building a small application using the Instagram API which replies with JSON "objects" for the GET operations. To get the response I'm currently using urllib2.

This is part of an assignment from one of the courses I'm currently attending to, and the biggest challenge is that we are not allowed to use the JSON library to quickly parse and retrieve the information from the instagram response. We are forced to use the regex library (and only that) to properly parse the information.

The instagram response format to obtain the feed page of an user, for example, follows the structure shown in this link.

I honestly have spent 3 hours trying to figure this out by myself and also tried to obtain information on the internet, but most answered questions always point out to use the JSON library.

Any tips or suggestion would come in handy.

Additionally, other than urllib2 (may be considered external), I am not allowed to use any other external library (more like, 3rd party library) than the ones provided with python 2.7.

Thanks in advance.

This seems like a really unfair assignment. JSON is sufficiently complex that Regex just isn't suited for it. See this classic post: stackoverflow.com/a/1732454/755900 - In an attempt to be helpful though, you'll need to write your own tokenization methods. You most definitely will not be able to do all of it with just a regex, you'll need plenty of custom parsing. — Sean Johnson, Commented May 23, 2014 at 6:24
I happen to agree. The code required to write a JSON parser is not trivial. If this really is an assignment I hope it's a 4th Year Software Engineering course or similar. — James Mills, Commented May 23, 2014 at 6:26
This is currently my 3rd year, first semester and the course's name is "Programming languages".The course objective is to teach us the different language paradigms (imeprative, scripting, etc) in order to have more tools to face problems in the future (my professor words). It seems that the assignment was intended to teach us about the useful and powerful regex library, but as far as I can see it got a bit out of hand :P — selarom.epilef, Commented May 23, 2014 at 6:38
@felipeimm The thing is; it's quite non-trivial to implement a JOSN parser in pure Regex alone (regardless of the language). At some point you have structures to parse and so therefore need to write a parser. See my res;onse below. — James Mills, Commented May 23, 2014 at 6:51

JoseMiguel · Accepted Answer · 2014-05-23 13:12:24Z

It's not that complicated really, when you do the get request, you will get a bunch of code, from which you only need little parts, like for example, if you want to parse the news feeds from an user, and get the images and its captions:

query = "https://api.instagram.com/v1/users/"+profile_id+"/media/recent?access_token="+token
response = urlopen(query)
the_page = response.read()
feed = {}
feed['images'] = []
feed['captions'] = []
matchImage = re.findall(r'"standard_resolution":{"url":"(.*?)"', the_page)
matchCaption = re.findall(r'"caption":(.*?),(.*?),', the_page)
if len(matchImage) > 0:
    for x in xrange(0,len(matchImage)):
    image = matchImage[x].replace('\\','')
    if matchCaption[x][0] == 'null':
        feed['images'].append(image)
        feed['captions'].append('No Caption')
    else:
        caption = re.search(r'"text":"(.*?)"', matchCaption[x][1])
        caption = caption.group(1).replace('\\','')
        feed['images'].append(image)
        feed['captions'].append(caption)

James Mills · Accepted Answer · 2014-05-23 06:32:17Z

How about using a functional parser library and a bit of regex?

def parse(seq):
    'Sequence(Token) -> object'
    ...
    n = lambda s: a(Token('Name', s)) >> tokval
    def make_array(n):
        if n is None:
            return []
        else:
            return [n[0]] + n[1]
    ...
    null = n('null') >> const(None)
    true = n('true') >> const(True)
    false = n('false') >> const(False)
    number = toktype('Number') >> make_number
    string = toktype('String') >> make_string
    value = forward_decl()
    member = string + op_(':') + value >> tuple
    object = (
        op_('{') +
        maybe(member + many(op_(',') + member)) +
        op_('}')
        >> make_object)
    array = (
        op_('[') +
        maybe(value + many(op_(',') + value)) +
        op_(']')
        >> make_array)
    value.define(
          null
        | true
        | false
        | object
        | array
        | number
        | string)
    json_text = object | array
    json_file = json_text + skip(finished)

    return json_file.parse(seq)

You will need the funcparserlib library for this.

Note: Doing this with just pure regex is just too hard. You need to write some kind of "parser" -- So you may as well use a parser library to help with some of the boring bits.

Thanks for the response, unfortunately since my professor told us not to use any external library, I cannot really make use of this. However, I'll write this down in case I need in in the future. Thanks again! — selarom.epilef, Commented May 23, 2014 at 7:57
You can just borrow the parts of the library that you need. This is basically a recursive descent parser. Not that hard to implement without a 3rd party library. — James Mills, Commented May 23, 2014 at 8:48

Collectives™ on Stack Overflow

Parse JSON Object in python without the json library (Using only regex)

2 Answers 2

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Linked

Related