0

I have a text file formatted like a JSON file however everything is on a single line (could be a MongoDB File). Could someone please point me in the direction of how I could extract values using a Python regex method please?

The text shows up like this:

{"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.‌​au\/ns\/business\/wi‌​ki","author":null,"d‌​escription":null,"fi‌​leAssetId":"034b9317‌​-60d9-45c2-b6d6-0f24‌​b59e1991","filename"‌​:"Reports.pdf"},"cre‌​atedBy":1531,"create‌​dByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acro‌​bat.png","id":3041,"‌​inheritedPermissions‌​":false,"name":"map"‌​,"permissions":[23,8‌​7,35,49,65],"type":3‌​,"viewLevel":2},{"__‌​type":"WikiNode:http‌​:\/\/samplesite.com.‌​au\/ns\/business\/wi‌​ki","children":[],"c‌​ontent": 

I am wanting to get the "fileAssetId" and filename". Ive tried to load the like with Pythons JSON module but I get an error

For the FileAssetid I tried this regex:

regex = re.compile(r"([0-9a-f]{8})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{12})")

But i get the following 034b9317‌​, 60d9, 45c2, b6d6, 0f24‌​b59e1991

Im not to sure how to get the data as its displayed.

9
  • can you put some data of your file here?
    – Frank AK
    Commented Nov 23, 2017 at 11:42
  • The text shows up like this: {"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.au\/ns\/business\/wiki","author":null,"description":null,"fileAssetId":"034b9317-60d9-45c2-b6d6-0f24b59e1991","filename":"Reports.pdf"},"createdBy":1531,"createdByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acrobat.png","id":3041,"inheritedPermissions":false,"name":"map","permissions":[23,87,35,49,65],"type":3,"viewLevel":2},{"__type":"WikiNode:http:\/\/samplesite.com.au\/ns\/business\/wiki","children":[],"content": I am wanting to get the "fileAssetId" and filename"
    – iHaag
    Commented Nov 23, 2017 at 11:54
  • the dictionary is not completed. You are missgin [ at the beginning and }] and the end
    – ezdazuzena
    Commented Nov 23, 2017 at 12:01
  • I would love to extract the value after "fileAssetId": and the value after the filename, but I'm not to sure how to do it.
    – iHaag
    Commented Nov 23, 2017 at 12:06
  • Using a JSON parser must be a better option?
    – StefanE
    Commented Nov 23, 2017 at 12:18

4 Answers 4

1

How about using positive lookahead and lookbehind:

(?<=\"fileAssetId\":\")[a-fA-F0-9-]+?(?=\")

captures the fileAssetId and

(?<=\"filename\":\").+?(?=\")

matches the filename.

For a detailed explanation of the regex have a look at the Regex101-Example. (Note: I combined both in the example with an OR-Operator | to show both matches at once)

To get a list of all matches use re.findall or re.finditer instead of re.match.

re.findall(pattern, string) returns a list of matching strings.

re.finditer(pattern, string) returns an iterator with the objects.

13
  • That works, thank you so much, but its only showing the first, not all the values, im doing it this way: import re f=open("jsonfile.txt") f=f.readlines() for line in f: m = re.search(r'(?<=\"fileAssetId\":\")[a-fA-F0-9-]+?(?=\")|(?<=\"filename\":\").+?(?=\")', line) print m.group()
    – iHaag
    Commented Nov 23, 2017 at 14:12
  • As i said in my answer edit, use findall or finditer not search.
    – Igl3
    Commented Nov 23, 2017 at 14:13
  • That works a treat, thank you. Is there a way i could store all the values for "filename" and "fileAssetId" so i could do something like wget = urllib.urlopen('samplewebsite.com' + fileAssetId_value + filename_value) ??? Thank you for your help.
    – iHaag
    Commented Nov 23, 2017 at 14:26
  • If one asset id is always associated with one filename, I would try to fix your json data and load it instead of using regex as it'll be a very complex regex to get the associated values. Can you do with open('jsonfile', 'r') as f: distros_dict = json.load(f) for distro in distros_dict: print(distro) and share the output? Then I can maybe tell you why you can't access the filename.
    – Igl3
    Commented Nov 23, 2017 at 14:28
  • Output for your code loading it as a JSON file was just the letter d
    – iHaag
    Commented Nov 23, 2017 at 14:35
1

You can use python's walk method and check each entry with re.match.

In case that the string you got is not convertable to a python dict, you can use just regex:

print re.match(r'.*fileAssetId\":\"([^\"]+)\".*', your_pattern).group(1)

Solution for your example:

import re

example_string = '{"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.u\/ns\/business\/wiki","author":null,"description":null,"fileAssetId":"034b9317-60d9-45c2-b6d6-0f24b59e1991","filename":"Reports.pdf"},"createdBy":1531,"createdByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acrobat.png","id":3041,"inheritedPermissions":false,"name":"map","permissions":[23,87,35,49,65],"type":3,"viewLevel":2},{"__type":"WikiNode:http:\/\/samplesite.com.au\/ns\/business\/wiki","children":[],"content"'

regex_pattern = r'.*fileAssetId\":\"([^\"]+)\".*'
match = re.match(regex_pattern, example_string)
fileAssetId = match.group(1)
print('fileAssetId: {}'.format(fileAssetId))

executing this yields:

34b9317‌​-60d9-45c2-b6d6-0f24‌​b59e1991
15
  • coming up red in sublime text \":\"([^\"]+)\".*`).group(1)
    – iHaag
    Commented Nov 23, 2017 at 12:13
  • File "test2.py", line 18 fileAsset = re.match(r.*fileAssetId\":\"([^\"]+)\".*, regex).group(1) ^ SyntaxError: invalid syntax
    – iHaag
    Commented Nov 23, 2017 at 12:16
  • you are missing '
    – ezdazuzena
    Commented Nov 23, 2017 at 12:22
  • Thank you, that's better but errors out. return _compile(pattern, flags).match(string) TypeError: expected string or buffer
    – iHaag
    Commented Nov 23, 2017 at 12:26
  • 1
    @KhaledAhmedSobhy [^\"]+ matches at least one character that is not ", so int and float will be matched as well. Though, you might want to cast it to an int or float once matched.
    – ezdazuzena
    Commented Nov 25, 2019 at 8:09
0

Try adding \n to the string that you are entering in to the file (\n means new line)

0

Based on the idea given here https://stackoverflow.com/a/3845829 and by following the JSON standard https://www.json.org/json-en.html, we can use Python + regex https://pypi.org/project/regex/ and do the following:

json_pattern = (
    r'(?(DEFINE)'
    r'(?P<whitespace>( |\n|\r|\t)*)'
    r'(?P<boolean>true|false)'
    r'(?P<number>-?(0|([1-9]\d*))(\.\d*[1-9])?([eE][+-]?\d+)?)'
    r'(?P<string>"([^"\\]|\\("|\\|/|b|f|n|r|t|u[0-9a-fA-F]{4}))*")'
    r'(?P<array>\[((?&whitespace)|(?&value)(,(?&value))*)\])'
    r'(?P<key>(?&whitespace)(?&string)(?&whitespace))'
    r'(?P<value>(?&whitespace)((?&boolean)|(?&number)|(?&string)|(?&array)|(? &object)|null)(?&whitespace))'
    r'(?P<object>\{((?&whitespace)|(?&key):(?&value)(,(?&key):(?&value))*)\})'
    r'(?P<document>(?&object)|(?&array))'
    r')'
    r'(?&document)'
)

json_regex = regex.compile(json_pattern)

match = json_regex.match(json_document_text)

You can change last line in json_pattern to match not document but individual objects replacing (?&document) by (?&object). I think the regex is easier than I expected, but I did not run extensive tests on this. It works fine for me and I have tested hundreds of files. I wil try to improve my answer in case I find any issue when running it.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.