Parsing text files with "magic" values

Question

Background

I have some large text files used in an automation script for audio tuning. Each line in the text file looks roughly like:

A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]] BANANA # BANANA

The text gets fed to an old command-line program which searches for keywords, and swaps them out. Sample output would be:

A[0] + B[100] - C[0x1000] [[0]] 0 # 0
A[2] + B[200] - C[0x100A] [[2]] 0 # 0

Problem

Sometimes, text files have keywords that are meant to be left untouched (i.e. cases where we don't want "BANANA" substituted). I'd like to modify the text files to use some kind of keyword/delimiter that is unlikely to pop up in normal circumstances, i.e:

A[#1] + B[#2] - C[#3] [[#1]] #1 # #1

Question

Does python's text file parser have any special indexing/escape sequences I could use instead of simple keywords?

You don't mean that all of the input lines look like that? Do you have a grammar for the input lines? — Bill Bell, Commented Jan 31, 2018 at 18:41
This is a strange post, almost confusing. What is this, like mail merge? A simple key/value pair that needs to be swapped out? It's pretty simple, why is it confusing me ? — user557597, Commented Jan 31, 2018 at 19:08
If you don't want "BANANA" substituted for, then why would you substitute in #1? — President James K. Polk, Commented Jan 31, 2018 at 21:05

Jean-François Fabre · Accepted Answer · 2018-01-31 18:58:51Z

2

use a regular expression replacement function with a dictionary.

Match everything between brackets (non-greedy, avoiding the brackets themselves) and replace by the value of the dict, put original value if not found:

import re

d = {"BANANA":"12", "PINEAPPLE":"20","CHERRY":"100","BANANA":"400"}
s = "A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]"

print(re.sub("\[([^\[\]]*)\]",lambda m : "[{}]".format(d.get(m.group(1),m.group(1))),s))

prints:

A[400] + B[20] - C[100] [[400]]

answered Jan 31, 2018 at 18:58

Jean-François Fabre♦

140k24 gold badges177 silver badges244 bronze badges

Add a comment |

Ajax1234 · Accepted Answer · 2018-01-31 18:45:35Z

You can use re.sub to perform the substitution. This answer creates a list of randomized values to demonstrate, however, the list can be replaces with the data you are using:

import re
import random
s = "A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]"
new_s = re.sub('(?<=\[)[a-zA-Z0-9]+(?=\])', '{}', s)
random_data = [[random.randint(1, 2000) for i in range(4)] for _ in range(10)]
final_results = [new_s.format(*i) for i in random_data]
for command in final_results:
  print(command)

Output:

A[51] + B[134] - C[864] [[1344]]
A[468] + B[1761] - C[1132] [[1927]]
A[1236] + B[34] - C[494] [[1009]]
A[1330] + B[1002] - C[1751] [[1813]]
A[936] + B[567] - C[393] [[560]]
A[1926] + B[936] - C[906] [[1596]]
A[1532] + B[1881] - C[871] [[1766]]
A[506] + B[1505] - C[1096] [[491]]
A[290] + B[1841] - C[664] [[38]]
A[1552] + B[501] - C[500] [[373]]

Lookarounds are much more "expensive" - rather match the brackets and put them in the substitution back again. See mine vs. your's, 27 vs 81 steps, a third. — Jan, Commented Jan 31, 2018 at 18:51

Jan · Accepted Answer · 2018-01-31 19:44:57Z

Just use

\[([^][]+)\]

And replace this with the desired result, e.g. 123.

Broken down, this says

\[       # opening bracket
([^][]+) # capture anything not brackets, 1+ times
\]       # closing bracket

See a demo on regex101.com.

For your changed requirements, you could use an OrderedDict:

import re
from collections import OrderedDict

rx = re.compile(r'\[([^][]+)\]')
d = OrderedDict()

def replacer(match):
    item = match.group(1)
    d[item] = 1
    return '[#{}]'.format(list(d.keys()).index(item) + 1)

string = "A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]"
string = rx.sub(replacer, string)
print(string)

Which yields

A[#1] + B[#2] - C[#3] [[#1]]

The idea here is to put every (potentially) new item in the dict, then search for the index. OrderedDicts remember the order entry.

For the sake of academic completeness, you could do it all on your own as well:

import re

class Replacer:
    rx = re.compile(r'\[([^][]+)\]')
    keywords = []

    def do_replace(self, match):
        idx = self.lookup(match.group(1))
        return '[#{}]'.format(idx + 1)

    def replace(self, string):
        return self.rx.sub(self.do_replace, string)

    def lookup(self, item):
        for idx, key in enumerate(self.keywords):
            if key == item:
                return idx

        self.keywords.append(item)
        return len(self.keywords)-1

string = "A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]"

rpl = Replacer()
string = rpl.replace(string)
print(string)

Bill Bell · Accepted Answer · 2018-01-31 19:54:32Z

Can also be done using pyparsing.

This parser essentially defines noun to be the uppercase things within square brackets, then defines a sequence of them to be one line of input, as complete.

To replace items identified with other things define a class derived from dict in a suitable way, so that anything not in the class is left unchanged.

>>> import pyparsing as pp
>>> noun = pp.Word(pp.alphas.upper())
>>> between = pp.CharsNotIn('[]')
>>> leftbrackets = pp.OneOrMore('[')
>>> rightbrackets = pp.OneOrMore(']')
>>> stmt = 'A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]'
>>> one = between + leftbrackets + noun + rightbrackets
>>> complete = pp.OneOrMore(one)
>>> complete.parseString(stmt)
(['A', '[', 'BANANA', ']', ' + B', '[', 'PINEAPPLE', ']', ' - C', '[', 'CHERRY', ']', ' ', '[', '[', 'BANANA', ']', ']'], {})
>>> class Replace(dict):
...     def __missing__(self, key):
...         return key
...     
>>> replace = Replace({'BANANA': '1', 'PINEAPPLE': '2'})
>>> new = []
>>> for item in complete.parseString(stmt).asList():
...     new.append(replace[item])
... 
>>> ''.join(new)
'A[1] + B[2] - C[CHERRY] [[1]]'

Bill Bell · Accepted Answer · 2018-02-01 23:36:02Z

I think it's easier — and clearer — using plex. The snag is that it appears to be available only for Py2. It took me an hour or two to make sufficient conversion work to Py3 to get this.

Just three types of tokens to watch for, then a similar number of branches within a while statement.

from plex import *
from io import StringIO

stmt = StringIO('A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]')

lexicon = Lexicon([
    (Rep1(AnyBut('[]')), 'not_brackets'),
    (Str('['), 'left_bracket'),
    (Str(']'), 'right_bracket'),
])

class Replace(dict):
    def __missing__(self, key):
        return key

replace = Replace({'BANANA': '1', 'PINEAPPLE': '2'})

scanner = Scanner(lexicon, stmt)
new_statement = []
while True:
    token = scanner.read()
    if token[0] is None:
        break
    elif token[0]=='no_brackets':
        new_statement.append(replace[token[1]])
    else:
        new_statement.append(token[1])

print (''.join(new_statement))

Result:

A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]

Collectives™ on Stack Overflow

Parsing text files with "magic" values

Background

Problem

Question

5 Answers 5

Hot Network Questions

Collectives™ on Stack Overflow

Background

Problem

Question

5 Answers 5

Related