2

Background

I have some large text files used in an automation script for audio tuning. Each line in the text file looks roughly like:

A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]] BANANA # BANANA

The text gets fed to an old command-line program which searches for keywords, and swaps them out. Sample output would be:

A[0] + B[100] - C[0x1000] [[0]] 0 # 0
A[2] + B[200] - C[0x100A] [[2]] 0 # 0

Problem

Sometimes, text files have keywords that are meant to be left untouched (i.e. cases where we don't want "BANANA" substituted). I'd like to modify the text files to use some kind of keyword/delimiter that is unlikely to pop up in normal circumstances, i.e:

A[#1] + B[#2] - C[#3] [[#1]] #1 # #1

Question

Does python's text file parser have any special indexing/escape sequences I could use instead of simple keywords?

4
  • 4
    Whats the desired output from this?
    – heemayl
    Commented Jan 31, 2018 at 18:36
  • You don't mean that all of the input lines look like that? Do you have a grammar for the input lines?
    – Bill Bell
    Commented Jan 31, 2018 at 18:41
  • This is a strange post, almost confusing. What is this, like mail merge? A simple key/value pair that needs to be swapped out? It's pretty simple, why is it confusing me ?
    – user557597
    Commented Jan 31, 2018 at 19:08
  • If you don't want "BANANA" substituted for, then why would you substitute in #1? Commented Jan 31, 2018 at 21:05

5 Answers 5

2

use a regular expression replacement function with a dictionary.

Match everything between brackets (non-greedy, avoiding the brackets themselves) and replace by the value of the dict, put original value if not found:

import re

d = {"BANANA":"12", "PINEAPPLE":"20","CHERRY":"100","BANANA":"400"}
s = "A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]"

print(re.sub("\[([^\[\]]*)\]",lambda m : "[{}]".format(d.get(m.group(1),m.group(1))),s))

prints:

A[400] + B[20] - C[100] [[400]]
0
2

You can use re.sub to perform the substitution. This answer creates a list of randomized values to demonstrate, however, the list can be replaces with the data you are using:

import re
import random
s = "A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]"
new_s = re.sub('(?<=\[)[a-zA-Z0-9]+(?=\])', '{}', s)
random_data = [[random.randint(1, 2000) for i in range(4)] for _ in range(10)]
final_results = [new_s.format(*i) for i in random_data]
for command in final_results:
  print(command)

Output:

A[51] + B[134] - C[864] [[1344]]
A[468] + B[1761] - C[1132] [[1927]]
A[1236] + B[34] - C[494] [[1009]]
A[1330] + B[1002] - C[1751] [[1813]]
A[936] + B[567] - C[393] [[560]]
A[1926] + B[936] - C[906] [[1596]]
A[1532] + B[1881] - C[871] [[1766]]
A[506] + B[1505] - C[1096] [[491]]
A[290] + B[1841] - C[664] [[38]]
A[1552] + B[501] - C[500] [[373]]
1
  • Lookarounds are much more "expensive" - rather match the brackets and put them in the substitution back again. See mine vs. your's, 27 vs 81 steps, a third.
    – Jan
    Commented Jan 31, 2018 at 18:51
1

Just use

\[([^][]+)\]

And replace this with the desired result, e.g. 123.


Broken down, this says

\[       # opening bracket
([^][]+) # capture anything not brackets, 1+ times
\]       # closing bracket

See a demo on regex101.com.


For your changed requirements, you could use an OrderedDict:

import re
from collections import OrderedDict

rx = re.compile(r'\[([^][]+)\]')
d = OrderedDict()

def replacer(match):
    item = match.group(1)
    d[item] = 1
    return '[#{}]'.format(list(d.keys()).index(item) + 1)

string = "A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]"
string = rx.sub(replacer, string)
print(string)

Which yields

A[#1] + B[#2] - C[#3] [[#1]]

The idea here is to put every (potentially) new item in the dict, then search for the index. OrderedDicts remember the order entry.


For the sake of academic completeness, you could do it all on your own as well:

import re

class Replacer:
    rx = re.compile(r'\[([^][]+)\]')
    keywords = []

    def do_replace(self, match):
        idx = self.lookup(match.group(1))
        return '[#{}]'.format(idx + 1)

    def replace(self, string):
        return self.rx.sub(self.do_replace, string)

    def lookup(self, item):
        for idx, key in enumerate(self.keywords):
            if key == item:
                return idx

        self.keywords.append(item)
        return len(self.keywords)-1

string = "A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]"

rpl = Replacer()
string = rpl.replace(string)
print(string)
0
1

Can also be done using pyparsing.

This parser essentially defines noun to be the uppercase things within square brackets, then defines a sequence of them to be one line of input, as complete.

To replace items identified with other things define a class derived from dict in a suitable way, so that anything not in the class is left unchanged.

>>> import pyparsing as pp
>>> noun = pp.Word(pp.alphas.upper())
>>> between = pp.CharsNotIn('[]')
>>> leftbrackets = pp.OneOrMore('[')
>>> rightbrackets = pp.OneOrMore(']')
>>> stmt = 'A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]'
>>> one = between + leftbrackets + noun + rightbrackets
>>> complete = pp.OneOrMore(one)
>>> complete.parseString(stmt)
(['A', '[', 'BANANA', ']', ' + B', '[', 'PINEAPPLE', ']', ' - C', '[', 'CHERRY', ']', ' ', '[', '[', 'BANANA', ']', ']'], {})
>>> class Replace(dict):
...     def __missing__(self, key):
...         return key
...     
>>> replace = Replace({'BANANA': '1', 'PINEAPPLE': '2'})
>>> new = []
>>> for item in complete.parseString(stmt).asList():
...     new.append(replace[item])
... 
>>> ''.join(new)
'A[1] + B[2] - C[CHERRY] [[1]]'
1

I think it's easier — and clearer — using plex. The snag is that it appears to be available only for Py2. It took me an hour or two to make sufficient conversion work to Py3 to get this.

Just three types of tokens to watch for, then a similar number of branches within a while statement.

from plex import *
from io import StringIO

stmt = StringIO('A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]')

lexicon = Lexicon([
    (Rep1(AnyBut('[]')), 'not_brackets'),
    (Str('['), 'left_bracket'),
    (Str(']'), 'right_bracket'),
])

class Replace(dict):
    def __missing__(self, key):
        return key

replace = Replace({'BANANA': '1', 'PINEAPPLE': '2'})

scanner = Scanner(lexicon, stmt)
new_statement = []
while True:
    token = scanner.read()
    if token[0] is None:
        break
    elif token[0]=='no_brackets':
        new_statement.append(replace[token[1]])
    else:
        new_statement.append(token[1])

print (''.join(new_statement))

Result:

A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]