1

I am relatively inexperienced at Python, and I've hit a wall using it to clean some text data into a usable format.

Essentially: I have paired names and values separated by a variable number of periods. This feature of the text is thankfully regular, but the surrounding format has much variability - There can be multiple (name, value) pairs on a single line, there can be additional useless text on any given line (and this "useless text" can include any characters, not just alphabetic), there can be entire lines with no useful data, etc.

An example of what the data looks like follows

string = 'apples, red .... 0.15 apples, green ... 0.99\nbananas (bunch).......... 0.111\nfruit salad, small........1.35 [unwanted stuff #1.11 here]\nunwanted line here\nfruit salad, large .... 1.77 strawberry ........ 0.66 unwanted 00-11info here'

Just to make the string easier for you to see on StackOverflow, this is what "the data" looks like visually, splitting on the newlines

apples, red .... 0.15 apples, green ... 0.99
bananas (bunch).......... 0.111
fruit salad, small........1.35 [unwanted stuff #1.11 here]
unwanted line here
fruit salad, large .... 1.77 strawberry ........ 0.66 unwanted 00-11info here

Another lucky feature of the string is that "unwanted text" will always follow the values and be at the end of the line. I do not need to worry about unwanted text being next to the (name).

At the end of the day, I want to get

apples, red | 0.15
apples, green | 0.99
bananas (bunch) | 0.111
fruit salad, small | 1.35
fruit salad, large | 1.77
strawberry | 0.66

or something similar that can be loaded into R, excel, etc.

I have tried using split and regular expressions by splitting on the variable number of periods, but I'm struggling to write an expression that gives me what I want. For example, I tried

string = 'apples, red .... 0.15 apples, green ... 0.99\nbananas (bunch).......... 0.111\nfruit salad, small........1.35 [unwanted stuff #1.11 here]\nunwanted line here\nfruit salad, large .... 1.77 strawberry ........ 0.66 unwanted 00-11info here'

text = re.split(r"\.{3,}|\n", string)
print(text)

which splits on either a newline or 3+ periods and gives

['apples, red ', ' 0.15 apples, green ', ' 0.99', 'bananas (bunch)', ' 0.111', 'fruit salad, small', '1.35 [unwanted stuff #1.11 here]', 'unwanted line here', 'fruit salad, large ', ' 1.77 strawberry ', ' 0.66 unwanted 00-11info here'] 

which is close, but the problems with this solution are:

(1) Each element in the list is not a correct (name, value) pair, as the split occurs between the (name) and (value) elements. E.g., the 0.15 should be associated with "apples, red", but instead it shares the list element with the subsequent "apples, green".

(2) There is some additional unnecessary text hanging about after some of the values. I could probably brute force some additional post-processing, but I feel like there should be a more elegant solution given the regularity of the string. I.e., there should be some regex out there that can look for "alphabetic characters" followed by "3 or more periods" followed by "number", with any additional text following the "number" being tossed out as useless.

Any help would be much appreciated. Thank you!

3
  • is normal that some \n are missing? and what about [unwanted stuff #1.11 here]\nunwanted line here? how can be identified unwanted text/row? Commented May 26 at 16:44
  • 2
    It’d be great to see an minimal reproducible example of what you’ve already tried and why it doesn’t meet your requirements Commented May 26 at 16:54
  • maybe use a lookahead assertion, smt like r"(?=\d\.\d+)", together with some cleaning pre-processing Commented May 26 at 17:03

6 Answers 6

1

Try matching on:

([^\d.\n]+)[^\S\n]*\.{3,}[^\S\n]*(\d+.\d+)

See: regex101

Then you can just join the strings:

import re

string = 'apples, red .... 0.15 apples, green ... 0.99\nbananas (bunch).......... 0.111\nfruit salad, small........1.35 [unwanted stuff #1.11 here]\nunwanted line here\nfruit salad, large .... 1.77 strawberry ........ 0.66 unwanted 00-11info here'

pattern=r"([^\d.\n]+)[^\S\n]*\.{3,}[^\S\n]*(\d+.\d+)"

matches=re.findall(pattern,string)

"\n".join(" | ".join(m.strip() for m in pair) for pair in matches)

Explanation

  • ( ... ): To group 1 capture
    • [^\d.\n]+: Anything that is neighter a digit, a dot or a linebreak.

Between the item and the value you find

  • [^\S\n]*: zero or more times a space that is no newline
  • \.{3,}: followed by at least 3 dots
  • [^\S\n]*: again zero or more non newline spaces.

Then match the amount

  • ( ... ): and capture to group 2
    • \d+.\d+: any digits that must be separated by a comma
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you very much, especially for the detailed explanation on the regex! I did not consider using this "findall" method, and it looks like it successfully worked on a small subsample of my text data.
1

First of all, I would not use split here, as in this case it is a bit easier to match what you want to keep (in two capture groups) than what you want to exclude.

The first capture group could just take anything (that doesn't start or end with white space) before at least two points occur, and the second capture group could require the number format.

Here is one way of doing it:

regex = r"\s*(.+?) *\.{2,} *(\d+(?:\.\d*)?)"
result = "\n".join(" | ".join(pair) for pair in re.findall(regex, string))

This sets the following string to result:

apples, red | 0.15
apples, green | 0.99
bananas (bunch) | 0.111
fruit salad, small | 1.35
fruit salad, large | 1.77
strawberry | 0.66

Comments

0

You can probably collect these with a regular expression, but you want to be careful not to exclude too much!

>>> for line in string.splitlines():
...     print(re.findall(r"([a-z][a-z\s]+,\s\w+)\s*\.+\s*([\d\.]+)", line))
... 
[('apples, red', '0.15'), ('apples, green', '0.99')]
[]
[('fruit salad, small', '1.35')]
[]
[('fruit salad, large', '1.77')]

Otherwise (and especially if you have a file which is separated by-lines) a stack or latching/categorizing function may solve this

content = """\
[header]
foo = bar
zap = baz
[header2]
...
"""

position = 0
values = []
for line in content.splitlines():  # in a real example iterate on open() result
    if line.startswith("["):  # can be a more complex check
        # it's a header!
        position = 0  # reset some position if using
        # finalize the previous block (this is a new one)
        ...
        # increment position or whatever else
        continue
    elif _:  # later conditions or perhaps just position >= 1
        # parse non-header line
    else:
        # opportunity to complain about malformed lines

Comments

0

Although you probably should use a regular expression, it can be done (given the sample data) without the re module.

The idea is to replace all occurrences of a period (that is neither preceded nor succeeded by a digit) with a space.

You can then split the resulting string on any/all whitespace.

There seems to be a pattern that indicates that a floating point number will be preceded by, at most, 3 other tokens.

Therefore:

string = "apples, red .... 0.15 apples, green ... 0.99\nbananas (bunch).......... 0.111\nfruit salad, small........1.35 [unwanted stuff #1.11 here]\nunwanted line here\nfruit salad, large .... 1.77 strawberry ........ 0.66 unwanted 00-11info here"

output = ""

for i, c in enumerate(string):
    if i > 0 and c == ".":
        if string[i - 1].isdecimal() and string[i + 1].isdecimal():
            output += c
        else:
            output += " "
    else:
        output += c

stack = []

for token in output.split():
    try:
        float(token)
        print(*stack[-3:], "|", token)
        stack.clear()
    except ValueError:
        stack.append(token)

Output:

apples, red | 0.15
apples, green | 0.99
bananas (bunch) | 0.111
fruit salad, small | 1.35
fruit salad, large | 1.77
strawberry | 0.66

Note:

While this produces the required output from the given sample data, it may not be a general solution.

Comments

0

@Ramrab's solution is creative, and avoids regular expressions, but it breaks just by removing the hash (#) symbol from #1.11. But that's just the nature of cleaning unstructured text.

You say,

there should be some regex out there that can look for "alphabetic characters" followed by "3 or more periods" followed by "number"

But according to your desired result:

  • You want more than just "alphabetic characters" (,, ()).
  • According to your example, the "3 or more periods" may or may not contain whitespace.
  • It's unclear what you mean by "numbers". Should they all be in decimal format (#.##)?

It's hard to manage regular expressions unless you have some definite pattern to work with.

Assuming your example sets the pattern, this

regex = r'(.*?) *\.{3,} *(\d+\.\d+)'
results = re.findall(regex, string)

for item in results:
    print(f'{item[0].strip()} | {item[1].strip()}')

will get you what you want. But, even then, if there is some unwanted data that fits that pattern (e.g., [unwanted stuff ... 1.11 here]), it'll still break.

Comments

0

Answer Updated:

Use .finditer() function with re.IGNORECASE AND re.MULTILINE flags to find all the matches. And, then use the m.group('group_name') to extract the matched groups for output. I used named groups in this example.

###PYTHON (Updated Regex Pattern - raw string notation removed from string value)
import re

string = 'apples, red .... 0.15 apples, green ... 0.99\nbananas (bunch).......... 0.111\nfruit salad, small........1.35 [unwanted stuff #1.11 here]\nunwanted line here\nfruit salad, large .... 1.77 strawberry ........ 0.66 unwanted 00-11info here'

pattern = r'(?P<fruit>[a-z][a-z (]+[a-z)](?:,[ ]*[a-z]+)?)[ .]*(?P<amount>\d+\.\d+)'

matches = re.finditer(pattern, string, re.IGNORECASE | re.MULTILINE)

for m in matches: 
    print(f"{m.group('fruit')} | {m.group('amount')}")

     

OUTPUT:

apples, red | 0.15
apples, green | 0.99
bananas (bunch) | 0.111
fruit salad, small | 1.35
fruit salad, large | 1.77
strawberry | 0.66

REGEX PATTERN (Python re flavor)(Updated):

(?P<fruit>[a-z][a-z (]+[a-z)](?:,[ ]*[a-z]+)?)[ .]*(?P<amount>\d+\.\d+)

Regex demo: https://regex101.com/r/TT8hKa/4 (Updated)

REGEX PATTERN NOTES (Updated):

  • (?P<fruit>[a-z][a-z (]+[a-z)](?:,[ ]*[a-z]+)?) Named group. Match and capture the 'fruit pattern'. This pattern makes sure it starts with a letter and ends in a letter or closing parenthesis. Note that the non-capturing group, (?:...)? pattern is optional with the (?) in the end.
  • [ .]* Then match, but do not capture, 0 or more (*) literal spaces or literal dots.
  • (?P<amount>\d+\.\d+) Named group. Then, match and capture the amount that consists of 1 or more (+) digits, followed by a literal dot, followed by one or more digits.

2 Comments

Based on the question it can be seen, that the input string is not a raw string and hence the "\n" is not literal, but a line break.
Thank you, @DuesserBaest! I have updated the regex pattern and removed the raw-notation from the 'string' value. So, much better now. TY! : ) : )

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.