4

Given a unknown string with an unknown size, e.g. a ScriptBlock expression or something like:

$Text = @'
LOREM IPSUM

Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.
It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.
It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
'@

I would like to summarize the string to a single line (replace all the consecutive white spaces to a single white space) and truncate it to a specific $Length:

$Length = 32
$Text = $Text -Replace '\s+', ' '
if ($Text.Length -gt $Length) { $Text = $Text.SubString(0, $Length) }
$Text
LOREM IPSUM Lorem Ipsum is simpl

The issue is that if it concerns a large string, it isn't very effective towards replacing the white spaces: it replaces all white spaces in the whole $Text string where only need to replace the first few white spaces till I have a string of the required size ($Length = 32).
Swapping the -replace and SubString operations isn't desired as well as that would return a lesser length than required or even a single space for any $Text string that starts with something like 32 white spaces.

Question:
How can I effectively merge the two (-replace and SubString) operations so that I am not replacing more white spaces than necessarily and get a string of the required length (in case the $Text string is larger than the required length)?


Update

I think I am close by using a MatchEvaluator Delegate:

$Length = 8
$TotalSpaces = 0
$Delegate = {
    if ($Args[0].Index - $TotalSpaces -gt $Length) {
        '{break}'
        ([Ref]$TotalSpaces).Value = [int]::MaxValue
    }
    else { ([Ref]$TotalSpaces).Value += $Args[0].Value.Length }
}
[regex]::Replace('test 0 1 2 3 4 5 6 7 8 9', '\s+', $Delegate)
test01234{break}56789

Now the question is how can I break the regex processing at the {break}?
Note that for performance reasons I really want to break out and not substitute the <regular-expression> with the found match (which makes it look like it stopped).

6
  • 1
    I don't have proper access to a computer right now, but if you're processing a string of arbitrary length you might want to look at the Match(String, Int32) overload of the Regex class - see learn.microsoft.com/en-us/dotnet/api/… - and just pull everything up to the first whitespace match off the front of the string, then call it again starting from after the current match until you've got the desired length of output... Commented Jul 24, 2024 at 10:46
  • 1
    Or even the Match(String, Int32, Int32) overload- learn.microsoft.com/en-us/dotnet/api/… - which does the same thing but stops after a given number of input characters have been read, so you don't process the entire string if there's no more whitespace... Commented Jul 24, 2024 at 10:50
  • 1
    Np. Note that might still incur an expensive string copy into its return value for the remaining portion of the string that it leaves “un-replaced” though. Depends on what performance you need as to whether it’s worth rolling something more complicated or not. Commented Jul 24, 2024 at 10:58
  • 1
    you can't break out of the regex matchevaluator, the best you can do there is very similar to what you would do in a ForEach-Object, if ($imDoneReplacingCondition) { return } Commented Jul 24, 2024 at 13:02
  • 1
    I tried a solution that uses Match(String, Int32, Int32) and the performance is strangely bad if the input string is long - it's like it's doing something to scan the entire string regardless of how short the search window is. Try $regex = [regex] "\s+"; $text = "LOREM IPSUM" * 2000; $regex.Match($text, 0, 100) for example and then change * 2000 to * 20000000. @SantiagoSquarzon's manual parsing performs much better on very long strings... Commented Jul 24, 2024 at 13:55

3 Answers 3

4

Perhaps a more manual approach is faster than trying to do it with regex, of course it's a lot more code.

$Text = @'
LOREM IPSUM
Lorem   Ipsum is
   simply dummy    text
'@

$Length = 32
$sb = [System.Text.StringBuilder]::new($Length)

foreach ($char in $Text.GetEnumerator()) {
    if ($sb.Length -eq $Length) {
        break
    }

    if ([char]::IsWhiteSpace($char)) {
        if (-not $prevSpace) {
            $sb = $sb.Append(' ')
        }

        $prevSpace = $true
        continue
    }

    $sb = $sb.Append($char)
    $prevSpace = $false
}

$sb.ToString()

Very similar approach using String.Create might probably be even faster but will need pre-compile or Add-Type it. You can find an example here.

Sign up to request clarification or add additional context in comments.

7 Comments

Nice. As a performance tweak you could keep track of the first non-whitespace character's position in a run and then use substring to copy the whole run rather than character by character in every iteration - it's swings and roundabouts though as it makes the code a bit more complicated, but if $Length is large it might be worth it...
Also, the whitespace class - \s - has a few extra characters in it - learn.microsoft.com/en-us/dotnet/standard/base-types/…, if that matters to the OP...
@mclayton not sure using substring would be a performance boost since it's already processing char by char and appending to a stringbuilder is already very efficient. substring would have to iterate over the original string once again when this is already doing that. as for the rest of whitespace character, i.e. a tab, I'm not sure how iRon wants to deal with those. will wait for his feedback.
Nice, came here to suggest the same ^_^ FWIW you can simplify the branching logic with a single elseif([char]::IsWhiteSpace()){ if(!$prevSpace){$sb.Append(' ')} $prevSpace = $true } block in the middle
that's much better @MathiasR.Jessen ! thanks
|
1

Benchmark:

$Length = 32

$Sizes = 50, 100, 200, 400, 800, 1600 # words
$Strings = @(
    foreach ($Size in $Sizes) {
        -Join @('Word ') * $Size
    }
)

$Iterations = 1000

@(
    $Results = [Ordered]@{ Name = 'Question' }
    for ($i = 1; $i -le $Iterations; $i++) {
        foreach ($String in $Strings) {
            $Results["$($String.Length)"] += (Measure-Command {
                $Text = $String -Replace '\s+', ' '
                if ($Text.Length -gt $Length) { $Text = $Text.SubString(0, $Length) }
                $Void = $Text
            }).TotalMilliseconds
        }
    }
    $Void.Length | Should -be $Length
    [PSCustomObject]$Results

    $Results = [Ordered]@{ Name = 'Santiago' }
    for ($i = 1; $i -le $Iterations; $i++) {
        foreach ($String in $Strings) {
            $Results["$($String.Length)"] += (Measure-Command {
                $sb = [System.Text.StringBuilder]::new($Length)

                foreach ($char in $Text.GetEnumerator()) {
                    if ($sb.Length -eq $Length) {
                        break
                    }

                    if ([char]::IsWhiteSpace($char)) {
                        if (-not $prevSpace) {
                            $sb = $sb.Append(' ')
                        }

                        $prevSpace = $true
                        continue
                    }

                    $sb = $sb.Append($char)
                    $prevSpace = $false
                }

                $Void = $sb.ToString()
            }).TotalMilliseconds
        }
    }
    $Void.Length | Should -be $Length
    [PSCustomObject]$Results
) | Format-Table
Name       250   500  1000  2000  4000   8000
----       ---   ---  ----  ----  ----   ----
Question 28.15 12.18 20.19 35.81 68.50 129.03
Santiago 79.05 57.73 54.25 55.33 54.73  54.12

Comments

0

You cannot.
Your best bet to increase efficiency(and I'm not sure how much) is to first cut down the original string into a substring because you already know you are going to reduce its size anyway so no reason to elaborate a 10MB file if you only end up needing the first 100kB.

Something like ($Text.Substring(0, $Length * 2) -replace '\ +', ' ').Substring(0, $Lenght)
I've used $Length * 2 but you can use any dimension you want, depending on how many multiple spaces you realistically expect in the original(sub) string.
I'm guessing anything from $Length * 1.25 up should be enough

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.