7

I have this PowerShell script that's main purpose is to search through HTML files within a folder, find specific HTML markup, and replace with what I tell it to.

I have been able to do 3/4 of my find and replaces perfectly. The one I am having trouble with involves a Regular Expression.

This is the markup that I am trying to make my regex find and replace:

<a href="programsactivities_skating.html"><br />
                                           </a>

Here is the regex I have so far, along with the function I am using it in:

automate -school "C:\Users\$env:username\Desktop\schools\$question" -query '(?mis)(?!exclude1|exclude2|exclude3)(<a[^>]*?>(\s|&nbsp;|<br\s?/?>)*</a>)' -replace ''

And here is the automate function:

function automate($school, $query, $replace) {
    $processFiles = Get-ChildItem -Exclude *.bak -Include "*.html", "*.HTML", "*.htm", "*.HTM" -Recurse -Path $school
    foreach ($file in  $processFiles) {
        $text = Get-Content $file
        $text = $text -replace $query, $replace
        $text | Out-File $file -Force -Encoding utf8
    }
}

I have been trying to figure out the solution to this for about 2 days now, and just can't seem to get it to work. I have determined that problem is that I need to tell my regex to account for Multiline, and that's what I'm having trouble with.

Any help anyone can provide is greatly appreciate.

Thanks in Advance.

3 Answers 3

20

Get-Content produces an array of strings, where each string contains a single line from your input file, so you won't be able to match text passages spanning more than one line. You need to merge the array into a single string if you want to be able to match more than one line:

$text = Get-Content $file | Out-String

or

[String]$text = Get-Content $file

or

$text = [IO.File]::ReadAllText($file)

Note that the 1st and 2nd method don't preserve line breaks from the input file. Method 2 simply mangles all line breaks, as Keith pointed out in the comments, and method 1 puts <CR><LF> at the end of each line when joining the array. The latter may be an issue when dealing with Linux/Unix or Mac files.

1
  • 7
    Or if you're on V3 or greater $text = Get-Content $file -raw. BTW be careful with that last example as it does NOT preserve line breaks.
    – Keith Hill
    Commented Feb 20, 2014 at 18:36
1

I don't get what it is you're trying to do with those Exclude elements, but I find multi-line regex is usually easier to construct in a here-string:

$text = @'
<a href="programsactivities_skating.html"><br />
                                       </a>
'@

$regex = @'
(?mis)<a href="programsactivities_skating.html"><br />
\s+?</a>
'@

$text -match $regex

True
-1

Get-Content will return an array of strings, you want to concatenate the strings in question to create one:

function automate($school, $query, $replace) {
    $processFiles = Get-ChildItem -Exclude *.bak -Include "*.html", "*.HTML", "*.htm", "*.HTM" -Recurse -Path $school
    foreach ($file in  $processFiles) {
        $text = ""
        $text = Get-Content $file | % { $text += $_ +"`r`n" }
        $text = $text -replace $query, $replace
        $text | Out-File $file -Force -Encoding utf8
    }
}
1
  • Why not $text = (Get-Content $file) -join "`r`n" or as mentioned above: $Text = Get-Content $file | Out-String
    – dwarfsoft
    Commented Feb 3, 2015 at 2:27

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.