Extract text using word VBA regex then save to a variable as string

Question

I am trying to create code in Word VBA that will automatically save (as PDF) and name a document based on it's content, which is in text and not fields. Luckily the formatting is standardized and I already know how to save it. I tested my regex elsewhere to make sure it pulls what I am looking for. The trouble is I need to extract the matched statement, convert it to a string, and save it to an object (so I have something to pass on to the code where it names the document).

The part of the document I need to match is below, from the start of "Program" through the end of the line and looks like:

Program: Program Name (abr)

and the regex I worked out for this is "Program:[^\n]"

The code I have so far is below, but I don't know how to execute the regex in the active document, convert the output to a string and save to an object:

Sub RegExProgram()

Dim regEx
Dim pattern As String

Set regEx = CreateObject("VBScript.RegExp")
regEx.IgnoreCase = True
regEx.Global = False
regEx.pattern = "Program\:[^\n]"

(missing code here)

End Sub

Any ideas are welcome, and I am sorry if this is simple and I am just overlooking something obvious. This is my first VBA project, and most of the resources I can find suggest replacing using regex, not saving extracted text as string. Thank you!

While that was a good description of how to create regular expressions, it did not address my largest problem... I do not want to replace text, I want to extract it and save it to an object as a string. Also I am not using excel to work within cells or worksheets, I am using Word so when I use regEx.test() or regEx.Execute() I don't know where to reference. — schradera
– schradera, Commented Oct 21, 2016 at 17:37
I'm assuming that the "Program Name (abr)" part of your string will be different things depending on the document? — Pat Jones
– Pat Jones, Commented Oct 21, 2016 at 18:11
@Pat_Jones, yes it will be different. I have tested the regex above using sample docs and an online regex tester, and that seems to work. It grabs everything from "Program: ..." through the end of the line. — schradera
– schradera, Commented Oct 21, 2016 at 21:33

mklement0 · Accepted Answer · 2016-10-24 18:27:15Z

Try this:

^{You can find documentation for the RegExp class here.}

Dim regEx as Object
Dim matchCollection As Object
Dim extractedString As String

Set regEx = CreateObject("VBScript.RegExp")
With regEx
  .IgnoreCase = True
  .Global = False    ' Only look for 1 match; False is actually the default.
  .Pattern = "Program: ([^\r]+)"  ' Word separates lines with CR (\r)
End With

' Pass the text of your document as the text to search through to regEx.Execute().
' For a quick test of this statement, pass "Program: Program Name (abr)"
set matchCollection = regEx.Execute(ActiveDocument.Content.Text)

' Extract the first submatch's (capture group's) value - 
' e.g., "Program Name (abr)" - and assign it to variable extractedString.
extractedString = matchCollection(0).SubMatches(0)

I've modified your regex based on the assumption that you want to capture everything after Program: through the end of the line; your original regex would only have captured Program:<space>.
- Enclosing [^\r]+ (all chars. through the end of the line) in (...) defines a so-called subexpression (a.k.a. capture group), which allows selective extraction of only the substring of interest from what the overall pattern captures.
The .Execute() method, to which you pass the string to search in, always returns a collection of matches (Match objects).
Since the .Global property is set to False in your code, the output collection has (at most) 1 entry (at index 0) in this case.
If the regular expression has subexpressions (1 in our case), then each entry of the match collection has a nonempty .SubMatches collection, with one entry for each subexpression, but note that the .SubMatches entries are strings, not Match objects.
Match objects have properties .FirstIndex, .Length, and Value (the captured string). Since the .Value property is the default property, it is sufficient to access the object itself, without needing to reference the .Value property (e.g., instead of the more verbose matchCollection(0).Value to access the captured string (in full), you can use shortcut matchCollection(0) (again, by contrast, .SubMatches entries are strings only).

This looks like exactly what I am looking for. Can't wait to try it Monday morning. Thanks @mklement0
Exactly what I needed, but finding one problem... It isn't stopping at the end of the line (or any of the other lines or carriage returns). I pasted the variable extractedString into a message box to double check the output. I know your regEx is right because I tested it in a tester. Any ideas? I made sure the escape character goes the right way and everything. I am copying and pasting so you can see what I am using... .pattern = "Program: ([^\n]+)"
Oops! They are carriage returns and require \r not \n, but your code was perfect it was my mistake that made it not work. Thanks!
@schradera: Glad to hear it worked; thanks for telling me that \r rather than \n is needed - I've updated the answer.

Pat Jones · Accepted Answer · 2016-10-21 18:37:55Z

2

If you're just looking for a string that starts with "Program:" and want to go to the end of the line from there, you don't need a regular expression:

Public Sub ReadDocument()

Dim aLine As Paragraph
Dim aLineText As String

Dim start As Long

For Each aLine In ActiveDocument.Paragraphs

    aLineText = aLine.Range.Text
    start = InStr(aLineText, "Program:")

    If start > 0 Then
        my_str = Mid(aLineText, start)
    End If

Next aLine

End Sub

This reads through the document line by line, and stores your match in the variable "my_str" when it encounters a line that has the match.

answered Oct 21, 2016 at 18:37

Pat Jones

8989 silver badges21 bronze badges

3 Comments

mklement0 Over a year ago

While you don't need the RegExp object in this simple case, why wouldn't you use it, given that it offers more powerful matching when you need it, makes for shorter code, and is probably faster, since you don't have to loop in VBA code?

Pat Jones Over a year ago

All quite true, just giving the OP another option if it is desired, that's all.

schradera Over a year ago

Interesting. I was excite to finally learn regex, it has been on my to to list for a while, but I like being able to look at things from different perspectives too. Thanks @PatJones

Slai · Accepted Answer · 2016-10-22 02:40:43Z

2

Lazier version:

a = Split(ActiveDocument.Range.Text, "Program:")
If UBound(a) > 0 Then 
    extractedString = Trim(Split(a(1), vbCr)(0))
End If

If I remember correctly, paragraphs in Word end with vbCr ( \r not \n )

answered Oct 22, 2016 at 2:40

Slai

23k5 gold badges49 silver badges55 bronze badges

Collectives™ on Stack Overflow

Extract text using word VBA regex then save to a variable as string

3 Answers 3

4 Comments

3 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

3 Comments

Comments

Related