Using Regular Expressions for Search/Replace
|
Normally, when you search for a sub-string in a string, the match should be exact. So if we search for a sub-string "abc" then the string being searched should contain these exact letters in the same sequence for a match to be found. We can extend this kind of search to a case insensitive search where the sub-string "abc" will find strings like "Abc", "ABC" etc. That is, the case is ignored but the sequence of the letters should be exactly the same. Sometimes, a case insensitive search is also not enough. For example, if we want to search for numeric digit, then we basically end up searching for each digit independantly. This is where regular expressions come in to our help.
Regular expressions are text patterns that are used for string matching. Regular expressions are strings that contains a mix of plain text and special characters to indicate what kind of matching to do. Here's a very brief turorial on using regular expressions before we move on to the code for handling regular expressions.
Suppose, we are looking for a numeric digit then the regular expression we would search for is "[0-9]". The brackets indicate that the character being compared should match any one of the characters enclosed within the bracket. The dash (-) between 0 and 9 indicates that it is a range from 0 to 9. Therefore, this regular expression will match any character between 0 and 9, that is, any digit. If we want to search for a special character literally we must use a backslash before the special character. For example, the single character regular expression "\*" matches a single asterisk. In the table below the special characters are briefly described.
Character |
Description |
^ |
Beginning of the string. The expression "^A" will match an A only at the beginning of the string. |
^ |
The caret (^) immediately following the left-bracket ([) has a different meaning. It is used to exclude the remaining characters within brackets from matching the target string. The expression "[^0-9]" indicates that the target character should not be a digit. |
$ |
The dollar sign ($) will match the end of the string. The expression "abc$" will match the sub-string "abc" only if it is at the end of the string. |
| |
The alternation character (|) allows either expression on its side to match the target string. The expression "a|b" will match a as well as b. |
. |
The dot (.) will match any character. |
* |
The asterix (*) indicates that the character to the left of the asterix in the expression should match 0 or more times. |
+ |
The plus (+) is similar to asterix but there should be at least one match of the character to the left of the + sign in the expression. |
? |
The question mark (?) matches the character to its left 0 or 1 times. |
() |
The parenthesis affects the order of pattern evaluation and also serves as a tagged expression that can be used when replacing the matched sub-string with another expression. |
[] |
Brackets ([ and ]) enclosing a set of characters indicates that any of the enclosed characters may match the target character. |
The parenthesis, besides affecting the evaluation order of the regular expression, also serves as tagged expression which is something like a temporary memory. This memory can then be used when we want to replace the found expression with a new expression. The replace expression can specify a & character which means that the & represents the sub-string that was found. So, if the sub-string that matched the regular expression is "abcd", then a replace expression of "xyz&xyz;" will change it to "xyzabcdxyz". The replace expression can also be expressed as "xyz\0xyz". The "\0" indicates a tagged expression representing the entire sub-string that was matched. Similarly we can have other tagged expression represented by "\1", "\2" etc. Note that although the tagged expression 0 is always defined, the tagged expression 1,2 etc. are only defined if the regular expression used in the search had enough sets of parenthesis. Here are few examples.
String |
Search |
Replace |
Result |
Mr. |
(Mr)(\.) |
\1s\2 |
Mrs. |
abc |
(a)b(c) |
&-\1-\2 |
abc-a-c |
bcd |
(a|b)c*d |
&-\1 |
bcd-b |
abcde |
(.*)c(.*) |
&-\1-\2 |
abcde-ab-de |
cde |
(ab|cd)e |
&-\1 |
cde-cd |
The code for regular expression search is given below. This code was derived from work done by Henry Spencer. Besides adding a C++ wrapper around C code, a couple of functions were added to enable search and replace operations.
//////////////////////////////////////////////////////////////////////// // RegExp.h // // This code has been derived from work by Henry Spencer. // The main changes are // 1. All char variables and functions have been changed to TCHAR // counterparts // 2. Added GetFindLen() & GetReplaceString() to enable search // and replace operations. // 3. And of course, added the C++ Wrapper // // The original copyright notice follows: // // Copyright (c) 1986, 1993, 1995 by University of Toronto. // Written by Henry Spencer. Not derived from licensed software. // // Permission is granted to anyone to use this software for any // purpose on any computer system, and to redistribute it in any way, // subject to the following restrictions: // // 1. The author is not responsible for the consequences of use of // this software, no matter how awful, even if they arise // from defects in it. // // 2. The origin of this software must not be misrepresented, either // by explicit claim or by omission. // // 3. Altered versions must be plainly marked as such, and must not // be misrepresented (by explicit claim or omission) as being // the original software. // // 4. This notice must not be removed or altered. ///////////////////////////////////////////////////////////////////////////// #define NSUBEXP 10 class CRegExp { public: CRegExp(); ~CRegExp(); CRegExp *RegComp( const TCHAR *re ); int RegFind(const TCHAR *str); TCHAR* GetReplaceString( const TCHAR* sReplaceExp ); int GetFindLen() { if( startp[0] == NULL || endp[0] == NULL ) return 0; return endp[0] - startp[0]; } private: TCHAR *regnext(TCHAR *node); void reginsert(TCHAR op, TCHAR *opnd); int regtry(TCHAR *string); int regmatch(TCHAR *prog); size_t regrepeat(TCHAR *node); TCHAR *reg(int paren, int *flagp); TCHAR *regbranch(int *flagp); void regtail(TCHAR *p, TCHAR *val); void regoptail(TCHAR *p, TCHAR *val); TCHAR *regpiece(int *flagp); TCHAR *regatom(int *flagp); // Inline functions private: TCHAR OP(TCHAR *p) {return *p;}; TCHAR *OPERAND( TCHAR *p) {return (TCHAR*)((short *)(p+1)+1); }; // regc - emit (if appropriate) a byte of code void regc(TCHAR b) { if (bEmitCode) *regcode++ = b; else regsize++; }; // regnode - emit a node TCHAR * regnode(TCHAR op) { if (!bEmitCode) { regsize += 3; return regcode; } *regcode++ = op; *regcode++ = _T('\0'); /* Null next pointer. */ *regcode++ = _T('\0'); return regcode-3; }; private: BOOL bEmitCode; BOOL bCompiled; TCHAR *sFoundText; TCHAR *startp[NSUBEXP]; TCHAR *endp[NSUBEXP]; TCHAR regstart; // Internal use only. TCHAR reganch; // Internal use only. TCHAR *regmust; // Internal use only. int regmlen; // Internal use only. TCHAR *program; // Unwarranted chumminess with compiler. TCHAR *regparse; // Input-scan pointer. int regnpar; // () count. TCHAR *regcode; // Code-emit pointer; ®dummy; = don't. TCHAR regdummy[3]; // NOTHING, 0 next ptr long regsize; // Code size. TCHAR *reginput; // String-input pointer. TCHAR *regbol; // Beginning of input, for ^ check. }; //////////////////////////////////////////////////////////////////////////////// // RegExp.cpp //////////////////////////////////////////////////////////////////////////////// #include "stdafx.h" #include "RegExp.h" // definition number opnd? meaning #define END 0 // no End of program. #define BOL 1 // no Match beginning of line. #define EOL 2 // no Match end of line. #define ANY 3 // no Match any character. #define ANYOF 4 // str Match any of these. #define ANYBUT 5 // str Match any but one of these. #define BRANCH 6 // node Match this, or the next..\&. #define BACK 7 // no "next" ptr points backward. #define EXACTLY 8 // str Match this string. #define NOTHING 9 // no Match empty string. #define STAR 10 // node Match this 0 or more times. #define PLUS 11 // node Match this 1 or more times. #define OPEN 20 // no Sub-RE starts here. // OPEN+1 is number 1, etc. #define CLOSE 30 // no Analogous to OPEN. // Utility definitions. #define FAIL(m) { regerror(m); return(NULL); } #define ISREPN(c) ((c) == _T('*') || (c) == _T('+') || (c) == _T('?')) #define META "^$.[()|?+*\\" // Flags to be passed up and down. #define HASWIDTH 01 // Known never to match null string. #define SIMPLE 02 // Simple enough to be STAR/PLUS operand. #define SPSTART 04 // Starts with * or +. #define WORST 0 // Worst case. CRegExp::CRegExp() { bCompiled = FALSE; program = NULL; sFoundText = NULL; for( int i = 0; i < NSUBEXP; i++ ) { startp[i] = NULL; endp[i] = NULL; } } CRegExp::~CRegExp() { delete program; delete sFoundText; } CRegExp* CRegExp::RegComp(const TCHAR *exp) { TCHAR *scan; int flags; if (exp == NULL) return NULL; bCompiled = TRUE; // First pass: determine size, legality. bEmitCode = FALSE; regparse = (TCHAR *)exp; regnpar = 1; regsize = 0L; regdummy[0] = NOTHING; regdummy[1] = regdummy[2] = 0; regcode = regdummy; if (reg(0, &flags;) == NULL) return(NULL); // Allocate space. delete program; program = new TCHAR[regsize]; memset( program, 0, regsize * sizeof(TCHAR) ); if (program == NULL) return NULL; // Second pass: emit code. bEmitCode = TRUE; regparse = (TCHAR *)exp; regnpar = 1; regcode = program; if (reg(0, &flags;) == NULL) return NULL; // Dig out information for optimizations. regstart = _T('\0'); // Worst-case defaults. reganch = 0; regmust = NULL; regmlen = 0; scan = program; // First BRANCH. if (OP(regnext(scan)) == END) { // Only one top-level choice. scan = OPERAND(scan); // Starting-point info. if (OP(scan) == EXACTLY) regstart = *OPERAND(scan); else if (OP(scan) == BOL) reganch = 1; // If there's something expensive in the r.e., find the // longest literal string that must appear and make it the // regmust. Resolve ties in favor of later strings, since // the regstart check works with the beginning of the r.e. // and avoiding duplication strengthens checking. Not a // strong reason, but sufficient in the absence of others. if (flags&SPSTART;) { char *longest = NULL; size_t len = 0; for (; scan != NULL; scan = regnext(scan)) if (OP(scan) == EXACTLY && _tcslen(OPERAND(scan)) >= len) { longest = OPERAND(scan); len = _tcslen(OPERAND(scan)); } regmust = longest; regmlen = (int)len; } } return this; } // reg - regular expression, i.e. main body or parenthesized thing // // Caller must absorb opening parenthesis. // // Combining parenthesis handling with the base level of regular expression // is a trifle forced, but the need to tie the tails of the branches to what // follows makes it hard to avoid. TCHAR *CRegExp::reg(int paren, int *flagp) { char *ret; char *br; char *ender; int parno; int flags; *flagp = HASWIDTH; // Tentatively. if (paren) { // Make an OPEN node. if (regnpar >= NSUBEXP) { TRACE1("Too many (). NSUBEXP is set to %d\n", NSUBEXP ); return NULL; } parno = regnpar; regnpar++; ret = regnode(OPEN+parno); } // Pick up the branches, linking them together. br = regbranch(&flags;); if (br == NULL) return(NULL); if (paren) regtail(ret, br); // OPEN -> first. else ret = br; *flagp &= ~(~flags&HASWIDTH;); // Clear bit if bit 0. *flagp |= flags&SPSTART; while (*regparse == _T('|')) { regparse++; br = regbranch(&flags;); if (br == NULL) return(NULL); regtail(ret, br); // BRANCH -> BRANCH. *flagp &= ~(~flags&HASWIDTH;); *flagp |= flags&SPSTART; } // Make a closing node, and hook it on the end. ender = regnode((paren) ? CLOSE+parno : END); regtail(ret, ender); // Hook the tails of the branches to the closing node. for (br = ret; br != NULL; br = regnext(br)) regoptail(br, ender); // Check for proper termination. if (paren && *regparse++ != _T(')')) { TRACE0("unterminated ()\n"); return NULL; } else if (!paren && *regparse != _T('\0')) { if (*regparse == _T(')')) { TRACE0("unmatched ()\n"); return NULL; } else { TRACE0("internal error: junk on end\n"); return NULL; } // NOTREACHED } return(ret); } // // regbranch - one alternative of an | operator // // Implements the concatenation operator. TCHAR *CRegExp::regbranch(int *flagp) { TCHAR *ret; TCHAR *chain; TCHAR *latest; int flags; int c; *flagp = WORST; // Tentatively. ret = regnode(BRANCH); chain = NULL; while ((c = *regparse) != _T('\0') && c != _T('|') && c != _T(')')) { latest = regpiece(&flags;); if (latest == NULL) return(NULL); *flagp |= flags&HASWIDTH; if (chain == NULL) // First piece. *flagp |= flags&SPSTART; else regtail(chain, latest); chain = latest; } if (chain == NULL) // Loop ran zero times. (void) regnode(NOTHING); return(ret); } // // regpiece - something followed by possible [*+?] // // Note that the branching code sequences used for ? and the general cases // of * and + are somewhat optimized: they use the same NOTHING node as // both the endmarker for their branch list and the body of the last branch. // It might seem that this node could be dispensed with entirely, but the // endmarker role is not redundant. TCHAR *CRegExp::regpiece(int *flagp) { TCHAR *ret; TCHAR op; TCHAR *next; int flags; ret = regatom(&flags;); if (ret == NULL) return(NULL); op = *regparse; if (!ISREPN(op)) { *flagp = flags; return(ret); } if (!(flags&HASWIDTH;) && op != _T('?')) { TRACE0("*+ operand could be empty\n"); return NULL; } switch (op) { case _T('*'): *flagp = WORST|SPSTART; break; case _T('+'): *flagp = WORST|SPSTART|HASWIDTH; break; case _T('?'): *flagp = WORST; break; } if (op == _T('*') && (flags&SIMPLE;)) reginsert(STAR, ret); else if (op == _T('*')) { // Emit x* as (x&|), where & means "self". reginsert(BRANCH, ret); // Either x regoptail(ret, regnode(BACK)); // and loop regoptail(ret, ret); // back regtail(ret, regnode(BRANCH)); // or regtail(ret, regnode(NOTHING)); // null. } else if (op == _T('+') && (flags&SIMPLE;)) reginsert(PLUS, ret); else if (op == _T('+')) { // Emit x+ as x(&|), where & means "self". next = regnode(BRANCH); // Either regtail(ret, next); regtail(regnode(BACK), ret); // loop back regtail(next, regnode(BRANCH)); // or regtail(ret, regnode(NOTHING)); // null. } else if (op == _T('?')) { // Emit x? as (x|) reginsert(BRANCH, ret); // Either x regtail(ret, regnode(BRANCH)); // or next = regnode(NOTHING); // null. regtail(ret, next); regoptail(ret, next); } regparse++; if (ISREPN(*regparse)) { TRACE0("nested *?+\n"); return NULL; } return(ret); } // // regatom - the lowest level // // Optimization: gobbles an entire sequence of ordinary characters so that // it can turn them into a single node, which is smaller to store and // faster to run. Backslashed characters are exceptions, each becoming a // separate node; the code is simpler that way and it's not worth fixing. TCHAR *CRegExp::regatom(int *flagp) { TCHAR *ret; int flags; *flagp = WORST; // Tentatively. switch (*regparse++) { case _T('^'): ret = regnode(BOL); break; case _T('$'): ret = regnode(EOL); break; case _T('.'): ret = regnode(ANY); *flagp |= HASWIDTH|SIMPLE; break; case _T('['): { int range; int rangeend; int c; if (*regparse == _T('^')) { // Complement of range. ret = regnode(ANYBUT); regparse++; } else ret = regnode(ANYOF); if ((c = *regparse) == _T(']') || c == _T('-')) { regc(c); regparse++; } while ((c = *regparse++) != _T('\0') && c != _T(']')) { if (c != _T('-')) regc(c); else if ((c = *regparse) == _T(']') || c == _T('\0')) regc(_T('-')); else { range = (unsigned) (TCHAR)*(regparse-2); rangeend = (unsigned) (TCHAR)c; if (range > rangeend) { TRACE0("invalid [] range\n"); return NULL; } for (range++; range <= rangeend; range++) regc(range); regparse++; } } regc(_T('\0')); if (c != _T(']')) { TRACE0("unmatched []\n"); return NULL; } *flagp |= HASWIDTH|SIMPLE; break; } case _T('('): ret = reg(1, &flags;); if (ret == NULL) return(NULL); *flagp |= flags&(HASWIDTH|SPSTART); break; case _T('\0'): case _T('|'): case _T(')'): // supposed to be caught earlier TRACE0("internal error: \\0|) unexpected\n"); return NULL; break; case _T('?'): case _T('+'): case _T('*'): TRACE0("?+* follows nothing\n"); return NULL; break; case _T('\\'): if (*regparse == _T('\0')) { TRACE0("trailing \\\n"); return NULL; } ret = regnode(EXACTLY); regc(*regparse++); regc(_T('\0')); *flagp |= HASWIDTH|SIMPLE; break; default: { size_t len; TCHAR ender; regparse--; len = _tcscspn(regparse, META); if (len == 0) { TRACE0("internal error: strcspn 0\n"); return NULL; } ender = *(regparse+len); if (len > 1 && ISREPN(ender)) len--; // Back off clear of ?+* operand. *flagp |= HASWIDTH; if (len == 1) *flagp |= SIMPLE; ret = regnode(EXACTLY); for (; len > 0; len--) regc(*regparse++); regc(_T('\0')); break; } } return(ret); } // reginsert - insert an operator in front of already-emitted operand // // Means relocating the operand. void CRegExp::reginsert(TCHAR op, TCHAR *opnd) { TCHAR *place; if (!bEmitCode) { regsize += 3; return; } (void) memmove(opnd+3, opnd, (size_t)((regcode - opnd)*sizeof(TCHAR))); regcode += 3; place = opnd; // Op node, where operand used to be. *place++ = op; *place++ = _T('\0'); *place++ = _T('\0'); } // // regtail - set the next-pointer at the end of a node chain void CRegExp::regtail(TCHAR *p, TCHAR *val) { TCHAR *scan; TCHAR *temp; // int offset; if (!bEmitCode) return; // Find last node. for (scan = p; (temp = regnext(scan)) != NULL; scan = temp) continue; *((short *)(scan+1)) = (OP(scan) == BACK) ? scan - val : val - scan; } // regoptail - regtail on operand of first argument; nop if operandless void CRegExp::regoptail(TCHAR *p, TCHAR *val) { // "Operandless" and "op != BRANCH" are synonymous in practice. if (!bEmitCode || OP(p) != BRANCH) return; regtail(OPERAND(p), val); } // RegFind - match a regexp against a string // Returns - Returns position of regexp or -1 // if regular expression not found // Note - The regular expression should have been // previously compiled using RegComp int CRegExp::RegFind(const TCHAR *str) { TCHAR *string = (TCHAR *)str; // avert const poisoning TCHAR *s; // Delete any previously stored found string delete sFoundText; sFoundText = NULL; // Be paranoid. if(string == NULL) { TRACE0("NULL argument to regexec\n"); return(-1); } // Check validity of regex if (!bCompiled) { TRACE0("No regular expression provided yet.\n"); return(-1); } // If there is a "must appear" string, look for it. if (regmust != NULL && _tcsstr(string, regmust) == NULL) return(-1); // Mark beginning of line for ^ regbol = string; // Simplest case: anchored match need be tried only once. if (reganch) { if( regtry(string) ) { // Save the found substring in case we need it sFoundText = new TCHAR[GetFindLen()+1]; sFoundText[GetFindLen()] = _T('\0'); _tcsncpy(sFoundText, string, GetFindLen() ); return 0; } //String not found return -1; } // Messy cases: unanchored match. if (regstart != _T('\0')) { // We know what TCHAR it must start with. for (s = string; s != NULL; s = _tcschr(s+1, regstart)) if (regtry(s)) { int nPos = s-str; // Save the found substring in case we need it later sFoundText = new TCHAR[GetFindLen()+1]; sFoundText[GetFindLen()] = _T('\0'); _tcsncpy(sFoundText, s, GetFindLen() ); return nPos; } return -1; } else { // We don't -- general case for (s = string; !regtry(s); s++) if (*s == _T('\0')) return(-1); int nPos = s-str; // Save the found substring in case we need it later sFoundText = new TCHAR[GetFindLen()+1]; sFoundText[GetFindLen()] = _T('\0'); _tcsncpy(sFoundText, s, GetFindLen() ); return nPos; } // NOTREACHED } // regtry - try match at specific point int CRegExp::regtry(TCHAR *string) { int i; TCHAR **stp; TCHAR **enp; reginput = string; stp = startp; enp = endp; for (i = NSUBEXP; i > 0; i--) { *stp++ = NULL; *enp++ = NULL; } if (regmatch(program)) { startp[0] = string; endp[0] = reginput; return(1); } else return(0); } // regmatch - main matching routine // // Conceptually the strategy is simple: check to see whether the current // node matches, call self recursively to see whether the rest matches, // and then act accordingly. In practice we make some effort to avoid // recursion, in particular by going through "ordinary" nodes (that don't // need to know whether the rest of the match failed) by a loop instead of // by recursion. int CRegExp::regmatch(TCHAR *prog) { TCHAR *scan; // Current node. TCHAR *next; // Next node. for (scan = prog; scan != NULL; scan = next) { next = regnext(scan); switch (OP(scan)) { case BOL: if (reginput != regbol) return(0); break; case EOL: if (*reginput != _T('\0')) return(0); break; case ANY: if (*reginput == _T('\0')) return(0); reginput++; break; case EXACTLY: { size_t len; TCHAR *const opnd = OPERAND(scan); // Inline the first character, for speed. if (*opnd != *reginput) return(0); len = _tcslen(opnd); if (len > 1 && _tcsncmp(opnd, reginput, len) != 0) return(0); reginput += len; break; } case ANYOF: if (*reginput == _T('\0') || _tcschr(OPERAND(scan), *reginput) == NULL) return(0); reginput++; break; case ANYBUT: if (*reginput == _T('\0') || _tcschr(OPERAND(scan), *reginput) != NULL) return(0); reginput++; break; case NOTHING: break; case BACK: break; case OPEN+1: case OPEN+2: case OPEN+3: case OPEN+4: case OPEN+5: case OPEN+6: case OPEN+7: case OPEN+8: case OPEN+9: { const int no = OP(scan) - OPEN; TCHAR *const input = reginput; if (regmatch(next)) { // Don't set startp if some later // invocation of the same parentheses // already has. if (startp[no] == NULL) startp[no] = input; return(1); } else return(0); break; } case CLOSE+1: case CLOSE+2: case CLOSE+3: case CLOSE+4: case CLOSE+5: case CLOSE+6: case CLOSE+7: case CLOSE+8: case CLOSE+9: { const int no = OP(scan) - CLOSE; TCHAR *const input = reginput; if (regmatch(next)) { // Don't set endp if some later // invocation of the same parentheses // already has. if (endp[no] == NULL) endp[no] = input; return(1); } else return(0); break; } case BRANCH: { TCHAR *const save = reginput; if (OP(next) != BRANCH) // No choice. next = OPERAND(scan); // Avoid recursion. else { while (OP(scan) == BRANCH) { if (regmatch(OPERAND(scan))) return(1); reginput = save; scan = regnext(scan); } return(0); // NOTREACHED } break; } case STAR: case PLUS: { const TCHAR nextch = (OP(next) == EXACTLY) ? *OPERAND(next) : _T('\0'); size_t no; TCHAR *const save = reginput; const size_t min = (OP(scan) == STAR) ? 0 : 1; for (no = regrepeat(OPERAND(scan)) + 1; no > min; no--) { reginput = save + no - 1; // If it could work, try it. if (nextch == _T('\0') || *reginput == nextch) if (regmatch(next)) return(1); } return(0); break; } case END: return(1); // Success! break; default: TRACE0("regexp corruption\n"); return(0); break; } } // We get here only if there's trouble -- normally "case END" is // the terminating point. TRACE0("corrupted pointers\n"); return(0); } // regrepeat - report how many times something simple would match size_t CRegExp::regrepeat(TCHAR *node) { size_t count; TCHAR *scan; TCHAR ch; switch (OP(node)) { case ANY: return(_tcslen(reginput)); break; case EXACTLY: ch = *OPERAND(node); count = 0; for (scan = reginput; *scan == ch; scan++) count++; return(count); break; case ANYOF: return(_tcsspn(reginput, OPERAND(node))); break; case ANYBUT: return(_tcscspn(reginput, OPERAND(node))); break; default: // Oh dear. Called inappropriately. TRACE0("internal error: bad call of regrepeat\n"); return(0); // Best compromise. break; } // NOTREACHED } // regnext - dig the "next" pointer out of a node TCHAR *CRegExp::regnext(TCHAR *p) { const short &offset; = *((short*)(p+1)); if (offset == 0) return(NULL); return((OP(p) == BACK) ? p-offset : p+offset); } // GetReplaceString - Converts a replace expression to a string // Returns - Pointer to newly allocated string // Caller is responsible for deleting it TCHAR* CRegExp::GetReplaceString( const TCHAR* sReplaceExp ) { TCHAR *src = (TCHAR *)sReplaceExp; TCHAR *buf; TCHAR c; int no; size_t len; if( sReplaceExp == NULL || sFoundText == NULL ) return NULL; // First compute the length of the string int replacelen = 0; while ((c = *src++) != _T('\0')) { if (c == _T('&')) no = 0; else if (c == _T('\\') && isdigit(*src)) no = *src++ - _T('0'); else no = -1; if (no < 0) { // Ordinary character. if (c == _T('\\') && (*src == _T('\\') || *src == _T('&'))) c = *src++; replacelen++; } else if (startp[no] != NULL && endp[no] != NULL && endp[no] > startp[no]) { // Get tagged expression len = endp[no] - startp[no]; replacelen += len; } } // Now allocate buf buf = new TCHAR[replacelen+1]; if( buf == NULL ) return NULL; TCHAR* sReplaceStr = buf; // Add null termination buf[replacelen] = _T('\0'); // Now we can create the string src = (TCHAR *)sReplaceExp; while ((c = *src++) != _T('\0')) { if (c == _T('&')) no = 0; else if (c == _T('\\') && isdigit(*src)) no = *src++ - _T('0'); else no = -1; if (no < 0) { // Ordinary character. if (c == _T('\\') && (*src == _T('\\') || *src == _T('&'))) c = *src++; *buf++ = c; } else if (startp[no] != NULL && endp[no] != NULL && endp[no] > startp[no]) { // Get tagged expression len = endp[no] - startp[no]; int tagpos = startp[no] - startp[0]; _tcsncpy(buf, sFoundText + tagpos, len); buf += len; } } return sReplaceStr; }Here's a function that will do global search and replace using regular expressions. Note that the CStringEx class described in the earlier section is being used here. The main reason for using CStringEx is that it provides the Replace() function which makes our task easier.
int RegSearchReplace( CStringEx& string, LPCTSTR sSearchExp, LPCTSTR sReplaceExp ) { int nPos = 0; int nReplaced = 0; CRegExp r; LPTSTR str = (LPTSTR)(LPCTSTR)string; r.RegComp( sSearchExp ); while( (nPos = r.RegFind((LPTSTR)str)) != -1 ) { nReplaced++; TCHAR *pReplaceStr = r.GetReplaceString( sReplaceExp ); int offset = str-(LPCTSTR)string+nPos; string.Replace( offset, r.GetFindLen(), pReplaceStr ); // Replace might have caused a reallocation str = (LPTSTR)(LPCTSTR)string + offset + _tcslen(pReplaceStr); delete pReplaceStr; } return nReplaced; }
|
Partners
More for Developers
Latest Developer Headlines
- Featured Tip: Convert a PPT File into an Image or HTML File in .NET
- Is Android Destined to Be the Windows of Mobile Computing?
- Software Architecture: Propping Up a House of Cards
- Microsoft Offers Developers New Outlook Tools Under Apache Open-source License
- Exploring Derived Column Transformation in SQL Server Integration Services
Top Authors
- Voted: 13 times.
- Voted: 11 times.
- Voted: 11 times.
- Voted: 10 times.
- Voted: 8 times.
- Paul Kimmel 78 articles
- Tom Archer - MSFT 75 articles
- Zafir Anjum 61 articles
- Bradley Jones 44 articles
- Marius Bancila 32 articles