Safe, simultaneous string replacement

Question

I've written a function which takes as an input:

a string to be modified
a vector of regular expressions on which to match
a vector of replacements
option to ignore case in the match statements

The function will then, from the users perspective, simultaneously perform all replacements in a 'safe' fashion. By 'safe' I mean sub-string matches are ignored in favor of longer matches (if 'the' and 'they' are patterns to match, ignore the sub-string 'the' in 'they') and also allow users to have cross-replacement (swap 'hey' for 'ho' and 'ho' for 'hey').

Examples:

rcpp_mgsub("hey ho",{"hey","ho"},{"ho","hey"},true)
// Expect "ho hey"
rcpp_mgsub("Hey, how are you?",{"hey","how","are","you"},{"how","are","you","hey"},true)
// Expect "how, are you hey?"

I've written a function in the R programming language that accomplishes this, but R isn't a fast language. R libraries can include pre-compiled C++11 code which can result in vastly faster functions. To this end, I've written a C++11 function but the performance is actually significantly slower than R (3-7x slower in various comparisons I've done) so I realize there must be some significant inefficiencies in my function.

So, I'm curious what inefficiencies are present in the rcpp_mgsub function that I should focus on improving.

#include <iostream>
#include <string>
#include <regex>

std::string rcpp_mgsub(std::string string, std::vector<std::string> const& match, std::vector<std::string> const& replace, bool const& ic) {
  std::string newString = "";
  std::smatch matches;
  std::string prefix = string;
  std::string detected;
  std::string suffix;

  std::regex::flag_type flags;
  flags |= std::regex_constants::optimize;
  if(ic) flags |= std::regex_constants::icase;

  std::vector<std::regex> r(match.size());
  std::transform(match.begin(), match.end(), r.begin(),
                 [&flags](std::string m) {return std::regex (m, flags);});

  int j=0;
  while(string.size() > 0){
    prefix = string;
    detected = "";
    suffix = "";
    j = 0;
    for(int i=0;i < match.size();i++){
      if (std::regex_search(string,matches, r[i])) {
        std::string pr = matches.prefix();
        std::string m = matches[0];
        if(pr.size() < prefix.size() || (pr.size() == prefix.size() && m.size() > detected.size())){
          prefix = pr;
          detected = m;
          suffix = matches.suffix();
          j=i;
        }
      }
    }
    if(prefix == string){
      newString = newString+string;
      string = "";
    } else {
      newString = newString+prefix+std::regex_replace(detected,r[j],replace[j],std::regex_constants::format_sed);
      string = suffix;
    }
  }
  return newString;
}


int main()
{
   std::cout << rcpp_mgsub("hey ho hey",{"hey","ho"},{"ho","hey"},true) << "\n";
   std::cout << rcpp_mgsub("Hey, how are you?",{"hey","how","are","you"},{"how","are","you","hey"},true) << "\n";
   std::cout << rcpp_mgsub("Dopazamine is not the same as Dopachloride and is still fake.",{"[Dd]opa(.*?mine)","fake"},{"Meta\\1","real"},false);
   return(0);
}

Are you compiling with optimizations enabled? With GCC and a main() which performs the last substitution and its reverse 20000 times, I get a factor of 10 difference between -O0 and -O3. I suspect that there's a fair bit of template code that's subject to your code generation options, plus a lot that can be done by the link-time optimizer. I've got some other ideas, but it might take until next week before I get time to try them and write a review. — Toby Speight
– Toby Speight, Commented Jan 19, 2018 at 11:50
Updated answer to fix the bug. Sorry it took so long to return to this! — Toby Speight
– Toby Speight, Commented Nov 7, 2018 at 15:22

Toby Speight · Accepted Answer · 2018-11-07 15:19:44Z

Headers

You forgot to include <vector>.

Only the test program requires <iostream>, so consider moving it to immediately before main(), to make it easier to separate the implementation and its tests when the time comes.

Interface

You might be constrained by what the calling environment expects (I don't know much about R), but there are a couple of surprises in the function signature:

Passing the match strings and their replacements as a pair of parallel containers can be hard to get right (and there doesn't seem to be even the minimum of checking that their lengths match). It's better to accept a list of pairs than a pair of lists; that way, each match appears alongside its replacement.
The boolean flag is a danger sign - it's not obvious at the call site what the flag means. It might be better to accept a std::regex_constants::syntax_option_type to be used; this would also allow the caller to choose different regex grammars.

I think I would write the interface something like this:

std::string rcpp_mgsub(std::string string,
                       std::vector<std::pair<std::regex,std::string>> const& replacements);

// Compatibility layer, if required
std::string rcpp_mgsub(const std::string& string,
                       std::vector<std::string> const& match,
                       std::vector<std::string> const& replace,
                       bool const& ic)
{
    if (match.size() != replace.size())
        throw std::invalid_argument("match/replace lengths differ");

    auto flags = std::regex_constants::optimize;
    if (ic)
        flags |= std::regex_constants::icase;
    std::vector<std::pair<std::regex,std::string>> replacements;
    replacements.reserve(match.size());
    std::transform(match.begin(), match.end(), replace.begin(), std::back_inserter(replacements),
                   [&flags](const std::string& m, const std::string& r) {return std::make_pair(std::regex(m, flags), r);});

    return rcpp_mgsub(string, replacements);
}

Algorithm

After each replacement, every regex is re-searched from the last match. If we remembered where each one matched, we'd only need to update the matches for any that matched before the text just substituted. This may save a great deal of processing, particularly for regexes that are unmatched and for long input strings.

Here's an implementation that does this:

#include <regex>
#include <stdexcept>
#include <string>
#include <utility>
#include <vector>

std::string rcpp_mgsub(const std::string& s, std::vector<std::pair<std::regex,std::string>> const& replacements)
{
    static const std::sregex_iterator no_match = {};

    using IterAndReplacement = std::pair<std::sregex_iterator,const std::string&>;

    std::vector<IterAndReplacement> iterators;
    iterators.reserve(replacements.size());
    for (auto const& r: replacements)
        iterators.emplace_back(std::sregex_iterator{s.begin(), s.end(), r.first}, r.second);

    std::string result = {};
    auto position = s.begin();

    while (true) {
        // find the next match, ignoring any shorter overlapping matches
        IterAndReplacement const *best_match = nullptr;
        for (auto& i: iterators) {
            auto& it = i.first;
            // advance iterators to after last match
            while (it != no_match && (*it)[0].first < position) {
                ++it;
            }
            if (it == no_match) continue;
            if (!best_match) {
                best_match = &i;
                continue;
            }
            auto const& match = (*i.first)[0];
            auto const& best = (*best_match->first)[0];
            if (match.first >= best.second)
                continue;
            if (match.second < best.first
                || match.first < best.first && match.length() >= best.length()
                || match.first < best.second && match.length() > best.length())
            {
                best_match = &i;
            }
        }

        // if no regex matches, just copy the rest of string and finish
        if (!best_match) {
            result.append(position, s.end());
            return result;
        }

        // otherwise, replace the match and continue to the next one
        auto const& best = (*best_match->first);
        auto const m = best.format(best_match->second, std::regex_constants::format_sed);
        result.append(position, best[0].first).append(m);
        position = best[0].second;
    }
}

Compilation

The question is tagged performance, but there's no indication of how you're conducting performance tests. I adapted main() to transform a string (using a replacement and its inverse) twenty thousand times, and to use the result (to avoid over-optimizing):

#include <iostream>

int main()
{
   std::cout << rcpp_mgsub("Hey hey hey ho ho Ho",
                           {"hey","ho" },
                           {"ho", "hey"}, true) << "\n";
   std::cout << rcpp_mgsub("Hey, how are you?",
                           {"hey","how","are","you"},
                           {"how","are","you","hey"}, true) << "\n";
   std::cout << rcpp_mgsub("Dopazamine is not the same as Dopachloride and is still fake.",
                           {"[Dd]opa(.*?mine)", "fake"},
                           {"Meta\\1",          "real"}, false) << "\n";

   std::string s = "Dopazamine is not the same as Dopachloride and is still fake.";
   for (auto i = 0u;  i < 10000;  ++i) {
       s = rcpp_mgsub(s,
       {"[Dd]opa(.*?mine)", "fake"},
       {"Meta\\1",          "real"}, false);
       s = rcpp_mgsub(s,
       {"Meta(.*?mine)", "fake"},
       {"Dopa\\1",       "real"}, false);
   }
   std::cout << s << std::endl ;
}

I found a large difference between g++ -O0 and g++ -O3 on this code (roughly a factor of 10×). Quite a large part of this program comes from expanding templates from the <regex> header (therefore compiled as part of this translation unit, with our compiler). And there's quite a lot that can be inlined or removed by a link-time optimizer.

Remember: when making performance-related changes to code, always measure before and after - and make sure that what you're measuring is representative! If you carefully measure and profile the unoptimized builds, you may find that you're sacrificing readability for no improvement on what the optimizing compiler produces.

wow, this is great. Thanks! So, the implementation in R would have checking of equal length match/replacements in R before calling the C++ function (just how the implementation works). In fact, all my error checking would be handled in R and the C++ function would be an internal call. I'm not 100% clear on how R objects are converted to C++ objects when they're passed in, but vectors and scalars definitely work. Also, the R function would have documentation on what the boolean flag would actually do. — Mark
– Mark, Commented Jan 25, 2018 at 19:48
As I don't know R, and you didn't include the R interface, I've just reviewed this like any other C++ code; I hope that's still helpful. — Toby Speight
– Toby Speight, Commented Jan 26, 2018 at 11:11

Snowbody · Accepted Answer · 2018-01-17 14:34:40Z

2

What compiler are you using? Some implementations of the C++ standard library are better than others for supporting Regex.

Do you really need backreferences? That usually makes regex much slower.

It's very unusual to use a variable with the same name as the variable's type.

You don't need to initialize prefix=string before the loop, it's done first thing inside the loop.

Have you done any profiling of your code to see which lines are slowing it down? It'd be very good to know if the slowness is due to the regex or due to the string concatenation.

answered Jan 17, 2018 at 14:34

Snowbody

8,69225 silver badges50 bronze badges

\$\begingroup\$ It's a mingw_64 compiler - I don't know the exact version (compiling for use in R requires the current 'standard' set by R). Sorry for my ignorance - what is a backreference? Where am I using them here? Can you recommend a tool for profiling my C++ code? \$\endgroup\$

Mark
– Mark

2018-01-17 14:36:51 +00:00
Commented Jan 17, 2018 at 14:36
\$\begingroup\$ By backreference do you mean the using () and \\1 in the regex? I'm trying to write a general function that accepts a wide variety of inputs so I would like for them to continue to be supported. I added std::regex_constants::format_sed to support using backreferences the same way R does it. \$\endgroup\$

Mark
– Mark

2018-01-17 14:49:48 +00:00
Commented Jan 17, 2018 at 14:49
\$\begingroup\$ So you're on Windows using a recent mingw version of GCC then, I take it. \$\endgroup\$

Snowbody
– Snowbody

2018-01-17 14:50:10 +00:00
Commented Jan 17, 2018 at 14:50
\$\begingroup\$ Yes. on Windows. Sorry for missing the most obvious part of that response. \$\endgroup\$

Mark
– Mark

2018-01-17 14:50:44 +00:00
Commented Jan 17, 2018 at 14:50
\$\begingroup\$ For profiling MinGW code on Windows, see jonforums.github.io/cochise/2011/07/18/… \$\endgroup\$

Snowbody
– Snowbody

2018-01-18 13:39:08 +00:00
Commented Jan 18, 2018 at 13:39

Add a comment |

Stack Exchange Network

Safe, simultaneous string replacement

2 Answers 2

Headers

Interface

Algorithm

Compilation

You must log in to answer this question.

Hot Network Questions

Safe, simultaneous string replacement

2 Answers 2

Headers

Interface

Algorithm

Compilation

You must log in to answer this question.

Related

Hot Network Questions