I've written a function which takes as an input:
- a string to be modified
- a vector of regular expressions on which to match
- a vector of replacements
- option to ignore case in the match statements
The function will then, from the users perspective, simultaneously perform all replacements in a 'safe' fashion. By 'safe' I mean sub-string matches are ignored in favor of longer matches (if 'the' and 'they' are patterns to match, ignore the sub-string 'the' in 'they') and also allow users to have cross-replacement (swap 'hey' for 'ho' and 'ho' for 'hey').
Examples:
rcpp_mgsub("hey ho",{"hey","ho"},{"ho","hey"},true)
// Expect "ho hey"
rcpp_mgsub("Hey, how are you?",{"hey","how","are","you"},{"how","are","you","hey"},true)
// Expect "how, are you hey?"
I've written a function in the R programming language that accomplishes this, but R isn't a fast language. R libraries can include pre-compiled C++11 code which can result in vastly faster functions. To this end, I've written a C++11 function but the performance is actually significantly slower than R (3-7x slower in various comparisons I've done) so I realize there must be some significant inefficiencies in my function.
So, I'm curious what inefficiencies are present in the rcpp_mgsub function that I should focus on improving.
#include <iostream>
#include <string>
#include <regex>
std::string rcpp_mgsub(std::string string, std::vector<std::string> const& match, std::vector<std::string> const& replace, bool const& ic) {
std::string newString = "";
std::smatch matches;
std::string prefix = string;
std::string detected;
std::string suffix;
std::regex::flag_type flags;
flags |= std::regex_constants::optimize;
if(ic) flags |= std::regex_constants::icase;
std::vector<std::regex> r(match.size());
std::transform(match.begin(), match.end(), r.begin(),
[&flags](std::string m) {return std::regex (m, flags);});
int j=0;
while(string.size() > 0){
prefix = string;
detected = "";
suffix = "";
j = 0;
for(int i=0;i < match.size();i++){
if (std::regex_search(string,matches, r[i])) {
std::string pr = matches.prefix();
std::string m = matches[0];
if(pr.size() < prefix.size() || (pr.size() == prefix.size() && m.size() > detected.size())){
prefix = pr;
detected = m;
suffix = matches.suffix();
j=i;
}
}
}
if(prefix == string){
newString = newString+string;
string = "";
} else {
newString = newString+prefix+std::regex_replace(detected,r[j],replace[j],std::regex_constants::format_sed);
string = suffix;
}
}
return newString;
}
int main()
{
std::cout << rcpp_mgsub("hey ho hey",{"hey","ho"},{"ho","hey"},true) << "\n";
std::cout << rcpp_mgsub("Hey, how are you?",{"hey","how","are","you"},{"how","are","you","hey"},true) << "\n";
std::cout << rcpp_mgsub("Dopazamine is not the same as Dopachloride and is still fake.",{"[Dd]opa(.*?mine)","fake"},{"Meta\\1","real"},false);
return(0);
}
main()which performs the last substitution and its reverse 20000 times, I get a factor of 10 difference between-O0and-O3. I suspect that there's a fair bit of template code that's subject to your code generation options, plus a lot that can be done by the link-time optimizer. I've got some other ideas, but it might take until next week before I get time to try them and write a review. \$\endgroup\$