C++ regex_replace with arbitrary function

Question

This is a version of C++17 regex_replace that handles arbitrary functions to do the replacement, instead of using regex_replace's weird minigame language ($&, $', and $`). Also, I made it take string_view instead of const string&; that's more to see if people have opinions on string_view than for any practical reason. :) (The standard C++17 regex library is not aware of the existence of string_view.)

template<class F>
std::string regex_replace(std::string_view haystack,
                          const std::regex& rx, const F& f)
{
    std::string result;
    const char *begin = haystack.data();
    const char *end = begin + haystack.size();
    std::cmatch m, lastm;
    if (!std::regex_search(begin, end, m, rx)) {
        return std::string(haystack);
    }
    do {
        lastm = m;
        result.append(m.prefix());
        result.append(f(m));
        begin = m[0].second;
        begin += (begin != end && m[0].length() == 0);
        if (begin == end) break;
    } while (std::regex_search(begin, end, m, rx,
        std::regex_constants::match_prev_avail));
    result.append(lastm.suffix());
    return result;
}

void test()
{
    auto s = "std::sort(std::begin(v), std::end(v))";
    auto t = regex_replace(s, std::regex("\\bstd::(\\w+)"),
        [](auto&& m) { return m[1]; });
    assert(t == "sort(begin(v), end(v))");

    auto u = regex_replace(s, std::regex("\\bstd::(\\w+)"),
        [](auto&& m) { return "my::" + m.str(1); });
    assert(u == "my::sort(my::begin(v), my::end(v))");

    auto v = regex_replace(s, std::regex("\\bstd::(\\w+)"),
        [](auto&& m) {
            std::string result;
            std::transform(m[1].first, m[1].second, back_inserter(result), ::toupper);
            return result;
        });
    assert(v == "SORT(BEGIN(v), END(v))");
}

Any bugs in regex_replace? Any way to shorten it up? Any way to reuse the standard regex_iterator or regex_token_iterator would be greatly appreciated.

Toby Speight · Accepted Answer · 2024-08-28 09:27:41Z

Missing includes

We need:

#include <regex>
#include <string>

And for test(), we also need:

#include <algorithm>
#include <cassert>
#include <cctype>

Instead of using char pointers for begin and end, I think it's simpler and clearer to use iterators:

    auto begin = haystack.begin();
    auto const end = haystack.end();

That might also be a start towards more generic code.

Instead of directly calling f(m), it's more flexible to std::invoke() it.

Consider (from C++20) using a suitable Concept to constrain F.

The loop has special treatment for empty matches, but there's no test that this works. That's unfortunate, as there are two bugs here:

We increment begin without copying that character to output.
We don't adjust the last match to remove that character.

So we end up with one character being lost each iteration, plus an extra character after we leave the loop.

If we're careful, we can avoid using m after we copy it to lastm, and thus replace it with a move or swap. Remember that a regex match object does allocation, so that's a worthwhile gain.

We should template the function on the character type, so we can use with wide and/or Unicode strings. We'll probably need an overload that converts C-style strings to C++ views.

The test has the standard signed/unsigned gotcha with std::toupper(), unless you've defined a ::toupper() differently (don't do that; call it something different).

Modified code

Incorporating all the above suggestions:

#include <concepts>
#include <functional>
#include <regex>
#include <string>
#include <string_view>
#include <utility>

template<typename F, typename StringView, typename Regex>
auto regex_replace(StringView haystack, const Regex& rx, const F& f)
    requires std::is_same_v<typename StringView::value_type, typename Regex::value_type>
    &&       std::is_invocable_v<F, std::match_results<typename StringView::iterator>>
{
    using Char = StringView::value_type;
    using Iter = StringView::iterator;
    using String = std::basic_string<Char>;
    using Match = std::match_results<Iter>;

    auto begin = haystack.begin();
    auto const end = haystack.end();
    Match m;
    if (!std::regex_search(begin, end, m, rx)) {
        return String{haystack};
    }

    String result;
    Match lastm;
    do {
        std::swap(m, lastm);
        result.append(lastm.prefix());
        result.append(std::invoke(f, lastm));
        begin = lastm[0].second;
        if (lastm.length(0) == 0) {
            // remove this char from stored suffix
            static const Regex one_char{String(1, '.')};
            if (std::regex_search(begin, end, m, one_char)) {
                result.push_back(*begin++);
                std::swap(m, lastm);
            } else {
                break;
            }
        }
    } while (std::regex_search(begin, end, m, rx,
                               std::regex_constants::match_prev_avail));
    result.append(lastm.suffix());
    return result;
}

template<typename F, typename Char, typename Regex>
auto regex_replace(Char const *haystack, const Regex& rx, const F& f)
{
    return regex_replace(std::basic_string_view{haystack}, rx, f);
}

My testing uses Google Test for my own convenience, because it gives more informative output when an assertion fails; my main point here though is to show the edge cases that I address, starting with the most trivial:

#include <algorithm>
#include <cctype>
#include <ranges>

#include <gtest/gtest.h>

TEST(regex_replace, empty_string)
{
    auto const replace_with_underscores = [](auto) { return "__"; };
    EXPECT_EQ(regex_replace("", std::regex{""}, replace_with_underscores),
              "__");
    EXPECT_EQ(regex_replace("", std::regex{"\\b"}, replace_with_underscores),
              "");
    EXPECT_EQ(regex_replace("", std::regex{"\\B"}, replace_with_underscores),
              "__");
}

TEST(regex_replace, nomatch)
{
    auto const s = "foo";
    EXPECT_EQ(s, regex_replace(s, std::regex{"foobar"},
                               [](auto) { return std::string{"__"};  }));
}

TEST(regex_replace, char)
{
    auto const s = "std::sort(std::begin(v), std::end(v))";
    auto const t = regex_replace(s, std::regex{R"(\bstd::(\w+))"},
                                 [](auto const& m) { return m[1]; });
    EXPECT_EQ(t, "sort(begin(v), end(v))");
}

TEST(regex_replace, transform)
{
    auto const s = "std::sort(std::begin(v), std::end(v))";
    auto const v = regex_replace(s, std::regex("\\bstd::(\\w+)"),
        [](auto&& m) {
            constexpr auto safe_toupper = [](unsigned char c){ return std::toupper(c); };
            return m[1].str()
                | std::views::transform(safe_toupper)
                | std::ranges::to<std::string>();
        });
    EXPECT_EQ(v, "SORT(BEGIN(v), END(v))");
}

TEST(regex_replace, wchar)
{
    auto const s = L"std::sort(std::begin(v), std::end(v))";
    auto const t = regex_replace(s, std::wregex{LR"(\bstd::(\w+))"},
                               [](auto const& m) { return m[1]; });
    EXPECT_EQ(t, L"sort(begin(v), end(v))");
}

TEST(regex_replace, match_empty)
{
    auto const match_word_boundary = std::wregex{L"\\b"};
    auto const replace_with_dots = [](auto) { return L".."; };
    EXPECT_EQ(regex_replace(L"foo", match_word_boundary, replace_with_dots),
              L"..foo..");
    EXPECT_EQ(regex_replace(L"foo-+*", match_word_boundary, replace_with_dots),
              L"..foo..-+*");
    EXPECT_EQ(regex_replace(L"-+*foo", match_word_boundary, replace_with_dots),
              L"-+*..foo..");
    EXPECT_EQ(regex_replace(L"foo-+*bar", match_word_boundary, replace_with_dots),
              L"..foo..-+*..bar..");

    EXPECT_EQ(regex_replace(L"foo", std::wregex{L""}, replace_with_dots),
              L"..f..o..o..");
}

Stack Exchange Network

C++ regex_replace with arbitrary function

1 Answer 1

Missing includes

Modified code

You must log in to answer this question.

Hot Network Questions

C++ regex_replace with arbitrary function

1 Answer 1

Missing includes

Modified code

You must log in to answer this question.

Related

Hot Network Questions