McNaughton-Yamada-Thompson algorithm for converting regex to NDFA

Question

The McNaughton-Yamada-Thompson algorithm (aka. Thompson algorithm) is designed to transform a regular expression into a Non-Deterministic Finite Automaton (NDFA) that recognizes the language represented by the regular expression.

There are many Implementations of McNaughton-Yamada-Thompson algorithm in different languages:

Java
Python
Golang
Mathematica, written by me
Rust, written by me

I was trying to rewrite the Python code in C++. The result is consistent.

The C++ code written by me is as follows:

Try it online! \$\quad\$ Ref Python code for verification.

#include <iostream>
#include <string>
#include <unordered_map>
#include <stack>
#include <vector>
#include <set>

// Define the State structure
struct State {
    char label;       // Character label, '\0' for epsilon
    State* edge1;     // First transition
    State* edge2;     // Second transition

    State(char lbl = '\0') : label(lbl), edge1(nullptr), edge2(nullptr) {}
};

// Define the NFA structure
struct NFA {
    State* initial;
    State* accept;

    NFA(State* init = nullptr, State* acc = nullptr) : initial(init), accept(acc) {}
};

// Function to convert infix regex to postfix using the Shunting Yard algorithm
std::string shunt(const std::string& infix) {
    std::unordered_map<char, int> specials = {
        {'*', 60},
        {'+', 55},
        {'?', 50},
        {'.', 40},
        {'|', 20}
    };
    std::string postfix;
    std::stack<char> stack;

    for (char c : infix) {
        if (c == '(') {
            stack.push(c);
        }
        else if (c == ')') {
            while (!stack.empty() && stack.top() != '(') {
                postfix += stack.top();
                stack.pop();
            }
            if (!stack.empty()) {
                stack.pop(); // Remove '('
            }
        }
        else if (specials.find(c) != specials.end()) {
            while (!stack.empty() && specials.find(stack.top()) != specials.end() &&
                   specials[c] <= specials[stack.top()]) {
                postfix += stack.top();
                stack.pop();
            }
            stack.push(c);
        }
        else {
            postfix += c;
        }
    }

    while (!stack.empty()) {
        postfix += stack.top();
        stack.pop();
    }

    return postfix;
}

// Function to compute the epsilon closure of a state
std::set<State*> followes(State* state) {
    std::set<State*> states;
    std::stack<State*> stack;
    stack.push(state);

    while (!stack.empty()) {
        State* s = stack.top();
        stack.pop();
        if (states.find(s) == states.end()) {
            states.insert(s);
            if (s->label == '\0') { // Epsilon transition
                if (s->edge1 != nullptr) {
                    stack.push(s->edge1);
                }
                if (s->edge2 != nullptr) {
                    stack.push(s->edge2);
                }
            }
        }
    }

    return states;
}

// Function to compile postfix regex into an NFA
NFA compileRegex(const std::string& postfix) {
    std::stack<NFA> nfaStack;

    for (char c : postfix) {
        if (c == '*') {
            NFA nfa1 = nfaStack.top(); nfaStack.pop();
            State* initial = new State();
            State* accept = new State();
            initial->edge1 = nfa1.initial;
            initial->edge2 = accept;
            nfa1.accept->edge1 = nfa1.initial;
            nfa1.accept->edge2 = accept;
            nfaStack.push(NFA(initial, accept));
        }
        else if (c == '.') {
            NFA nfa2 = nfaStack.top(); nfaStack.pop();
            NFA nfa1 = nfaStack.top(); nfaStack.pop();
            nfa1.accept->edge1 = nfa2.initial;
            nfaStack.push(NFA(nfa1.initial, nfa2.accept));
        }
        else if (c == '|') {
            NFA nfa2 = nfaStack.top(); nfaStack.pop();
            NFA nfa1 = nfaStack.top(); nfaStack.pop();
            State* initial = new State();
            State* accept = new State();
            initial->edge1 = nfa1.initial;
            initial->edge2 = nfa2.initial;
            nfa1.accept->edge1 = accept;
            nfa2.accept->edge1 = accept;
            nfaStack.push(NFA(initial, accept));
        }
        else if (c == '+') {
            NFA nfa1 = nfaStack.top(); nfaStack.pop();
            State* initial = new State();
            State* accept = new State();
            initial->edge1 = nfa1.initial;
            nfa1.accept->edge1 = nfa1.initial;
            nfa1.accept->edge2 = accept;
            nfaStack.push(NFA(initial, accept));
        }
        else if (c == '?') {
            NFA nfa1 = nfaStack.top(); nfaStack.pop();
            State* initial = new State();
            State* accept = new State();
            initial->edge1 = nfa1.initial;
            initial->edge2 = accept;
            nfa1.accept->edge1 = accept;
            nfaStack.push(NFA(initial, accept));
        }
        else { // Literal character
            State* initial = new State(c);
            State* accept = new State();
            initial->edge1 = accept;
            nfaStack.push(NFA(initial, accept));
        }
    }

    return nfaStack.top();
}

// Function to match a string against the regex
bool matchRegex(const std::string& infix, const std::string& str) {
    std::string postfix = shunt(infix);
    // Uncomment the next line to see the postfix expression
    // std::cout << "Postfix: " << postfix << std::endl;

    NFA nfa = compileRegex(postfix);

    std::set<State*> current = followes(nfa.initial);
    std::set<State*> nextStates;

    for (char c : str) {
        for (State* state : current) {
            if (state->label == c) {
                std::set<State*> follow = followes(state->edge1);
                nextStates.insert(follow.begin(), follow.end());
            }
        }
        current = nextStates;
        nextStates.clear();
    }

    return current.find(nfa.accept) != current.end();
}

int main() {
    std::vector<std::string> infixes = {"a.b.c*", "a.(b|d).c*", "(a.(b|d))*", "a.(b.b)*.c"};
    std::vector<std::string> strings = {"", "abc", "abbc", "abcc", "abad", "abbbc"};

    for (const auto& infix : infixes) {
        for (const auto& str : strings) {
            bool result = matchRegex(infix, str);
            std::cout << (result ? "True " : "False ") << infix << " " << str << std::endl;
        }
        std::cout<<"\n";
    }

    return 0;
}

My question is:

The Python/C++ code snippets both use self-referenced nested object (self referencing structure, i.e., pointer relation graph may have cycle). How to port this kind of code to Non-pointer language(like, Wolfram Mathematica) ?

What about in Rust? (Rust's ownership and borrowing checker prevents simple self referencing structures. This is because Rust needs to know the lifecycle of variables at compile time to ensure memory safety. Use Box Rc; UnsafeCell; Pin?)

Your specific questions are about code not yet written, and/or not present in the question and/or not working, so answering them would be outside the scope of Code Review. I've reviewed the code in the normal way, but can't answer those questions. — Toby Speight
– Toby Speight, Commented Nov 15, 2024 at 11:47

Toby Speight · Accepted Answer · 2024-11-15 11:45:15Z

The code is clean and well-presented, with no warnings when compiled with my usual aggressive checking flags. However, running under Valgrind shows that it leaks memory like a firehose - see below.

I'm assuming that the algorithm is fine if it produces correct results (BTW, consider using a unit-test framework to help you create self-checking tests that don't need human inspection). So I'll walk through the code to see where it might be clearer.

#include <iostream>
#include <string>
#include <unordered_map>
#include <stack>
#include <vector>
#include <set>

A minor improvement is to sort the includes into alphabetical order. That often makes it easier to spot omissions.

// Define the State structure

This comment adds nothing that the C++ code doesn't already say. I recommend removing this one.

struct State {
    char label;       // Character label, '\0' for epsilon
    State* edge1;     // First transition
    State* edge2;     // Second transition

    State(char lbl = '\0') : label(lbl), edge1(nullptr), edge2(nullptr) {}
};

The comments within the struct are much more useful - they explain the why rather than the what. I'd probably change the first one to say it's the character to match rather than a "label", but the information that 0 means the end is definitely helpful.

Consider using in-class initialisers for those members that don't depend on constructor arguments:

struct State
{
    char label;
    State* edge1 = {};
    State* edge2 = {};

    State(char lbl = '\0')
        : label{lbl}
    {}
};

You may find that using an array for the two edges makes the rest of the code easier:

    std::array<State*,2> edge = { nullptr, nullptr };

// Define the NFA structure
struct NFA {
    State* initial;
    State* accept;

    NFA(State* init = nullptr, State* acc = nullptr) : initial(init), accept(acc) {}
};

Again, a redundant comment.

We never default the arguments when we construct an NFA, so we can change the constructor to just:

    NFA(State* init, State* acc)
        : initial(init), accept(acc)
    {}

But that's exactly the same as the default aggregate initialisation we'd get from the compiler, so we can omit it completely:

struct NFA {
    State* initial;
    State* accept;
};

std::string shunt(const std::string& infix) {

Consider using a std::string_view instead. That has cheap conversion from std::string& and from char const*).

    std::unordered_map<char, int> specials = {
        {'*', 60},
        {'+', 55},
        {'?', 50},
        {'.', 40},
        {'|', 20}
    };

This would benefit from an explanatory comment, especially for the value type (I think it's an operator precedence?).

We don't need to construct a new instance each time we call the function, so I suggest we make this a static const object. We just need to change operator[] to .at() in the two places we use it.

Also, it's likely that plain (sorted) std::map is more efficient than std::unordered_map for this small set of integer keys.

    std::string postfix;
    std::stack<char> stack;

    for (char c : infix) {

Since we don't intend to modify c in the body of the loop, we can write for (const char c: infix). That reduces cognitive load for readers.

        if (c == '(') {
            ⋮
        }
        else if (c == ')') {
            ⋮
        }
        else if (specials.find(c) != specials.end()) {
            ⋮
        }
        else {
            ⋮
        }

Consider using switch (c) here. If we don't mind duplicating the keys for the metacharacters, that could just be:

        switch (c) {
        case '(':
            ⋮
            break;
        case ')':
            ⋮
            break;
        case '*':
        case '+':
        case '?':
        case '.':
        case '|':
            ⋮
            break;
        default:
            ⋮
            break;
        }

I think we might be missing a case for '\' so that users can escape the regex metacharacters.

            if (!stack.empty()) {
                stack.pop(); // Remove '('
            }

I think it's an error if we reach the end of stack here, because that means an unmatched (. Consider how you might want to report this (throw an exception, perhaps?).

        else if (specials.find(c) != specials.end()) {
            while (!stack.empty() && specials.find(stack.top()) != specials.end() &&
                   specials[c] <= specials[stack.top()]) {

Here, we lookup c and stack.top() twice in the map (once with find and once with []). We could retain the iterator from find instead of performing a second lookup:

            else if (auto p = specials.find(c);  p != specials.end()) {
                while (!stack.empty()) {
                    if (auto const q = specials.find(stack.top());
                        q == specials.end() || p->second > q->second)
                    {
                        break;
                    }
                    postfix += stack.top();
                    stack.pop();
                }
            }

// Function to compute the epsilon closure of a state
std::set<State*> followes(State* state) {

Is that a typo for follows? Spelling does matter, because it can affect whether we can search large codebases effectively.

NFA compileRegex(const std::string& postfix) {
    std::stack<NFA> nfaStack;

    for (char c : postfix) {
        if (c == '*') {
            NFA nfa1 = nfaStack.top(); nfaStack.pop();
            State* initial = new State();
            State* accept = new State();

We're using raw pointers here, so we need to be careful that the matching deletes don't miss any of the allocations. Otherwise we'll leak memory.

Actually, I've looked through the rest of the code, and it seems we do leak all of these, and that's confirmed by running in Valgrind. So there's a definite bug to fix here.

            initial->edge1 = nfa1.initial;
            initial->edge2 = accept;
            nfa1.accept->edge1 = nfa1.initial;
            nfa1.accept->edge2 = accept;

If we made the edges in State into an array as I suggested above, these four lines become one:

            initial->edge = nfa1.accept->edge = { nfa1.initial, accept };

            std::cout << (result ? "True " : "False ") << infix << " " << str << std::endl;

Instead of writing (result ? "True " : "False "), we could std::cout << std::boolalpha so that the stream will do that for us.

There's no need to flush every line, so I recommend plain '\n' rather than std::endl.

An array might be even faster than a map, considering there are only 256 possible keys. — ayaan098
– ayaan098, Commented Nov 15, 2024 at 16:48
If NFA doesn't have a constructor then maybe it is a good idea to use member initializers, otherwise a default constructed NFA has undefined pointer values. — G. Sliepen
– G. Sliepen, Commented Nov 17, 2024 at 9:36

Louis Thomas · Accepted Answer · 2024-11-15 21:18:07Z

4

To implement pointers in a non-pointer language, create an array of objects. Pointers are now indexes into the array. They're just integers and no one can complain about integers.

I'm not a Rust programmer, but I suspect something similar will work. The array owns the objects and the rest is just bounds checked array access.

If it's a problem, don't recurse; iterate. Throw your item indexes onto a stack (an array and a counter doncha know) and process in a loop until done.

The only store a Turing machine has is one really big array.

edited Nov 15, 2024 at 21:18

answered Nov 15, 2024 at 21:10

Louis Thomas

412 bronze badges

Add a comment |

Stack Exchange Network

McNaughton-Yamada-Thompson algorithm for converting regex to NDFA

2 Answers 2

You must log in to answer this question.

Hot Network Questions

McNaughton-Yamada-Thompson algorithm for converting regex to NDFA

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions