5
\$\begingroup\$

Problem description

Given a text \$T[0..n)\$, a pattern \$P[0..m)\$, and a nonnegative integer \$k\$, report all positions \$j \in [0..n)\$ such that \$ed(P, T(j - l..j]) \leq k\$ for some \$l \geq 0\$, where \$ed\$ is the Levenshtein distance between two input strings.

Question

I am comparing two algorithms: the trivial one, and the one, which incorporates Ukkonen's heuristic for pruning computing the entire distance matrix.

See what I have:

ApproximateStringMatcher.java

package net.coderodde.string.matching.approximate;

import java.util.List;

/**
 * This interface defines the API for approximate string matching algorithms.
 * 
 * @author Rodion "rodde" Efremov
 * @version 1.6 (Mar 23, 2016)
 */
public interface ApproximateStringMatcher {

    /**
     * Returns the list of all approximate matches of {@code pattern} in 
     * {@code text}. The edit distance between an approximate match and the 
     * pattern is no more than {@code maximumEditDistance}.
     * 
     * @param text                the text to search in.
     * @param pattern             the pattern to search for.
     * @param maximumEditDistance the maximum allowed edit distance.
     * @return a list of the last indices of all approximate matches.
     */
    public List<Integer> match(String text, 
                               String pattern, 
                               int maximumEditDistance);
}

DefaultApproximateStringMatcher.java

package net.coderodde.string.matching.approximate.support;

import java.util.ArrayList;
import java.util.List;
import static net.coderodde.misc.Miscellanea.delta;
import static net.coderodde.misc.Miscellanea.min;
import net.coderodde.string.matching.approximate.ApproximateStringMatcher;

/**
 * This class implements a default approximate string matching algorithm.
 * 
 * @author Rodion "rodde" Efremov
 * @version 1.6 (Mar 23, 2016)
 */
public class DefaultApproximateStringMatcher 
implements ApproximateStringMatcher {

    @Override
    public List<Integer> match(String text, 
                               String pattern, 
                               int maximumEditDistance) {
        int n = text.length();
        int m = pattern.length();
        int[][] g = new int[m + 1][n + 1];
        List<Integer> matchIndexList = new ArrayList<>();

        for (int i = 0; i < m + 1; ++i) {
            g[i][0] = i;
        }

        for (int j = 1; j < n + 1; ++j) {
            for (int i = 1; i < m + 1; ++i) {
                g[i][j] = min(g[i - 1][j - 1] + delta(text.charAt(j - 1),
                                                      pattern.charAt(i - 1)),
                              g[i - 1][j] + 1,
                              g[i][j - 1] + 1);

            }

            if (g[m][j] <= maximumEditDistance) {
                matchIndexList.add(j);
            }
        }

        return matchIndexList;
    }    
}

UkkonenCutOffAlgorithm.java

package net.coderodde.string.matching.approximate.support;

import java.util.ArrayList;
import java.util.List;
import static net.coderodde.misc.Miscellanea.delta;
import static net.coderodde.misc.Miscellanea.min;
import net.coderodde.string.matching.approximate.ApproximateStringMatcher;

/**
 * This class implements an approximate string matching algorithms with a
 * cut-off heuristic by Esko Ukkonen.
 * 
 * @author Rodion "rodde" Efremov
 * @version 1.6 (Mar 23, 2016)
 */
public class UkkonenCutOffAlgorithm implements ApproximateStringMatcher {

    @Override
    public List<Integer> match(String text, 
                               String pattern, 
                               int maximumEditDistance) {
        int n = text.length();
        int m = pattern.length();
        int top = min(maximumEditDistance + 1, m);
        int[][] g = new int[m + 1][n + 1];
        List<Integer> matchIndexList = new ArrayList<>();

        for (int i = 1; i <= top; ++i) {
            g[i][0] = i;
        }

        for (int j = 1; j <= n; ++j) {
            for (int i = 1; i <= top; ++i) {
                g[i][j] = min(g[i - 1][j - 1] + delta(pattern.charAt(i - 1),
                                                      text.charAt(j - 1)),
                              g[i - 1][j] + 1,
                              g[i][j - 1] + 1);
            }

            while (g[top][j] > maximumEditDistance) {
                --top;
            }

            if (top == m) {
                matchIndexList.add(j);
            } else {
                g[++top][j] = maximumEditDistance + 1;
            }
        }

        return matchIndexList;
    }
}

Miscellanea.java

package net.coderodde.misc;

import java.util.Random;

/**
 * This class contains various utilities.
 * 
 * @author Rodion "rodde" Efremov
 * @version 1.6 (Mar 23, 2016)
 */
public class Miscellanea {

    public static int min(int... ints) {
        if (ints.length == 0) {
            throw new IllegalArgumentException("Nothing to return.");
        }

        int min = ints[0];

        for (int i = 1; i < ints.length; ++i) {
            if (min > ints[i]) {
                min = ints[i];
            }
        }

        return min;
    }

    public static int delta(char a, char b) {
        return a == b ? 0 : 1;
    }

    public static String createRandomString(int size, 
                                            char smallest,
                                            char largest,
                                            Random random) {
        StringBuilder sb = new StringBuilder(size);

        for (int i = 0; i < size; ++i) {
            sb.append(smallest + random.nextInt(largest - smallest + 1));
        }

        return sb.toString();
    }
}

Demo.java

import java.util.List;
import java.util.Random;
import static net.coderodde.misc.Miscellanea.createRandomString;
import net.coderodde.string.matching.approximate.ApproximateStringMatcher;
import net.coderodde.string.matching.approximate.support.DefaultApproximateStringMatcher;
import net.coderodde.string.matching.approximate.support.UkkonenCutOffAlgorithm;

public class Demo {

    private static final int TEXT_LENGTH = 1_000_000;
    private static final int PATTERN_LENGTH = 10;
    private static final int MAXIMUM_DISTANCE = 1;

    public static void main(String[] args) {
        long seed = System.currentTimeMillis();
        Random random = new Random(seed);
        String text = createRandomString(TEXT_LENGTH, 'A', 'C', random);
        String pattern = createRandomString(PATTERN_LENGTH, 'A', 'C', random);
        System.out.println("Seed = " + seed);

        ApproximateStringMatcher matcher1 = 
                new DefaultApproximateStringMatcher();
        ApproximateStringMatcher matcher2 = 
                new UkkonenCutOffAlgorithm();

        warmup(random);

        long startTime = System.nanoTime();
        List<Integer> result1 = matcher1.match(text, pattern, MAXIMUM_DISTANCE);
        long endTime = System.nanoTime();

        System.out.printf("%s in %.2f milliseconds.\n",
                          matcher1.getClass().getSimpleName(),
                          (endTime - startTime) / 1e6);

        startTime = System.nanoTime();
        List<Integer> result2 = matcher1.match(text, pattern, MAXIMUM_DISTANCE);
        endTime = System.nanoTime();

        System.out.printf("%s in %.2f milliseconds.\n",
                          matcher2.getClass().getSimpleName(),
                          (endTime - startTime) / 1e6);


        if (result1.equals(result2)) {
            System.out.println("Matches: " + result1.size());
        } else {
            System.out.println("Algorithms disagree, please debug.");
        }
    }

    private static final void warmup(Random random) {
        ApproximateStringMatcher matcher1 =
                new DefaultApproximateStringMatcher();

        ApproximateStringMatcher matcher2 = 
                new UkkonenCutOffAlgorithm();

        for (int i = 0; i < 20; ++i) {
            String text = createRandomString(10_000, 'A', 'Z', random);
            String pattern = createRandomString(10, 'A', 'Z', random);

            matcher1.match(text, pattern, 2);
            matcher2.match(text, pattern, 2);
        }
    }
}

Please, tell me anything that comes to mind.

\$\endgroup\$
2
  • \$\begingroup\$ Very nice question. If applicable, would you have a link to the challenge/task source to include? \$\endgroup\$
    – Phrancis
    Commented Mar 23, 2016 at 15:18
  • \$\begingroup\$ These two are not from a challenge. In the beginning of the post, I intended solely explain what those two algorithms compute. \$\endgroup\$
    – coderodde
    Commented Mar 23, 2016 at 15:44

1 Answer 1

2
\$\begingroup\$

Overall very nice programming. I am glad that you correctly declared and implemented ApproximateStringMatcher.

In your Miscellanea.min method you could have used min = Math.min(min, ints[i]) instead of that if.

I would consider to think about the Miscellanea.delta method. The only thing he is doing is to do a ternary, I wonder if I preferred to have that code in place so I didn't have the need to navigate the method to see what it does. Aside of this what you definitely want to do is to have a variable to store the result so you shorten the g[i][j] assignment.

public List<Integer> match(String text, String pattern, int maximumEditDistance) {
   //...
    for (int j = 1; j <= n; ++j) {
        for (int i = 1; i <= top; ++i) {
            //consider to do the ternary here instead
            int delta = delta(pattern.charAt(i - 1), text.charAt(j - 1)); 
            g[i][j] = min(g[i - 1][j - 1] + delta, g[i - 1][j] + 1, g[i][j - 1] + 1);
        }
        //...
    }
    //...
}
\$\endgroup\$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.