Revisions to indexOf Boyer-Moore String Search algorithm

Post Made Community Wiki by Gareth Rees

occurred Apr 10, 2018 at 22:13

simpler approach for picking a random substring

Source Link

edited Mar 30, 2018 at 11:36

50.1k
3
130
211

To implement this, it's convenient to have a function for choosing a random substring. (See this answer on Stack Overflow for an explanation.)

    from math import sqrt
    import random

    def random_substring(string):
        "Return a non-empty substring of string, chosen uniformly at random."
        n = len(string)
        k = random.randrange(n * (n + 1) // 2)
        stop = int((sqrt(8 * k + 1) - 1) / 2) + 1
        start = k - stop * (stop - 1) // 2
        return string[start:stop]

Now it'sIt's easy to generate random (positive) test cases:

    def test_random(self):
        for n in range(1, 100):
            haystack = ''.join(random.choice('abc') for _ in range(n))
            needlestart, stop = random_substringsorted(haystackrandom.sample(range(n + 1), 2))
            needle = haystack[start:stop]
            self.check(haystack, needle)

To implement this, it's convenient to have a function for choosing a random substring. (See this answer on Stack Overflow for an explanation.)

    from math import sqrt
    import random

    def random_substring(string):
        "Return a non-empty substring of string, chosen uniformly at random."
        n = len(string)
        k = random.randrange(n * (n + 1) // 2)
        stop = int((sqrt(8 * k + 1) - 1) / 2) + 1
        start = k - stop * (stop - 1) // 2
        return string[start:stop]

Now it's easy to generate random (positive) test cases:

    def test_random(self):
        for n in range(1, 100):
            haystack = ''.join(random.choice('abc') for _ in range(n))
            needle = random_substring(haystack)
            self.check(haystack, needle)

It's easy to generate random (positive) test cases:

    def test_random(self):
        for n in range(1, 100):
            haystack = ''.join(random.choice('abc') for _ in range(n))
            start, stop = sorted(random.sample(range(n + 1), 2))
            needle = haystack[start:stop]
            self.check(haystack, needle)

link random testing

Source Link

edited Jan 23, 2018 at 13:21

Gareth Rees

50.1k
3
130
211

Having a suite of unit tests is excellent, better than 99% of submissions to Code Review.

Using assert for the unit tests means that you don't get much information out of a test case failure. For example, if I artifically introduce a bug into the code, then I might get this test output:

 ======================================================================
 FAIL: test_index_of_with_matching_patterns (cr179311.Test)
 ----------------------------------------------------------------------
 Traceback (most recent call last):
   File "cr179311.py", line 38, in test_index_of_with_matching_patterns
     assert index_of('abc', 'c') == 2
 AssertionError
 ----------------------------------------------------------------------

Since we have a test oracle, it would make a lot of sense to generate random test casesrandom test cases. This allows us to explore a much larger space of possibilities than we could do just by writing out test cases by hand.

Having a suite of unit tests is excellent, better than 99% of submissions to Code Review.

Using assert for the unit tests means that you don't get much information out of a test case failure. For example, if I introduce a bug then I might get this test output:

 ======================================================================
 FAIL: test_index_of_with_matching_patterns (cr179311.Test)
 ----------------------------------------------------------------------
 Traceback (most recent call last):
   File "cr179311.py", line 38, in test_index_of_with_matching_patterns
     assert index_of('abc', 'c') == 2
 AssertionError
 ----------------------------------------------------------------------

Since we have a test oracle, it would make a lot of sense to generate random test cases. This allows us to explore a much larger space of possibilities than we could do just by writing out test cases by hand.

Having a suite of unit tests is excellent, better than 99% of submissions to Code Review.

Using assert for the unit tests means that you don't get much information out of a test case failure. For example, if I artifically introduce a bug into the code, then I might get this test output:

 ======================================================================
 FAIL: test_index_of_with_matching_patterns (cr179311.Test)
 ----------------------------------------------------------------------
 Traceback (most recent call last):
   File "cr179311.py", line 38, in test_index_of_with_matching_patterns
     assert index_of('abc', 'c') == 2
 AssertionError
 ----------------------------------------------------------------------

Since we have a test oracle, it would make a lot of sense to generate random test cases. This allows us to explore a much larger space of possibilities than we could do just by writing out test cases by hand.

Source Link

answered Jan 7, 2018 at 11:17

Gareth Rees

50.1k
3
130
211

1. Bug

The code in the post does not work!

>>> index_of('abba', 'bba')
>>> # expected 1 but got None

Read on to see how you could have discovered this.

2. Review of the unit tests

Having a suite of unit tests is excellent, better than 99% of submissions to Code Review.

Using assert for the unit tests means that you don't get much information out of a test case failure. For example, if I introduce a bug then I might get this test output:

 ======================================================================
 FAIL: test_index_of_with_matching_patterns (cr179311.Test)
 ----------------------------------------------------------------------
 Traceback (most recent call last):
   File "cr179311.py", line 38, in test_index_of_with_matching_patterns
     assert index_of('abc', 'c') == 2
 AssertionError
 ----------------------------------------------------------------------

At this point it would be helpful to know what the incorrect value was. The unittest.TestCase class has a bunch of assert methods that produce better output. For example, if I replace the line:

    assert index_of('abc', 'c') == 2

with:

    self.assertEqual(index_of('abc', 'c'), 2)

then the test failure message includes the incorrect value:

    ======================================================================
    FAIL: test_index_of_with_matching_patterns (cr179311.Test)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "cr179311.py", line 38, in test_index_of_with_matching_patterns
        self.assertEqual(index_of('abc', 'c'), 2)
    AssertionError: -1 != 2
    ----------------------------------------------------------------------

Each unit test knows its expected result. But you had to type these in by hand and could have made a mistake. It would be more reliable if the expected result were computed. In this case we could use the built-in str.find method, for example:
```
 self.assertEqual(index_of('abc', 'c'), 'abc'.find('c'))
```

This approach is known as using a test oracle.

The unit tests are very repetitive. Every test has the form:
```
 assert index_of(haystack, needle) == expected_index
```

or if you follow my suggestions above,

    self.assertEqual(index_of(haystack, needle), str.find(haystack, needle))

The repetition could be avoided by making the tests data-driven, like this:

    class TestIndexOf(unittest.TestCase):
        CASES = [
            # Positive cases
            ('abc', ''),             # all strings contain empty string
            ('abc', 'a'),            # single letters are easy
            ('abc', 'b'),
            ('abc', 'c'),
            ('abc', 'ab'),           # multiple letters are harder
            ('abc', 'bc'),
            ('abc', 'abc'),          # all strings contain themselves
            ('aaa', 'a'),            # multiple occurrences
            ('aaa', 'aa'),           # overlapping pattern
            ('thisisabsolutelycrazy', 't'),
            ('abcdef', 'abcdef'),    # all strings contain themselves
            ('ababc', 'abc'),        # overlapping prefix
            ('bananas', 'nas'),      # overlapping prefix
            ('abcabcabc', 'abc'),    # multiple occurrences
            ('abcabcab', 'abc'),     # multiple occurrences
            ('abcabcdef', 'abcd'),   # overlapping prefix
            ('abcabcdef', 'abcdef'), # overlapping prefix
            ('abcabcdabcde', 'abcde'), # overlapping prefix
            ('abcabcdabcde', 'abcd'), # multiple occurrences, overlapping prefix
            ('abra cadabra', 'abra'), # multiple occurrences
            ('abra cadabra', 'adab'), # overlapping prefix

            # Negative cases
            ('abc', 'z'),           # remember to test other letters
            ('abc', 'ac'),          # important to test close cases
            ('abc', 'az'),          # first letter, but not last
            ('abc', 'abz'),         # first 2 letters, but not last
        ]

        def check(self, haystack, needle):
            expected = haystack.find(needle)
            if expected == -1:
                expected = None
            self.assertEqual(index_of(haystack, needle), expected,
                             "index_of({!r}, {!r})".format(haystack, needle))

        def test_cases(self):
            for haystack, needle in self.CASES:
                self.check(haystack, needle)

This makes it easy to add more test cases.

Since we have a test oracle, it would make a lot of sense to generate random test cases. This allows us to explore a much larger space of possibilities than we could do just by writing out test cases by hand.

To implement this, it's convenient to have a function for choosing a random substring. (See this answer on Stack Overflow for an explanation.)

    from math import sqrt
    import random

    def random_substring(string):
        "Return a non-empty substring of string, chosen uniformly at random."
        n = len(string)
        k = random.randrange(n * (n + 1) // 2)
        stop = int((sqrt(8 * k + 1) - 1) / 2) + 1
        start = k - stop * (stop - 1) // 2
        return string[start:stop]

Now it's easy to generate random (positive) test cases:

    def test_random(self):
        for n in range(1, 100):
            haystack = ''.join(random.choice('abc') for _ in range(n))
            needle = random_substring(haystack)
            self.check(haystack, needle)

And if we run these randomly generated test cases, we soon get a test failure as described in §1 above:

    ======================================================================
    FAIL: test_random (cr179311.TestIndexOf)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "cr179311.py", line 91, in test_random
        self.check(haystack, needle)
      File "cr179311.py", line 81, in check
        "index_of({!r}, {!r})".format(haystack, needle))
    AssertionError: None != 3 : index_of('cacaaaaaa', 'aaaaaa')
    ----------------------------------------------------------------------

Stack Exchange Network

Return to Answer

1. Bug

2. Review of the unit tests