Algorithm to find the smallest snippet from searching a document?

Question

I've been going through Skiena's excellent "The Algorithm Design Manual" and got hung up on one of the exercises.

The question is: "Given a search string of three words, find the smallest snippet of the document that contains all three of the search words—i.e. , the snippet with smallest number of words in it. You are given the index positions where these words in occur search strings, such as word1: (1, 4, 5), word2: (4, 9, 10), and word3: (5, 6, 15). Each of the lists are in sorted order, as above."

Anything I come up with is O(n^2)... This question is in the "Sorting and Searching" chapter, so I assume there is a simple and clever way to do it. I'm trying something with graphs right now, but that seems like overkill.

Ideas? Thanks

I need a better illustration. If I understood this right, then I would line things up as: 1,4,4,5,5,6,9,10,15. This gives range 1...15. Now ask: can I throw out 1 to shrink this range? Can I throw out 15? etc. Make it greedy and every time approach from both left and right, so you would need a doubly-linked list probably ... Hopefully answering each question is sub-linear. Data structure might be a bit non-trivial. By the way, how do you define n in this case? P.S. I wish I had to do stuff like that at work more frequently. — Hamish Grubijan
– Hamish Grubijan, Commented Jun 2, 2010 at 2:44
this is exactly how the problem is stated in the book. One thing I don't understand... how word1 and word2 can be both at index 4 and not be the same word. I wonder if this is a careless typo in the problem. — Paul C
– Paul C, Commented Jun 2, 2010 at 2:52
I don't think I understand this question. If the sorted arrays are the index positions where the words occur, why are word1 and word2 both at index 4 and word2 and 3 at index 5? — user85509
– user85509, Commented Jun 2, 2010 at 2:52
Perhaps they consider words like "hell", "hello" and the index is not the index of the word, but the index of the character where the word begins? In that case, my answer needs to be modified a little I guess. — Aryabhatta
– Aryabhatta, Commented Jun 2, 2010 at 2:59
possible duplicate of Google search results: How to find the minimum window that contains all the search keywords? — AnT stands with Russia
– AnT stands with Russia, Commented Jun 2, 2010 at 22:32

Yuval Cohen · Accepted Answer · 2010-06-09 15:19:35Z

Unless I've overlooked something, here's a simple, O(n) algorithm:

We'll represent the snippet by (x, y) where x and y are where the snippet begins and ends respectively.
A snippet is feasible if it contains all 3 search words.
We will start with the infeasible snippet (0,0).
Repeat the following until y reaches end-of-string:
1. If the current snippet (x, y) is feasible, proceed to the snippet (x+1, y)
  Else (the current snippet is infeasible) proceed to the snippet (x, y+1)
Choose the shortest snippet among all feasible snippets we went through.

Running time - in each iteration either x or y is increased by 1, clearly x can't exceed y and y can't exceed string length so total number of iterations is O(n). Also, feasibility can be checked at O(1) in this case since we can track how many occurences of each word are within the current snippet. We can maintain this count at O(1) with each increase of x or y by 1.

Correctness - For each x, we calculate the minimal feasible snippet (x, ?). Thus we must go over the minimal snippet. Also, if y is the smallest y such that (x, y) is feasible then if (x+1, y') is a feasible snippet y' >= y (This bit is why this algorithm is linear and the others aren't).

This is correct yes, and a simpler algorithm than I could think of :-). Notice that it is O(N), if the number of terms is considered constant. If not, than it is O(N log M), where M is the number of search terms, since your algorithms implies a merge of all the sorted lists as a preprocessing step, and that is O(N log M).

Community · Accepted Answer · 2017-05-23 12:25:53Z

7

I already posted a rather straightforward algorithm that solves exactly that problem in this answer

Google search results: How to find the minimum window that contains all the search keywords?

However, in that question we assumed that the input is represented by a text stream and the words are stored in an easily searchable set.

In your case the input is represented slightly differently: as a bunch of vectors with sorted positions for each word. This representation is easily transformable to what is needed for the above algorithm by simply merging all these vectors into a single vector of (position, word) pairs ordered by position. It can be done literally, or it can be done "virtually", by placing the original vectors into the priority queue (ordered in accordance with their first elements). Popping an element from the queue in this case means popping the first element from the first vector in the queue and possibly sinking the first vector into the queue in accordance with its new first element.

Of course, since your statement of the problem explicitly fixes the number of words as three, you can simply check the first elements of all three arrays and pop the smallest one at each iteration. That gives you a O(N) algorithm, where N is the total length of all arrays.

Also, your statement of the problem seems to suggest that target words can overlap in the text, which is rather strange (given that you use the term "word"). Is it intentional? In any case, it doesn't present any problem for the above linked algorithm.

edited May 23, 2017 at 12:25

CommunityBot

11 silver badge

answered Jun 2, 2010 at 22:40

AnT stands with Russia

323k44 gold badges548 silver badges793 bronze badges

10 Comments

Amichai Over a year ago

I disagree with your claim that the time complexity of your algorithm is O(N). You're neglecting the computations necessary to determine the smallest value of the different search word arrays (which is necessary to know which value to pop before the next computation) and the computations necessary to explicitly determine the length of the snippet given the smallest values from all the search word arrays (which is a constant value proportional to the amount of search words being computed). When these factors are taken into consideration, I think my algorithm is the best that you can do.

AnT stands with Russia Over a year ago

@Ami: I stated in my linked post that the complexity is O(N log M), where N is the total number of occurrences of our keywords and M is the number of different keywords. However, I explicitly explained in this post that if the statement of the problem explicitly fixes M as a constant (3 in this case), the algorithm becomes a O(N) algorithm for obvious reasons.

AnT stands with Russia Over a year ago

@Ami: As for your algorithm... I don't really see a complete algorithm is your post, just a bunch of ideas, but I think it is pretty obvious that my algorithm is nothing else than a compact practical single-pass implementation of what you essentially propose in your post. I don't see though how you managed to obtain O(N log N) for your algorithm, which is notably worse than my O(N log M)

Amichai Over a year ago

@Andrey: I want to respond to the critisms you raised about my algorithm (perhaps I should rewrite my answer to make it more clear), but before I do that let me clarify the point I was making earlier. In your linked post you explained the O(log M) as "the complexity of checking whether the current word belongs to the keyword set." You correctly pointed out in your answer above that there is no need to make a check like that in our case because the indecies given in this problem are already known to be part of the keyword set. I claim that you're neglecting a different set of computations....

Amichai Over a year ago

@Andrey part2: Given M different sets of indicies, your algorithm requires deterimining a snippet length from the M smallest index values then popping the smallest index. How does your algorithm know which of the M indecies is smalllest index which is to be popped? To answer this question your algorithm has to iterate through all M indecies and this process is repeated exactly N times. For this reason I think your algorithm is operating at O(NM) not O(N). O(NM) is not ostensibly worse than O(N log N)....

|

8 revs · Accepted Answer · 2010-06-03 02:09:10Z

From the question, it seems that you're given the index locations for each of your n “search words” (word1, word2, word3, ..., word n) in the document. Using a sorting algorithm, the n independent arrays associated with search words can readily be represented as a single array of all the index locations in ascending numerical order and a word label associated with each index in the array (the index array).

The Basic Algorithm:

(Designed to work whether or not the poster of this question intended to allow two different search words to coexist at the same index number.)

First, we define a simple function for measuring the length of a snippet that contains all n labels given a starting point in the index array. (It is obvious from the definition of our array that any starting point on the array will necessarily be the indexed location of one of the n search labels.) The function simply keeps track of the unique search labels seen as the function iterates through the elements in the array until all n labels have been observed. The length of the snippet is defined as the difference between the index of the last unique label found and the index of the starting point in the index array (the first unique label found). If all n labels aren't observed before the end of the array the function returns a null value.

Now, the snippet length function can be run for each element in your array to associate a snippet size containing all n search words starting from each element in the array. The smallest non-Null value returned by the snippet length function over the whole index array is the snippet in your document that you're looking for.

Necessary Optimizations:

Keep track of the value of the current shortest snippet length so that the value will be know immediately after iterating once through the index array.
When iterating through your array terminate the snippet length function if the current snippet under inspection ever surpasses the length of the shortest snippet length previously seen.
When the snippet length function returns null for not locating all n search words in the remaining index array elements, associate a null snippet length to all successive elements in the index array.
If the snippet length function is applied to a word label and the label immediately following it is identical to the starting label, assign a null value to the starting label and move on to the next label.

Computational Complexity:

Obviously the sorting part of the algorithm can be arranged in O(n log n).

Here's how I would work out the time complexity of the second part of the algorithm (any critiques and corrections would be greatly appreciated).

In the best case scenario, the algorithm only applies the snippet length function to the first element in the index array and finds that no snippet containing all the search words exists. This scenario would be computed in just n calculations where n is the size of the index array. Slightly worse than that is if the smallest snippet turns out to be equal to the size of the whole array. In this case the computational complexity will be a little less than 2 n (once through the array to find the smallest snippet length, a second time to demonstrate that no other snippets exist). The shorter the average computed snippet length, the more times the snippet length function will need to be applied over the index array. We can assume that our worse case scenario will be the case where the snippet length function needs to be applied to every element in the index array. To develop a case where the function will be applied to every element in the index array we need to design an index array where the average snippet length over the whole index array is negligible in comparison to the size of the index array as a whole. Using this case we can write out our computational complexity as O(C n) where C is some constant that is significantly smaller then n. Giving a final computational complexity of:

O(n log n + C n)

Where:

C << n

Edit:

AndreyT correctly points out that instead of sorting the word indicies in n log n time, one might just as well merge them (since the sub arrays are already sorted) in n log m time where m is the amount of search word arrays to be merged. This will obviously speed up the algorithm is cases where m < n.

Kaue Silveira · Accepted Answer · 2011-10-24 14:40:33Z

O(n log k) solution, where n is the total number of indices and k is the number of words. The idea is to use a heap to identify the smallest index at each iteration, while also keeping track of the maximum index in the heap. I also put the coordinates of each value in the heap, in order to be able to retrieve the next value in constant time.

#include <algorithm>
#include <cassert>
#include <limits>
#include <queue>
#include <vector>

using namespace std;

int snippet(const vector< vector<int> >& index) {
    // (-index[i][j], (i, j))
    priority_queue< pair< int, pair<size_t, size_t> > > queue;
    int nmax = numeric_limits<int>::min();
    for (size_t i = 0; i < index.size(); ++i) {
        if (!index[i].empty()) {
            int cur = index[i][0];
            nmax = max(nmax, cur);
            queue.push(make_pair(-cur, make_pair(i, 0)));
        }
    }
    int result = numeric_limits<int>::max();
    while (queue.size() == index.size()) {
        int nmin = -queue.top().first;
        size_t i = queue.top().second.first;
        size_t j = queue.top().second.second;
        queue.pop();
        result = min(result, nmax - nmin + 1);
        j++;
        if (j < index[i].size()) {
            int next = index[i][j];
            nmax = max(nmax, next);
            queue.push(make_pair(-next, make_pair(i, j)));
        }
    }
    return result;
}

int main() {
    int data[][3] = {{1, 4, 5}, {4, 9, 10}, {5, 6, 15}};
    vector<vector<int> > index;
    for (int i = 0; i < 3; i++) {
        index.push_back(vector<int>(data[i], data[i] + 3));
    }
    assert(snippet(index) == 2);
}

alampada · Accepted Answer · 2014-09-13 17:26:07Z

Sample implementation in java (tested only with the implementation in the example, there might be bugs). The implementation is based on the replies above.

import java.util.Arrays;


public class SmallestSnippet {
    WordIndex[] words; //merged array of word occurences

    public enum Word {W1, W2, W3};

    public SmallestSnippet(Integer[] word1, Integer[] word2, Integer[] word3) {
        this.words = new WordIndex[word1.length + word2.length + word3.length];
        merge(word1, word2, word3);
        System.out.println(Arrays.toString(words));
    }

    private void merge(Integer[] word1, Integer[] word2, Integer[] word3) {
        int i1 = 0;
        int i2 = 0;
        int i3 = 0;
        int wordIdx = 0;
        while(i1 < word1.length || i2 < word2.length || i3 < word3.length) {
            WordIndex wordIndex = null;
            Word word = getMin(word1, i1, word2, i2, word3, i3);
            if (word == Word.W1) {
                wordIndex = new WordIndex(word, word1[i1++]);
            }
            else if (word == Word.W2) {
                wordIndex = new WordIndex(word, word2[i2++]);
            }
            else {
                wordIndex = new WordIndex(word, word3[i3++]);
            }
            words[wordIdx++] = wordIndex;
        }       
    }

    //determine which word has the smallest index
    private Word getMin(Integer[] word1, int i1, Integer[] word2, int i2, Integer[] word3,
            int i3) {
        Word toReturn = Word.W1;
        if (i1 == word1.length || (i2 < word2.length && word2[i2] < word1[i1])) {
            toReturn  = Word.W2;
        }
        if (toReturn == Word.W1 && i3 < word3.length && word3[i3] < word1[i1])
        {
            toReturn = Word.W3;
        }
        else if (toReturn == Word.W2){
            if (i2 == word2.length || (i3 < word3.length && word3[i3] < word2[i2])) {
                toReturn = Word.W3;
            }
        }
        return toReturn;
    }

    private Snippet calculate() {
        int start = 0;
        int end = 0;
        int max = words.length;
        Snippet minimum = new Snippet(words[0].getIndex(), words[max-1].getIndex());
        while (start < max)
        {
            end = start;
            boolean foundAll = false;
            boolean found[] = new boolean[Word.values().length];
            while (end < max && !foundAll) {
                found[words[end].getWord().ordinal()] = true;
                boolean complete = true;
                for (int i=0 ; i < found.length && complete; i++) {
                    complete = found[i];
                }
                if (complete)
                {
                    foundAll = true;
                }
                else {
                    if (words[end].getIndex()-words[start].getIndex() == minimum.getLength())
                    {
                        // we won't find a minimum no need to search further
                        break;
                    }
                    end++;
                }
            }
            if (foundAll && words[end].getIndex()-words[start].getIndex() < minimum.getLength()) {
                minimum.setEnd(words[end].getIndex());
                minimum.setStart(words[start].getIndex());
            }
            start++;
        }
        return minimum;

    }


    /**
     * @param args
     */
    public static void main(String[] args) {
        Integer[] word1 = {1,4,5};
        Integer[] word2 = {3,9,10};
        Integer[] word3 = {2,6,15};
        SmallestSnippet smallestSnippet = new SmallestSnippet(word1, word2, word3);
        Snippet snippet = smallestSnippet.calculate();
        System.out.println(snippet);

    }
}

Helper classes:

public class Snippet {

    private int start;

    private int end;

//getters, setters etc

    public int getLength()
    {
        return Math.abs(end - start);
    }
}



public class WordIndex
{
    private SmallestSnippet.Word word;
    private int index;
    public WordIndex(SmallestSnippet.Word word, int index) {

        this.word = word;
        this.index = index;
    }
}

Abhijit Sarkar · Accepted Answer · 2019-01-30 05:56:56Z

The other answers are alright, but like me, if you're having trouble understanding the question in the first place, those aren't really helpful. Let's rephrase the question:

Given three sets of integers (call them A, B, and C), find the minimum contiguous range that contains one element from each set.

There is some confusion about what the three sets are. The 2nd edition of the book states them as {1, 4, 5}, {4, 9, 10}, and {5, 6, 15}. However, another version that has been stated in a comment above is {1, 4, 5}, {3, 9, 10}, and {2, 6, 15}. If one word is not a suffix/prefix of another, version 1 isn't possible, so let's go with the second one.

Since a picture is worth a thousand words, lets plot the points:

Simply inspecting the above visually, we can see that there are two answers to this question: [1,3] and [2,4], both of size 3 (three points in each range).

Now, the algorithm. The idea is to start with the smallest valid range, and incrementally try to shrink it by moving the left boundary inwards. We will use zero-based indexing.

MIN-RANGE(A, B, C)
  i = j = k = 0
  minSize = +∞

  while i, j, k is a valid index of the respective arrays, do
    ans = (A[i], B[j], C[k])
    size = max(ans) - min(ans) + 1
    minSize = min(size, minSize)
    x = argmin(ans)
    increment x by 1
  done

  return minSize

where argmin is the index of the smallest element in ans.

+---+---+---+---+--------------------+---------+
| n | i | j | k | (A[i], B[j], C[k]) | minSize |
+---+---+---+---+--------------------+---------+
| 1 | 0 | 0 | 0 | (1, 3, 2)          | 3       |
+---+---+---+---+--------------------+---------+
| 2 | 1 | 0 | 0 | (4, 3, 2)          | 3       |
+---+---+---+---+--------------------+---------+
| 3 | 1 | 0 | 1 | (4, 3, 6)          | 4       |
+---+---+---+---+--------------------+---------+
| 4 | 1 | 1 | 1 | (4, 9, 6)          | 6       |
+---+---+---+---+--------------------+---------+
| 5 | 2 | 1 | 1 | (5, 9, 6)          | 5       |
+---+---+---+---+--------------------+---------+
| 6 | 3 | 1 | 1 |                    |         |
+---+---+---+---+--------------------+---------+

n = iteration

At each step, one of the three indices is incremented, so the algorithm is guaranteed to eventually terminate. In the worst case, i, j, and k are incremented in that order, and the algorithm runs in O(n^2) (9 in this case) time. For the given example, it terminates after 5 iterations.

george · Accepted Answer · 2011-02-18 02:35:13Z

1

O(n)

Pair find(int[][] indices) {
pair.lBound = max int;
pair.rBound = 0;
index = 0;

for i from 0 to indices.lenght{
    if(pair.lBound > indices[i][0]){
        pair.lBound = indices[i][0]
        index = i;
    }
    if(indices[index].lenght > 0)
        pair.rBound = max(pair.rBound, indices[i][0])
}
remove indices[index][0]

return min(pair, find(indices)}

answered Feb 18, 2011 at 2:35

george

111 bronze badge

Collectives™ on Stack Overflow

Algorithm to find the smallest snippet from searching a document?

7 Answers 7

1 Comment

10 Comments

Comments

Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

1 Comment

10 Comments

Comments

Comments

Comments

Comments

Comments

Linked

Related