Finding the longest common substring of multiple strings in Haskell

Question

The task is to find the longest common substring of a multitude of long strings. I'm trying to find the best algorithm that wouldn't use syntax trees/arrays (as I don't know anything about them yet).

The idea was to use binary search to find the length of the desired substring, instead of simply trying n, then (n - 1) etc.

My code looks like this:

-- find all the substrings of length n
ngrams :: Int -> String -> [String]
ngrams n s | n > length s = []
           | otherwise    = take n s : ngrams n (drop 1 s)

-- find the longest common substring of multiple strings
longestCommonSubstring :: [String] -> String
longestCommonSubstring xs = go 0 $ length (head xs) + 1
 where
  -- find a substring of a given length n that is common to all strings
  commonSubstrings n = foldr1 intersect (map (ngrams n) xs)
  go l r
    -- if the binary search ended, pick one common substring
    | r - l == 1 = head $ commonSubstrings l
    | otherwise
    = case commonSubstrings m of
      [] -> go l m    -- the length is too big, try a smaller one
      _  -> go m r    -- try longer length to find longer substring
   where
    m = (l + r) `div` 2    -- the middle point

It runs at around 3s for my dataset (~100 strings of length ~1000), which seems slow to me. Is there any way to clean up and quicken the code? And is there a better way to approach this problem (apart from the syntax trees) in general?

Gurkenglas · Accepted Answer · 2017-06-04 14:24:33Z

intersect runs in quadratic time. Sets can use Ord information to speed that up.

-- find the longest common substring of multiple strings
longestCommonSubstring :: [String] -> String
longestCommonSubstring xs = go 0 $ length (head xs) + 1
 where
  -- find a substring of a given length n that is common to all strings
  commonSubstrings n = foldr1 S.intersection (map (S.fromList . ngrams n) xs)
  go l r
    -- if the binary search ended, pick one common substring
    | r - l == 1 = S.findMin $ commonSubstrings l
    | otherwise
    = if S.null $ commonSubstrings m
      then go l m    -- the length is too big, try a smaller one
      else go m r    -- try longer length to find longer substring
   where
    m = (l + r) `div` 2    -- the middle point

For comparison, here's an implementation that skips the binary search part:

-- find the longest common substring of multiple strings
longestCommonSubstring :: [String] -> String
longestCommonSubstring = maximumBy (comparing length)
  . foldr1 S.intersection . map (S.fromList . map inits . tails)

Good insights. Your version (with sets and with binary search) runs 2s longer then the one with lists, though. — Eugleo
– Eugleo, Commented Jun 12, 2017 at 15:17

Stack Exchange Network

Finding the longest common substring of multiple strings in Haskell

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Finding the longest common substring of multiple strings in Haskell

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions