The task is to find the longest common substring of a multitude of long strings. I'm trying to find the best algorithm that wouldn't use syntax trees/arrays (as I don't know anything about them yet).
The idea was to use binary search to find the length of the desired substring, instead of simply trying n, then (n - 1) etc.
My code looks like this:
-- find all the substrings of length n
ngrams :: Int -> String -> [String]
ngrams n s | n > length s = []
| otherwise = take n s : ngrams n (drop 1 s)
-- find the longest common substring of multiple strings
longestCommonSubstring :: [String] -> String
longestCommonSubstring xs = go 0 $ length (head xs) + 1
where
-- find a substring of a given length n that is common to all strings
commonSubstrings n = foldr1 intersect (map (ngrams n) xs)
go l r
-- if the binary search ended, pick one common substring
| r - l == 1 = head $ commonSubstrings l
| otherwise
= case commonSubstrings m of
[] -> go l m -- the length is too big, try a smaller one
_ -> go m r -- try longer length to find longer substring
where
m = (l + r) `div` 2 -- the middle point
It runs at around 3s for my dataset (~100 strings of length ~1000), which seems slow to me. Is there any way to clean up and quicken the code? And is there a better way to approach this problem (apart from the syntax trees) in general?