For some context, I'm trying to write a streaming tokenizer in Haskell and noticed that I was writing a lot of functions with the type signature String -> Maybe Int that attempted to consume "the longest possible token matching a particular pattern" from the input stream. The pattern might be something like a quoted string literal in JSON.
I tried using Parsec for a while to get a "prefix matcher", but I kept getting tripped up over what the appropriate userstate is supposed, and what the m argument in ParsecT means. Parsec is probably a strict superset of the functionality that this little library exposes.
A "Matcher" is anything with the type String -> Maybe Int. Matchers are supposed to be greedy and produce the longest prefix that they possibly can. There's also an unenforced property that, in cases where the match is a proper prefix of the input string, adding more characters to the end of a string doesn't extend the prefix. The library also encourages / permits users to write sloppy grammars (for instance using symOrExn) that fail in ugly ways.
I could use some helping figuring out how to generalize String -> Maybe Int. I don't know what class I should insist on for the input Eq a => [a] is the first thing that comes to mind, but it might be possible to make something more general than that. As for the output type maybe a type that's constrained to be a monoid or a group? I really just need a notion of zero and the ability to increment for the index. ... although it might also be possible to generalize the notion of a "prefix" to tree-like structures.
I'm hoping to structure this library eventually as a collection of really concrete implementations for a handful of commonly-used text-like types, as well as a generic implementation that can be used in other circumstances. Advice on how to generalize the library is much appreciated.
module Text.PrefixMatcher.String where
import Data.List
-- A simple non-generic prefix matching library
-- with a few simple combinators
--
-- a Matcher has type (String -> Maybe Int)
--
-- It takes a String and returns a number corresponding
-- to the length of the matched prefix.
--
-- The intended use case here is to provide a simple utility
-- to help in converting a string to a stream of tokens.
--
-- makeMatcher -- make a matcher for a constant string
--
type Matcher = String -> Maybe Int
identityMatcher :: Matcher
identityMatcher str = Just 0
makeMatcher :: String -> Matcher
makeMatcher spec str = case isPrefixOf spec str of
True -> Just (length spec)
False -> Nothing
andThen :: Matcher -> Matcher -> Matcher
andThen m1 m2 str = ltotal where
l1 = m1 str
ltotal = case l1 of
Nothing -> Nothing
(Just l1) ->
let rest = drop l1 str in
let l2 = m2 str in
case l2 of
Nothing -> Nothing
(Just l2) -> Just (l1 + l2)
symOrExn :: Matcher -> Matcher -> Matcher
symOrExn m1 m2 str = res where
l1 = m1 str
l2 = m2 str
res = case (l1, l2) of
-- good cases
(Just l1, Nothing) -> Just l1
(Nothing, Just l2) -> Just l2
-- bad cases
(Nothing, Nothing) -> Nothing
-- if you reach this case, the grammar was
-- constructed badly
(Just l1, Just l2) -> error "ambiguous match"
leftOr :: Matcher -> Matcher -> Matcher
leftOr m1 m2 str = result where
l1 = m1 str
l2 = m2 str
result = case (l1, l2) of
(Just l1, _) -> Just l1
(Nothing, Just l2) -> Just l2
(Nothing, Nothing) -> Nothing
andThenList :: [Matcher] -> Matcher
andThenList [] = identityMatcher
andThenList [x] = x
andThenList (x:xs) = andThen x (andThenList xs)
leftOrList :: [Matcher] -> Matcher
leftOrList [] = identityMatcher
leftOrList [x] = x
leftOrList (x:xs) = leftOr x (leftOrList xs)