The new-in-10.1 Sequence*
family of functions should provide newly optimized methods for handling sequence problems. Happily in some cases they do! For example Murta reports 28 times faster performance from SequencePosition
than a ReplaceList
alternative.
However there also seem to be major problems in other cases.
Alternatives
When looking for multiple subsequences using Alternatives
performance degrades.
Let's try counting the appearances of several short sequences within an integer list. Here are three functions to do the same operation, one based on StringCount
, one on ReplaceList
, and one on the new SequenceCount
.
pat = {{1, 1}, {1, 2}, {3, 2, 1}, {2, 2, 1, 0}};
ts = ToString @ Row[#, ","] &;
f1 = StringCount[ts@#, ts /@ pat, Overlaps -> True] &;
f2 = Length @ ReplaceList[#, ({___, ##, ___} -> 1) & @@@ pat] &;
f3 = SequenceCount[#, Alternatives @@ pat, Overlaps -> True] &;
We can see that their results agree:
foo = RandomInteger[3, 500];
#[foo] & /@ {f1, f2, f3}
{57, 57, 57}
The string method has the overhead of converting the list and sublists strings, and ReplaceList
first builds a list of results before counting them with Length
. Surely SequenceCount
should be fastest being written for this exact operation and without superfluous overhead. However that is far from the case.
(First you will need to patch the broken-in-10.1.0 BenchmarkPlot
if you intend to follow along.)
Needs["GeneralUtilities`"]
BenchmarkPlot[{f1, f2, f3}, RandomInteger[3, #] &, TimeConstraint -> 15]
This is catastrophically poor performance from SequenceCount
! (Note the log-log scale.)
Its complexity is through the roof at apparently $O(n^3)$.
Let us introduce a fourth function also using SequenceCount
but eliminating Alternatives
:
f4 = Sum[SequenceCount[#, ss, Overlaps -> True], {ss, pat}] &;
Note that this will scan the list once for each list in pat
so theoretically it should be proportionately slower, however:
BenchmarkPlot[{f1, f2, f3, f4}, RandomInteger[3, #] &, TimeConstraint -> 15]
This gives us $O(n)$ complexity with a lower coefficient than f1
!
SequenceCases
and SequencePosition
exhibit similar behavior regarding Alternatives
.
What is causing this and is there another solution besides scanning the list separately for each subsequence?
Plain patterns
Unlike the literal sequences in the first section the introduction of patterns adds potential complication for matching. For example pattern tests could potentially have side effects. (See: PatternTest not optimized?)
However plain patterns should be handled efficiently. As evidence we can again compare string performance to that of the new Sequence*
functions. First with a fixed-length pattern:
BenchmarkPlot[
{StringPosition[ToString@Row@#, "11" ~~ _ ~~ "0" ~~ _] &,
SequencePosition[#, {1, 1, _, 0, _}] &},
RandomInteger[3, #] &
]
And now with a variable-length pattern:
BenchmarkPlot[
{StringPosition[ToString@Row@#, "11" ~~ __ ~~ "0"] &,
SequencePosition[#, {1, 1, __, 0}] &},
RandomInteger[3, #] &
]
Pretty much catastrophic again. Unlike the Alternatives
issue there doesn't appear to be a work-around so despite the new functions I expect that when working with patterns I will still be recasting problems in terms of String functions which frankly is a shame.
Is there any justification for the poor performance I am witnessing or is this yet another function that was pushed out the door too soon?
Sequence*
functions are implemented in top level, so they were never going to be much faster than using other functions. However, they should not be an order of magnitude slower. I've filed a report internally to get this looked at. $\endgroup$f3
is now equal tof4
in yourAlternatives
example, and for plain patterns (your second example)SequenceCases
will have the same complexity but lower coefficient thanStringPosition
(and hence faster forn
greater than about 100 in my testing). $\endgroup$