43
$\begingroup$

The new-in-10.1 Sequence* family of functions should provide newly optimized methods for handling sequence problems. Happily in some cases they do! For example Murta reports 28 times faster performance from SequencePosition than a ReplaceList alternative.

However there also seem to be major problems in other cases.

Alternatives

When looking for multiple subsequences using Alternatives performance degrades.

Let's try counting the appearances of several short sequences within an integer list. Here are three functions to do the same operation, one based on StringCount, one on ReplaceList, and one on the new SequenceCount.

pat = {{1, 1}, {1, 2}, {3, 2, 1}, {2, 2, 1, 0}};

ts = ToString @ Row[#, ","] &;

f1 = StringCount[ts@#, ts /@ pat, Overlaps -> True] &;
f2 = Length @ ReplaceList[#, ({___, ##, ___} -> 1) & @@@ pat] &;
f3 = SequenceCount[#, Alternatives @@ pat, Overlaps -> True] &;

We can see that their results agree:

foo = RandomInteger[3, 500];
#[foo] & /@ {f1, f2, f3}
{57, 57, 57}

The string method has the overhead of converting the list and sublists strings, and ReplaceList first builds a list of results before counting them with Length. Surely SequenceCount should be fastest being written for this exact operation and without superfluous overhead. However that is far from the case.

(First you will need to patch the broken-in-10.1.0 BenchmarkPlot if you intend to follow along.)

Needs["GeneralUtilities`"]

BenchmarkPlot[{f1, f2, f3}, RandomInteger[3, #] &, TimeConstraint -> 15]

enter image description here

This is catastrophically poor performance from SequenceCount! (Note the log-log scale.)
Its complexity is through the roof at apparently $O(n^3)$.

Let us introduce a fourth function also using SequenceCount but eliminating Alternatives:

f4 = Sum[SequenceCount[#, ss, Overlaps -> True], {ss, pat}] &;

Note that this will scan the list once for each list in pat so theoretically it should be proportionately slower, however:

BenchmarkPlot[{f1, f2, f3, f4}, RandomInteger[3, #] &, TimeConstraint -> 15]

enter image description here

This gives us $O(n)$ complexity with a lower coefficient than f1!

SequenceCases and SequencePosition exhibit similar behavior regarding Alternatives.

What is causing this and is there another solution besides scanning the list separately for each subsequence?

Plain patterns

Unlike the literal sequences in the first section the introduction of patterns adds potential complication for matching. For example pattern tests could potentially have side effects. (See: PatternTest not optimized?)

However plain patterns should be handled efficiently. As evidence we can again compare string performance to that of the new Sequence* functions. First with a fixed-length pattern:

BenchmarkPlot[
 {StringPosition[ToString@Row@#, "11" ~~ _ ~~ "0" ~~ _] &,
  SequencePosition[#, {1, 1, _, 0, _}] &},
 RandomInteger[3, #] &
]

enter image description here

And now with a variable-length pattern:

BenchmarkPlot[
 {StringPosition[ToString@Row@#, "11" ~~ __ ~~ "0"] &,
  SequencePosition[#, {1, 1, __, 0}] &},
 RandomInteger[3, #] &
]

enter image description here

Pretty much catastrophic again. Unlike the Alternatives issue there doesn't appear to be a work-around so despite the new functions I expect that when working with patterns I will still be recasting problems in terms of String functions which frankly is a shame.

Is there any justification for the poor performance I am witnessing or is this yet another function that was pushed out the door too soon?

$\endgroup$
9
  • 6
    $\begingroup$ Disappointment strikes again... $\endgroup$ Commented May 13, 2015 at 7:16
  • 3
    $\begingroup$ The Sequence* functions are implemented in top level, so they were never going to be much faster than using other functions. However, they should not be an order of magnitude slower. I've filed a report internally to get this looked at. $\endgroup$
    – Stefan R
    Commented May 15, 2015 at 15:56
  • 3
    $\begingroup$ It will still be in top level. In the latest builds, the performance of f3 is now equal to f4 in your Alternatives example, and for plain patterns (your second example) SequenceCases will have the same complexity but lower coefficient than StringPosition (and hence faster for n greater than about 100 in my testing). $\endgroup$
    – Stefan R
    Commented May 19, 2015 at 19:17
  • 5
    $\begingroup$ On version 10.3: Alternatives, Plain patterns fixed-length, and Plain patterns variable-length. $\endgroup$
    – Karsten7
    Commented Oct 18, 2015 at 10:16
  • 6
    $\begingroup$ No performance changes in version 11.0 compared to 10.3. $\endgroup$
    – Karsten7
    Commented Aug 16, 2016 at 3:53

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.