Aug
2008
Oh yeah, we’re really getting into Digg-friendly titles now, aren’t we?
The topic of persistent vectors is one of those odd backwaters of functional programming that few dare to approach. The basic idea behind it all is to try to devise an immutable data structure which has the performance characteristics of a mutable vector. What this means for practical application is that you shouldn’t have to deal with O(n) efficiency on random access like you do with List
(s). Instead, accessing arbitrary indexes should be constant time (O(1)), as should computing its length. Additionally, modifying an arbitrary index should be reasonably efficient – as close to constant-time as possible.
Of course, the word “modifying” is a trifle inaccurate as it implies a mutable data structure. One of the primary qualifications for a purely functional vector is that it is completely immutable. Any changes to the vector result in the creation of a new vector, rather than modifying the old. Basically, it’s an immutable data structure just like any other, but one which retains the brutal efficiency of its mutable counterpart.
Unfortunately, this turns out to be a very tough nut to crack. A number of researchers have attempted different strategies for solving the problem, none of which have been entirely successful. Rich Hickey, the creator of Clojure, has a brilliant presentation that essentially describes the solution I have chosen. For the impatient, the good stuff starts at about 36:00 and lasts for about ten minutes. I’ll elaborate on the problems with functional vectors a bit more in a second, but first a bit of motivational propaganda…
Thou Shalt Not Mutate
There is a single principle which should be drilled into skulls of all programmers everywhere: mutable data structures are bad news. Don’t get me wrong, I love ArrayList
as much as the Java addict, but such structures cause serious problems, particularly where concurrency is concerned. We can consider the trivial example where two threads are attempting to populate an array simultaneously:
private String[] names = new String[6]; private int index = 0; public static void main(String[] args) { Thread thread1 = new Thread() { public void run() { names[index++] = "Daniel"; names[index++] = "Chris"; names[index++] = "Joseph"; } }; Thread thread2 = new Thread() { public void run() { names[index++] = "Renee"; names[index++] = "Bethany"; names[index++] = "Grace"; } }; thread1.start(); thread2.start(); thread1.join(); thread2.join(); for (String name : names) { System.out.println(name); } } |
What does this snippet print? I don’t know. It’s actually indeterminate. Now we can guess that on most machines the result will be essentially interleaved between the two threads, but there is no way to guarantee this. Part of the reason for this is the fact that arrays are mutable. As such, they enable (and indeed, encourage) certain patterns which are highly destructive when employed asynchronously.
However, concurrency is not even the only motivation for immutable data structures. Consider the following example:
List<String> names = ... for (String name : names) { names.add(name.toUpperCase()); } |
I’m sure all of us have done something like this, most likely by accident. The result is (of course) a ConcurrentModificationException
caused by the fact that we are attempting to add to a List
while we are iterating over its contents. I know that the first time I was faced with this error message I became extremely confused. After all, no threads are being employed, so why is this a problem?
Iterators are extremely dependent on the internal state of their data structure. Anyone who has ever implemented an iterator for a linked list or (even better) a tree will attest to this fact. This means that generally speaking, there is no way for an iterator to guarantee correctness if that structure is changing out from underneath it (so to speak). Things may be fine in a linear structure like a list, but as soon as you get into anything non-linear like a tree or a hash table it becomes difficult to even define what the “correct” behavior should be. Think about it; should the iterator backtrack to hit the missing elements? Should this backtracking include elements which have already been consumed? What if the order changes dramatically and pre-consumed elements are now ahead of the current index? There are a whole host of problems associated with iterating over mutable data structures, and so rather than vainly attempt to solve these issues in a sane and consistent manner, JDK collections simply throw an exception.
All of this becomes moot once you start using immutable data structures. There is no way to modify a structure while iterating over it because there is no way to modify the structure at all! Concurrency is not an issue because without any mutable state to require locking, every thread can operate simultaneously on the structure. Not only is it thread safe, but it is unsynchronized and thread safe. Immutable data structures retain all of the asynchronous throughput of non-locking implementations without any of the race conditions and non-determinacy which usually results.
A Brief Note on Idioms
At this point, the question you must be asking yourself is: “So if the data structure is immutable, what good is it?” The answer is “for reading”. Data structures spend most of their lives being read and traversed by other code. Immutable data structures can be read in exactly the same fashion as mutable ones. The trick of course is constructing that data structure in the first place. After all, if the data structure is completely immutable, where does it come from? A simple example from a prior article is sufficient to demonstrate both aspects:
def toSet[T](list: List[T]): Set[T] = list match { case hd :: tail => hd + toSet(tail) case Nil => Set[T]() } |
This is neither the most concise nor the most efficient way to accomplish this task. The only purpose served by this example is to illustrate that it is very possible to build up immutable data structures without undue syntactic overhead. You’ll notice that every time we want to “change” a data structure – either removing from the list or adding to the set – we use a function call and either pass or return the modified structure. In essence, the state is kept entirely on the stack, with each new version of the data structure in question becoming the “changed” version from the previous operation.
This idiom is actually quite powerful and can be applied to even more esoteric (and less uniformly iterative) operations. As long as you are willing to let execution jump from one function to the next, it becomes extremely natural to deal with such immutability. In fact, you start to think of immutable data structures as if they were in fact mutable, simply due to the fact that you are idiomatically “modifying” them at each step. Note that this pattern of modifying data between functions is critical to actor-based programming and any seriously concurrent application design.
Problems Abound
For the sake of argument, let’s assume that my pitiful arguments have convinced you to lay aside your heathen mutating ways and follow the path of functional enlightenment. Let’s also assume that you’re consumed with the desire to create an application which tracks the status of an arbitrary number of buttons. These buttons may be pressed in any order regardless of what other buttons are already pressed. Following the path of immutability and making use of the patron saint of persistent data structures (List
), you might come up with the following solution:
class ButtonStrip private (buttons: List[Boolean]) { def this(num: Int) = { this((0 until num).foldLeft(List[Boolean]()) { (list, i) => false :: list }) } def status(index: Int) = buttons(index) def push(index: Int) = modify(index, true) def unpush(index: Int) = modify(index, false) /** * Modify buttons list and return a new ButtonStrip with the results. */ private def modify(index: Int, status: Boolean) = { val (_, back) = (buttons :\ (buttons.length - 1, List[Boolean]())) { (tuple, button) => val (i, total) = tuple (if (i == index) status else button) :: total } new ButtonStrip(back) } } |
This is a horrible mess of barely-idiomatic functional code. It’s difficult to read and nearly impossible to maintain; but it’s purely immutable! This is not how you want your code to look. In fact, this is an excellent example of what David MacIver would call “bad functional code“.
Perhaps even worse than the readability (or lack thereof) is the inefficiency of this code. It’s terribly slow for just about any sort of operation. Granted, we can imagine this only being used with a list of buttons of limited length, but it’s the principle of the thing. The fact is that we are relying on a number of operations which are extremely inefficient with lists, most prominently length
and apply()
(accessing an arbitrary index). Not only that, but we’re recreating the entire list every time we change the status of a single button, something which is bad for any number of reasons.
What we really need here is a random-access structure, something which allows us to access and “change” any index with some degree of efficiency. Likely the most intuitive thing to do here would be to just use a good ol’ array of Boolean
(s) and make a new copy of this array any time we need to change something. Unfortunately, this is almost as bad as our copying the list every time. Normally, when you use an immutable data structure, modifications do not require copying large amounts of data. Our toSet
example from above uses almost zero data copying under the surface, due to the way that Set
and List
are implemented.
Specifically, Set
and List
are persistent data structures. This doesn’t mean that they live on a filesystem. Rather, the term “persistent” refers to the fact that each instance of the collection may share significant structure with another instance. For example, prepending an element onto an immutable list yields a new list which consists of the new element and a tail which is precisely the original list. Thus, each list contains its predecessor (if you will) within itself. List
is an example of a fully persistent data structure; not everything can be so efficient. Set
and Map
for example are usually implemented as some sort of tree structure, and so insertions require some data copying (specifically, the parent nodes). However, this copying is minimized by the nature of the structure. This notion of persistence in the data structure works precisely because these structures are immutable. If you could change an element in a persistent data structure it would likely result in the modification of that same element in totally disparate instances of that structure across the entire runtime (not a pleasant outcome).
So List
is persistent, arrays are not. Even if we treat arrays as being completely immutable, the overhead of copying a potentially huge array on each write operation is rather daunting. What we need is some sort of data structure with the properties of an array (random access, arbitrary modification) with the persistent properties of a list. As I said before, this turns out to be a very difficult problem.
Partitioned Tries
One solution to this problem which provides a compromise between these two worlds is that of a partitioned trie (pronounced “try” by all that is sane and holy). In essence, a partitioned trie is a tree of vectors with a very high branching factor (the number of children per node). Each vector is itself a tree of vectors and so on. Note that these are not not like the binary search trees that every had to create back in college, partitioned tries can potentially have dozens of branches per node. As it turns out, it is this unusual property which makes these structures so efficient.
To get a handle on this concept, let’s imagine that we have a trie with a branching factor of 3 (much smaller than it should be, but it’s easier to draw). Into this vector we will insert the following data:
Index
|
Data
|
0 |
Daniel
|
7 |
Chris
|
2 |
Joseph
|
4 |
Renee
|
1 |
Bethany
|
3 |
Grace
|
9 |
Karen
|
13 |
Larry
|
5 |
Moya
|
After all the jumbling necessary to make this structure work, the result will look like the following:
It’s hard to see where the “trie” part of this comes into play, so bear with me. The important thing to notice here is the access times for indexes 0-2: it’s O(1). This is of course a tree, so not all nodes will be one step away from the root, but the point is that we have achieved constant-time access for at least some of the nodes. Mathematically speaking, the worst-case access time for any index n is O( log3(n) ). Not too horrible, but we can do better.
First though, we have to lay some groundwork. I said that the structures we were working with are tries rather than normal trees. A trie implies that the key for each element is encoded in the path from the root to that node. So far, it just appears that we have built a fancy tree with numeric keys and a higher branching factor than “normal”. This would be true if all we were given is the above diagram, but consider the slightly modified version below:
Structurally, nothing has changed, but most of the edges have been renumbered. It is now a bit more apparent that each node has three branches numbered from 0 to 2. Also, with a little more thought, we can put the rest of the pieces together.
Consider the “Moya” node. In our input table, this bit of data has an index of 5. To find its “index” in our renumbered trie, we follow the edges from the root down to the destination node, arriving at a value of 12. However, remember that each node has only 3 branches. Intuitively, we should be thinking about base-3 math somewhere about now. And indeed, converting 12 into base-3 yields a final value of 5, indicating that the index of the node is indeed encoded in the path from the root based on column. By the way, this works on any node (try it yourself). The path to “Karen” is 100, which converted into base-3 becomes 9, the input index of the element.
This is all fine and dandy, but we haven’t really solved our original problem yet: how to achieve constant-time access to arbitrary indexes in a persistent structure. To really approach a solution to our problem, we must increase the branching factor substantially. Rather than working with a branching factor of 3 (and thus, O( log3(n) ) efficiency), let’s dial the branching factor up to 32 and see what happens.
The result is completely undiagramable; but it does actually provide constant time access for indexes between 0 and 31. If we were to take our example data set and input it into our revised trie, the result would be a single layer of nodes, numbered at exactly their logical index value. In the worst case, the efficiency of our more complicated trie is O( log32(n) ). Generally speaking, we can infer (correctly) that for any branching factor b and any index n, the lookup efficiency will be precisely logb(n). As we increase the branching factor, the read-efficiency of our trie increases exponentially. To put a branching factor of 32 into perspective, this means that the algorithmic complexity of accessing index 1,000,000 would only be 3.986! That’s incredibly small, especially given the sheer magnitude of the index in question. It’s not technically constant time, but it’s so incredibly small for all conceivable indexes that we can just pretend that it is. As Rich Hickey says:
…when it’s a small enough value, I really don’t care about it.
So that takes care of the reading end of life, but what about writing? After all, if all we needed was constant time lookups, we could just use an array. What we really need to take care to do is ensure that modifications are also as fast as we can make them, and that’s where the tricky part comes in.
We can think of an array as a partitioned trie with a branching factor of . When we modify an array immutably, we must copy every element from the original array into a new one which will contain the modification. This contrasts with
List
– effectively, a partitioned trie with a branching factor of 1 – which in the best case (prepending) requires none of the elements to be copied. Our 32-trie is obviously somewhere in between. As I said previously, the partitioned trie doesn’t really solve the problem of copying, it just compromises on efficiency somewhat (the difference between 1 and 3.986).
The truth is that to modify a partitioned trie, every node in the target sub-trie must be copied into a new subtrie, which then forces the copying of its level and so-on recursively until the root is reached. Note that the contents of the nodes are not being copied, just the nodes themselves (a shallow copy). Thus, if we return to our example 3-trie from above and attempt to insert a value at index 12, we will have to copy the “Larry” node along with our new node to form the children of a copy of the “Renee” node. Once this is done, the “Grace” and “Moya” nodes must also be copied along with the new “Renee” to form the children of a new “Bethany”. Finally, the “Daniel” and “Joseph” nodes are copied along with the new “Bethany” to form the children of a new root, which is returned as the modified trie. That sounds like a lot of copying, but consider how much went untouched. We never copied the “Karen” or the “Chris” nodes, they just came over with their parent’s copies. Instead of copying 100% of the nodes (as we would have had it been an array), we have only copied 80%. Considering that this was an example contrived to force the maximum copying possible, that’s pretty good!
Actually, we can do even better than this by storing the children of each node within an array (we would have to do this anyway for constant-time access). Thus, only the array and the modified nodes need be copied, the other nodes can remain untouched. Using this strategy, we further reduce the copying from 80% to 30%. Suddenly, the advantages of this approach are becoming apparent.
Now of course, the higher the branching factor, the larger the arrays in question and so the less efficient the inserts. However, insertion is always going to be more efficient than straight-up arrays so long as the inserted index is greater than the branching factor. Considering that most vectors have more than 32 elements, I think that’s a pretty safe bet.
Implementation
I bet you thought I was going to get to this section first. Foolish reader…
Once we have all this theoretical ground-work, the implementation just falls into place. We start out with a Vector
class parameterized covariantly on its element type. Covariant type parameterization just means that a vector with type Vector[B]
is a subtype of Vector[A]
whenever B
is a subtype of A
. List
works this way as well, as do most immutable collections, but as it turns out, this sort of parameterization is unsafe (meaning it could lead to a breakdown in the type system) when used with mutable collections. This is part of why generics in Java are strictly invariant.
Coming back down to Earth (sort of), we consider for our design that the Vector
class will represent the partitioned trie. Since each child node in the trie is a trie unto itself, it is only logical to have each of the nodes also represented by Vector
. Thus, a Vector
must have three things:
data
length
branches
Put into code, this looks like the following:
class Vector[+T] private (private val data: Option[T], val length: Int, private val branches: Seq[Vector[T]]) extends RandomAccessSeq[T] { def this() = this(None, 0, new Array[Vector[T]](BRANCHING_FACTOR)) // ... } |
RandomAccessSeq
is a parent class in the Scala collections API which allows our vector to be treated just like any other collection in the library. You’ll notice that we’re hiding the default constructor and providing a no-args public constructor which instantiates the default. This only makes sense as all of those fields are implementation-specific and should not be exposed in the public API. It’s also worth noting that the branches field is typed as Seq[Vector[T]]
rather than Array[Vector[T]]
. This is a bit of a type-system hack to get around the fact that Array
is parameterized invariantly (as a mutable sequence) whereas Seq
is not.
With this initial design decision, the rest of the implementation follows quite naturally. The only trick is the ability to convert an index given in base-10 to the relevant base-32 (where 32 is the branching factor) values to be handled at each level. After far more pen-and-paper experimentation than I would like to admit, I finally arrived at the following solution to this problem:
def computePath(total: Int, base: Int): List[Int] = { if (total < 0) { throw new IndexOutOfBoundsException(total.toString) } else if (total < base) List(total) else { var num = total var digits = 0 while (num >= base) { num /= base digits += 1 } val rem = total % (Math.pow(base, digits)).toInt val subPath = computePath(rem, base) num :: (0 until (digits - subPath.length)).foldRight(subPath) { (i, path) => 0 :: path } } } |
As a brief explanation, if our branching factor is 10 and the input index (total
) is 20017, the result of this recursive function will be List(2, 0, 0, 1, 7)
. The final step in the method (dealing with ranges and folding and such) is required to solve the problem of leading zeroes dropping off of subsequent path values and thus corrupting the final coordinates in the trie.
The final step in our implementation (assuming that we’ve got the rest correct) is to implement some of the utility methods common to all collections. Just for demonstration, this is the implementation of the map
function. It also happens to be a nice, convenient example of good functional code.
override def map[A](f: (T)=>A): Vector[A] = { val newBranches = branches map { vec => if (vec == null) null else vec map f } new Vector(data map f, length, newBranches) } |
Properties
Before moving on from this section, it’s worth noting that that our implementation of the vector concept has some rather bizarre properties not held by conventional, mutable vectors. For one thing, it has a logically infinite size. What I mean by this is it is possible to address any positive integral index within the vector without being concerned about resize penalties. In fact, the only addresses which throw an IndexOutOfBoundsException
are negative. The length of the vector is defined to be the maximum index which contains a valid element. This actually mirrors the semantics of Ruby’s Array
class, which also allows placement at any positive index. Interestingly enough, the efficiency of addressing arbitrary indexes is actually worst-case much better in the persistent trie than it is in a conventional amortized array-based vector.
Vectored Buttons
Since we now have an immutable data structure with efficient random-access, we may as well rewrite our previous example of the button strip using this structure. Not only is the result far more efficient, but it is also intensely cleaner and easier to read:
class ButtonStrip private (buttons: Vector[Boolean]) { def this(num: Int) = this(new Vector[Boolean]) // no worries about length def status(index: Int) = buttons(index) def push(index: Int) = new ButtonStrip(buttons(index) = true) def unpush(index: Int) = new ButtonStrip(buttons(index) = false) } |
You’ll notice that the update
method is in fact defined for Vector
, but rather than returning Unit
it returns the modified Vector
. Interestingly enough, we don’t need to worry about length allocation or anything bizarre like that due to the properties of the persistent vector (infinite length). Just like arrays, a Vector
is pre-populated with the default values for its type. In the case of most types, this is null
. However, for Int
, the default value is 0
; for Boolean
, the default is false
. We exploit this property when we simply return the value of dereferencing the vector based on any index. Thus, our ButtonStrip
class actually manages a strip of infinite length.
Conclusion
The truth is that we didn’t even go as far as we could have in terms of optimization. Clojure has an implementation of a bit-partitioned hash trie which is basically the same thing (conceptually) as the persistent vector implemented in this article. However, there are some important differences.
Rather than jumping through strange mathematical hoops and hacky iteration to develop a “path” to the final node placement, Clojure’s partitioned trie uses bit masking to simply choose the trailing five bits of the index. If this node is taken, then the index is right-shifted and the next five bits are masked off. This is far more efficient in path calculation, but it has a number of interesting consequences on the final shape of the trie. Firstly, the average depth of the tree for random input is less, usually by around 1 level (on average). This means that the array copying on insert must occur less frequently, but must copy more references at each step. Literally, the trie is more dense. This probably leads to superior performance. Unfortunately, it also requires that the index for each value be stored along with the node, requiring more memory. Also, this sort of “bucket trie” (to coin a phrase) is a little less efficient in lookups in the average case. Not significantly so, but the difference is there. Finally, this masking technique requires that the branching factor be a multiple of 2. This isn’t such a bad thing, but it does impose a minor restriction on the flexibility of the trie.
Most importantly though, Clojure’s trie uses two arrays to contain the children: one for the head and one for the tail. This is a phenomenally clever optimization which reduces the amount of array copying by 50% across the board. Rather than having a single array of length 32, it has two arrays of length 16 maintained logically “end to end”. The correct array is chosen based on the index and recursively from there on down the line. Naturally, this is substantially more efficient in terms of runtime and heap cost on write.
At the end of the day though, these differences are all fairly academic. Just having an idiomatic partitioned trie for Scala is a fairly significant step toward functional programming relevance in the modern industry. With this structure it is possible to still maintain lightning-fast lookup times and decent insertion times without sacrificing the critical benefits of immutable data structures. Hopefully structures like this one (hopefully, even better implementations) will eventually find their way into the standard Scala library so that all may benefit from their stream-lined efficiency.
- Vector.scala
- VectorSpecs.scala (ScalaCheck rocks)
Update
The implementation of Vector
which I present in this article is inherently much less efficient than the one Rich Hickey created for Clojure. I finally broke down and created a line-for-line port from Clojure’s Java implementation of PersistentVector
into a Scala class. I strongly suggest that you use this (much faster) implementation, rather than my own flawed efforts. You can download the improved
Vector
here: Vector.scala.
Comments
This is, incidentally, more or less how the new IntMap in 2.7.2 works. The details are different, but the idea is the same.
IntMap is persistent, but it’s not a partitioned trie in the same sense. I had a cursory glance at the source code, and while I’m certainly not grokking everything involved, it doesn’t look like quite the same idea to me. IntMap seems to be a more dynamic tree structure for the branches, whereas a partitioned trie uses an array (for constant lookup time). Actually, IntMap looks like it might be *more* efficient for some operations (like any modification), but I’d have to really sit down and figure it out to be sure.
Finally, IntMap is just that, a Map. It doesn’t implement Seq directly or indirectly (much less RandomAccessSeq), so it is unavailable for a lot of operations. The Vector implementation I gave is an immutable drop-in replacement for Array. That’s not to say that a wrapper could not be implemented *around* IntMap to provide such an interface, but it presently does not satisfy that constraint.
Side note: I was curious with respect to comparative performance: bit-partitioned vs big-endian path partitioning. I hadn’t given much thought to this issue prior to writing the article, so when my fingers wandered off down the conceptual bunny-trail, my interest was piqued.
To answer my own question, I quickly implemented a bit-partitioned trie after the Clojure design: head and tail array, strict branching factor, bit-masking to determine path, etc. I tested the comparative performance by using the VectorSpecs tests and running them separately against Vector and my experimental BitVector. The suite was run 8 times against each implementation with the best and worst results discarded. To my surprise, Vector came out ahead in the average case:
Vector
======
472
478
482
479
467
461
===
473ms
BitVector
========
537
488
491
556
486
484
===
507ms
I suppose this does bear out my analysis of the distinctions between the average lookup times. However, given the fact that the bit-partitioned trie has a lower average depth, one would imagine that this hindrance would be nullified. Considering that the BitVector has a major optimization which is lacking in the Vector implementation (tail and head arrays), I think this is pretty impressive.
It’s also interesting that the bit-partitioned implementation has far more wild fluctuations in its results. I would imagine that this stems from the random nature of the tests and its more complex “worst-case” (since indexes could technically “stack up” linearly like hash buckets).
For the curious, here is my BitVector implementation: http://www.codecommit.com/blog/misc/implementing-persistent-vectors-in-scala/BitVector.scala
Right. I wasn’t saying “Why would you use this when there’s IntMap?!”. They have a fundamentally different interface if nothing else.
Just that the basic idea of a Trie on integer keys is similar. As far as I can tell the two major differences are that IntMap will shortcut boring segments of the keys (so if you have two keys which differ in only a few places it won’t branch for the places where they don’t differ) and that your implementation builds much broader trees (which is great for reads and bad for writes). Which is why I said that it’s fundamentally the same idea, just differing in details.
In particular it’s probably worth considering making your version shortcut similarly to IntMap and my version broaden similarly to Vector.
As a side note, I’ve been sketching the details for implementing a similar structure which is somewhere between this and a finger tree. It would have the characteristics of the Vector but would also have fast add and remove to both ends and fast concatenation (also access to indices near either end point would get a non-trivial speedup).
There’s an easier (and faster) way to compute the path. The trick is to start with the last digit instead of the first. To get the last digit of a number you compute `n % base`. e.g. 1234 % 10 = 4. To remove the last digit you (integer) divide it by base: 1234 / 10 = 123 (x86 processors do both steps in one divmod instruction, but if you choose a power of 2 base you can use two shifts, which is even faster).
In pseudo code (because I don’t speak Scala):
def path(n, base):
–digits = nil
–do {
—-digits = cons(n % base, digits)
—-n = n / base
–} while n != 0
–digits
(this is a do-while loop because otherwise it wouldn’t work when n=0)
path(1234,10) => [1,2,3,4]
I don’t understand this bit:
new Vector(data map f, length, newBranches)
Shouldn’t that be:
new Vector(f(data), length, newBranches)
Otherwise you’re assuming that we’re storing collections in the vector?
Oh I get it. data is an Option, and map is defined for Option like:
Some(x) => f(x)
Nothing => Nothing
Cool!
Couldn’t you set branches to null in the default constructor? There would be some more checks but you would save a bunch of space on arrays that you can never put anything into.
I think it’s a better idea to set branches to the array of the size we actually need. Then if you have, say, 33 elements in the vector, you need just 1 array of size 2 (at the root), 1 array of size 32 (left branch), 1 array of size 1 (right branch) and 1 array of size 0 (since it can be shared between all leaf vectors).
@David
I’ll have to look more closely at IntMap, the branch shortcuting sounds quite interesting. The only problem I see with it is it sounds like the structural location of a given index is partially dependent on the structure of the rest of the trie. Just talking off the top of my head, but wouldn’t that require that you store the key along with the value in each node? Bit-partitioned tries require this too, and one of the things I noticed in my benchmarking is that it has a deleterious impact on read-times (since reads can sometimes become linear on a set of possible targets). Shortcutting doesn’t sound quite as bad as the bit-masking technique of “just taking what you need” from the key, but it sounds like the impact to the structure is in the same conceptual vein. Am I wrong?
@Jules
The path computation algorithm that you described would be a little-endian path rather than the big-endian that I use. Technically speaking, they have the same worst-case performance, but I’m pretty sure that the average write performance is slightly better with big-endian. My reasoning for this is the fact that big-endian will result in repeated writes to the same super-branch when incrementing the index (a common pattern). In turn, this means that a large number of branches retain empty child arrays for a longer period of time, while one child array is copied repeatedly as the key traverses its range. In truth, I’m not sure whether or not this yields better performance. However, if I followed @Asd’s excellent suggestion of avoiding the creation of extra arrays, I would save a tremendous amount of space using big-endian rather than little (since nodes would remain without children for longer since the children would be grouped rather than distributed). Bit-partitioned vectors have little-endian ish paths, though not precisely since a higher index can come structurally above a lower index with the same trailing bits. In terms of raw performance (ignoring memory), it’s hard to say which is better in the average case. For a random set of indexes, they are identical.
@Alexey
That would lead to truly constant read times, but extremely poor writes. Literally, this would be the same as maintaining an array and copying it every time new data was to be inserted. Partitioned tries are a compromise between this full-copy strategy and the need for fast reads. The branching factor is a direct reflection of this compromise. One of the nice things about a path-partitioned trie rather than bit-partitioned is that the branching factor may be set on instantiation. This would allow the developer to “tune the vector” as it were, reflecting the case-specific needs for better reads or better writes.
I don’t see how it would “be the same as maintaining an array”. I am talking about the following scheme.
Let’s write a Vector(data, count, branches) as (data, count, branches), empty vector as Empty == (None, 0, [0]{}) and the single-element vector as Single(elt) == (Some(elt), 1, Vector[int][0]). The array of Vectors of length l will be written as [l]{elements of the array}
v = new Vector[int] //v == Empty.
v = (v[10] = 1) //v == (None, 11, [11]{Empty,Empty,…,Single(1)})
v = (v[32] = 2) //v == (None, 32, [11]{Empty,(None, 2, [1]{Single(2)}),…,Single(1)})
In other words, the only difference in the insertion code is this: instead of
val newBranches = new Array[Vector[A]](branches.length)
Array.copy(branches, 0, newBranches, 0, branches.length)
we have
val newBranches = new Array[Vector[A]](Max(branches.length, hd + 1))
Array.copy(branches, 0, newBranches, 0, branches.length)
In apply(i) instead of
val node = branches(hd)
we have
val node = if (hd
@Alexey
Oh, I think I see what you’re saying! So you still have a static branching factor, but the individual branch array for each sub-vector is only sized just large enough to contain is maximum index? That’s terribly clever. It does require that the branching factor be maintained as a separate value within the class, rather than relying on the array length. This is a good idea anyway, considering the fact that having sub-vectors with different branching factors would lead to inconsistent results (and crashing).
Anyway, this is almost a more general space-saving solution that the idea given previously of lazily creating branch arrays. Also, it significantly reduces the average insert complexity due to the fact that rarely would an entire array of length=branchingFactor have to be copied linearly. I was planning on continuing to tweak my path-partitioned implementation a bit more; I’ll try applying this technique and we’ll see how much it saves!
That last line should be
val node = if (hd < branches.length) branches(hd) else null
I’ve written up this solution in C#, it’s available at
http://code.google.com/p/functional-dotnet/
There are also finger trees, ropes, and some more useful functional data structures.
I tweaked the implementation to employ Alexey’s optimizations as well as the left/right split of the branches array as well as a tinsy bit of memoization on functions like computePath. Surprisingly, the results were only marginally faster than the original implementation, and far less consistent over a random data set:
Vector(2.0)
===========
105
114
97
980
36
972
25
997
979
998
0
31
967
10
963
31
22
991
968
3
===
416ms
I tend to think that there was probably some weird process running on my machine while running these tests, given the dramatic spread we’re seeing. So if I had to pick a number to “believe”, it would probably be in the range of 100ms. If that is indeed the case, then the optimizations have yielded an 80% reduction in test suite runtime. The bigger gain (I imagine) is the memory utilization, which I did not measure, but is likely orders of magnitude lower than the original implementation. In fact, as far as random access sequences go, this implementation is probably even more efficient than a mutable array in terms of memory, especially for wide index spreads.
The revised implementation is here: http://www.codecommit.com/blog/misc/implementing-persistent-vectors-in-scala/Vector2.scala
How’s performance compared to ArrayList?
@Daniel: I’m not sure if I understand what you mean by big and little endian. You mean compute_path(1234,10) => [1,2,3,4] is big endian and [4,3,2,1] is little endian? My example is big endian: compute_path(1234,10) = [1,2,3,4]. It does start with the last digit, but because cons() adds an item at the front of the list the result is cons(1,cons(2,cons(3,cons(4,nil)))).
I think that big endian is not only faster for writes, but also for sequential reads. This is because the i’th item in the vector will require many memory reads that are the same for the i+1′th item. So these reads go to the processors cache instead of RAM. (aka better memory locality)
Daniel, this is some great stuff. I’d love to see it in the standard library, but in the meantime, why don’t you contribute it to scalax?
@Jesper
I haven’t benchmarked against ArrayList (yet), but read time should be comparable (as in: very, very close). Writes are probably a bit more widely separated, though with a wide index spread you may see closer performance between the two (maybe even superior with the immutable implementation). Where Vector really shines is in the memory utilization with extremely high indexes. Potentially, Vector could be many orders of magnitude smaller in-memory than ArrayList, due to the fact that it does not require large contiguous blocks.
@Jules
Well, if I read the code, that would be helpful.
Sorry, my brain was a bit mushy yesterday, I somehow missed the key points of your algorithm. You are quite correct that using mod and starting from the right is more efficient. I had tried to do it that way originally, but somehow I wasn’t able to make it work. Now a feel a little sheepish about it.
If you don’t mind, I’m going to use your idea and rewrite the computePath method accordingly. Which brings me to…
@Jorge
I would also love to see it make it into the stdlib. I’m guessing this just requires community interest and a ticket in Trac? (as well as a well-tested implementation of course) In the meantime though, I’ll shoot an email to Jamie Webb and attach the current incarnation of the implementation (Mercurial probably won’t work well on my current system). I’m still working on efficiently implementing some of the common collection operations, and once I have that done I’ll get up the nerve to pitch it to LAMP.
This won’t compile:
class ButtonStrip private (buttons: List[Boolean]) {
def this(num: Int) = {
val buttons = (0 until num).foldLeft(List[Boolean]()) { (list, i) =>
false :: list
}
this(buttons)
}
}
error: ‘this’ expected but ‘val’ found.
val buttons = (0 until num).foldLeft(List[Boolean]()) { (list, i) =>
^
t3a.scala:5: error: ‘(‘ expected but ‘;’ found.
}
^
two errors found
Scala has to call the primary constructor immediately in a secondary
constructor:
class ButtonStrip private (buttons: List[Boolean]) {
def this(num: Int) = {
this((0 until num).foldLeft(List[Boolean]()) { (list, i) =>
false :: list})
}
}
BTW, this is one of my pet peeves with Scala. If you want to do a
calculation to pass to the primary constructor, you have to do it
in the arguments to this().
@Blair
You’re quite correct of course. I’ll fix the snippet. I suppose that one way of getting around this would be to move the calculation into a method in the companion object, but that’s a little ugly. Scala really should allow vals to be used prior to this(…) in secondary constructors. There’s no problem with this in compilation since vals can be inlined, it’s just a clarity thing.
hey daniel, a very interesting class. in fact i’m always struggeling with the problem of concurrent updates of structures. i basically use a mix of immutable (for simple objects) and mutable (for more complex objects – due to the lack of algorithms like your one) approaches.
i have tested your Vector class and wonder if it could be combined with the concept of interval keys, in order to obtain a kind of internal tree (–> http://www.nabble.com/scala.Collection,-Seq,-jcl.Collection,-Buffer-td18033249.html )… what i need is a robust fast data structure that can handle lookup, insertion and deletion of items in O(log(n)) time, where lookup means finding items by a key (so it’s a kind of Map) which is a time interval (given in integer sample frames).
thanks for hints and suggestions!
sorry, typo, should be “interval tree”, or something in that direction (quadtree, k-d tree)
@Sciss
Hmm, I’m not sure I understand completely what you’re trying to do, but I can definately help you with a log(n) efficient Map!
What you want is a functional, balanced binary search tree. Traditionally, most languages use an immutable Red-Black tree for this case, but Scala actually has an even faster implementation: scala.collection.immutable.Tree. This structure guarantees O(log(n)) for all operations.
It’s actually possible to do a little better though. Clojure has a persistent hash map based on Bagwell’s ideal hash tries which is able to claim O(log32(n)) for all operations. This is actually even a little better than conventional mutable bucket hashing which has an absolute worse-case O(log(n)) (when everything hashes to the same bucket). I haven’t had the chance to port Clojure’s hash map to Scala yet, but I’m working on it!
This persistent hash map is very similar in design to the persistent vector (they’re both based on the same structural concept).
Correction, the Tree class should be linked as follows: http://www.scala-lang.org/docu/files/api/scala/collection/immutable/TreeMap.html
TreeMap uses Tree to implement an essentially Red-Black tree structure.
@Daniel
thanks for the fast reply, and thanks for the hint to TreeMap. in fact, i only understand the conception of scala.collection.immutable after reading your article – coming from the java world, i had always thought that that package is just creating read-only objects like java.util.Collections.unmodifiable(List|Map|…) !
TreeMap however doesn’t work in my case due to two things:
- all range operations are 1-dimensional, where i need two-dimensional keys
(like a (start, stop) tuple)
- i cannot insert two elements at the same key
i just realize the second issue is also true for the Vector class ;-( and i guess it’s a lot of work to modify it so that it accepts duplicate keys. (there is a trick however i read in a paper – you can store a proxy value for a key that points to more than one value, the proxy value contains a list of all objects stored under that key). however that introduces a new problem for the generics. i guess i will need to have a wrapper class to handle that.
Actually, you don’t have to create a new proxy object; Scala already gives you one: Tuple2. I think you can accomplish your goals by using tuples as keys. Thus, for some types A, B and C, your Map type signature will be something like the following:
Map[(A, B), C]
@Daniel
i understand i can use tuples as keys. but they won’t solve the two problems
- duplicate keys
- range searches for boundaries
while i can imagine solving the first problem by using a Map[ Span, Seq[ Object ]], the second problem seems impossible to do with the normal tree / vector: if any object is defined by its Span( , ), how would i query for example
“give me all objects that intersect with Span( 10000, 40000 )” ?
if i’m not wrong this requires a two-dimensional tree… or do you see another solution? thanks again!
p.s. here is what i mean. i’m splitting my range object (Span) into two pairs of keys, by start and stop (Long). but when querying a range, the probematic lines are the last four, which destroy the speed…
case class Span( start: Long, stop: Long );
case class Lala( name: String, span: Span );
var t_start = new scala.collection.immutable.TreeMap[ Long, Lala ]()
var t_stop = new scala.collection.immutable.TreeMap[ Long, Lala ]()
val rand = new java.util.Random
(1 to 1000).foreach { i =>
val start = rand.nextInt( 400000 )
val stop = start + rand.nextInt( 400000 )
val span = Span( start, stop )
val lala = Lala( String.valueOf( i ), span )
t_start = t_start.update( span.start, lala )
t_stop = t_stop.update( span.stop, lala )
}
val search = Span( 100000, 200000 )
val coll1 = t_start from search.start
val coll2 = t_stop until search.stop
val c1vals = coll1.values
val c1valSet = new scala.collection.mutable.HashSet[ Lala ]
c1valSet ++= c1vals
val result = coll2.filter { e => c1valSet.contains( e._2 )}
here is some measurements with the above code. t2-t1 is the loop adding elements, t5-t3 is the from and until operations, t8-t6 is creating the intersection between the from and until collections. interestingly, all are O(N), where i had guessed that t5-t3 should be O(logN) ? or do i have to use a special kind of iterator generation to get that performance?
// for N=50000
t2-t1 = 1226 ms
t5-t3 = 6 ms
t8-t6 = 159 ms
// for N=100000
t2-t1 = 2688 m
t5-t3 = 12 ms
t8-t6 = 317 ms
// for N=200000
t2-t1 = 6155 ms
t5-t3 = 20 ms
t8-t6 = 603 ms
Ah, I didn’t realize that you needed range matching (I suppose I should have read your original post…). This is starting to get a bit out of my depth, but it would seem to me that you could match key spans in O(log(n)) time using a balanced binary search tree. This is because you can just treat ranges as trimmed sub-trees. Doing it immutably might be a bit tough, but still possible. It would seem to me that you could take advantage of Scala’s SortedMap#rangeImpl method for this, but I’m not entirely sure.
A higher branching factor (such as is used in Vector) might be interesting to explore, but I don’t think it would help too much. In fact, I think it would hinder performance. For one thing, you would *have* to use a tree structure rather than a trie (otherwise range matches are still linear time). Also, bumping up the branching factor makes it much more complicated to just select out your sub-tree. You would have to trim nodes as well as trees, and that gets to be quite inefficient.
well, i’m reading through two papers now (quite an effort for a not-computer-science person)
Neil Sarnek, Robert Tarjan, “Planar Point Location using persistent trees”
and
James Driscoll, Daniel Sleator, Robert Tarjan “Making data structures persistent”
… hopefully with some insight.
Recommended reading: Chris Okasaki’s doctoral thesis was on persistent data structures and ways of making them efficient. It’s very readable and was actually republished as a proper book.
Post a Comment