Context
I want to build a transposition table where the game states are binary square matrices which size MATRIX_SIZE is known at compile time. Rather than storing these matrices, I want to store their "hash". More precisely, I want to build a function that, upon being given such a matrix returns a unique identifier. My requirements are:
- Performance: computing a unique identifier must be as fast as possible, since the function will be called a lot.
- Perfect collision-resistance: the unique identifier must be, well, unique. Collisions are not accepted.
- Parallelism: at some point in time, several threads would want to compute uid for different matrices.
The first idea that comes to mind for tackling such a problem is using well-known hash functions. However:
- The basic hash functions in Rust make a trade off between performance and collision resistance.
- I expect to see more than 2^32 matrices, which means I would have to use tags that can't be stored on 64-bit numbers, because of the birthday paradox. This would in tun prevent me from using these tags directly as keys for a structure such that an
HashMapor anIndexMap. - Even the most performant hash functions such as SHA3 make several computations while reading the output.
My idea is that since I know that my inputs are always going to be the same size, I can exploit this property. Fundamentally, what I want to do is to store the matrices that I see. Upon having to give a unique identifier to a matrix, I check whether this matrix has already been given an identifier. Otherwise, I increase a global counter by one and assign it to this matrix.
For instance, an idea that could work is:
- Start with an empty
vec! - Compute a (very) large integer which binary decomposition is given by the matrix
- If this integer is within the
vec!, returns its index within thevec! - Otherwise, insert it into the vec in a sorted way
Inserting and searching are efficient since the vec! is sorted (for instance, one may use the binary_search method of such objects). In fact, this could be a potential solution to my problem, except that the vec will always be growing. From my understanding, this means that for every next power of 2, a new allocation would have to take place, copying the whole vec! each time. Plus, inserting an element is clearly not thread-safe.
The solution I went with (for now?) is to implement a tree that will store the matrices it has seen in its leaves, along with their uid. For instance, let us assume MATRIX_SIZE=4. We start from a tree that is just a root:
Node
|---+---|
None None
Suppose that we now want to compute the unique identifier of [0, 0, 0, 1] (note that we consider the flatten matrices, since we just see them as iterators). The tree will then look like this:
Node
|-------------------------------+-------------------------------|
Node None
|---------------+---------------|
Node None
|-------+-------|
Node None
|---+---|
None Leaf(0)
If we now store [0, 1, 0, 1], the tree will look like this:
Node
|-------------------------------+-------------------------------|
Node None
|---------------+---------------|
Node Node
|-------+-------| |-------+-------|
Node None Node None
|---+---| |---+---|
None Leaf(0) None Leaf(1)
The idea is that:
- We only have to read the data and progress within the tree along with it, there's no additional computations
- As soon as we find a
None, we know that this is a new stream, so we can directly return the new unique identifier and let another thread expand the tree in the background (though this is not yet implemented, I wrote the code with this idea in mind, which is whyexpandis a separate function). - It seems not to suffer from the same problems as
vec! - It seems a little bit more thread-friendly (though I'm not sure about this).
However:
- Just like the
vec!solution, it may require a lot of memory, contrarily to a stateless hash function - It requires a lot of "jumping around" in memory and I'm not sure whether this affects performance.
The code
Here's my attempt at implementing such a tree in Rust. Note that I haven't taken care of the parallelism aspect for now. Since I'm a beginner in Rust, every feedback is much appreciated, be it performance-wise, on good practices or on the algorithm itself.
Here's the main.rs file:
// main.rs
mod constants;
mod uid_assigner;
fn main() {
println!("Hello, world!");
}
Here's the constants.rs file:
// constants.rs
const N_QUBITS: usize = 5;
pub const MATRIX_SIZE: usize = N_QUBITS * N_QUBITS;
And finally, here's the uid_assigner.rs file:
// uid_assigner.rs
use crate::constants::MATRIX_SIZE;
use crate::uid_assigner::UidAssignerTree::{Leaf, Node};
// Only used to initialize an array full of None
const NONE_INIT: Option<Box<UidAssignerTree>> = None;
/// Each Node in the UidAssignerTree has the same number of children. This
/// value must be equal to the largest integer the assigner will ever meet in
/// the stream.
///
/// Since we are dealing with binary streams, its value is currently set to 2.
const TREE_BRANCHES: usize = 2;
/// The enum that represents the tree used for assigning unique identifiers. The
/// data is stored in the leaves only, the nodes are only meant to lead to the
/// leaves.
///
/// The rationale is that it is possible to both access and create an entry in
/// $n$ steps, where $n$ is the length of the stream. Note that this assigner
/// assumes that every stream to give an identifier to has the same length $n$.
/// Otherwise, the `get_uid` function will panic.
enum UidAssignerTree {
// FIXME: This makes the leaves "heavy", since they must be able to be
// casted to an array of pointers. That means that storing a u64 is as
// heavy as storing TREE_BRANCHES usizes
Leaf(u64),
Node([Option<Box<UidAssignerTree>>; TREE_BRANCHES]),
}
/// The UidAssigner is the struct the user will be exposed to in order to assign
/// unique identifiers to streams.
pub(crate) struct UidAssigner {
/// The `uid_assigner` is the structure that contains all previous
/// identifiers and is in charge of creating new ones.
uid_assigner: UidAssignerTree,
/// The `next_id` simply represents the identifier that will be affected to
/// the next stream to which no identifier is associated yet.
next_id: u64,
}
impl UidAssignerTree {
/// Creates the root of a new UidAssignerTree.
fn new() -> Self {
Node([NONE_INIT; TREE_BRANCHES])
}
/// Expands the tree until a new leaf.
///
/// This function is only called whenever a None Node is found when looking
/// for a new uid in the `get_uid` function. It uses the knowledge that all
/// upcoming Nodes are going to be None to expand the tree in a branchless
/// way.
fn expand<'a>(&mut self, mut data: impl Iterator<Item = &'a u8>, next_uid: &mut u64) {
match data.next() {
// If the iterator isn't empty yet, we have to create the children
// of the current node.
Some(index) => {
if let Node(children) = self {
let child_reference = &mut children[*index as usize];
*child_reference = Some(Box::new(UidAssignerTree::new()));
child_reference.as_mut().unwrap().expand(data, next_uid);
} else {
// The nodes we encounter are those we previously created,
// so we know for certain that they're not leaves
unreachable!()
}
}
// Otherwise, the current node should be a Leaf to which we can
// assign the new uid.
None => {
*self = Leaf(*next_uid);
*next_uid += 1;
}
}
}
/// Computes the uid of a stream of data. It does so by either assigning a
/// new uid to the stream if it is the first time the function is called
/// with this particular stream, or by giving back the previously associated
/// uid.
fn get_uid<'a>(&mut self, mut data: impl Iterator<Item = &'a u8>, next_uid: &mut u64) -> u64 {
match self {
Node(children) => {
let child_reference = &mut children[*data.next().unwrap() as usize];
if !child_reference.is_none() {
child_reference.as_mut().unwrap().get_uid(data, next_uid)
} else {
*child_reference = Some(Box::new(UidAssignerTree::new()));
let uid = *next_uid;
child_reference.as_mut().unwrap().expand(data, next_uid);
uid
}
}
Leaf(uid) => *uid,
}
}
}
impl UidAssigner {
pub(crate) fn new() -> Self {
Self {
uid_assigner: UidAssignerTree::new(),
next_id: 0,
}
}
/// Computes the unique identifier of a stream of data.
///
/// The goal of this function is to set a unique identifier to every stream
/// of data it is given. It does so by storing the stream in a tree-like
/// structure, which allows efficient search and creation. It is guaranteed
/// that the same stream will always be associated to the same identifier
/// and that two different streams will never be associated to the same
/// identifier, as long as the number of different streams doesn't exceed
/// $2^64$.
///
/// # Arguments
///
/// * `data` - The stream of data to assign an identifier to. Note that it
/// must satisfy two strict conditions:
/// - Each stream of data must be of the same length.
/// - Each stream of data mustn't contain integers larger than
/// `TREE_BRANCHES`.
/// If one of these conditions isn't satisfied, this function will panic.
pub(crate) fn get_uid<'a>(&mut self, data: impl ExactSizeIterator<Item = &'a u8>) -> u64 {
if data.len() != MATRIX_SIZE {
panic!(
"Expected iterator of size {}, {} given",
MATRIX_SIZE,
data.len()
);
}
self.uid_assigner.get_uid(data, &mut self.next_id)
}
}
#[cfg(test)]
mod tests {
use crate::constants::MATRIX_SIZE;
use crate::uid_assigner::UidAssigner;
/// This test ensures that a stream is always associated to the same uid.
#[test]
fn consistency_test() {
let mut uid_assigner = UidAssigner::new();
let array0: [u8; MATRIX_SIZE] = [0; MATRIX_SIZE];
// Check that asking twice for a uid gives the same uid.
assert_eq!(
uid_assigner.get_uid(array0.iter()),
uid_assigner.get_uid(array0.iter())
);
let array1: [u8; MATRIX_SIZE] =
core::array::from_fn(|i| if i == MATRIX_SIZE - 1 { 0 } else { 1 });
uid_assigner.get_uid(array1.iter());
// Check that the previous property remains true even after having
// inserted another record in the tree.
assert_eq!(
uid_assigner.get_uid(array0.iter()),
uid_assigner.get_uid(array0.iter())
);
}
/// This test ensures that the uid are assigned in a consecutive order
#[test]
fn consecutive_id_test() {
let mut uid_assigner = UidAssigner::new();
let array0: [u8; MATRIX_SIZE] = [0; MATRIX_SIZE];
let array1: [u8; MATRIX_SIZE] =
core::array::from_fn(|i| if i == MATRIX_SIZE - 1 { 0 } else { 1 });
let array2: [u8; MATRIX_SIZE] =
core::array::from_fn(|i| if i == MATRIX_SIZE - 2 { 0 } else { 1 });
let array3: [u8; MATRIX_SIZE] =
core::array::from_fn(|i| if i >= MATRIX_SIZE - 2 { 0 } else { 1 });
// Check that the uid of three different streams are the first natural
// numbers
assert_eq!(uid_assigner.get_uid(array0.iter()), 0);
assert_eq!(uid_assigner.get_uid(array1.iter()), 1);
assert_eq!(uid_assigner.get_uid(array2.iter()), 2);
// Check that the previous property remains true even after asking for
// an already existent uid.
assert_eq!(uid_assigner.get_uid(array0.iter()), 0);
assert_eq!(uid_assigner.get_uid(array3.iter()), 3);
}
/// This test ensures that the `get_uid` function panics upon being given an
/// iterator with a length shorter than `MATRIX_SIZE`.
#[test]
#[should_panic(expected = "Expected iterator of size ")]
fn panic_on_too_short_iterator_test() {
let mut uid_assigner = UidAssigner::new();
let array0: [u8; MATRIX_SIZE - 1] = [0; MATRIX_SIZE - 1];
uid_assigner.get_uid(array0.iter());
}
/// This test ensures that the `get_uid` function panics upon being given an
/// iterator with a length larger than `MATRIX_SIZE`.
#[test]
#[should_panic(expected = "Expected iterator of size ")]
fn panic_on_too_long_iterator_test() {
let mut uid_assigner = UidAssigner::new();
let array0: [u8; MATRIX_SIZE + 1] = [0; MATRIX_SIZE + 1];
uid_assigner.get_uid(array0.iter());
}
}