Assign unique identifiers to binary matrices using a binary tree

Question

Context

I want to build a transposition table where the game states are binary square matrices which size MATRIX_SIZE is known at compile time. Rather than storing these matrices, I want to store their "hash". More precisely, I want to build a function that, upon being given such a matrix returns a unique identifier. My requirements are:

Performance: computing a unique identifier must be as fast as possible, since the function will be called a lot.
Perfect collision-resistance: the unique identifier must be, well, unique. Collisions are not accepted.
Parallelism: at some point in time, several threads would want to compute uid for different matrices.

The first idea that comes to mind for tackling such a problem is using well-known hash functions. However:

The basic hash functions in Rust make a trade off between performance and collision resistance.
I expect to see more than 2^32 matrices, which means I would have to use tags that can't be stored on 64-bit numbers, because of the birthday paradox. This would in tun prevent me from using these tags directly as keys for a structure such that an HashMap or an IndexMap.
Even the most performant hash functions such as SHA3 make several computations while reading the output.

My idea is that since I know that my inputs are always going to be the same size, I can exploit this property. Fundamentally, what I want to do is to store the matrices that I see. Upon having to give a unique identifier to a matrix, I check whether this matrix has already been given an identifier. Otherwise, I increase a global counter by one and assign it to this matrix.

For instance, an idea that could work is:

Start with an empty vec!
Compute a (very) large integer which binary decomposition is given by the matrix
If this integer is within the vec!, returns its index within the vec!
Otherwise, insert it into the vec in a sorted way

Inserting and searching are efficient since the vec! is sorted (for instance, one may use the binary_search method of such objects). In fact, this could be a potential solution to my problem, except that the vec will always be growing. From my understanding, this means that for every next power of 2, a new allocation would have to take place, copying the whole vec! each time. Plus, inserting an element is clearly not thread-safe.

The solution I went with (for now?) is to implement a tree that will store the matrices it has seen in its leaves, along with their uid. For instance, let us assume MATRIX_SIZE=4. We start from a tree that is just a root:

     Node
  |---+---|
None     None

Suppose that we now want to compute the unique identifier of [0, 0, 0, 1] (note that we consider the flatten matrices, since we just see them as iterators). The tree will then look like this:

                                                                Node
                                 |-------------------------------+-------------------------------|
                               Node                                                             None
                |---------------+---------------|
              Node                             None
       |-------+-------|
     Node             None
  |---+---|
None    Leaf(0)

If we now store [0, 1, 0, 1], the tree will look like this:

                                                                Node
                                 |-------------------------------+-------------------------------|
                               Node                                                             None
                |---------------+---------------|
              Node                             Node
       |-------+-------|                |-------+-------|
     Node             None            Node             None
  |---+---|                        |---+---|
None    Leaf(0)                  None    Leaf(1)

The idea is that:

We only have to read the data and progress within the tree along with it, there's no additional computations
As soon as we find a None, we know that this is a new stream, so we can directly return the new unique identifier and let another thread expand the tree in the background (though this is not yet implemented, I wrote the code with this idea in mind, which is why expand is a separate function).
It seems not to suffer from the same problems as vec!
It seems a little bit more thread-friendly (though I'm not sure about this).

However:

Just like the vec! solution, it may require a lot of memory, contrarily to a stateless hash function
It requires a lot of "jumping around" in memory and I'm not sure whether this affects performance.

The code

Here's my attempt at implementing such a tree in Rust. Note that I haven't taken care of the parallelism aspect for now. Since I'm a beginner in Rust, every feedback is much appreciated, be it performance-wise, on good practices or on the algorithm itself.

Here's the main.rs file:

// main.rs
mod constants;
mod uid_assigner;

fn main() {
    println!("Hello, world!");
}

Here's the constants.rs file:

// constants.rs
const N_QUBITS: usize = 5;
pub const MATRIX_SIZE: usize = N_QUBITS * N_QUBITS;

And finally, here's the uid_assigner.rs file:

// uid_assigner.rs
use crate::constants::MATRIX_SIZE;
use crate::uid_assigner::UidAssignerTree::{Leaf, Node};

// Only used to initialize an array full of None
const NONE_INIT: Option<Box<UidAssignerTree>> = None;

/// Each Node in the UidAssignerTree has the same number of children. This
/// value must be equal to the largest integer the assigner will ever meet in
/// the stream.
///
/// Since we are dealing with binary streams, its value is currently set to 2.
const TREE_BRANCHES: usize = 2;

/// The enum that represents the tree used for assigning unique identifiers. The
/// data is stored in the leaves only, the nodes are only meant to lead to the
/// leaves.
///
/// The rationale is that it is possible to both access and create an entry in
/// $n$ steps, where $n$ is the length of the stream. Note that this assigner
/// assumes that every stream to give an identifier to has the same length $n$.
/// Otherwise, the `get_uid` function will panic.
enum UidAssignerTree {
    // FIXME: This makes the leaves "heavy", since they must be able to be
    //  casted to an array of pointers. That means that storing a u64 is as
    //  heavy as storing TREE_BRANCHES usizes
    Leaf(u64),
    Node([Option<Box<UidAssignerTree>>; TREE_BRANCHES]),
}

/// The UidAssigner is the struct the user will be exposed to in order to assign
/// unique identifiers to streams.
pub(crate) struct UidAssigner {
    /// The `uid_assigner` is the structure that contains all previous
    /// identifiers and is in charge of creating new ones.
    uid_assigner: UidAssignerTree,
    /// The `next_id` simply represents the identifier that will be affected to
    /// the next stream to which no identifier is associated yet.
    next_id: u64,
}

impl UidAssignerTree {
    /// Creates the root of a new UidAssignerTree.
    fn new() -> Self {
        Node([NONE_INIT; TREE_BRANCHES])
    }

    /// Expands the tree until a new leaf.
    ///
    /// This function is only called whenever a None Node is found when looking
    /// for a new uid in the `get_uid` function. It uses the knowledge that all
    /// upcoming Nodes are going to be None to expand the tree in a branchless
    /// way.
    fn expand<'a>(&mut self, mut data: impl Iterator<Item = &'a u8>, next_uid: &mut u64) {
        match data.next() {
            // If the iterator isn't empty yet, we have to create the children
            // of the current node.
            Some(index) => {
                if let Node(children) = self {
                    let child_reference = &mut children[*index as usize];
                    *child_reference = Some(Box::new(UidAssignerTree::new()));
                    child_reference.as_mut().unwrap().expand(data, next_uid);
                } else {
                    // The nodes we encounter are those we previously created,
                    // so we know for certain that they're not leaves
                    unreachable!()
                }
            }
            // Otherwise, the current node should be a Leaf to which we can
            // assign the new uid.
            None => {
                *self = Leaf(*next_uid);
                *next_uid += 1;
            }
        }
    }

    /// Computes the uid of a stream of data. It does so by either assigning a
    /// new uid to the stream if it is the first time the function is called
    /// with this particular stream, or by giving back the previously associated
    /// uid.
    fn get_uid<'a>(&mut self, mut data: impl Iterator<Item = &'a u8>, next_uid: &mut u64) -> u64 {
        match self {
            Node(children) => {
                let child_reference = &mut children[*data.next().unwrap() as usize];

                if !child_reference.is_none() {
                    child_reference.as_mut().unwrap().get_uid(data, next_uid)
                } else {
                    *child_reference = Some(Box::new(UidAssignerTree::new()));
                    let uid = *next_uid;
                    child_reference.as_mut().unwrap().expand(data, next_uid);
                    uid
                }
            }
            Leaf(uid) => *uid,
        }
    }
}

impl UidAssigner {
    pub(crate) fn new() -> Self {
        Self {
            uid_assigner: UidAssignerTree::new(),
            next_id: 0,
        }
    }

    /// Computes the unique identifier of a stream of data.
    ///
    /// The goal of this function is to set a unique identifier to every stream
    /// of data it is given. It does so by storing the stream in a tree-like
    /// structure, which allows efficient search and creation. It is guaranteed
    /// that the same stream will always be associated to the same identifier
    /// and that two different streams will never be associated to the same
    /// identifier, as long as the number of different streams doesn't exceed
    /// $2^64$.
    ///
    /// # Arguments
    ///
    /// * `data` - The stream of data to assign an identifier to. Note that it
    /// must satisfy two strict conditions:
    /// - Each stream of data must be of the same length.
    /// - Each stream of data mustn't contain integers larger than
    /// `TREE_BRANCHES`.
    /// If one of these conditions  isn't satisfied, this function will panic.
    pub(crate) fn get_uid<'a>(&mut self, data: impl ExactSizeIterator<Item = &'a u8>) -> u64 {
        if data.len() != MATRIX_SIZE {
            panic!(
                "Expected iterator of size {}, {} given",
                MATRIX_SIZE,
                data.len()
            );
        }

        self.uid_assigner.get_uid(data, &mut self.next_id)
    }
}

#[cfg(test)]
mod tests {
    use crate::constants::MATRIX_SIZE;
    use crate::uid_assigner::UidAssigner;

    /// This test ensures that a stream is always associated to the same uid.
    #[test]
    fn consistency_test() {
        let mut uid_assigner = UidAssigner::new();
        let array0: [u8; MATRIX_SIZE] = [0; MATRIX_SIZE];
        // Check that asking twice for a uid gives the same uid.
        assert_eq!(
            uid_assigner.get_uid(array0.iter()),
            uid_assigner.get_uid(array0.iter())
        );

        let array1: [u8; MATRIX_SIZE] =
            core::array::from_fn(|i| if i == MATRIX_SIZE - 1 { 0 } else { 1 });
        uid_assigner.get_uid(array1.iter());

        // Check that the previous property remains true even after having
        // inserted another record in the tree.
        assert_eq!(
            uid_assigner.get_uid(array0.iter()),
            uid_assigner.get_uid(array0.iter())
        );
    }

    /// This test ensures that the uid are assigned in a consecutive order
    #[test]
    fn consecutive_id_test() {
        let mut uid_assigner = UidAssigner::new();
        let array0: [u8; MATRIX_SIZE] = [0; MATRIX_SIZE];
        let array1: [u8; MATRIX_SIZE] =
            core::array::from_fn(|i| if i == MATRIX_SIZE - 1 { 0 } else { 1 });
        let array2: [u8; MATRIX_SIZE] =
            core::array::from_fn(|i| if i == MATRIX_SIZE - 2 { 0 } else { 1 });
        let array3: [u8; MATRIX_SIZE] =
            core::array::from_fn(|i| if i >= MATRIX_SIZE - 2 { 0 } else { 1 });

        // Check that the uid of three different streams are the first natural
        // numbers
        assert_eq!(uid_assigner.get_uid(array0.iter()), 0);
        assert_eq!(uid_assigner.get_uid(array1.iter()), 1);
        assert_eq!(uid_assigner.get_uid(array2.iter()), 2);

        // Check that the previous property remains true even after asking for
        // an already existent uid.
        assert_eq!(uid_assigner.get_uid(array0.iter()), 0);
        assert_eq!(uid_assigner.get_uid(array3.iter()), 3);
    }

    /// This test ensures that the `get_uid` function panics upon being given an
    /// iterator with a length shorter than `MATRIX_SIZE`.
    #[test]
    #[should_panic(expected = "Expected iterator of size ")]
    fn panic_on_too_short_iterator_test() {
        let mut uid_assigner = UidAssigner::new();
        let array0: [u8; MATRIX_SIZE - 1] = [0; MATRIX_SIZE - 1];
        uid_assigner.get_uid(array0.iter());
    }

    /// This test ensures that the `get_uid` function panics upon being given an
    /// iterator with a length larger than `MATRIX_SIZE`.
    #[test]
    #[should_panic(expected = "Expected iterator of size ")]
    fn panic_on_too_long_iterator_test() {
        let mut uid_assigner = UidAssigner::new();
        let array0: [u8; MATRIX_SIZE + 1] = [0; MATRIX_SIZE + 1];
        uid_assigner.get_uid(array0.iter());
    }
}

If some parts of your matrices are common, I'd use a trie. Are they? ... What do you mean by: several computations while reading the output? — Peter Blackson
– Peter Blackson, Commented Dec 15, 2023 at 17:39
I would try docs.rs/fasthash/latest/fasthash/farm/struct.Hash128.html — Peter Blackson
– Peter Blackson, Commented Dec 15, 2023 at 17:57
@PeterBlackson I don't expect them to unfortunately. I'll take a look at FastHash, thanks!, I didn't know about it! I was considering xxhash as an alternative for now — Tristan Nemoz
– Tristan Nemoz, Commented Dec 18, 2023 at 15:36

Stack Exchange Network

Assign unique identifiers to binary matrices using a binary tree

Context

The code

0

You must log in to answer this question.

Hot Network Questions

Assign unique identifiers to binary matrices using a binary tree

Context

The code

0

You must log in to answer this question.

Related

Hot Network Questions