2
\$\begingroup\$
use std::{cmp::max_by_key, collections::HashMap};

fn longest_common_subsequence(s1: Vec<u8>, s2: Vec<u8>) -> Vec<u8> {
    struct Helper {
        s1: Vec<u8>,
        s2: Vec<u8>,
        cache: HashMap<(usize, usize), Vec<u8>>,
    }
    impl Helper {
        fn helper(&mut self, i: usize, j: usize) -> Vec<u8> {
            if self.cache.contains_key(&(i, j)) {
                return self.cache[&(i, j)].clone();
            }
            let value = {
                if i >= self.s1.len() || j >= self.s2.len() {
                    Vec::new()
                } else if self.s1[i] == self.s2[j] {
                    let mut tmp = vec![self.s1[i]];
                    tmp.append(&mut self.helper(i + 1, j + 1));
                    tmp
                } else {
                    max_by_key(self.helper(i + 1, j), self.helper(i, j + 1), |s| s.len())
                }
            };
            self.cache.insert((i, j), value.clone());
            return value;
        }
    }
    return Helper {
        s1,
        s2,
        cache: HashMap::new(),
    }
    .helper(0, 0);
}

fn main() {
    println!(
        "{}",
        String::from_utf8(longest_common_subsequence(
            String::from("abcd").into_bytes(),
            String::from("adfsfsdfbcd").into_bytes()
        ))
        .unwrap()
    );
}

I'm playing around with Rust (only a few days into it) for a toy project in which I had to find the canonical representation for a given term -- I figured longest common subsequence of first few Google search results (titles) might do the trick, so I started implementing it and this is what I have (from a mental translation of how I'd do it in Python).

Among other things, it looks.. ugly to me. What can I do better (shorter / more readable e.t.c.)?

\$\endgroup\$

1 Answer 1

2
\$\begingroup\$

Your code looks dense because you've stuffed a lot of stuff into a single function, without any structuring whitespace. But that is merely cosmetic and can be fixed. For example, you might structure the code with some blank lines …

fn longest_common_subsequence(s1: Vec<u8>, s2: Vec<u8>) -> Vec<u8> {
    return Helper {
        s1,
        s2,
        cache: HashMap::new(),
    }
    .helper(0, 0);

    struct Helper { ... }

    impl Helper { ...  }
}

… or move the helper outside entirely:

fn longest_common_subsequence(s1: Vec<u8>, s2: Vec<u8>) -> Vec<u8> {
    Helper {
        s1,
        s2,
        cache: HashMap::new(),
    }
    .helper(0, 0)
}

struct Helper { ... }

impl Helper { ...  }

I've thought about whether a separate Helper struct even makes sense. In principle, you could just pass the values like the cache as arguments to a helper function, but given the pair of vecs s1, s2 that should be kept together a struct makes sense. Often, the alternative would be to simply use a closure, but we can't directly have recursive closures.

A different debate can be had about whether the helper() function should be a method. I'd argue “no” – this is a purely syntactic choice here, and we could instead write the helper function to take a &mut Helper parameter. But that is primarily opinion-based.

When I look into the helper() implementation, I note this pattern:

if self.cache.contains_key(&(i, j)) {
    return self.cache[&(i, j)].clone();
}

This unnecessarily makes two accesses to the cache. Instead, we can write:

if let Some(value) = self.cache.get(&(i, j)) {
    return value.clone();
}

The get() method returns an Option<&_>, which we can pattern-match using if let.

Next, we have the pattern

if let Some(value) = self.cache.get(&(i, j)) {
    return value.clone();
}
let value = ...;
self.cache.insert(&(i, j), value.clone());
value

Again, the get() and insert() accesses do duplicate work. Typically, we can avoid this using the HashMap Entry API. An entry gives access to a slot in the hash map which may be occupied or may be vacant, and we can efficiently insert a value. The inserting function will return a reference to the value in either case, which we can clone:

self.cache.entry((i, j)).or_insert_with(|| ... ).clone()

which is very elegant except that this doesn't work here: the entry() keeps a &mut reference to the self Helper so that it can insert an item, but the closure also wants to borrow self. For that reason, it makes sense stick with your general design – but note that this could be avoided by using a different type for the cache, e.g. a Vec instead of a HashMap. A Vec or slice can be split_at() so that different sub-ranges can be borrowed separately. The problem as a whole is also amenable to using Dynamic Programming instead of memoization: without recursion, we wouldn't have to hold on to the cache for the duration of the recursive calls.

This is a common theme with Rust: things seem like they should be easy, but they're not that easy when thinking about the actual ownership or safety issues.

For example, I was thinking about returning a &[u8] from the helper() function to avoid unnecessary copies. This mostly works and saves a wasted allocation in the tmp.append(...) statement, but will ultimately have little effect since the max_by_key(...) statement has two calls into helper(). If their return value were to borrow the Helper struct, they would conflict with each other. In this particular case, it's also not possible to break the &mut requirement unless we put the cache into a RefCell and guarantee that the vec contents won't be moved, which we could do with a Pin and a bit of unsafe code.

In order to avoid excessive Vec copies, an alternative to references would be to track indexes into an array. This is a common strategy in Rust to avoid lifetime issues. Instead of

struct Helper {
  ...
  cache: HashMap<(usize, usize), Vec<u8>>,
}

we might use:

struct Helper {
  ...
  cache: HashMap<(usize, usize), usize>,
  subseqs: Vec<Vec<u8>>,
}

Then:

fn longest_common_subsequence(s1: Vec<u8>, s2: Vec<u8>) -> Vec<u8> {
    let mut helper = Helper {
        s1,
        s2,
        cache: HashMap::new(),
        subseqs: vec![Vec::new()],
    };
    let result_index = helper.cached_value_for(0, 0);
    helper.subseqs[result_index].clone()
}

impl Helper {
    fn cached_value_for(&mut self, i: usize, j: usize) -> usize {
        if let Some(&value) = self.cache.get(&(i, j)) {
            return value;
        }
        
        let value = self.computed_value_for(i, j);
        self.cache.insert((i, j), value);
        value
    }
    
    fn computed_value_for(&mut self, i: usize, j: usize) -> usize {
        if i >= self.s1.len() || j >= self.s2.len() {
            return 0;
        }
        
        if self.s1[i] == self.s2[j] {
            let next_index = self.cached_value_for(i + 1, j + 1);
            let mut tmp = vec![self.s1[i]];
            tmp.extend_from_slice(&self.subseqs[next_index]);
            self.subseqs.push(tmp);
            return self.subseqs.len() - 1;
        }
        
        max_by_key(
            self.cached_value_for(i + 1, j),
            self.cached_value_for(i, j + 1),
            |&s| self.subseqs[s].len(),
        )
    }
}

The above code also separates the value construction from cache management. If desired, we could use this to eliminate the recursion by priming the cache in the correct pattern:

for i in (0..s1.len()).rev() {
    for j in (0..s2.len()).rev() {
        cache.insert((i, j), ...);
    }
}

A disadvantage of using indices instead of references is that we lose type safety: all those usize values look the same. This can be often be adressed with crates like typed-arena.

\$\endgroup\$
1
  • \$\begingroup\$ Thanks for the detailed answer! \$\endgroup\$
    – avamsi
    Commented Jul 15, 2021 at 3:31

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.